Machine Learning Reliability Engineering (MLRE) is an upcoming specialization of Site Reliability Engineering (SRE).
In this article, I'll introduce to you why specialization is required in SRE and some of the other specializations that already exist. I'll also talk about the roles and responsibilities of an MLRE and provide brief insight into how different engineering functions will interact with this new role.
Throughout this article, I'll use MLRE to refer to both the field of Machine Learning Reliability Engineering and also for referring to a Machine Learning Reliability Engineer.
Google first came up with the idea of SRE by applying the principles of software engineering to DevOps more than 15 years ago. Since then, this new field has taken the shape of its own coexisting with DevOps. While DevOps has branched into several specializations like DataOps, DevSecOps, and MLOps, the field of SRE is yet to fully branch out into specialized domains at scale.
Over time, as other specialized fields like data science, machine learning, security engineering, and artificial intelligence mature, specialized infrastructure, tooling, and processes will also exist -- this will give rise to specializations within the field of SRE. Currently, the only branch of SRE that has flourished in the last couple of years is Database Reliability Engineering. That's because databases, data warehouses, data lakes, ETL, and other related technologies have been widely used at scale for many decades.
Just as a Database Reliability Engineer needs to have in-depth knowledge about high availability in databases, replication topologies, database migration, etc., an MLRE will need to have domain-specific knowledge related to machine learning. For instance, monitoring and alerting specific to GPUs and TPUs. The role and responsibility of MLRE are based on the same ideas as SRE.
With that, let's take a quick look at the different branches of SRE that have popped up over the years:
|SRE||Application, Overall Infrastructure||2010||DevOps, Engineering|
|DBRE||Databases, Lakes, and Warehouses||2015||DataOps, Data Engineering|
|MLRE||Machine Learning||2020||MLOps, Machine Learning Engineering|
By design, reliability engineering demands generalists. Reliability engineers need to have a clear view of complete systems and the ability to understand and work on different components as and when required. Aside from the core responsibilities, MLREs also have responsibilities shared with other engineering functions.
- Making sure that machine learning infrastructure is highly available, reliable, and meets the service-level agreements (SLAs).
- Setting up a system to proactively monitor compute, memory, network latency, etc.
- Controlling costs of machine learning infrastructure by optimizing design and workflow.
- With machine learning engineers - making sure that the models are as accurate as possible by reducing feature drift, bias, fraud, etc.
- With other engineering functions - agreeing on a larger goal and making sure that the output of the work done by machine learning teams is useful and relevant to the business goal.
Now, let's talk about the principles behind MLRE that are the basis of the aforementioned roles and responsibilities.
As a branch of SRE, MLRE also follows the same set of principles like-
- Reducing toil through automation
- Following Service Level Objectives (SLOs)
- Keeping the cost within the budget
- Ensuring smooth releases
- Working towards immutable infrastructure
- Owning the reliability of the complete infrastructure running machine learning projects, products, experiments, and so forth
It's well established in the previous list that an MLRE will essentially be the infrastructure owner. Along with this, they're also responsible for controlling the cost, raising flags when the infrastructure could go over budget, etc. All of this does require a deep understanding of the underlying infrastructure that an MLRE has to deal with.
With every major cloud platform introducing machine learning capabilities and with the advent of machine learning and AI-specific hardware, having a deep understanding of it all requires a dedicated effort. As cloud platforms introduce more and more services, the knowledge gap will increase. In data engineering, this has already given rise to cloud-specific positions like GCP Data Engineer, Azure Data Engineer, and AWS Data Engineer. A similar evolution is entirely possible in machine learning and reliability engineering too.
The other big idea is to make everything repetitive and automatable. This saves a significant amount of time and frustration in the long run. Carla Geisser, a Google SRE, says, "If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow."
As discussed in the previous section, an MLRE needs to understand the infrastructure well. They also need to understand the operational processes (MLOps) that make machine learning work possible. While a Machine Learning engineer must have in-depth knowledge of data collection, data verification, feature engineering, metadata management, model analysis, and so on, an MLOps Engineer needs to understand the DevOps side of things, which includes identity management, roles, grants, permissions, source control, and CI/CD pipelines.
MLOps applies the best practices from DevOps — collaboration, version control, automated testing, compliance, security, and CI/CD — to productionizing machine learning.
Although an MLOps Engineer is responsible for all the things mentioned above, they usually don't ensure that the underlying infrastructure is working. That's where an MLRE comes in. And this brings us to the things that an MLRE needs to know to do their job right.
All new and upcoming fields derive heavily from the existing ones. I've already talked about how SRE was derived from DevOps and how DBRE and MLRE have sprouted from SRE. There are a lot of other influences along the way. Due to this complex lineage, the skill set required for working in these different fields overlap to a very high degree. So, although I've referred to MLRE as a specialization, it's not from a skills standpoint. A wide range of skills is required to perform even a seemingly specialized job. Robert A. Heinlein, a modern-age renaissance man, wrote about the need for having a wide range of life skills:
A human being should be able to change a diaper, plan an invasion, butcher a hog, conn a ship, design a building, write a sonnet, balance accounts, build a wall, set a bone, comfort the dying, take orders, give orders, cooperate, act alone, solve equations, analyze a new problem, pitch manure, program a computer, cook a tasty meal, fight efficiently, die gallantly. Specialization is for insects.
Robert A.Heinlein's idea applies to computer engineering, data science, machine learning, and artificial intelligence. This brings us to the various skill sets required for an MLRE to perform their job and contribute value to the team efficiently. Let's dig in.
An MLRE should have a good understanding of the basics and purpose of machine learning. They have to be aware of the components that make up a machine learning system, especially the most critical and costly ones.
For instance, it helps to understand that data collection, wrangling, and initial processing can cost much less than more compute and memory intensive steps like model training and parameter tuning. A lot of machine learning projects are experimental and can be very costly. The cost is understood only by understanding the underlying processes.
Scripting and Programming
A good understanding of Unix-based systems always comes in handy and is required in this job. This mandates that an MLRE is comfortable writing shell scripts. They also need to be familiar with software engineering. As mentioned earlier, the whole idea behind SRE is to apply the principles of software engineering to flesh out unknowns before they hit production. An MLRE must be able to write code in order to build pipelines, configure machine learning stacks, spin up and tear down infrastructure, and much more.
DevOps and MLOps
This brings us to one of the most critical areas in MLRE: developer operations. This area requires one to learn about building machine learning pipelines using CI/CD tools while also taking into consideration all the steps in machine learning starting from data collection up to a productionized prediction model -- i.e., model validation, feature storage, metadata management, and source control management, to name a few. While in-depth knowledge of every step is not required, as mentioned in an earlier section, it's vital to understand the flow of data in a machine learning pipeline and also the underlying infrastructure.
Spinning up infrastructure is not necessarily an MLRE's job, but it does fall under the purview of DevOps and SRE. Loosely speaking, to provide access to resources is the job of an MLOps Engineer, and to provide efficient ways to spin up infrastructure to machine learning engineers is the job of an MLRE. For this reason, they need to know how to write infrastructure code using tools like Terraform, Pulumi or AWS CloudFormation. Such tools enable MLREs to write reusable plugins, templates, and modules for machine learning engineers to use. This goes to the heart of the ideas behind why SRE was born.
In a sense, data science, machine learning, and artificial intelligence derive from and are heavily dependent on data engineering. Whenever data moves from one place to another it must be processed, cleansed, and transformed. This is where the ideas of data engineering come into play. With many roles having significant overlap with many other functions, lines of responsibility have blurred. For this reason, an MLRE needs to have a good overview of how data engineering systems work. Working knowledge of SQL is essential. Experience with relational databases, data warehouses, and/or ETL (extract, transform, load) frameworks is a plus.
Testing machine learning models is quite different from how you test a typical software product. The fundamental difference is that machine learning models are non-deterministic.
On a basic level, there can be two types of tests for testing machine learning models:
- Testing the coding logic of the model
- Testing the output/accuracy of the model
There are a lot of ways to test the latter. As mentioned earlier, you can test the accuracy by creating test data sets corresponding to training data sets, which tests pre and post training. On the other hand, you don't always have efficient or even correct ways of testing the model's coding logic. It comes back to the model being non-deterministic. When you don't know what to expect, how do you test the code?
For an in-depth read about testing machine learning models, please read Jeremy Jordan's blog post, Effective testing for machine learning systems.
Test sooner than later. This fundamental principle of shifting left has been in vogue for some time now. The idea is to ensure that integration testing is done in an automated fashion as early as possible in the development cycle. This reduces the number of unknowns that you'll have to deal with later. Consequently, shifting left will also help you avoid dealing with technical debt later as you will make design and infrastructure changes early rather than later.
A conversation about the shared skill sets between various engineering functions is a nice segway into discussing how these different engineering functions collaborate. Some level of continuous collaboration is required for all the different teams. Apart from the overlap in skills, another level of complexity is introduced when different companies have different ideas of how a specific role functions. This is usually the reason that the processes of collaboration vary from company to company (and even team to team internally). Here's a bird's eye view of the responsibilities of various roles that we have discussed:
|DevOps Engineer||Identity management, access to resources, whatever else developers need.|
|MLOps Engineer||DevOps specific to machine learning engineers, building CI/CD pipelines.|
|SRE||Reliability of infrastructure, apps, services, databases, etc.|
|DBRE||Reliability of databases, data warehouses, and lake infrastructure, services, etc.|
|MLRE||Reliability of machine learning related infrastructure, apps, and services.|
|Data Engineer||Data availability for all the other teams needing the data.|
The output of the work of one team is often the input for the work of another team. In an ideal scenario, you would automate most of it.
We've talked about some of the reasons behind the birth of reliability engineering and how it has taken different shapes merging with different engineering functions.
Machine Learning Reliability Engineering is an upcoming branch of SRE that we'll soon see from companies that want to build scalable solutions based on machine learning and artificial intelligence. The reliability team will focus on keeping the systems up and running while the machine learning engineers will focus on improving the models and making better predictions.