From Backend Engineer to ML Engineer — Roadmap

From Backend Engineer to ML Engineer — Roadmap

In the past few years, Machine Learning has become a leading subject in which many companies are working on, whether directly or indirectly. The rise of ML led to an evolution in the roles around it. One of the most intriguing roles discussed in the community today is the one of ML Engineer. What is the role of an ML engineer? What is required to become one? We will try to discuss those topics briefly and then lay the foundation for a roadmap to become one.
The list of technologies and concepts mentioned in this article is long. In most cases an acquaintance of the terms is enough and in others, deeper understanding and experience are required. But keep in mind — it is a proposal for a roadmap, which should be treated as a long journey that will never end (I hope!), as new technologies keep emerging constantly.

Why are ML Engineers important?

When discussing ML, the most obvious role that comes to mind is the one of Data Scientist. Data Scientists are the ones that build a model, by applying knowledge in Mathematics combined with a profound understanding of the business. They research the available data, apply data manipulation and feature engineering, select the appropriate model and tune it. Once a model is ready, it is time to put it in production. Unfortunately, Data Scientists seldom have the engineering knowledge and capacity to bring a model into production. In the following diagram the rather important part of the Data Science activity is depicted in the orange box in the middle. The surrounding boxes detail the necessary activities to put the model in production.

From “Hidden Technical Debt in Machine Learning Systems” https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf

The engineering expertise required to build a production application around the model includes, among others, knowledge in topics such as Data Engineering, Software engineering practices, CI-CD, MLOps, Deployment, Logging & Monitoring. This is where ML engineers become handy. Those practices come mostly from the Software Engineering field. Therefore, the best match for the role of ML engineers are software engineers that handle the backend side of applications. The ML applications development process is different from standard application’s one, which requires different skills than an average software engineer has.

Photo by Hitesh Choudhary on Unsplash

ML Application Lifecycle

ML-based applications have a lifecycle that is different from regular software applications: ML models require training. This step changes the way new versions are enrolled. It affects the way the CI and CD processes are built and requires a different way of thinking relative to other software applications. The lifecycle of ML-based software includes the following blocks:

  1. Research — Data Scientists perform research to solve business problems using ML. They scan the data available and test different ML techniques. The result of this phase is usually an experimental code that includes data manipulation and an untuned ML model. Tools such as Jupyter Lab become very handy during this phase. ML engineers are less involved in this phase but in some cases, aid is required by the Data Scientists to solve programming problems and environmental issues that are usually part of the knowledge of Software Engineers.
  2. Training — Model training is a repetitive step in which both data and model are being tuned. Data Scientists try different data features and data engineering. The appropriate model is being selected and hyperparameter tuning cycles are applied. This phase is repeated on every software version enrolled. It requires frameworks and libraries from the MLOps discipline. ML Engineers help to make this step iterative by providing reusable building blocks which are used in ML pipelines. The reusable blocks usually solve problems such as data retrieving, connectivity to sources, communication with the environment and more. These blocks are a good place to apply performance enhancements from the software engineering discipline.
  3. Prediction — The tuned models are being used to predict new data. The prediction can be performed in a repetitive batch operation or by an online application that handles incoming data in a streaming fashion. ML Engineers are the ones building the software and processes to apply these predictions. Whether it is a batch prediction pipeline or a microservice for online prediction, the capabilities of a software engineer are required to build robust applications with tuned performance.
  4. Action — Predictions alone are not enough. For the prediction to influence the business, an action must be applied. This action requires, in addition to the prediction, the current state of the entities involved, configuration, etc. ML Engineers build the infrastructure to retrieve those states and configurations and also develop the rule engine that decides the appropriate action. This step can be done by ML engineers, or in better cases, by Data Scientists providing ML engineers have built the appropriate infrastructure and building blocks (entity state and configuration retrieval, and API to an external system to apply the action)
  5. Monitoring — when speaking about software monitoring, DevOps is the best resource for applying best practices. But in ML applications, not only the health of the software is being monitored, but also the accuracy and precision of the models. Models tend to drift over time. Monitoring the model properly can detect drifts in early stages, suggesting a change in the model is required, either by re-training, adding features, or even replacing the model itself. ML Engineers are required to understand the model type being used and monitor the appropriate metrics accordingly. Monitoring the wrong metrics might result in poor decisions and actions taken by the model and the remedies applied by the team.

As you can see, the phases are different from a regular software application development life cycle, which requires the ML engineer to master not only classic software building techniques but also expertise in ML application lifecycle, and understanding the process of ML modeling. The following section details what I believe should be the roadmap of a Backend engineer to become an ML engineer.

Photo by Mark König on Unsplash

Roadmap for becoming ML engineer

This roadmap assumes you are already a software engineer and have some experience in software building, but the below sections also can give hints of focusing to people who start their way in the software and ML worlds.

Python

Most of the models, framework, and infrastructure in the world of ML is written in Python. You might be a software engineer with experience in other languages such as Java, C++, NodeJS (Javascript), or Go but to work in the ML field you must have a good knowledge of Python.

As a software engineer, you are probably familiar with cross-cutting concerns, such as error handling, performance, logging, security, networking, and more. You should learn how to do it all in Python.

Frameworks for producing web applications are important as well as online model serving is usually done behind a web server. There are SaaS services for model serving but in 99.9% of the cases you will need to add pre and post-prediction logic and most of the SaaS frameworks do not allow it without building a web application around the model. In such cases I found myself building microservices using Flask and later FastAPI to fulfill the requirement of pre and post processing logic.

Additionally, I strongly recommend learning Python’s parallelism mechanism. The GIL is a topic that is important to understand. Multithreading and multiprocessing are the building blocks of parallelism. Many think that GIL makes the use of multithreading not relevant as it prevents using more than one core at a time, but I found myself using it in several cases where external IO is involved. Using dozens of threads, I managed to enhance the performance of a single-core application by ten times and more!
Multiprocessing in python is the way to utilize multi-core computers. Another option is to use Kubernetes’ auto-scaling mechanism by building single-core PODs and letting K8s utilize multi-core machines by spawning more PODs according to the load pressure applied on them.

Python software developments come in handy mostly in the Prediction and Action phases mentioned above, but also come in handy to support the Train and Research phases done by the Data Scientists.

Photo by KOBU Agency on Unsplash

Data Engineering

Models need data. Although the Data Scientist determines what data it needs, in most cases they will leave the ML engineer to resolve how to retrieve and prepare the data in production, both for training and prediction. Feature Stores is a great new way to serve that data but someone still needs to develop the pipeline that handles retrieval and transformation before it reaches the Feature Store. To do that, ML engineers should have experience with data tools as described below.

Organizations hold their data in Data Warehouses and Data Lakes. Famous data warehouses include Google’s BigQuery, Amazon Redshift and Snowflake, which is also considered as a Data Lake. Knowledge and experience with some of these tools are essential for the ML engineer to build end-to-end infrastructure for ML projects. Orchestrating tools for data pipelines are also useful. One of the most famous is Apache Airflow. Another rising star in the field of data warehousing management is DBT, which helps define schemas, run pipelines and test data quality.
In my company I helped building workflows using Airflow and BigQuery which build tables that hold aggregated data of various entities. Those tables are the source for most of the models developed by my fellow data scientists. Preparing the aggregations in advance reduced the training jobs from 2 hours to less than 20 minutes. It allowed faster iterations of model tuning and reduced the time of batch predictions done in production.

At the base of all these tools lies the SQL language. A profound experience with SQL is required. Almost all of the systems revolving around data pipelines expose an SQL engine to interact with data. I cannot imagine my role as an ML engineer without my SQL knowledge.

Additional important data tools are the famous python libraries of Pandas and Numpy. Most ML use cases involve data processing using them. ML Engineers must master those libraries to be able to code inside an ML based project. In cases where the data is big and cannot be handled in a single machine’s memory, tools like Dask, Vaex, Ray and Modin might come in handy. They have a similar API as Pandas (or wrapping Pandas) but able to handle larger-than-memory dataframes in a shorter amount of time.

Photo by Jonathan Borba on Unsplash

MLOps

For those familiar with the term, you might think it is a topic for DevOps only. The DevOps responsibility is changing over time and more of its previous capabilities are being transferred to the developers. Issues such as CI-CD, ML Pipelines, Serving and Monitoring are different when it comes to ML. In many companies, the MLOps side of DevOps is not developed enough and you might find yourselves more often pushing the topic alone. There are countless tools and companies in the field of MLOps, and such an article is not enough to mention

Model Serving performance, especially in real-time cases is crucial. Learning Model Types and their performance can help to select the appropriate model. Even if it is the Data Scientist’s responsibility to select the model, your knowledge and experience as ML engineers can prevent wasting precious time by discussing and consulting the matter with the Data Scientist. Many tools for model serving exist in the market. Either selecting one or building in-house serving solutions is a task for an ML Engineer. In my Company I built a microservice that performs online streaming prediction. The deployment is done on Kubernetes running on Google GKE.

Considering Model Serving solution, one should keep in mind that models are heavy CPU consumers during prediction. They might also require a bigger memory size. Understanding the way different models use the CPU and memory can help to build the appropriate infrastructure for model serving.

Topics such as GPU, and how to use it in train and prediction can contribute to the overall performance. Understanding the impact of GPU, which libraries support it, and the influence of GPU on the overall performance can benefit a lot while building model pipelines for training and prediction.

Photo by Markus Winkler on Unsplash

ML Basic Concepts

Communication with Data Scientists is enhanced greatly by acquiring a basic understanding of the ML world.
Topics in the field of Feature Engineering such as One-Hot encoding, Label Encoding, Imputing, are important for understanding the data needs of Data Scientists. The concept of Target Variables and how to handle them is crucial for data preparation during training.

Understanding concepts such as Supervised Learning vs. Unsupervised Learning enhances the experience and engagement building ML applications. Complex algorithms such as Deep Learning should also be familiar to ML engineers. It can help design the proper software and hardware for Deep Learning training and prediction pipelines.

I started my journey to ML engineering long before the term became ubiquitous, by taking the great course by Andrew Ng called simply — Machine Learning. It is aimed at people that want to be Data Scientists, but I found myself many times understanding the ML concepts explained by Data Scientists I work with thanks to this course. As mentioned above, different models have different metrics to measure their accuracy and this course helped me understand the ones I need to pick for Model Monitoring. This course contributed a lot to my understanding and in some cases decision making as an ML engineer.

ML Libraries

Data Scientists work with different ML libraries to run their models. In order to support production code it is crucial to understand the structure of those libraries. One of the most famous getting-started libraries in the world of ML is Scikit-Learn. Their concept of Pipeline simplifies the understanding of classic ML structure. The library includes most of the required tools in the life cycle of ML models, including transformers and Model Selection.
Another famous library for ML is Tensorflow. It is specialized in Deep Learning and has a strong facility for model serving.

Pytorch and MXNet also have their share of users. I recommend reading this article to get a good overview of those libraries and their specialties.

Summary and Conclusions

To sum it up, ML Engineer should acquire knowledge in the following fields:

  • Python programming language
  • Data Engineering
  • MLOps
  • ML Basic Concepts
  • ML Libraries

ML Engineering is an exciting role. It incorporates both worlds of software engineering and ML modeling, which makes it unique in the Software Engineering landscape.

The list of topics above comes from my own experience in the past 10 years building software around ML. Thinking otherwise? I’ll be glad to hear your opinion!


From Backend Engineer to ML Engineer — Roadmap was originally published in Everything Full Stack on Medium, where people are continuing the conversation by highlighting and responding to this story.

Big Data Architect & Machine Learning Engineer

Backend Group

Thank you for your interest!

We will contact you as soon as possible.

Want to Know More?

Oops, something went wrong
Please try again or contact us by email at info@tikalk.com
Thank you for your interest!

We will contact you as soon as possible.

Let's talk

Oops, something went wrong
Please try again or contact us by email at info@tikalk.com