TECH CULTURE
10 minute read • February 29, 2024

sennder MLOps Primer

At sennder, we're proud to be at the forefront of machine learning and AI technology. Our team of skilled data scientists and engineers work tirelessly to develop cutting-edge models that deliver value to our customers and partners.
sennder MLops
authors
Varun Chitale

Varun Chitale

About the author: Varun Chitale is an experienced Machine Learning Engineer Manager with more than 6 years of experience. He has worked in ML applications of all kind with particular focus on NLP and Recommendation Systems.

How We Explored And Selected MLOps Technologies

At Sennder, we're proud to be at the forefront of machine learning and AI technology. Our team of skilled data scientists and engineers work tirelessly to develop cutting-edge models that deliver value to our customers and partners.

But as any seasoned data science practitioner knows, building a machine learning model is only half the battle. The real challenge lies in productionizing it - in deploying it at scale, ensuring it's performant, and keeping it up to date as data and business requirements evolve.

That's why at Sennder, we've invested heavily in the development of a robust machine learning engineering platform. Our platform includes several key components that are critical to productionizing an ML model, including an ML pipeline orchestrator, ML metadata and artefact tracking, a model registry, and tools for model training and serving.

To make sure we chose the right tools for each of these components, we used the process of Markdown Any Decision Records (MADR). This helped us make informed decisions that were well-documented and transparent.

MADR (Markdown Any Decision Records) is a process for documenting decisions made during a project or initiative. It involves creating a simple Markdown file that outlines the decision, its rationale, and any relevant context or background. MADR files can be easily shared and reviewed, making it easier for team members to stay informed and make informed decisions going forward. Overall, MADR helps teams make better decisions and promotes transparency and collaboration.

In this blog post, we'll dive deeper into each of these components, explore the decision-making process behind our tooling choices, and showcase some of the amazing work our team has done with this technology. We hope this post inspires other data science practitioners and potential team members to join us in our mission to push the boundaries of what's possible with machine learning.

The Components

ML Pipelines Orchestrator

An ML pipelines orchestrator is a tool or platform that manages the flow of data and computations during the machine learning process. Essentially, it helps you automate the steps involved in training and deploying an ML model, so that you can more easily manage the complexity of these processes. A common example of an ML pipelines orchestrator is Apache Airflow, which allows you to create DAGs (Directed Acyclic Graphs) to define the sequence of tasks that need to be executed.

Thanks to our use of MADR, we were able to carefully evaluate and document our decisions about which ML Pipelines Orchestrator tools to consider, making it possible to compile this comprehensive list for your reference.

Features Airflow Flyte Metaflow Kubeflow Pipelines Prefect Kedro Databricks Workflow
Environment & Dependency Isolation Poor Good Good Good Poor Poor Good
Pipeline Versioning No Yes Yes Yes Yes No No
Data flow between tasks No Yes Yes No Yes No No
Task level resources Yes (Kubernetes) Yes Yes Yes No No No
Graphical User Interface (GUI) Yes Yes Yes (Not mature) Yes Yes Yes (Not mature) Yes
Credentials Management Poor Good Good Good Poor Poor Good
Ecosystem - Plugins Great Good Poor Good Good Poor Good
Kubernetes Support Yes (Not Native) Yes - Native Yes (Not Native) Yes - Native Yes Yes - Not Native Yes - Vendor Managed
Documentation* Great Good Poor Poor Poor Poor Good
Release Version 2.4.3 1.2.1 2.7.14 1.6 2.6.9 0.18.3 11.3
Github Stars 28.3k 2.9k 6.2k 12.1k 10.6k 7.8k -
License Apache Apache Apache Apache Apache Apache Proprietary
Cloud/SaaS Astronomer UnionML Outerbounds - PrefectHQ - Databricks
Additional Notes Rich Ecosystem for Data Engineering LF AI & Data (Graduate Project) Not mature as product offering Complex Deployment Focus on Data Engineering Not Mature Mature and Paid service, vendor lock in.

Model Training

Model training is the process of training your machine learning models on data to improve their performance. This involves selecting the appropriate algorithms and hyperparameters, preparing the data, and running the training process itself. A common example of a tool for model training is scikit-learn, which provides a wide range of algorithms and tools for preprocessing data and running the training process.

Features Ray Apache Spark Dask Horovod BigDL Distributed Tensorflow Mesh Tensorflow GPipe PyTorch Microsoft DeepSpeed
Type of training Data + Model parallelism Data parallelism Data + Model parallelism Data parallelism pipeline parallelism Data parallelism Data + Model parallelism pipeline parallelism Data +Model parallelism data + model + pipeline parallelism
Integrations Yes Yes Yes Yes Yes Yes Unknown None Yes Yes
Scalable training Yes via Train Yes:
standalone via MLib
Cluster via MapReduce
Yes via parallelising numpy, scikit-learn, pandas Yes using different backends: MPI, Gloo, NCCL, oneCCL Yes Yes Yes Yes Yes
(e.g. via
FairScale)
Yes
Scalable Hyperparameter tuning Yes via Tune Yes via MLib Yes: built in via HyperbandSearchCV Yes via Ray Tune Yes via Orca Yes via external libs No n/a Yes via Ray Tune Yes bult in + LAMB
Hardware+software support CPU/GPU/TPU CPU
GPU via
RAPIDS Accelerator
CPU
GPU not native support
CPU/GPU CPU CPU/GPU/TPU CPU/GPU/TPU GPU/TPU/CPU CPU/GPU/TPU CPU/GPU
License Apache 2.0 Apache 2.0 BSD 3-Clause License Apache 2.0 Apache 2.0 MIT License Apache 2.0 Open source modified BSD license MIT License
Documentation* Great Great Great Great Good OK OK Bad Great Good
Github Stars 23.9k 34.9k 10.7k 13k 4.1k n/a 1.4k 2.7k (part of Lingvo) 62.3k 8.6k

Table 2: Platforms For Distributed Training

Features AWS Sagemaker Google Cloud Computing Microsoft Azure
Storage S3, Redshift, RDS GCS, BigQuery Blob, Data Lake
ETL EMR, Glue, Sagemaker with PySpark Cloud Dataflow, BigQuery Azure Databricks, Synapse, Kusto
Visualisation QuickSight DataStudio Power BI, Cognitive Services
Exploration Athena, Sagemaker Autopilot BigQuery, AutoML table Azure ML Studio, Azure Databricks
Distributed training Yes Yes Yes
Model versioning Yes Yes Yes
Experiment tracking Yes Yes Yes
Error analysis Sagemaker Debugger AutoML table with BigQuery Azure ML
1-click deployment Yes Yes Yes
Batch prediction Yes Yes Yes
Native Pipelines for MLOps No No Yes
SSO Admin Yes Yes Yes
Scaling options Auto Scaling Autoscaler Azure autoscale
Analytics Amazon kinesis Cloud dataflow Azure stream analytics

ML Metadata And Artefacts Tracking

ML metadata and artefacts tracking is the process of capturing and storing information about the data and models used in the machine learning process. This helps ensure reproducibility and auditability, as well as facilitating collaboration and experimentation. A common example of an ML metadata and artefacts tracking tool is MLflow, which allows you to log information about the data, models, and experiments you run during the machine learning process.

You know the drill, here’s the compiled data for different tools for this task –

Features ML Flow DVC Neptune AI Valohai Amazon Sagemaker Tensorflow ml-metadata Weights and biases clearml Polyaxon comet.com Microsoft Azure
Experiment tracking (Yes / No) Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Model registry (Yes / No) Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No (Uses ml Flow)
Data versioning (Yes / No) No ( Usually combined with DVC) Yes Yes Yes Yes Yes Yes Yes No No Yes
Lineage tracking (Yes / No) No No Not sure Yes Yes Yes Yes No No No No (Uses microsoft purview)
CI/CD/CT Integration (Yes / No) Yes Via separate tool
CML
Yes (Github Actions or Docker) Yes (Gitlab, Docker, Github) Yes Yes Yes (early access) Not sure Github Actions, Jenkins, argo, ariflow, kafka zapier Not sure Github Actions
Integration with python Python, Python Python (most ML libraries)
Jupyter notebooks
Python,
Jupyer notebooks,
Pyhton (most ML libraries), Python (most ML libraries) Python (most ML libraries) Python(most ML libraries) Python(most ML libraries) Python(most ML libraries) Pyhton
Integrations (Frameworks/tools) Docker,
Databricks
N/A Kedro, ZenML, Sacred, Optuna, Arize, Sagemaker, google colab, deepnote , etc. Integrations Docker,
Mostly with Sagemaker features.
Kubeflow Sagemaker, Kubeflow, Databricks Sagemaker, Kubeflow, Kubernetes, Optuna, etc. A lot of services A lot of services Other Azure services
Organising and searching experiments, models and related metadata. (Yes / No) Yes Yes Yes (Using graphic interface) Yes (Using graphic interface) Yes Yes (Using sql queries) Yes Yes Yes Yes Yes
UI/ Dashboard (Yes / No) Yes Yes via
Iterative Studio
Yes Yes Yes No Yes Yes Yes Yes Yes
API Yes (Python, R, Java, REST) Yes Yes Yes Yes No No Yes Yes Yes Yes
Part of larger ecosystem (Yes / No) Yes No No No Yes ~ Kubeflow uses it. No No Yes No Yes
Github Starts (#) 13.4k 10.9k N/A N/A N/A 507 N/A 4k 3.2k N/A N/A
License Apache Apache Proprietary Proprietary Proprietary Apache Proprietary Apache Apache Proprietary Proprietary
Documentation* (Great, Good, Ok, Bad) Great Good Good Great Great Ok Good Ok Good Good Good

Model Registry

A model registry is a tool or platform that allows you to store and manage different versions of your machine learning models. This helps you keep track of the changes made to your models over time, and makes it easier to deploy and manage them in production. A common example of a model registry is the TensorFlow Model Garden, which provides a centralized repository of pre-trained models and tools for managing model versions and deployments.

Drill –

Features ML Flow DVC Valohai Verta AI Sagemaker Dataiku DataRobot Azure ML Comet Weight & Biases ModelDB H2O MLOps Neptune AI Yatai
Deployment type Managed (self-hosted), fully-managed (via Databricks) Managed (self-hosted) fully managed fully-managed, enterprise deployment (on-premise or VPC) fully managed self-hosted, fully-managed (SaaS), On-premise/ on-cloud deployments self-hosted, fully-managed, On-premise / on-cloud deployments Hosted on Azure ML Workspace, available through SDK, on-premise/on-cloud deployments fully-managed
On-premise/on-cloud deployment
fully-managed
On-premise/on-cloud deployment
self-hosted
Fully-managed
Fully Managed Managed (self-hosted), fully-managed Managed (self-hosted)
on-cloud deployment (BentoML)
Experiment tracking Yes Yes Yes Yes Yes Yes n/a Yes n/a Yes n/a n/a n/a n/a
Part of larger ecosystem Yes No Yes Yes Yes Yes Yes Yes Yes Yes No Yes No Yes part of BentoML
Team collaboration No ❓ Yes Yes Yes Yes N/A N/A Yes Proprietary Yes n/a n/a Yes Yes
Access management No Yes Yes Yes Yes N/A N/A Yes Proprietary Yes n/a n/a Yes Yes
Code versioning No Yes Yes N/A Yes N/A N/A Yes Yes Yes Yes n/a No. Logging only No
Data versioning No Yes Yes Yes Yes Yes Yes No n/a n/a Yes via dataset metadata logging No
API integration Yes Yes
Python only ❓
No Yes Yes Yes Yes Yes Yes n/a Yes Yes
UI / Dashboards Yes Yes via
Iterative Studio
Yes Yes Yes Yes Yes Yes Yes Yes Dashboards n/a Yes Yes
CI/CD workflow integration Yes Via seperate tool
CML
Yes Yes
(Jenkins, Chef, GitOps…)
Yes ❓ Yes
(Jenkings)
N/A Yes n/a n/a No ❓ Yes
Jenkins, CircleCI and Github Actions etc
Model staging Yes Yes Yes ❓ Yes Yes N/A N/A Yes Yes Yes n/a n/a Yes No
Model promotion Yes N/A Yes ❓ Yes N/A N/A N/A Yes Yes n/a n/a Yes No
Tool integration N/A No ❓
However other tools have DVC integrations (e.g. Hydra, Hugging Face, DagsHub, VS code)
N/A Yes
(Docker, Kubernetes, Tensorflow, PyTorch, etc)
Yes Yes N/A n/a Yes Yes
(Kubeflow, Ray, ZenML, Tensorboard…)
Yes n/a Yes Yes
Tensorflow, PyTorch, Keras, XGBoost etc
Model deployment integration N/A Yes N/A Yes Yes Yes Yes Yes Yes n/a n/a n/a n/a Yes
Model training / experiments integration N/A Yes N/A No Yes Yes Yes Yes Yes Yes n/a n/a n/a Yes
Github Stars 13.3k 10.9k n/a n/a n/a n/a n/a n/a n/a n/a 1.5k n/a n/a 493
License Apache Apache Proprietary Proprietary Proprietary Proprietary Proprietary Proprietary Proprietary Proprietary Apache Proprietary Proprietary Apache
Documentation* Good Good Good Good Good Good Ok Good Good Ok Ok Bad Good Ok

Model Serving

Model serving is the process of making your machine learning models available to users or applications for inference. This involves deploying the model to a production environment and providing an API or other interface for accessing it. A common example of a model serving tool is TensorFlow Serving, which allows you to deploy TensorFlow models to production environments and provides an API for making predictions.

Features BentoML Cortex TensorFlow Serving TorchServe Seldon-MLServer KFServe Azure ML Valohai ForestFlow Databricks Sagemaker
Multiple models serving
(Yes/No)
Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
RESTful endpoint monitoring
(Yes/No)
N/A Yes No Yes No Yes Yes Yes No Yes Yes
(via
Cloudwatch)
Control via UI
(Yes/No)
Yes
(with
Yatai)
No No Yes No No Yes Yes No Yes Yes
Input/output distribution shifts monitoring
(Yes/No)
No No No No No No Yes No No No Yes
(via
Cloudwatch)
Score via UI
(Yes/No)
No No No No No No No No No Yes No
Serving cluster customization/autoscaling
(Yes/No)
Only with Clipper Yes Yes Yes
(via
TorchX)
No Yes Yes Yes Yes Yes Yes
Model Agnostic
(Yes/No)
Yes Yes No No Yes Yes Yes Yes Yes Yes Yes
Dependency manager
(Yes/No)
Yes N/A Yes Yes Yes No Yes Yes N/A Yes Yes
Part of a larger ecosystem
(Yes/No)
No No Yes Yes Yes No Yes Yes No Yes Yes
A/B testing tools
(Yes/No)
No No Yes Yes No Yes Yes Yes Yes No Yes
Support for gRPC
(Yes/No)
Yes Yes Yes Yes Yes Yes Yes Yes N/A Yes Yes
Integrations Airflow, MLFlow, Spark TensorFlow, Keras, PyTorch, Scikit-learn, XGBoost TensorFlow PyTorch, MLFlow, Kubeflow N/A Kubeflow, MLFlow N/A N/A TensorFlow, Spark ML Spark ML, MLFlow, scikit-learn MLFlow, Kubeflow, Spark ML
Github Stars 4.5k 7.9k 5.7k 3.1k 349 1.9k N/A N/A 61 N/A N/A
License Apache License 2.0 Apache License 2.0 Apache License 2.0 Apache License 2.0 Apache License 2.0 Apache License 2.0 Proprietary Proprietary Apache License 2.0 Proprietary Proprietary
Documentation* Good Ok Good Good Ok Ok Good Good Ok Good Good

The Final Stack

Together, that is a lot of information to digest. Individually, each of these were parallelised for simultaneous completion.

Next, we used the popular MoSCoW Method to evaluate the tools and decide the winners. Here's a crisp explanation of MOSCOW prioritization:

  • MOSCOW prioritization is a project management technique to prioritize requirements or features based on their importance.
  • It stands for Must have, Should have, Could have, and Won't have.
  • Must-have requirements are critical and necessary for the project's success.
  • Should-have requirements are important but not critical.
  • Could-have requirements are desirable but not essential.
  • Won't-have requirements are excluded from the project scope.
  • MOSCOW prioritization helps teams focus on the most important features and manage stakeholder expectations.
  • It also helps reduce scope creep and ensure the project delivers the most value with available resources.
Component Tool
ML pipeline orchestrator Flyte
ML metadata and artefact tracking Sagemaker
Model Registry Sagemaker
Model Serving Sagemaker (BentoML was a close runner-up)
Model Training Ray + AWS(as the platform)

Azure was a close runner-up in some cases, however, Sagemaker beat it because of it’s wide applicability.

In conclusion, choosing the right frameworks for building a robust and scalable machine learning architecture can be a challenging task. However, the MADR approach can be a valuable tool in streamlining the decision-making process. By evaluating various tools and documenting the decision-making criteria and rationale, teams can ensure transparency and accountability in the decision-making process. While extensive research may seem time-consuming, it is essential to make an informed decision that aligns with the organization's goals and requirements. In the end, a well-designed machine learning architecture can significantly impact the success of ML projects, and investing in the right tools and frameworks can make all the difference.

authors
Varun Chitale

Varun Chitale

Share this article