TECH CULTURE
13 minute read • December 20, 2023

Deploying machine learning models at scale at sennder

sennAI
Deploying machine learning models at scale at sennder
authors
Henrik Jilke-sennder

Henrik Jilke

About the author: Henrik Jilke is a machine learning engineer with more than 6 years of professional experience in developing modern real-time & scalable AI applications

Introduction

At sennder, we leverage the power of model serving engines to seamlessly deploy our machine learning models into production. These engines play a pivotal role in enabling our engineers to efficiently deploy and manage machine learning models at scale. Notable open-source projects include Seldon, vllm, cortex, and BentoML/Yatai. In the upcoming article, we will delve into BentoML/Yatai[^1] and provide insights on its usage. 

Motivation

Developing custom APIs is a common way to deploy machine learning models, such as linear regression for predictive analytics or convolutional neural networks for image recognition, in production. Many companies use frameworks like Flask or FastAPI to create these APIs and run their models on Kubernetes within containerized applications. This approach allows for quick model deployment, generating business value, and maintaining a simple tech stack—a common strategy for AI-focused companies. However, there are limitations, particularly when dealing with large-scale ML. Scaling up ML operations requires managing resources, orchestrating tasks, allocating specialized hardware like GPUs, implementing model CICD, and ensuring easy monitoring and automated scaling.

Therefore, building custom APIs for models can become cumbersome at scale. This is where model serving engines come in. Tools like BentoML/Yatai[^1] and Seldon make it possible to deploy thousands of models at scale on Kubernetes without the need for manual service creation. These tools also seamlessly integrate with standard monitoring systems and support automatic scaling.

In this post, we'll explore the relatively new player, BentoML/Yatai, and provide a paragraph comparing BentoML/Yatai with Seldon, a popular and well-established model serving engine.

[^1]: BentoML/Yatai means using the BentoML Python library with the Yatai backend system for managing models. BentoML also works with different backends.

Seldon vs. Bento/Yatai

In order to showcase the characteristics of BentoML/Yatai we will compare it against Seldon, which, with a decade of experience, is a highly popular and reliable model serving engine. It can handle various model formats like Sklearn, Pytorch, and even more generic ones like MLFlow models. Seldon simplifies the process by automatically creating an API for the model and deploying it as individual Pods on Kubernetes.

The table below shows a comparison between Seldon and BentoML/Yatai in terms of how they handle deployments, compatibility, model registration, monitoring and operations

Seldon BentoML/Yatai
Define Deployments To deploy a model via Seldon a user defines a SeldonDeployment, specifies the model uri (e.g. S3 bucket with a supported model format to Seldon) and thereafter applies the manifest. Once the SeldonDeployment is applied, the Seldon orchestrator will create all relevant Kubernetes objects and your model will be deployed. Also BentoML/Yatai provides a CRD to define model services, called BentoDeployment. Similar to the SeldonOrchestrator there is a dedicated operator on Kubernetes which will create all the required resources.
Compatibility Seldon supports many different model formats, such as Sklearn or XGBoost, out of the box. For more info consult here. BentoML/Yatai requires the user to convert the model artifact to an intermediate format called Bento Model.
Model Registry Seldon does not provide a registry. However it is compatible with MLFlow registry. BentoML/Yatai comes with a registry, called Yatai-(dashboard). Very similar to the MLflow registry you can register models and deploy models via UI.
Monitoring & Operations Both frameworks support monitoring, scaling and allow engineers to have an automated and continuous deployment of ML

In simple terms, BentoML/Yatai and Seldon are quite similar overall. However, one important difference is how they handle different types of models. Seldon can directly load model files (like sklearn models) using its built-in provider. With BentoML/Yatai, you need to convert the model into an intermediate format called Bento Model before using it (more in section How to train models using BentoML).  

In the upcoming sections, we'll delve into using BentoML/Yatai, examine the backend setup, explore model monitoring with DataDog, and take a look at a model's continuous integration and continuous deployment (CICD) process.

What is BentoML/Yatai and terminology

Before we dive into the details, let's take a high-level look at BentoML/Yatai. BentoML, a Python library, allows you to convert model artifacts into Bento Models and define and customize your model APIs. Yatai handles the deployment and operation of these APIs on Kubernetes.

To simplify the terminology, here are the key concepts:

  1. Bento Model: This is the model artifact along with a YAML file containing metadata such as Python version and library versions.

  2. Bento: A packaged Bento Model that can be deployed, including API definitions, Docker files, and more.

  3. BentoML: The Python library for creating Bento Models and Bentos.

  4. Yatai-(dashboard): A service running on Kubernetes that registers Bentos, similar to the MLFlow model registry.

  5. Yatai-image-builder: This tool builds the image for a given Bento and can push it to a repository like ECR.

  6. Yatai-deployment: It's used to deploy a Bento, utilizing the image that has been pushed.

In the next section we will explore how to create Bento Models and Bentos using BentoML.

 

How to train models using BentoML

BentoML lets you create deployable machine learning applications in Python. To keep it simple, we follow the official tutorial here

The initial step is to convert your model into a Bento Model. In the example below, we train a model using sklearn and then save it as a Bento Model.

import bentoml
from sklearn import svm
from sklearn import datasets

# Load training data set
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Train the model
clf = svm.SVC(gamma='scale')
clf.fit(X, y)

# Save model to the BentoML local Model Store
saved_model = bentoml.sklearn.save_model("iris_clf", clf)

Once you run the script, you'll have a newly saved Bento Model on your local system. At this point, the Bento Model contains a pickled model file and a YAML file with metadata, including Python versions.

One of the great things about BentoML is that you can easily tailor your APIs. With BentoML's Services, we can create custom APIs. Let's get started!

import numpy as np
import bentoml
from bentoml.io import NumpyNdarray

iris_clf_runner = bentoml.sklearn.get("iris_clf:latest").to_runner()

svc = bentoml.Service("iris_classifier", runners=[iris_clf_runner])

@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def classify(input_series: np.ndarray) -> np.ndarray:
    result = iris_clf_runner.predict.run(input_series)
    return result

You might have noticed the similarity to Flask or FastAPI. Similarly, we could start this service already locally for testing. However, to create a fully functional service we need to build a Bento

service: "service:svc"  # Same as the argument passed to `bentoml serve`
labels:
   owner: bentoml-team
   stage: dev
include:
- "*.py"  # A pattern for matching which files to include in the Bento
python:
   packages:  # Additional pip packages required by the Service
   - scikit-learn
   - pandas
models: # The model to be used for building the Bento.
- iris_clf:latest

We can build the Bento using the CLI command, which will store the Bento locally:

bentoml build

BentoML/Yatai Backend infrastructure on EKS

In the BentoML/Yatai ecosystem's backend, there are three primary services, all running on Kubernetes. First, there's a model registry with a dashboard called yatai-(dashboard). Second, there's a service/operator named yatai-image-builder, responsible for building and pushing images. Finally, there's a service/operator, yatai-deployment, for deploying these models.

You can refer to a tutorial for installing these services. If you're part of a larger organization and manage your infrastructure with tools like Terraform, you'll find the BentoML/Yatai ecosystem to be quite flexible. You can still create and manage your own infrastructure and configure BentoML/Yatai to use it. For example, you might want to set up your own Postgres database, ECR repository, and IAM roles in accordance with your company's guidelines.

As an engineer, you don't necessarily need to use all of these components. In fact, you can often achieve your goals using just yatai-deployment. However, let's explore how these three components can be utilized.

Yatai-(dashboard)

Once you've trained a model and created a Bento, you can upload it to the yatai registry. This registry is similar to MLFlow's registry and provides a user-friendly interface showing a list of models and their deployment status. You can also deploy models directly from this interface.

Yatai-image-builder

Before you can deploy a Bento, you need to containerize it and upload the resulting image to a registry like ECR. If you have your Bento stored in the yatai-registry or an S3 bucket, you can use the BentoRequest CRD to create a Bento CR. In practice, by applying the manifest below, an image corresponding to your Bento is built and pushed to a registry such as AWS ECR. It's worth noting that you can also build and push the image as part of your CI process without using yatai-image-builder.

apiVersion: resources.yatai.ai/v1alpha1
kind: BentoRequest
metadata:
  name: my-bento
  namespace: my-namespace
spec:
  bentoTag: iris:1
  downloadUrl: s3://my-bucket/bentos/iris.tar.gz
  runners:
  - name: runner1
    runnableType: SklearnRunnable
    modelTags:
    - iris:1
  - name: runner2

Yatai-deployment

Yatai-deployment deploys the images defined via BentoDeployment resources. In the example below, we create a BentoRequest and deploy the service. You can customize the deployment by adding annotations or labels, for instance, when using DataDog.

The CRD also allows us to scale the resources for Runners and Services individually. This allows us to allocate additional resources to machine learning models with higher resource demands.

apiVersion: resources.yatai.ai/v1alpha1
kind: BentoRequest
metadata:
    name: iris-classifier
    namespace: yatai
spec:
    bentoTag: iris_classifier:3oevmqfvnkvwvuqj  # check the tag by `bentoml list iris_classifier`
---
apiVersion: serving.yatai.ai/v2alpha1
kind: BentoDeployment
metadata:
    name: my-bento-deployment
    namespace: yatai
spec:
    bento: iris-classifier
    ingress:
        enabled: true
    resources:
        limits:
            cpu: "500m"
            memory: "512Mi"
        requests:
            cpu: "250m"
            memory: "128Mi"
    autoscaling:
        maxReplicas: 10
        minReplicas: 2
    runners:
        - name: iris_clf
          resources:
              limits:
                  cpu: "1000m"
                  memory: "1Gi"
              requests:
                  cpu: "500m"
                  memory: "512Mi"
          autoscaling:
              maxReplicas: 4
              minReplicas: 1

Monitoring using DataDog

As mentioned earlier, BentoDeployments provide Prometheus metrics. If your company uses DataDog, you'll need to annotate your BentoDeployments, like this example:

apiVersion: serving.yatai.ai/v2alpha1
kind: BentoDeployment
metadata:
  name: my-bento-deployment
  namespace: yatai
spec:
  bento: iris-classifier
  ingress:
    enabled: true
  extraPodMetadata:
    annotations:
      ad.datadoghq.com/main.checks: |
        {
          "openmetrics": {
            "init_config": {},
            "instances": [
              {
                "openmetrics_endpoint": "http://%%host%%:%%port%%/metrics",
                "namespace": "bentoml",
                "metrics": [".*"]
              }
            ]
          }
        }
  resources:
    limits:
      cpu: "500m"
      memory: "512Mi"
    requests:
      cpu: "250m"
      memory: "128Mi"
  autoscaling:
    maxReplicas: 10
    minReplicas: 2
  runners:
    - name: iris_clf
      resources:
        limits:
          cpu: "1000m"
          memory: "1Gi"
        requests:
          cpu: "500m"
          memory: "512Mi"
      autoscaling:
        maxReplicas: 4
        minReplicas: 1

If using New Relic, please follow the documentation here to find details on how to fetch the metrics using New Relic 

In the last section, we will illustrate one way to design a model continuous delivery (CD).

Ideas for Model Continuous Delivery

For testing, you can manually apply the above manifests using kubectl. However, it's recommended to establish a robust model CD process. This involves versioning your model manifests and metadata using platforms like GitHub and setting up automated CD pipelines, such as GitHub Actions. This approach makes it easy to redeploy your models in case of failures.

Many companies schedule model training using orchestrators like Apache Airflow. As demonstrated in section How to train models using BentoML, you can seamlessly integrate model training with Bento creation and storage in a registry or an S3 bucket. 

Subsequently, you can either manually or automatically open a pull request with the deployment manifest (refer to section BentoML/Yatai Backend infrastructure on EKS). Once merged, tools like ArgoCD can synchronize the new manifests and apply them to your Kubernetes cluster. If you're not using ArgoCD, you can define GitHub Actions and use kubectl to apply the manifests.

With these steps in place, you'll have a functional and automated model CD system!

Conclusion

In summary, model serving engines like BentoML/Yatai simplify the deployment of machine learning models. They handle resource isolation, hardware allocation, monitoring, and scaling automatically, especially at scale. However, setting up and maintaining these tools in your company environment can be complex. As a result, they are typically more beneficial for AI-savvy companies that deploy many models. If you're just starting with AI and have only a few models, it might be more straightforward to deploy them using tools like Flask or FastAPI.

authors
Henrik Jilke-sennder

Henrik Jilke

Share this article