Deploying machine learning, LLM, and AI models to the cloud can be a complex and time-consuming process.
You either need an extensive knowledge of specific cloud services and cloud resource configurations or a team of ops developers to do the work for you.
It’s no wonder data scientists hate this step of the development process — their work stalls as they wait for someone else to deploy the model for them.
This is where Ulap’s Inference Engine comes in, providing an efficient and data-scientists-friendly platform for model serving.
We’ll show you how to do that in this article.
What is Model Serving?
Model serving, simply put, is to host machine-learning models and to make their functions available through an API so applications can incorporate AI into their systems.
When a data scientist has an ML model ready, the next step is to deploy it.
Your business cannot offer AI products to a large user base without making the product accessible either in the cloud or on-premise (though we recommend in the cloud).
Common Serving Methods
In general, there are two common methods for deploying (or serving) models to the cloud.
Cloud Services
One of the most common methods of serving models is utilizing cloud services such as AWS Sagemaker, Azure ML, and GCP Vertex AI.
In this method, the user or team must be familiar with the specific services they are using and understand cloud resource configuration — something that requires extensive training and experience to do well.
This is usually the most expensive option and can lead to vendor lock-in. Once you choose a vendor, it is incredibly difficult and costly to switch, so choose wisely!
Inference Endpoints
The next method is to find a vendor that provides deployment of inference endpoints, such as Hugging Face, Weights and Biases, DataRobot, and Databricks.
These vendors do not require you to know how to configure cloud resources as the deployment process are usually automated.
In this method, the user or team provides the model or location to the model files and the software containerizes and serves the model.
The Tech Behind Our Inference Engine
There is a third option for serving models to the cloud — Ulap’s Inference Engine.
We built the Inference Engine to free up developers and data scientists to deliver amazing AI/ML applications on cloud resources via deployment wizards and automated operations. In other words, we wanted to create a cost-effective and easy-to-use option to serve models to the cloud.
Open-Source Technology
Under the hood, our inference engine utilizes open-source technologies. This leverages Kubernetes to manage and deploy your inference server. It helps by providing scalability, resource management, fault tolerance and automation of your server. This translates to a robust architecture for your inference service that is production-ready.
We use two open-source tools to power the interference engine.
MLflow
First, we use MLflow to track production-ready models.
This open-source platform manages the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.
It not only helps track training experiments throughout the model development processes, it also tracks the list of production-ready models in their model registry.
Information on model performance, artifact location, and model requirements are stored for easy access.
KServe
The second application that powers the inference engine is KServe.
This application enables serverless inferencing on Kubernetes and provides performant, high abstraction interfaces for common machine learning (ML) frameworks to solve production model serving use cases.
We use KServe for scaling, networking, health-checking, and configuring the model server.
Under the hood, it leverages MLserver, Tensorflow Serving, and TorchServe to serve the framework-specific models.
Preparing Models for Serving with Ulap’s Inference Engine
Serving your models with Ulap’s Inference Engine is simple.
Depending on the AI framework the model is built on (e.g. scikit-Learn or PyTorch), the model deployment process steps will vary.
However, there are three overall steps you need in order to deploy and serve a model on our Inference Engine:
Step 1: Train and Track the Model
The first step involves tracking the training of the model using MLFlow.
During the training process, the goal is to not only track model performance metrics but all the information necessary to reproduce the model from training.
This can include information such as:
- Dataset used
- Training script
- Environment requirements
The most common way of tracking your model training is by wrapping the training code and starting an MLflow run using the code below:
with mlflow.start_run():
# Train code
# eg. use model.fit(X_train, y_train) for scikit
Step 2: Prepare Your Model Files
In this step, the model is packaged into a framework-dependent file, and additional deployment files are created.
For scikit, XGBoost, and TensorFlow, the model is saved as a .pickle, .bst, or .pb file, respectively.
For PyTorch, however, the deployment process is a bit more complex as the framework allows for more serving configurability. Three files must be created:
- A handler file to define the model behavior
- A config file to find the torchserve server configuration
- The model archive file
These files are then loaded into a deployment folder associated with the run of the model via MLFlow using the following command:
# Upload model artifacts
client.log_artifacts(run_id=run_id,
local_dir="deployment",
artifact_path='deployment')
Step 3: Stage Your Model
The last step in the process is to stage the mode in MLflow, using the code below.
client = MlflowClient()
client.transition_model_version_stage(name="MODEL_NAME", version=1,
stage="Staging")
Here the model can be transitioned to “Staging”, “Production”, or “Archive”. This allows users to keep track of the model stage, but, more specifically, to inform the Inference Engine which models are ready to serve and where the artifacts are located, so the server can fetch the files during deployment.
Make Your Deployments Easy
Deploying your machine-learning models shouldn’t be a daunting task.
With Ulap’s Inference Engine, you can streamline your deployment process to allow your developers and data scientists to focus on what matters — building applications your customers will love.
Want to try the Inference Engine out yourself?
Sign up for a free 30-day trial and start deploying your ML models today.