Optimizing AI/ML Models for Serving: Proven Techniques to Reduce Inference Times

Published by
The Ulap Team
on
December 15, 2023 7:32 AM

Ensuring your model will perform well in a production environment requires more than just training, preparing, and serving.

Your models and server needs to be optimized.

Picture it: A user goes to use your LLM on their smartphone and has to wait minutes for the model to respond. 

Not only will your LLM suck up all the memory and computing power on their phone, but it will also provide a poor user experience — something you can’t easily bounce back from.

To avoid that gut-wrenching scenario, you need to optimize your model.

We’ll dive into 2 ways you can optimize your LLM or Deep Learning models before serving.

Getting Started with Model Optimization

LLM and Deep Learning model optimization starts once your model is trained but before you serve it.

To ensure your model performs well in a production environment, there are two factors you should consider:

Model Size

It’s important to consider model size as the server hardware may not support the memory and storage requirements for the model.

Additionally, when models are served they must be uploaded and downloaded. If the model is large, serving time will increase — something to look out for when serving models on edge devices (such as smartphones, tablets, or laptops) that have memory and computing power constraints.

Pay attention to LLMs in particular. They can be 10s of GBs in size.

Source: Inside language models (from GPT-4 to PaLM) – Dr Alan D. Thompson – Life Architect

Inference Latency

You also want to look at inference latency as real-world applications require fast inference times. This becomes especially important with real-time applications.

One easy way to reduce latency is to use GPUs.  However, only some hardware deployments have GPUs; in some cases, like with LLMs, this still needs to be improved.  Not only can this trigger a bad experience for the end user because they have to wait for inferences (imagine waiting minutes for ChatGPT to respond),  but it can also increase costs associated with serving the model.  

Model Optimization Techniques

The goal of optimizing your model is to reduce both the memory and computational requirements of the model, which will accelerate its overall performance.

There are two common ways to accomplish those goals:

  • Reduce the model size
  • Compile the model

Let’s look at both.

Reducing Model Size

Reducing the size of your model is essential in optimizing it for a production environment, especially if your model will be accessed on edge devices, such as smartphones, tablets, and laptops that have limited space.

There are two ways to reduce the size of your model:

  • Model Quantization
  • Model Pruning

Model Quantization

The most popular way to reduce model size is a process called quantization. 

Models hold millions, or even billions, or weights and biases serving as the model’s memory. These parameters, typically stored as 32-bit floating-point numbers, result in models that require massive amounts of memory.

Quantization works by reducing the precision of the weights — essentially simplifying these parameters into more compact forms and truncating the length of the digits. This process drastically reduces the memory footprint and computational requirements of the model.

Source: The Ultimate Guide to Deep Learning Model Quantization and Quantization-Aware Training

The Downside of Quantization

There is a drawback to quantization: a reduction in model accuracy. 

Quantization involves reducing the precision of numerical values in the model, which can lead to a loss of information which, in turn, leads to a decrease in model accuracy, especially for complex tasks.

In the case of LLMs, this can mean the response you get is gibberish.

Methods of Quantization

There are two ways to quantize a model, depending on where you are in the model development process:

Post-Training Quantization (PTQ)

According to TensorFlow, post-training quantization includes general techniques to reduce CPU and hardware accelerator latency, processing, power, and model size with little degradation in model accuracy.

This method of quantization is done after training the model (hence the name) and is easier to use.

Quantization-Aware Training (QAT)

According to TensorFlow, quantization-aware training is done by creating a model that downstream tools will use to produce quantized models. The quantized models use lower precision, leading to benefits during deployment.

This method of quantization is used during the training process and is often better for model accuracy.

Quantization Libraries

We recommend the following libraries for quantization:

  • Pytorch for Pytorch framework models.
  • Transformers for LLMs. Transformers allows users to choose from two algorithms:
  • ONNX for ONNX framework models.

Model Pruning

Model pruning is the technique of removing neurons within the neural network that do not improve a model’s performance.

Pruning these unimportant parameters reduces the number of weights which, effectively, makes the model smaller.

Methods of Pruning

Pruning can be done in two ways, both after the model has been trained.

Unstructured Pruning

Unstructured pruning, also called magnitude pruning, converts some of the parameters or weights with smaller magnitudes into zeros.

It removes individual weights from the model. 

Structured Pruning

Structured pruning removes a structure (building block) of the target neural network, such as:

  • Entire neurons
  • Channel of filters

Removal of a particular structure of a network results in shrinking in the size of the parameter (weight) matrices.

The image below shows a fully connected network before and after pruning.

Source: Pruning in Deep Learning Model

A Caution for Pruning

By pruning a model, inference times and memory usage can be reduced while improving model accuracy. 

However, if not used with caution, pruning can also lead to a loss of accuracy, especially if too many weights or connections are removed from the model. 

In addition to decreasing the model size, this can reduce the inference latency as the model has fewer calculations to execute. The drawback is the potential for loss in accuracy.

Pruning Libraries

Models can be pruned post-training using the following libraries:

Model Compilation

Compiling a model is the process of converting the model into an executable format that can be run on specific hardware.

Natively, the common deep learning frameworks, such as PyTorch, are written in Python, which is easy to understand and develop, but is notoriously slow. The code is executed at runtime rather than compiled like C/C++ programs, which can be many 1-2 orders of magnitude faster.  

Model compilation is completed after the model has been trained. It can be done either before or after quantization and pruning, but it is recommended to do it after. This allows the model to be optimized for the specific hardware platform, while still maintaining accuracy.  

Pros and Cons of Model Compilation

During the compilation process, the model is optimized for the specific hardware platform, which can lead to improved performance, reduced inference latency, and reduced memory usage. 

However, model compilation can be a time-consuming process, and it may require specialized knowledge and expertise. 

Additionally, compiled models may not support some features, so it is important to know the limitations when using this technique.

Model Compilation Libraries

This step comes after training your model and can be done using several libraries depending on the framework you are starting with, including:

  • PyTorch:  This library uses Just in Time (JIT) compilation which means the model is compiled at runtime.  This compilation method compiles the PyTorch code into optimized kernels rather than running it using an interpreter as is the case when running the model in a Jupyter notebook.  
  • ONNX Runtime
  • Apache TVM

Mastering Model Optimization

Optimizing your model can drastically enhance model-serving effectiveness while retaining performance and accuracy in a production environment.

Leverage the techniques we discussed to reduce the model size and compile your model before serving it to the cloud. 

Your end users will have a much better experience and you’ll save computing power, which ultimately saves you money.

Another way to save money with your models? Sign up for a free trial of the Ulap Inference Engine.