
MLServer is an open-source, lightweight inference server designed for DevOps and ML Engineers to efficiently deploy and serve machine learning models.
Vendor
Seldon
Company Website
MLServer is a specially developed open-source tool for DevOps and ML Engineers, enabling the deployment of machine learning models in various environments, from simple setups to complex production use-cases requiring significant scaling. It simplifies the process of spinning up inference servers by utilizing standardized protocols and handling the inherent scaling challenges. Models can be served efficiently via both REST and gRPC APIs, leveraging standardized API definitions through the Open Inference Protocol for swift deployment. The core Python inference server is highly flexible, designed to serve ML models within Kubernetes-native frameworks, allowing for easy modification and extension to fit specific requirements. It ensures smooth and efficient deployments by orchestrating dependencies essential for the execution of various runtimes, creating a streamlined operational environment. MLServer offers out-of-the-box support for popular machine learning frameworks such as scikit-Learn, XGBoost, MLlib, LightGBM, and MLflow, while also providing the capability to build custom runtimes. This flexibility allows users to serve models precisely according to their needs. Furthermore, MLServer includes features to optimize performance and reduce operational costs. It enhances speed by reducing latency and increasing throughput through parallel inference, which runs multiple inference processes on a single server, passing requests into separately-running processes. Resource efficiency is improved with adaptive batching, grouping requests for predictions and then separating responses to users. Significant infrastructure cost savings can be achieved through multi-model serving, allowing multiple models (different versions or entirely different models) to run on the same server, optimizing resource utilization.
Features & Benefits
- Simple and Fast Implementation
- Easily spins up inference servers with standardized protocols, handling scaling challenges for production use-cases.
- API-Driven Model Serving
- Delivers swift deployment complemented by standardized API definitions through the Open Inference Protocol, supporting REST or gRPC APIs.
- Flexible & Extensible Architecture
- A core Python inference server used to serve ML models in Kubernetes native frameworks, making it easy to modify and extend.
- Leverages popular frameworks including scikit-Learn, XGBoost, MLlib, LightGBM, and MLflow out of the box.
- Allows building of custom runtimes.
- Streamlined Dependency Orchestration
- Orchestrates dependencies essential for the execution of your runtimes, ensuring a smooth and efficient operational environment.
- Performance Optimization
- Reduces latency and increases throughput with parallel inference, running multiple inference processes on a single server.
- Resource Efficiency
- Improves efficiency with adaptive batching to group requests, perform predictions on the batch, and separate responses.
- Infrastructure Cost Savings
- Reduces cost and optimizes resources with multi-model serving, running multiple models on the same server.