Logo
Sign in
Product Logo
Amazon SageMaker InferenceAmazon Web Services (AWS)

Easily deploy and manage machine learning (ML) models for inference

Vendor

Vendor

Amazon Web Services (AWS)

Product details

Easily deploy and manage machine learning (ML) models for inference

What is Amazon SageMaker Inference?

Amazon SageMaker AI makes it easier to deploy ML models including foundation models (FMs) to make inference requests at the best price performance for any use case. From low latency and high throughput to long-running inference, you can use SageMaker AI for all your inference needs. SageMaker AI is a fully managed service and integrates with MLOps tools, so you can scale your model deployment, reduce inference cost, manage models more effectively in production, and reduce operational burden.

Benefits of SageMaker Inference

Deploy models in production for inference for any use case

SageMaker AI caters to a wide range of inference requirements, from low latency (a few milliseconds) and high throughput (millions of transactions per second) scenarios to long-running inference for use cases such as multilingual text processing, text-image processing, multi-modal understanding, natural language processing, and computer vision. SageMaker AI provides a robust and scalable solution for all your inference needs.

Achieve optimal inference performance and cost

Amazon SageMaker AI offers more than 100 instance types with varying levels of compute and memory to suit different performance needs. To better utilize the underlying accelerators and reduce deployment cost, you can deploy multiple models to the same instance. To further optimize costs, you can use autoscaling, which automatically adjusts the number of instances based on traffic. It shuts down instances when there is no usage, thereby reducing inference costs.

Reduce operational burden using SageMaker MLOps capabilities

As a fully managed service, Amazon SageMaker AI takes care of setting up and managing instances, software version compatibilities, and patching versions. With built-in integration with MLOps features, it helps off-load the operational overhead of deploying, scaling, and managing ML models while getting them to production faster.

Wide range of inference options

Real-Time Inference

Real-time, interactive, and low latency predictions for use cases with steady traffic patterns. You can deploy your model to an endpoint that is fully managed and supports autoscaling.

Serverless Inference

Low latency and high throughput for use cases with intermittent traffic patterns. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies.

Asynchronous Inference

Low latency for use cases with large payloads (up to 1 GB) or long processing times (up to one hour), and near real-time latency requirements. Asynchronous Inference helps save costs by autoscaling the instance count to zero when there are no requests to process.

Batch Transform

Offline inference on data batches for use cases with large datasets. With Batch Transform, you can preprocess datasets to remove noise or bias, and associate input records with inferences to help with result interpretation.

Find more products by category
Security SoftwareDevelopment SoftwareView all