
High performance at the lowest cost in Amazon EC2 for the most demanding inference workloads
Vendor
Amazon Web Services (AWS)
Company Website
High performance at the lowest cost in Amazon EC2 for generative AI inference
Why Amazon EC2 Inf2 Instances?
Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instances are purpose built for deep learning (DL) inference. They deliver high performance at the lowest cost in Amazon EC2 for generative artificial intelligence (AI) models, including large language models (LLMs) and vision transformers. You can use Inf2 instances to run your inference applications for text summarization, code generation, video and image generation, speech recognition, personalization, fraud detection, and more. Inf2 instances are powered by AWS Inferentia2, the second-generation AWS Inferentia chip. Inf2 instances raise the performance of Inf1 by delivering 3x higher compute performance, 4x larger total accelerator memory, up to 4x higher throughput, and up to 10x lower latency. Inf2 instances are the first inference-optimized instances in Amazon EC2 to support scale-out distributed inference with ultra-high-speed connectivity between Inferentia chips. You can now efficiently and cost-effectively deploy models with hundreds of billions of parameters across multiple chips on Inf2 instances. The AWS Neuron SDK helps developers deploy models on the AWS Inferentia chips (and train them on AWS Trainium chips). It integrates natively with frameworks, such as PyTorch and TensorFlow, so you can continue using your existing workflows and application code and run on Inf2 instances.
Benefits
Deploy 100B+ parameter, generative AI models at scale
Inf2 instances are the first inference-optimized instances in Amazon EC2 to support distributed inference at scale. You can now efficiently deploy models with hundreds of billions of parameters across multiple Inferentia chips on Inf2 instances, using the ultra-high-speed connectivity between the chips.
Increase performance while significantly lowering inference costs
Inf2 instances are designed to deliver high performance at the lowest cost in Amazon EC2 for your DL deployments. They offer up to 4x higher throughput and up to 10x lower latency than Amazon EC2 Inf1 instances. Inf2 instances deliver up to 40% better price performance than other comparable Amazon EC2 instances.
Use your existing ML frameworks and libraries
Use the AWS Neuron SDK to extract the full performance of Inf2 instances. With Neuron, you can use your existing frameworks like PyTorch and TensorFlow and get optimized out-of-the-box performance for models in popular repositories like Hugging Face. Neuron supports runtime integrations with serving tools like TorchServe and TensorFlow Serving. It also helps optimize performance with built-in profile and debugging tools like Neuron-Top and integrates into popular visualization tools like TensorBoard.
Meet your sustainability goals with an energy-efficient solution
Inf2 instances deliver up to 50% better performance/watt over other comparable Amazon EC2 instances. These instances and the underlying Inferentia2 chips use advanced silicon processes and hardware and software optimizations to deliver high energy efficiency when running DL models at scale. Use Inf2 instances to help meet your sustainability goals when deploying ultra-large models.
Features
Up to 2.3 petaflops with AWS Inferentia2
Inf2 instances are powered by up to 12 AWS Inferentia2 chips connected with ultra-high-speed NeuronLink for streamlined collective communications. They offer up to 2.3 petaflops of compute and up to 4x higher throughput and 10x lower latency than Inf1 instances.
Up to 384 GB high-bandwidth accelerator memory
To accommodate large DL models, Inf2 instances offer up to 384 GB of shared accelerator memory (32 GB HBM in every Inferentia2 chip, 4x larger than first-generation Inferentia) with 9.8 TB/s of total memory bandwidth (10x faster than first-generation Inferentia).
NeuronLink interconnect
For fast communication between Inferentia2 chips, Inf2 instances support 192 GB/s of NeuronLink, a high-speed, nonblocking interconnect. Inf2 is the only inference-optimized instance to offer this interconnect, a feature that is only available in more expensive training instances. For ultra-large models that do not fit into a single chip, data flows directly between chips with NeuronLink, bypassing the CPU completely. With NeuronLink, Inf2 supports faster distributed inference and improves throughput and latency.
Optimized for novel data types with automatic casting
Inferentia2 supports FP32, TF32, BF16, FP16, UINT8, and the new configurable FP8 (cFP8) data type. AWS Neuron can take high-precision FP32 and FP16 models and autocast them to lower-precision data types while optimizing accuracy and performance. Autocasting reduces time to market by removing the need for lower-precision retraining and enabling higher-performance inference with smaller data types.
State-of-the-art DL optimizations
To support the fast pace of DL innovation, Inf2 instances have several innovations that make them flexible and extendable to deploy constantly evolving DL models. Inf2 instances have hardware optimizations and software support for dynamic input shapes. To allow support for new operators in the future, they support custom operators written in C++. They also support stochastic rounding, a method for rounding probabilistically to achieve high performance and higher accuracy compared to legacy rounding modes.