Logo
Sign in
Product Logo
Amazon EC2 Trn1 InstancesAmazon Web Services (AWS)

The best price performance for training deep learning models in the cloud

Vendor

Vendor

Amazon Web Services (AWS)

Product details

High-performance, cost-effective training of generative AI models

Why Amazon EC2 Trn1 Instances?

Amazon Elastic Compute Cloud (EC2) Trn1 instances, powered by AWS Trainium chips, are purpose built for high-performance deep learning (DL) training of generative AI models, including large language models (LLMs) and latent diffusion models. Trn1 instances offer up to 50% cost-to-train savings over other comparable Amazon EC2 instances. You can use Trn1 instances to train 100B+ parameter DL and generative AI models across a broad set of applications, such as text summarization, code generation, question answering, image and video generation, recommendation, and fraud detection. The AWS Neuron SDK helps developers train models on AWS Trainium (and deploy models on the AWS Inferentia chips). It integrates natively with frameworks such as PyTorch and TensorFlow, so that you can continue using your existing code and workflows to train models on Trn1 instances.

Benefits

Reduce training times for 100B+ parameter models

Trn1 instances are purpose built for high-performance DL and reduce training times from months to weeks, or even days. With reduced training times, you can iterate faster, build more innovative models, and increase productivity. Trn1n instances deliver up to 20% faster time-to-train than Trn1 instances for models that benefit from increased network bandwidth.

Lower your fine-tuning and pre-training costs

Trn1 instances deliver high performance while offering up to 50% cost-to-train savings over other comparable Amazon EC2 instances.

Use your existing ML frameworks and libraries

Use the AWS Neuron SDK to extract the full performance of Trn1 instances. With Neuron, you can use popular ML frameworks like PyTorch and TensorFlow and continue to use your existing code and workflows to train models on Trn1 instances. To quickly get started with Trn1 instances, see popular model examples in the Neuron documentation.

Scale up to 6 exaflops with EC2 UltraClusters

Trn1 instances support up to 800 Gbps of second-generation Elastic Fabric Adapter (EFAv2) network bandwidth. Trn1n instances support up to 1600 Gbps of EFAv2 network bandwidth to deliver even higher performance for network-intensive models. Both instances are deployed in EC2 UltraClusters that enable scaling up to 30,000 Trainium chips, which are interconnected with a nonblocking petabit-scale network to provide 6 exaflops of compute performance.

Features

Up to 3 petaflops with AWS Trainium

Trn1 instances are powered by up to 16 AWS Trainium chips purpose built to accelerate DL training and deliver up to 3 petaflops of FP16/BF16 compute power. Each chip includes two second-generation NeuronCores.

Up to 512 GB high-bandwidth accelerator memory

To support efficient data and model parallelism, each Trn1 instance has 512 GB of shared accelerator memory (HBM) with 9.8 TB/s of total memory bandwidth.

High-performance networking and storage

To support training of network-intensive models, such as Mixture of Experts (MoE) and Generative Pre-Trained Transformers (GPT), each Trn1n instance delivers up to 1600 Gbps of EFAv2 networking bandwidth. Each Trn1 instance supports up to 800 Gbps of EFAv2 bandwidth. EFAv2 speeds up distributed training by delivering up to 50% improvement in collective communications performance over first-generation EFA. These instances also support up to 80 Gbps of Amazon Elastic Block Store (EBS) bandwidth and up to 8 TB of local NVMe solid state drive (SSD) storage for fast workload access to large datasets.

NeuronLink interconnect

For fast connectivity between Trainium chips and streamlined collective communications, Trn1 instances support up to 768 GB/s of NeuronLink, a high-speed, nonblocking interconnect.

State-of-the-art data types and DL optimizations

To deliver high performance while meeting accuracy goals, Trn1 instances are optimized for FP32, TF32, BF16, FP16, UINT8, and the new configurable FP8 (cFP8) data type. To support the fast pace of DL innovation and generative AI, Trn1 instances have several innovations that make them flexible and extendable to train constantly evolving DL models. Trn1 instances have hardware optimizations and software support for dynamic input shapes. To allow support for new operators in the future, they support custom operators written in C++. They also support stochastic rounding, a method for rounding probabilistically to achieve high performance and higher accuracy compared to legacy rounding modes.