Name: NVIDIA Collective Communications Library (NCCL)
Brand: NVIDIA

NVIDIA Collective Communications Library (NCCL)NVIDIA

The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.

Vendor

NVIDIA

Company Website

https://developer.nvidia.com/nccl

Product details

The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes. Leading deep learning frameworks such as Caffe2, Chainer, MxNet, PyTorch, and TensorFlow have integrated NCCL to accelerate deep learning training on multi-GPU multi-node systems.

Features

Automatic Topology Detection: Detects high bandwidth paths on AMD, ARM, PCI Gen4, and IB HDR.
High Performance: Achieves up to 2x peak bandwidth with in-network all-reduce operations utilizing SHARPV2.
Graph Search: Optimizes the set of rings and trees for highest bandwidth and lowest latency.
Multi-Threaded and Multi-Process Support: Compatible with single-threaded, multi-threaded, and multi-process applications.
InfiniBand and Networking Support: Supports InfiniBand verbs, libfabric, RoCE, and IP Socket internode communication.
Adaptive Routing: Reroutes traffic to alleviate congested ports with InfiniBand Adaptive routing.

Benefits

Ease of Programming: Uses a simple C API, accessible from various programming languages, and closely follows the popular collectives API defined by MPI.
Compatibility: Compatible with virtually any multi-GPU parallelization model.
Performance Optimization: Removes the need for developers to optimize their applications for specific machines, providing fast collectives over multiple GPUs both within and across nodes.
Scalability: Massively scales deep learning training with optimized communication primitives.

Find more products by industry

Professional Services Information & Communication View all

Find more products by category

Application Development Software Other Development Software View all