Logo
Sign in
Product Logo
NVIDIA cuBLASNVIDIA

NVIDIA cuBLAS is a GPU-accelerated library for accelerating AI and HPC applications. It includes several API extensions for providing drop-in industry standard BLAS APIs and GEMM APIs with support for fusions that are highly optimized for NVIDIA GPUs. The cuBLAS library also contains extensions for batched operations, execution across multiple GPUs, and mixed- and low-precision execution with additional tuning for the best performance.

Vendor

Vendor

NVIDIA

Company Website

Company Website

cublas-11-int.png
cublas-11-fp64.png
NewStrongScaling_Cut.png
cublas-11-mixed.png
Product details

NVIDIA cuBLAS is a GPU-accelerated library designed to accelerate AI and HPC applications. It includes several API extensions for providing drop-in industry standard BLAS APIs and GEMM APIs with support for fusions that are highly optimized for NVIDIA GPUs. The cuBLAS library also contains extensions for batched operations, execution across multiple GPUs, and mixed- and low-precision execution with additional tuning for the best performance. The cuBLAS library is included in both the NVIDIA HPC SDK and the CUDA Toolkit.

Features

  • Complete Support for BLAS Routines: Supports all 152 standard BLAS routines, ensuring comprehensive functionality for various linear algebra operations.
  • Optimized GEMM Extensions: Includes GEMM and GEMM extensions with fusion optimized for Tensor Cores, providing high performance for matrix multiplication tasks.
  • Multi-GPU and Multi-Node Capabilities: Offers APIs for single-process multi-GPU (cuBLASXt) and multi-node multi-GPU (cuBLASMp) operations, enabling efficient dispatching of workloads across multiple GPUs.
  • Mixed- and Low-Precision Execution: Supports half-precision and integer matrix multiplication, leveraging tensor cores for acceleration.
  • CUDA Streams: Utilizes CUDA streams for concurrent operations, enhancing performance and efficiency.
  • Device-Side API Extensions: Provides device-side API extensions (cuBLASDx) for performing BLAS calculations inside CUDA kernels, reducing latency and improving performance.

Benefits

  • High Performance: Optimized for NVIDIA GPUs, leveraging tensor cores for accelerated matrix multiplication.
  • Scalability: Supports multi-GPU and multi-node operations, making it suitable for large-scale HPC applications.
  • Flexibility: Offers a range of APIs for different levels of operations, from vector-vector to matrix-matrix calculations.
  • Efficiency: Reduces latency and improves performance with device-side API extensions and CUDA streams.
  • Comprehensive Functionality: Supports a wide range of BLAS routines and GEMM extensions, ensuring versatility for various applications.
Find more products by category
Development SoftwareView all