
NVIDIA NeMo™ Curator improves generative AI model accuracy by processing text, image, and video data at scale for training and customization. It also provides pre-built pipelines for generating synthetic data to customize and evaluate generative AI systems.
Vendor
NVIDIA
Company Website




NVIDIA NeMo™ Curator is a powerful tool designed to enhance the accuracy of generative AI models by processing large-scale text, image, and video data for training and customization. It provides pre-built pipelines for generating synthetic data, enabling developers to curate high-quality datasets and train highly accurate generative AI models across various industries, including finance, retail, telecommunications, automotive (AV), and robotics. NeMo Curator is part of the NVIDIA Cosmos™ platform and offers video processing pipelines for building or customizing world foundation models (WFM). Along with NeMo microservices, it helps create data flywheels to continuously optimize generative AI agents, enhancing the overall user experience.
Features
- Comprehensive Data Processing: Streamlines tasks such as data downloading, extraction, cleaning, quality filtering, deduplication, and blending or shuffling, providing them as Pythonic APIs.
- Text Data Processing: Includes downloading data, cleaning, applying heuristic filters, deduplication, advanced quality filtering using classifier models, and data blending.
- Synthetic Data Generation: Offers tools for generating synthetic data using pre-built pipelines or custom models, compatible with OpenAI API.
- Video Data Processing: Provides pipelines for video decoding, splitting, transcoding, captioning, and text embedding for downstream semantic search and deduplication.
- Image Data Processing: Supports downloading datasets, creating CLIP embeddings, applying NSFW and aesthetic filters, semantic deduplication, and creating high-quality datasets.
- Scalability: Capable of processing up to 100+PB of data, leveraging NVIDIA RAPIDS™ libraries and Dask for multi-node, multi-GPU environments.
Benefits
- Higher Accuracy with Less Data: Achieves higher model accuracy with less data and faster model convergence, reducing training time.
- Enhanced Performance: Provides up to 16X faster text processing and 89X faster video processing compared to alternatives.
- Customizable and Modular: Offers a customizable interface for building data processing pipelines tailored to specific needs.
- Industry Applications: Suitable for various industries, including finance, retail, telecommunications, automotive, and robotics.
- Continuous Optimization: Enables the creation of data flywheels to continuously optimize generative AI agents, improving user experience.