
NeMo Retriever Extraction is a scalable microservice for document content and metadata extraction, supporting various document types.
Vendor
NVIDIA
Company Website
NeMo Retriever Extraction, also known as NVIDIA Ingest, is a scalable, performance-oriented microservice designed for document content and metadata extraction. It supports parsing PDFs, Word, and PowerPoint documents, using specialized NVIDIA image NIMs to find, contextualize, and extract text, tables, charts, and images for downstream generative applications. The service enables parallelization of document splitting into pages, where artifacts are classified, extracted, and contextualized through optical character recognition (OCR) into a well-defined JSON schema. Additionally, NeMo Retriever Extraction can manage the computation of embeddings for the extracted content and store it in a vector database like Milvus.
Features
- Scalable Extraction: Efficiently processes and extracts content from various document types.
- Parallel Processing: Splits documents into pages for parallel classification and extraction.
- OCR Integration: Uses OCR to contextualize extracted artifacts into a JSON schema.
- Embeddings Management: Optionally computes embeddings for extracted content.
- Vector Database Support: Supports storing extracted content in databases like Milvus.
Benefits
- High Performance: Optimized for scalable and efficient document processing.
- Versatility: Handles multiple document formats and types.
- Contextualization: Provides detailed contextual extraction through OCR.
- Integration: Easily integrates with downstream applications and databases.
- Reliability: Ensures accurate and consistent extraction and storage of document content.