Logo
Sign in

Apache CarbonData is a fully indexed columnar data format optimized for fast analytics on big data platforms. It supports advanced compression, multi-level indexing, and seamless integration with Spark, enabling efficient queries over petabytes of data.

Vendor

Vendor

The Apache Software Foundation

Company Website

Company Website

cd-slider-2.png
cd-slider-3.png
CDfullyIndexing.png
Product details

Apache CarbonData

Apache CarbonData is a high-performance, fully indexed columnar data format designed for big data analytics. It is native to the Hadoop ecosystem and deeply integrated with Apache Spark, enabling fast, interactive querying over petabyte-scale datasets. Developed originally by Huawei and now a top-level Apache Software Foundation project, CarbonData is optimized for analytical workloads with advanced indexing, compression, and encoding techniques 

12.Features

  • Columnar Storage Format: Efficient data layout for analytical queries.
  • Multi-Level Indexing: Reduces I/O and CPU usage by pruning unnecessary data scans.
  • Advanced Compression & Encoding: Improves storage efficiency and query performance.
  • Custom DDL/DML Support: Extended SQL syntax for table creation, data loading, updates, and deletes.
  • Segment Management: Handles incremental data loads with transactional capabilities.
  • Partitioning: Supports Hive-style and CarbonData-specific hash, list, and range partitions.
  • Compaction: Merges segments to optimize query performance.
  • External Table Support: Reads CarbonData files and infers schema for SQL querying.
  • Bloom Filter Index: Accelerates filtering operations.
  • Lucene Index: Enables efficient text data indexing.
  • Materialized Views (MV): Supports query rewriting for pre-aggregated or joined data.
  • Streaming Support: Near real-time data ingestion via Spark Streaming DSL.
  • SDK Access: Read/write CarbonData files from non-Spark applications.
  • Cloud & Distributed Storage: Compatible with S3, OBS, HDFS, and Alluxio.

Capabilities

  • Big Data Query Acceleration: Sub-second response times on terabyte-scale data.
  • Spark Integration: Deeply embedded in Spark’s DataSource API for optimized execution.
  • Flexible Data Management: Supports updates, deletes, and incremental loads.
  • Cross-Platform Compatibility: Works across Hadoop-based systems and cloud environments.
  • Custom Application Integration: SDKs for Java and C++ enable CarbonData usage outside Spark.
  • Interactive Analytics: Designed for fast, ad-hoc querying in analytical environments.
  • Data Pruning: Efficiently reduces scanned data using indexing and partitioning.

Benefits

  • Performance: Queries are significantly faster due to indexing and compression.
  • Scalability: Handles petabyte-scale datasets with ease.
  • Cost Efficiency: Runs on commodity hardware with optimized resource usage.
  • Flexibility: Supports various data ingestion and querying patterns.
  • Open Source: Free to use under Apache License 2.0.
  • Ecosystem Integration: Works seamlessly with Spark, Hive, Presto, and cloud storage systems.
  • Developer-Friendly: Rich API support and SQL extensions for custom applications.