
Apache CarbonDataThe Apache Software Foundation
Apache CarbonData is a fully indexed columnar data format optimized for fast analytics on big data platforms. It supports advanced compression, multi-level indexing, and seamless integration with Spark, enabling efficient queries over petabytes of data.
Vendor
The Apache Software Foundation
Company Website



Product details
Apache CarbonData
Apache CarbonData is a high-performance, fully indexed columnar data format designed for big data analytics. It is native to the Hadoop ecosystem and deeply integrated with Apache Spark, enabling fast, interactive querying over petabyte-scale datasets. Developed originally by Huawei and now a top-level Apache Software Foundation project, CarbonData is optimized for analytical workloads with advanced indexing, compression, and encoding techniques
12.Features
- Columnar Storage Format: Efficient data layout for analytical queries.
- Multi-Level Indexing: Reduces I/O and CPU usage by pruning unnecessary data scans.
- Advanced Compression & Encoding: Improves storage efficiency and query performance.
- Custom DDL/DML Support: Extended SQL syntax for table creation, data loading, updates, and deletes.
- Segment Management: Handles incremental data loads with transactional capabilities.
- Partitioning: Supports Hive-style and CarbonData-specific hash, list, and range partitions.
- Compaction: Merges segments to optimize query performance.
- External Table Support: Reads CarbonData files and infers schema for SQL querying.
- Bloom Filter Index: Accelerates filtering operations.
- Lucene Index: Enables efficient text data indexing.
- Materialized Views (MV): Supports query rewriting for pre-aggregated or joined data.
- Streaming Support: Near real-time data ingestion via Spark Streaming DSL.
- SDK Access: Read/write CarbonData files from non-Spark applications.
- Cloud & Distributed Storage: Compatible with S3, OBS, HDFS, and Alluxio.
Capabilities
- Big Data Query Acceleration: Sub-second response times on terabyte-scale data.
- Spark Integration: Deeply embedded in Spark’s DataSource API for optimized execution.
- Flexible Data Management: Supports updates, deletes, and incremental loads.
- Cross-Platform Compatibility: Works across Hadoop-based systems and cloud environments.
- Custom Application Integration: SDKs for Java and C++ enable CarbonData usage outside Spark.
- Interactive Analytics: Designed for fast, ad-hoc querying in analytical environments.
- Data Pruning: Efficiently reduces scanned data using indexing and partitioning.
Benefits
- Performance: Queries are significantly faster due to indexing and compression.
- Scalability: Handles petabyte-scale datasets with ease.
- Cost Efficiency: Runs on commodity hardware with optimized resource usage.
- Flexibility: Supports various data ingestion and querying patterns.
- Open Source: Free to use under Apache License 2.0.
- Ecosystem Integration: Works seamlessly with Spark, Hive, Presto, and cloud storage systems.
- Developer-Friendly: Rich API support and SQL extensions for custom applications.
Find more products by industry
Other ServicesEducationFinance & InsuranceHealth & Social WorkPublic AdministrationInformation & CommunicationView all