Name: Apache CarbonData
Brand: The Apache Software Foundation

Apache CarbonDataThe Apache Software Foundation

Apache CarbonData is a fully indexed columnar data format optimized for fast analytics on big data platforms. It supports advanced compression, multi-level indexing, and seamless integration with Spark, enabling efficient queries over petabytes of data.

Vendor

The Apache Software Foundation

Company Website

https://carbondata.apache.org

YouTube

https://www.youtube.com/c/TheApacheFoundation

Product details

Apache CarbonData

Apache CarbonData is a high-performance, fully indexed columnar data format designed for big data analytics. It is native to the Hadoop ecosystem and deeply integrated with Apache Spark, enabling fast, interactive querying over petabyte-scale datasets. Developed originally by Huawei and now a top-level Apache Software Foundation project, CarbonData is optimized for analytical workloads with advanced indexing, compression, and encoding techniques

12.Features

Columnar Storage Format: Efficient data layout for analytical queries.
Multi-Level Indexing: Reduces I/O and CPU usage by pruning unnecessary data scans.
Advanced Compression & Encoding: Improves storage efficiency and query performance.
Custom DDL/DML Support: Extended SQL syntax for table creation, data loading, updates, and deletes.
Segment Management: Handles incremental data loads with transactional capabilities.
Partitioning: Supports Hive-style and CarbonData-specific hash, list, and range partitions.
Compaction: Merges segments to optimize query performance.
External Table Support: Reads CarbonData files and infers schema for SQL querying.
Bloom Filter Index: Accelerates filtering operations.
Lucene Index: Enables efficient text data indexing.
Materialized Views (MV): Supports query rewriting for pre-aggregated or joined data.
Streaming Support: Near real-time data ingestion via Spark Streaming DSL.
SDK Access: Read/write CarbonData files from non-Spark applications.
Cloud & Distributed Storage: Compatible with S3, OBS, HDFS, and Alluxio.

Capabilities

Big Data Query Acceleration: Sub-second response times on terabyte-scale data.
Spark Integration: Deeply embedded in Spark’s DataSource API for optimized execution.
Flexible Data Management: Supports updates, deletes, and incremental loads.
Cross-Platform Compatibility: Works across Hadoop-based systems and cloud environments.
Custom Application Integration: SDKs for Java and C++ enable CarbonData usage outside Spark.
Interactive Analytics: Designed for fast, ad-hoc querying in analytical environments.
Data Pruning: Efficiently reduces scanned data using indexing and partitioning.

Benefits

Performance: Queries are significantly faster due to indexing and compression.
Scalability: Handles petabyte-scale datasets with ease.
Cost Efficiency: Runs on commodity hardware with optimized resource usage.
Flexibility: Supports various data ingestion and querying patterns.
Open Source: Free to use under Apache License 2.0.
Ecosystem Integration: Works seamlessly with Spark, Hive, Presto, and cloud storage systems.
Developer-Friendly: Rich API support and SQL extensions for custom applications.

Find more products by segment

Large Business Enterprise Medium Business Small Business B2B View all

Find more products by industry

Other Services Education Finance & Insurance Health & Social Work Public Administration Information & Communication View all

Find more products by category

Other Software View all