
Apache CelebornThe Apache Software Foundation
Celeborn is an intermediate data service for Big Data compute engines like ETL, OLAP, and streaming systems, designed to boost performance, stability, and flexibility by managing shuffle and spilled data efficiently.
Vendor
The Apache Software Foundation
Company Website
Product details
Apache Celeborn
Apache Celeborn is an intermediate data service designed to optimize data exchange in distributed Big Data compute engines such as ETL, OLAP, and streaming systems. It addresses inefficiencies in traditional shuffle frameworks by reorganizing and managing shuffle and spilled data in a more performant and scalable way. Celeborn decouples data storage from compute nodes, enabling disaggregated architectures and improving overall system flexibility and stability.
Features
- Efficient shuffle data management across distributed systems
- Support for multiple storage layers: memory, local disks, distributed file systems, and object stores
- High availability via Raft-based Master node architecture
- Integration with Apache Spark, Apache Flink, and Hadoop MapReduce
- Modular architecture with Master, Worker, and Client components
- Fine-grained control over shuffle lifecycle and metadata
- Optimized disk and network usage through data reorganization
- Fault-tolerant data handling and recovery mechanisms
Capabilities
- Centralized shuffle data service decoupled from compute nodes
- Slot-based allocation and reservation for shuffle operations
- Logical partitioning of shuffle data for efficient access
- Sequential data reading with minimal network connections
- Dynamic partition splitting for large or failed data pushes
- LifecycleManager and ShuffleClient roles for control and data planes
- Compatibility with disaggregated compute-storage architectures
- Configurable storage strategies per Worker node
Benefits
- Boosts performance of distributed compute engines by reducing shuffle overhead
- Enhances system stability and scalability through centralized data management
- Reduces local storage requirements on compute nodes
- Simplifies integration with existing Big Data frameworks
- Improves resource utilization and reduces network congestion
- Enables flexible deployment strategies across heterogeneous environments
- Facilitates efficient data access patterns for large-scale analytics
Find more products by industry
Other ServicesEducationFinance & InsuranceHealth & Social WorkPublic AdministrationInformation & CommunicationView all