Logo
Sign in

Celeborn is an intermediate data service for Big Data compute engines like ETL, OLAP, and streaming systems, designed to boost performance, stability, and flexibility by managing shuffle and spilled data efficiently.

Vendor

Vendor

The Apache Software Foundation

Company Website

Company Website

mappartition.svg
celeborn.svg
Product details

Apache Celeborn

Apache Celeborn is an intermediate data service designed to optimize data exchange in distributed Big Data compute engines such as ETL, OLAP, and streaming systems. It addresses inefficiencies in traditional shuffle frameworks by reorganizing and managing shuffle and spilled data in a more performant and scalable way. Celeborn decouples data storage from compute nodes, enabling disaggregated architectures and improving overall system flexibility and stability.

Features

  • Efficient shuffle data management across distributed systems
  • Support for multiple storage layers: memory, local disks, distributed file systems, and object stores
  • High availability via Raft-based Master node architecture
  • Integration with Apache Spark, Apache Flink, and Hadoop MapReduce
  • Modular architecture with Master, Worker, and Client components
  • Fine-grained control over shuffle lifecycle and metadata
  • Optimized disk and network usage through data reorganization
  • Fault-tolerant data handling and recovery mechanisms

Capabilities

  • Centralized shuffle data service decoupled from compute nodes
  • Slot-based allocation and reservation for shuffle operations
  • Logical partitioning of shuffle data for efficient access
  • Sequential data reading with minimal network connections
  • Dynamic partition splitting for large or failed data pushes
  • LifecycleManager and ShuffleClient roles for control and data planes
  • Compatibility with disaggregated compute-storage architectures
  • Configurable storage strategies per Worker node

Benefits

  • Boosts performance of distributed compute engines by reducing shuffle overhead
  • Enhances system stability and scalability through centralized data management
  • Reduces local storage requirements on compute nodes
  • Simplifies integration with existing Big Data frameworks
  • Improves resource utilization and reduces network congestion
  • Enables flexible deployment strategies across heterogeneous environments
  • Facilitates efficient data access patterns for large-scale analytics