
Apache Paimon is a lakehouse storage format that supports real-time and batch processing with engines like Flink and Spark. It combines a lake format with LSM structure to enable real-time streaming updates, flexible data management, and efficient querying for large-scale data architectures.
Vendor
The Apache Software Foundation
Company Website


Apache Paimon
Apache Paimon is an open-source lakehouse storage format designed to support real-time and batch data processing. It combines the benefits of data lakes and data warehouses by enabling streaming updates, efficient querying, and scalable metadata management. Built to integrate with engines like Apache Flink and Apache Spark, Paimon supports both append-only and primary-key tables, making it suitable for a wide range of analytical and transactional workloads
Features
- Real-time streaming updates with primary-key support
- Flexible update mechanisms via merge engines
- Changelog tracking for accurate stream analytics
- Append-only tables for large-scale batch and streaming processing
- Data skipping using min-max indexes for fast queries
- Full schema evolution and time travel capabilities
- Compaction with z-order sorting for optimized storage
- Integration with Flink, Spark, Hive, Trino, and other engines
- Unified table abstraction for batch and streaming modes
Capabilities
- Supports hybrid read modes: batch snapshots, streaming offsets, and incremental snapshots
- Enables CDC (Change Data Capture) ingestion from databases
- Provides high-performance OLAP queries over large datasets
- Stores columnar files with manifest-based metadata for efficient access
- Uses LSM tree structure for scalable updates and queries
- Compatible with object stores and distributed file systems
- Facilitates real-time analytics with sub-minute query latency
- Offers flexible partitioning and indexing strategies
Benefits
- Combines the flexibility of data lakes with the performance of data warehouses
- Reduces latency for real-time data ingestion and querying
- Simplifies data architecture with unified storage and access patterns
- Enhances scalability and reliability for enterprise-grade workloads
- Supports modern data engineering practices including stream processing and schema evolution
- Enables cost-effective storage with efficient compaction and indexing
- Promotes open-source collaboration and extensibility