Logo
Sign in

Apache Hudi is an open-source data lakehouse platform that enables efficient, incremental data processing with ACID guarantees, time travel, and schema evolution. It supports streaming and batch workloads, offers high-performance indexing, and integrates with cloud-native and open data ecosystems.

Vendor

Vendor

The Apache Software Foundation

Company Website

Company Website

hudi-lake-overview-e39f80337517a0a1999d8eb5cd0ac965.png
2025-07-02-Lakehouse-Architecture-apache-hudi-and-apache-iceberg.png
lsm-1200x600.png
Product details

Apache Hudi

Apache Hudi is an open-source data lakehouse platform designed to bring database-like functionality to data lakes. It enables efficient, incremental data processing with ACID guarantees, time travel, and schema evolution. Built on a high-performance open table format, Hudi supports both streaming and batch workloads, making it ideal for modern data infrastructure.

Features

  • Support for mutability across all workload types
  • Fast, pluggable indexing for updates and deletes
  • Incremental data processing for low-latency analytics
  • ACID transactional guarantees with snapshot isolation
  • Time travel capabilities for historical data analysis
  • Multi-cloud ecosystem compatibility
  • Automated table services for clustering, compaction, and cleaning
  • Multi-modal indexing for query acceleration
  • Schema evolution and enforcement for resilient pipelines

Capabilities

  • Efficient upserts and deletes for CDC and streaming data
  • Integration with popular engines like Spark, Flink, Hive, Presto, and Trino
  • Support for open data formats and cloud-native environments
  • Auto-ingestion from sources like Kafka and Debezium
  • Auto-sync with cloud data catalogs
  • Native Rust implementation (Hudi-rs) with Python bindings
  • Optimized file layout and table types (Copy-on-Write, Merge-on-Read)
  • Snapshot, incremental, and read-optimized query modes

Benefits

  • Accelerated data ingestion and processing
  • Reduced operational complexity with automated services
  • Improved data reliability and consistency
  • Enhanced query performance on large datasets
  • Flexibility to adapt to evolving data schemas
  • Proven scalability in production environments
  • Active open-source community and continuous innovation