Logo
Sign in

Apache Iceberg is a high-performance open table format for large analytic datasets. It enables reliable SQL-like operations on big data and supports multiple engines like Spark, Flink, Trino, and Hive, allowing concurrent access and advanced features such as schema evolution, hidden partitioning, and time travel.

Vendor

Vendor

The Apache Software Foundation

Company Website

Company Website

partition-spec-evolution.png
iceberg-metadata.png
Product details

Apache Iceberg

Apache Iceberg is an open-source, high-performance table format designed for managing large analytic datasets. It brings the reliability and simplicity of SQL tables to big data environments, enabling multiple compute engines like Spark, Flink, Trino, Hive, and Impala to safely access and modify the same tables concurrently. Iceberg supports advanced features such as schema evolution, hidden partitioning, time travel, and rollback, making it ideal for modern data lake architectures.

Features

  • Full schema evolution including column add, drop, rename, and reorder
  • Hidden partitioning for automatic and optimized query performance
  • Time travel and rollback for reproducible queries and error recovery
  • Row-level deletes and updates using position and equality delete files
  • Advanced filtering with column-level and partition-level statistics
  • Optimistic concurrency for safe multi-writer environments
  • Serializable isolation ensuring atomic and consistent table changes
  • Support for branching and tagging of table versions
  • REST catalog and multiple language APIs for integration flexibility

Capabilities

  • Manages petabyte-scale tables with efficient metadata tracking
  • Enables fast scan planning without requiring distributed SQL engines
  • Supports multiple file formats including Parquet, Avro, and ORC
  • Integrates with cloud object stores and HDFS without relying on directory listings
  • Allows dynamic partition layout evolution based on query patterns
  • Provides snapshot-based access to table states for consistency and auditability
  • Facilitates cost-based optimization through rich metadata
  • Compatible with various compute engines and deployment environments
  • Offers extensible specification for cross-language and cross-platform support

Benefits

  • Simplifies big data table management with SQL-like semantics
  • Reduces query latency and improves performance through metadata pruning
  • Enhances data reliability and correctness in distributed environments
  • Supports agile data modeling with safe and flexible schema changes
  • Enables reproducible analytics and debugging with time travel
  • Minimizes operational complexity with built-in compaction and isolation
  • Promotes open standards and avoids vendor lock-in
  • Scales efficiently with growing data volumes and concurrent workloads