Logo
Sign in

Apache DataFusion is a fast, extensible query engine written in Rust using Apache Arrow. It provides SQL and DataFrame APIs, supports multiple file formats, and features a vectorized, multi-threaded execution engine. DataFusion is ideal for building high-performance, data-centric systems and analytics platforms.

Vendor

Vendor

The Apache Software Foundation

Company Website

Company Website

tpch_allqueries.png
improved-planning-time.png
Product details

Apache DataFusion

Apache DataFusion is a fast, extensible query engine written in Rust, designed for building high-performance, data-centric systems. It uses Apache Arrow as its in-memory format and provides SQL and DataFrame APIs for efficient data processing. DataFusion supports a wide range of file formats and offers a vectorized, multi-threaded execution engine, making it ideal for analytics, machine learning, and streaming applications.

Features

  • SQL and DataFrame APIs for flexible query construction
  • Native support for CSV, Parquet, JSON, Avro, and Arrow formats
  • Columnar, streaming, multi-threaded, vectorized execution engine
  • Full-featured SQL parser and query planner
  • Advanced query optimizer with join reordering, predicate pushdown, and projection pruning
  • Support for nested types, window functions, subqueries, and set operations
  • User-defined functions and custom execution plans
  • Streaming and asynchronous I/O from cloud object stores like AWS S3, Azure Blob, and GCS
  • Python bindings and language integrations (C, Java, Ruby)
  • Modular architecture with extension points for custom data sources and operators

Capabilities

  • Executes queries in-process using Apache Arrow memory model
  • Embeddable in Rust applications or used as a standalone SQL engine
  • Handles both batch and streaming workloads efficiently
  • Supports distributed execution via subprojects like Ballista and Comet
  • Enables real-time analytics and low-latency data processing
  • Integrates with cloud-native environments and big data ecosystems
  • Offers schema-aware query planning and execution
  • Facilitates development of custom databases, dataframes, and ML platforms
  • Provides tools for reading, sorting, and transcoding structured data
  • Compatible with Substrait query plans for cross-system interoperability

Benefits

  • Delivers high performance through Rust and Arrow optimizations
  • Reduces development effort with reusable components and APIs
  • Enhances scalability and responsiveness for data-intensive applications
  • Supports rapid prototyping and production deployment
  • Enables flexible integration with existing data platforms
  • Promotes modularity and maintainability in system design
  • Backed by a vibrant open-source community and Apache governance
  • Ideal for building modern analytical engines and data pipelines
  • Offers predictable performance and resource efficiency
  • Frees developers from reimplementing core query engine features