Logo
Sign in

Apache DataFu is a set of libraries for large-scale data processing in Hadoop, offering stable, well-tested tools for data mining, statistics, and incremental computation across Spark, Pig, and MapReduce environments.

Vendor

Vendor

The Apache Software Foundation

Company Website

Company Website

boxplot.png
Hourglass-MapCombineReduce.png
Hourglass-Concepts-Preserving.png
Product details

Apache DataFu

Apache DataFu is a suite of libraries designed for large-scale data processing in the Hadoop ecosystem. It provides robust, well-tested utilities for data mining, statistics, and incremental data processing. The project includes three main components: DataFu PigDataFu Spark, and DataFu Hourglass, each tailored to specific data processing needs.

Features

DataFu Pig

  • User-defined functions (UDFs) and macros for Apache Pig
  • Functions for statistics, sessionization, link analysis, set operations, and more
  • Included in Cloudera CDH and Apache Bigtop distributions
  • Used in production at LinkedIn since 2010

DataFu Spark

  • Utilities and UDFs for Apache Spark
  • Deduplication with ordering (e.g., keeping the most recent record)
  • Skewed joins for large datasets
  • Efficient distinct counting
  • Cross-language support: call Python from Scala and vice versa
  • Used in production at PayPal since 2017

DataFu Hourglass

  • Incremental processing framework for Hadoop MapReduce
  • Optimized for sliding window computations (e.g., daily/weekly tracking)
  • Reduces redundant computation, saving 50–95% in resources
  • Used in production at LinkedIn

Capabilities

  • Scalable Data Processing: Handles large-scale datasets efficiently across Hadoop, Pig, and Spark environments.
  • Incremental Computation: Hourglass enables efficient sliding window analytics without reprocessing entire datasets.
  • Advanced Analytics: Provides statistical functions, sessionization, link analysis (e.g., PageRank), and more.
  • Cross-Platform Integration: Supports integration with Spark and Pig, and allows interoperability between Scala and Python.
  • Production-Ready: All libraries are unit-tested and proven in enterprise environments like LinkedIn and PayPal.

Benefits

  • Efficiency: Reduces computational overhead through incremental processing and optimized joins.
  • Flexibility: Offers a wide range of functions for different analytical needs across multiple platforms.
  • Reliability: Built with stability and testing in mind, ensuring consistent performance in production.
  • Open Source: Freely available under the Apache 2.0 license, with active community contributions.
  • Enterprise Proven: Trusted by major companies for mission-critical data workflows.