Logo
Sign in

Apache Griffin is an open-source data quality solution for big data systems. It supports both batch and streaming modes, allowing users to define, measure, and monitor data quality metrics across diverse sources using a flexible rule-based framework integrated with modern data platforms.

Vendor

Vendor

The Apache Software Foundation

Company Website

Company Website

arch.png
arch-1.png
Product details

Apache Griffin

Apache Griffin is an open-source data quality solution for big data environments, supporting both batch and streaming modes. It provides a unified framework for defining, measuring, and reporting data quality metrics across diverse data sources. Griffin enables organizations to build trusted data assets by offering a domain-driven model and a flexible DSL for expressing quality rules. It is designed to integrate with modern big data stacks, including Spark, Hive, Kafka, and Hadoop.

Features

  • Supports batch and streaming data quality measurement.
  • Domain-specific language (DSL) for defining custom quality rules.
  • Built-in models for accuracy, completeness, timeliness, and profiling.
  • Integration with Spark, Hive, Kafka, Hadoop, and Elasticsearch.
  • Configurable sinks for outputting metrics to console, HDFS, or Elasticsearch.
  • Front-end interface for onboarding new data quality requirements.
  • Extensible architecture for custom rule definitions and connectors.
  • Real-time and scheduled execution of quality checks.

Capabilities

  • Ingests data from various sources and applies quality checks dynamically.
  • Measures data quality based on user-defined rules and outputs metrics.
  • Supports checkpointing and persistence for streaming data validation.
  • Enables comparison between source and target datasets in batch and stream.
  • Provides APIs and configuration files for flexible deployment.
  • Operates in distributed environments using Spark and YARN.
  • Handles structured data in Hive tables and Kafka topics.

Benefits

  • Improves trust in data assets through continuous quality monitoring.
  • Reduces manual effort in validating data across systems.
  • Enhances decision-making with reliable and timely data metrics.
  • Scales with enterprise data infrastructure and supports real-time use cases.
  • Encourages collaboration between data engineers and analysts.
  • Open-source and governed by the Apache Software Foundation.