
Apache Griffin is an open-source data quality solution for big data systems. It supports both batch and streaming modes, allowing users to define, measure, and monitor data quality metrics across diverse sources using a flexible rule-based framework integrated with modern data platforms.
Vendor
The Apache Software Foundation
Company Website


Apache Griffin
Apache Griffin is an open-source data quality solution for big data environments, supporting both batch and streaming modes. It provides a unified framework for defining, measuring, and reporting data quality metrics across diverse data sources. Griffin enables organizations to build trusted data assets by offering a domain-driven model and a flexible DSL for expressing quality rules. It is designed to integrate with modern big data stacks, including Spark, Hive, Kafka, and Hadoop.
Features
- Supports batch and streaming data quality measurement.
- Domain-specific language (DSL) for defining custom quality rules.
- Built-in models for accuracy, completeness, timeliness, and profiling.
- Integration with Spark, Hive, Kafka, Hadoop, and Elasticsearch.
- Configurable sinks for outputting metrics to console, HDFS, or Elasticsearch.
- Front-end interface for onboarding new data quality requirements.
- Extensible architecture for custom rule definitions and connectors.
- Real-time and scheduled execution of quality checks.
Capabilities
- Ingests data from various sources and applies quality checks dynamically.
- Measures data quality based on user-defined rules and outputs metrics.
- Supports checkpointing and persistence for streaming data validation.
- Enables comparison between source and target datasets in batch and stream.
- Provides APIs and configuration files for flexible deployment.
- Operates in distributed environments using Spark and YARN.
- Handles structured data in Hive tables and Kafka topics.
Benefits
- Improves trust in data assets through continuous quality monitoring.
- Reduces manual effort in validating data across systems.
- Enhances decision-making with reliable and timely data metrics.
- Scales with enterprise data infrastructure and supports real-time use cases.
- Encourages collaboration between data engineers and analysts.
- Open-source and governed by the Apache Software Foundation.