
Apache GobblinThe Apache Software Foundation
Apache Gobblin is a distributed data integration framework for ingesting, transforming, and managing large volumes of data from diverse sources in both streaming and batch environments. It supports scalability, fault tolerance, and metadata management across heterogeneous ecosystems.
Vendor
The Apache Software Foundation
Company Website


Product details
Apache Gobblin
Apache Gobblin is a distributed data integration framework designed to simplify and automate the ingestion, replication, organization, and lifecycle management of data across both streaming and batch ecosystems. Originally developed by LinkedIn, Gobblin addresses the challenges of managing diverse data pipelines and provides a unified platform for extracting, transforming, and loading (ETL) data from various sources into big data environments like Hadoop.
Features
- Universal Data Ingestion: Supports ingestion from databases, REST APIs, FTP/SFTP servers, file systems, and more.
- Pluggable Architecture: Modular design allows easy extension and customization of components like sources, extractors, converters, and writers.
- Job and Task Scheduling: Built-in scheduling capabilities for managing ETL workflows.
- Error Handling and Retry Mechanisms: Robust fault tolerance with automatic retries and error tracking.
- State Management: Maintains job and task states across executions for consistency and recovery.
- Data Quality Assurance: Includes row-level and task-level quality checkers to validate data before publishing.
- Converter Chaining: Enables complex data transformations through composable converter chains.
- Forking and Multi-Sink Support: Allows branching of data flows to multiple destinations or formats.
- Monitoring and Metrics: Integrated tools for tracking performance, metrics, and job health.
Capabilities
- Streaming and Batch Support: Handles both real-time and scheduled data ingestion scenarios.
- Multi-Mode Execution:
- Standalone Mode: Runs on a single machine, suitable for lightweight or embedded use cases.
- MapReduce Mode: Executes as a MapReduce job on Hadoop clusters, compatible with Azkaban.
- Cluster/YARN Mode: Operates as a distributed cluster with high availability, suitable for large-scale deployments.
- Cloud Mode: Deploys as an elastic cluster on public cloud infrastructure with high availability.
- Metadata Management: Centralized handling of metadata across diverse data sources.
- Extensibility: Easily integrates new data sources and sinks through custom adapters.
- Data Model Evolution Handling: Supports schema changes and evolution over time.
Benefits
- Efficiency: Reduces development and operational overhead by automating common ETL tasks.
- Scalability: Adapts to growing data volumes and complex ingestion needs.
- Flexibility: Works across various environments—on-premises, cloud, and hybrid setups.
- Reliability: Ensures data integrity and consistency through robust error handling and quality checks.
- Self-Service: Empowers teams to manage their own data pipelines without deep infrastructure knowledge.
- Cost-Effective: Open-source and designed to optimize resource usage across deployments.
Find more products by industry
Other ServicesEducationFinance & InsuranceHealth & Social WorkPublic AdministrationInformation & CommunicationView all