Logo
Sign in

Apache Avro is a data serialization system that enables efficient, compact, and schema-based data exchange across programming languages, supporting dynamic typing and schema evolution for reliable communication in distributed systems.

Vendor

Vendor

The Apache Software Foundation

Company Website

Company Website

classAvro_1_1Specific_1_1SpecificDatumReader-g.png
Product details

Apache Avro

Apache Avro™ is a data serialization system developed within the Apache Hadoop ecosystem. It is designed to provide compact, fast, binary data serialization with rich data structures and robust schema evolution capabilities. Avro is widely used in streaming data pipelines and supports multiple programming languages including Java, Python, C/C++, C#, PHP, Ruby, Rust, JavaScript, and Perl.

Features

  • Compact Binary Format: Avro uses a highly efficient binary format for data serialization, minimizing storage and transmission overhead.
  • Rich Data Structures: Supports complex types such as records, enums, arrays, maps, unions, and fixed-size binary data.
  • Schema-Based Serialization: Every Avro file includes its schema, enabling self-describing data and seamless interoperability.
  • JSON-Based Schemas: Schemas are defined using JSON, making them easy to read and integrate with existing tools.
  • Dynamic Typing: No need for code generation to read or write data, simplifying integration with dynamic languages.
  • Remote Procedure Call (RPC): Built-in support for RPC with schema exchange during connection handshakes.
  • Cross-Language Support: Implementations available for a wide range of languages, facilitating multi-platform data exchange.

Capabilities

  • Schema Evolution: Avro supports forward and backward compatibility between schema versions, allowing systems to evolve without breaking data pipelines.
  • Self-Describing Files: Avro container files embed the schema, enabling any system to interpret the data without external schema definitions.
  • Efficient Data Processing: Minimal per-value overhead and untagged data reduce serialization size and improve performance.
  • Symbolic Resolution: Schema differences are resolved using field names rather than manually assigned IDs.
  • Generic Data Handling: Enables construction of generic data-processing systems without relying on statically typed code.
  • Integration with Big Data Ecosystems: Commonly used with Apache Kafka, Apache Hive, Apache Spark, and other Hadoop-related tools.

Benefits

  • Interoperability: Facilitates seamless data exchange across different systems and languages.
  • Performance: Optimized for speed and compactness, making it ideal for high-throughput data pipelines.
  • Flexibility: Supports both static and dynamic typing, catering to a wide range of development environments.
  • Maintainability: Schema evolution and embedded metadata simplify long-term data management.
  • Developer Productivity: Reduces boilerplate code and simplifies serialization logic, especially in dynamic languages.
  • Reliability: Ensures consistent data interpretation through embedded schemas and robust specification adherence.