ZhangZhihui's Blog  

Apache Avro is an open-source data serialization system used primarily in big data ecosystems like Hadoop and Kafka. It transforms complex data structures into a compact, binary format for efficient storage and high-speed transmission across different programming languages.

Core Features

  • Schema-Based: Avro relies on schemas defined in JSON to structure data. Unlike JSON, which repeats field names in every record, Avro stores the schema separately (or once at the start of a file), making the actual data much smaller.

  • Schema Evolution: It is designed to handle changes over time. You can add, remove, or rename fields without breaking existing programs, as long as specific compatibility rules are followed.

  • Language Independent: It provides APIs for many languages, including Java, Python, C++, C#, and Ruby, allowing a Python program to easily read data written by a Java program.

  • No Code Generation Required: While you can generate classes for static languages like Java, Avro is also designed for scripting languages that need to process data dynamically without a pre-compiled schema.


Avro vs. Other Formats

Feature Apache Avro JSON Apache Parquet
Format Binary (Compact) Text (Human-readable) Binary (Columnar)
Schema Required (JSON-based) Optional Required
Primary Use Streaming & Data Ingestion Web APIs & Configuration Analytical Queries & Storage
Storage Style Row-oriented Row-oriented Column-oriented

How it Works

  1. Schema Definition: You define your data structure (fields and types) in a .avsc file using JSON.

  2. Serialization: The Avro library uses that schema to convert your objects into a dense binary stream.

  3. Self-Describing Files: When stored as a file (an Object Container File), Avro includes the schema in the file header so any reader knows exactly how to decode it later.

  4. RPC Support: Avro also includes a Remote Procedure Call (RPC) framework, allowing different systems to communicate using these defined schemas.

Would you like me to show you an example of an Avro schema definition in JSON?

 

Yes, I would.

 

An Avro schema is a JSON object that defines the structure of your data. Think of it as a blueprint: it tells the system exactly what fields to expect, what type of data they hold (integers, strings, etc.), and which fields are optional.

Here is an example of what a schema for a User Account looks like:

{
  "type": "record",
  "name": "User",
  "namespace": "com.example.data",
  "fields": [
    {
      "name": "username",
      "type": "string",
      "doc": "The unique name for the user"
    },
    {
      "name": "age",
      "type": "int",
      "doc": "User age in years"
    },
    {
      "name": "email",
      "type": ["null", "string"],
      "default": null,
      "doc": "Optional email address"
    }
  ]
}

Why this format matters

  • The "Union" Type: In the email field above, ["null", "string"] is called a union. This is how Avro handles optional data—it tells the reader the field can either be empty (null) or a string.

  • Documentation: The "doc" attribute allows you to embed descriptions directly into the schema, which is helpful for other developers.

  • Compactness: Because the schema is defined like this, the actual data files don't need to store the words "username" or "email" over and over again. They just store the values in a specific order, saving a massive amount of storage space.

How the Schema is used

When you send data, the Writer uses this schema to turn your data into binary. When the Reader gets that binary, they use the exact same (or a compatible) schema to turn it back into a readable object.

Would you like to see how a simple Python or Java snippet uses this schema to save data?

 

Python.

 

To use Avro in Python, you typically use the fastavro library (which is faster) or the official avro package. Below is a simple example using the official avro library.

1. Install the Library

First, you'll need to install the package:

Bash
 
pip install avro

2. The Python Script

This script performs two main steps: Serializing (writing) the data to a binary file and Deserializing (reading) it back.

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

# 1. Define the Schema
schema_json = """
{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "username", "type": "string"},
    {"name": "age", "type": "int"},
    {"name": "email", "type": ["null", "string"], "default": null}
  ]
}
"""
schema = avro.schema.parse(schema_json)

# 2. Write Data (Serialization)
with open("users.avro", "wb") as f:
    writer = DataFileWriter(f, DatumWriter(), schema)
    writer.append({"username": "Alyssa", "age": 25})
    writer.append({"username": "Ben", "age": 32, "email": "ben@example.com"})
    writer.close()

# 3. Read Data (Deserialization)
with open("users.avro", "rb") as f:
    reader = DataFileReader(f, DatumReader())
    for user in reader:
        print(f"Name: {user['username']}, Age: {user['age']}, Email: {user['email']}")
    reader.close()

Key Takeaways for Python Developers

  • Dictionary Mapping: Notice how Avro records map directly to Python dictionaries. This makes it very natural to use compared to formats that require complex class definitions.

  • Binary Output: If you try to open the users.avro file in a text editor, you'll see a mix of the JSON schema at the top and "gibberish" (binary data) below. This is why the file is so compact.

  • Schema Enforcement: If you tried to append a record with a string for the age field, Python would throw an error during the append() call, ensuring your data stays clean.

 

1. The Schema File (.avsc)

The .avsc (Avro Schema) file is a plain text JSON file. It contains only the "blueprint" or the rules for the data.

  • Analogy: Like a SQL CREATE TABLE statement.

  • Usage: You usually write this once and use it to tell your Python script how to handle the data.

2. The Data File (.avro)

The .avro file (often called an Object Container File) is a binary file. This is what you saw in the Python code. It contains:

  • A copy of the schema (in the header).

  • The actual data records in a compressed binary format.

  • Analogy: Like the actual .csv or .database file that holds the real info.


The Typical Workflow

In a real-world project, you wouldn't usually keep the schema as a string inside your Python code like I did in the previous example. Instead, you would do this:

  1. Save your JSON structure into a file named user_schema.avsc.

  2. In Python, load that file to prepare your writer:

# Loading the schema from the .avsc file
with open("user_schema.avsc", "r") as f:
    schema_json = f.read()
    schema = avro.schema.parse(schema_json)

# Writing the data to the .avro binary container
with open("output_data.avro", "wb") as f:
    writer = DataFileWriter(f, DatumWriter(), schema)
    # ... append data ...

Summary Table

Extension File Type Content Human Readable?
.avsc JSON / Text Only the structure/metadata Yes
.avro Binary Schema + Actual Data No (except the header)

Would you like me to show you how to "evolve" a schema, such as adding a new field to your .avsc file without breaking your old .avro data?

 

Yes, I would.

 

Schema Evolution is one of Avro's "killer features." It allows you to change your data structure (the .avsc file) over time without breaking your ability to read old data.

In a big data environment, you might have files from 2024 with "Schema V1" and new files from 2026 with "Schema V2." Avro handles this by looking for default values to fill in the gaps.

How it Works

When you read an Avro file, you provide two schemas:

  1. Writer's Schema: The schema used when the data was originally saved (stored in the .avro file header).

  2. Reader's Schema: The new, updated schema your application is currently using.


Example: Adding a "Phone Number"

Imagine your original schema only had username and age. Now, you want to add phone_number.

New Schema (V2):

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "username", "type": "string"},
    {"name": "age", "type": "int"},
    {
      "name": "phone_number", 
      "type": ["null", "string"], 
      "default": null
    }
  ]
}
  • Reading Old Data (V1): When the Python script reads an old file that doesn't have a phone number, it sees the default: null in your new schema and simply fills it in. The code doesn't crash.

  • Reading New Data (V2): It reads the phone number normally.

Rules for Safe Evolution

To ensure you don't "break" your data pipeline, follow these best practices:

  • Always provide default values: If you add a new field, give it a default (like null or 0). This allows the reader to process old data that lacks that field.

  • Deleting fields: You can remove a field from your reader's schema, and Avro will simply skip over that data when reading old records.

  • Don't change types: You generally cannot change a field from an int to a string without complex mapping.

Managing Schemas at Scale

In professional environments (like using Apache Kafka), developers don't just pass .avsc files around manually. They use a Schema Registry. This is a central server that stores all versions of your schemas and ensures that any change you make is "backward compatible" before you're allowed to save it.

 

posted on 2026-01-11 19:16  ZhangZhihuiAAA  阅读(1)  评论(0)    收藏  举报