Apache Avro is an open-source data serialization system used primarily in big data ecosystems like Hadoop and Kafka. It transforms complex data structures into a compact, binary format for efficient storage and high-speed transmission across different programming languages.
Core Features
-
Schema-Based: Avro relies on schemas defined in JSON to structure data. Unlike JSON, which repeats field names in every record, Avro stores the schema separately (or once at the start of a file), making the actual data much smaller.
-
Schema Evolution: It is designed to handle changes over time. You can add, remove, or rename fields without breaking existing programs, as long as specific compatibility rules are followed.
-
Language Independent: It provides APIs for many languages, including Java, Python, C++, C#, and Ruby, allowing a Python program to easily read data written by a Java program.
-
No Code Generation Required: While you can generate classes for static languages like Java, Avro is also designed for scripting languages that need to process data dynamically without a pre-compiled schema.
Avro vs. Other Formats
| Feature | Apache Avro | JSON | Apache Parquet |
| Format | Binary (Compact) | Text (Human-readable) | Binary (Columnar) |
| Schema | Required (JSON-based) | Optional | Required |
| Primary Use | Streaming & Data Ingestion | Web APIs & Configuration | Analytical Queries & Storage |
| Storage Style | Row-oriented | Row-oriented | Column-oriented |
How it Works
-
Schema Definition: You define your data structure (fields and types) in a
.avscfile using JSON. -
Serialization: The Avro library uses that schema to convert your objects into a dense binary stream.
-
Self-Describing Files: When stored as a file (an Object Container File), Avro includes the schema in the file header so any reader knows exactly how to decode it later.
-
RPC Support: Avro also includes a Remote Procedure Call (RPC) framework, allowing different systems to communicate using these defined schemas.
Would you like me to show you an example of an Avro schema definition in JSON?
Yes, I would.
An Avro schema is a JSON object that defines the structure of your data. Think of it as a blueprint: it tells the system exactly what fields to expect, what type of data they hold (integers, strings, etc.), and which fields are optional.
Here is an example of what a schema for a User Account looks like:
{
"type": "record",
"name": "User",
"namespace": "com.example.data",
"fields": [
{
"name": "username",
"type": "string",
"doc": "The unique name for the user"
},
{
"name": "age",
"type": "int",
"doc": "User age in years"
},
{
"name": "email",
"type": ["null", "string"],
"default": null,
"doc": "Optional email address"
}
]
}
Why this format matters
-
The "Union" Type: In the
emailfield above,["null", "string"]is called a union. This is how Avro handles optional data—it tells the reader the field can either be empty (null) or a string. -
Documentation: The
"doc"attribute allows you to embed descriptions directly into the schema, which is helpful for other developers. -
Compactness: Because the schema is defined like this, the actual data files don't need to store the words "username" or "email" over and over again. They just store the values in a specific order, saving a massive amount of storage space.
How the Schema is used
When you send data, the Writer uses this schema to turn your data into binary. When the Reader gets that binary, they use the exact same (or a compatible) schema to turn it back into a readable object.
Would you like to see how a simple Python or Java snippet uses this schema to save data?
Python.
To use Avro in Python, you typically use the fastavro library (which is faster) or the official avro package. Below is a simple example using the official avro library.
1. Install the Library
First, you'll need to install the package:
pip install avro
2. The Python Script
This script performs two main steps: Serializing (writing) the data to a binary file and Deserializing (reading) it back.
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
# 1. Define the Schema
schema_json = """
{
"type": "record",
"name": "User",
"fields": [
{"name": "username", "type": "string"},
{"name": "age", "type": "int"},
{"name": "email", "type": ["null", "string"], "default": null}
]
}
"""
schema = avro.schema.parse(schema_json)
# 2. Write Data (Serialization)
with open("users.avro", "wb") as f:
writer = DataFileWriter(f, DatumWriter(), schema)
writer.append({"username": "Alyssa", "age": 25})
writer.append({"username": "Ben", "age": 32, "email": "ben@example.com"})
writer.close()
# 3. Read Data (Deserialization)
with open("users.avro", "rb") as f:
reader = DataFileReader(f, DatumReader())
for user in reader:
print(f"Name: {user['username']}, Age: {user['age']}, Email: {user['email']}")
reader.close()
Key Takeaways for Python Developers
-
Dictionary Mapping: Notice how Avro records map directly to Python dictionaries. This makes it very natural to use compared to formats that require complex class definitions.
-
Binary Output: If you try to open the
users.avrofile in a text editor, you'll see a mix of the JSON schema at the top and "gibberish" (binary data) below. This is why the file is so compact. -
Schema Enforcement: If you tried to append a record with a string for the
agefield, Python would throw an error during theappend()call, ensuring your data stays clean.
1. The Schema File (.avsc)
The .avsc (Avro Schema) file is a plain text JSON file. It contains only the "blueprint" or the rules for the data.
-
Analogy: Like a SQL
CREATE TABLEstatement. -
Usage: You usually write this once and use it to tell your Python script how to handle the data.
2. The Data File (.avro)
The .avro file (often called an Object Container File) is a binary file. This is what you saw in the Python code. It contains:
-
A copy of the schema (in the header).
-
The actual data records in a compressed binary format.
-
Analogy: Like the actual
.csvor.databasefile that holds the real info.
The Typical Workflow
In a real-world project, you wouldn't usually keep the schema as a string inside your Python code like I did in the previous example. Instead, you would do this:
-
Save your JSON structure into a file named user_schema.avsc.
-
In Python, load that file to prepare your writer:
# Loading the schema from the .avsc file
with open("user_schema.avsc", "r") as f:
schema_json = f.read()
schema = avro.schema.parse(schema_json)
# Writing the data to the .avro binary container
with open("output_data.avro", "wb") as f:
writer = DataFileWriter(f, DatumWriter(), schema)
# ... append data ...
Summary Table
| Extension | File Type | Content | Human Readable? |
| .avsc | JSON / Text | Only the structure/metadata | Yes |
| .avro | Binary | Schema + Actual Data | No (except the header) |
Would you like me to show you how to "evolve" a schema, such as adding a new field to your .avsc file without breaking your old .avro data?
Yes, I would.
Schema Evolution is one of Avro's "killer features." It allows you to change your data structure (the .avsc file) over time without breaking your ability to read old data.
In a big data environment, you might have files from 2024 with "Schema V1" and new files from 2026 with "Schema V2." Avro handles this by looking for default values to fill in the gaps.
How it Works
When you read an Avro file, you provide two schemas:
-
Writer's Schema: The schema used when the data was originally saved (stored in the
.avrofile header). -
Reader's Schema: The new, updated schema your application is currently using.
Example: Adding a "Phone Number"
Imagine your original schema only had username and age. Now, you want to add phone_number.
New Schema (V2):
{
"type": "record",
"name": "User",
"fields": [
{"name": "username", "type": "string"},
{"name": "age", "type": "int"},
{
"name": "phone_number",
"type": ["null", "string"],
"default": null
}
]
}
-
Reading Old Data (V1): When the Python script reads an old file that doesn't have a phone number, it sees the
default: nullin your new schema and simply fills it in. The code doesn't crash. -
Reading New Data (V2): It reads the phone number normally.
Rules for Safe Evolution
To ensure you don't "break" your data pipeline, follow these best practices:
-
Always provide default values: If you add a new field, give it a default (like
nullor0). This allows the reader to process old data that lacks that field. -
Deleting fields: You can remove a field from your reader's schema, and Avro will simply skip over that data when reading old records.
-
Don't change types: You generally cannot change a field from an
intto astringwithout complex mapping.
Managing Schemas at Scale
In professional environments (like using Apache Kafka), developers don't just pass .avsc files around manually. They use a Schema Registry. This is a central server that stores all versions of your schemas and ensures that any change you make is "backward compatible" before you're allowed to save it.

浙公网安备 33010602011771号