ZhangZhihui's Blog  

Avro is a data serialization framework developed within the Apache Hadoop ecosystem, widely used in Kafka and big data systems. It provides a compact, fast, and schema-based way to serialize structured data. Let’s go through it carefully.


1️⃣ What Avro Is

  • Data serialization system: Converts data structures (records, objects) into a compact binary or JSON format for storage or transmission.

  • Schema-based: Every Avro data file or message has a schema describing the fields, their types, and structure.

  • Language-neutral: Supports many programming languages: Java, Python, C, C++, Go, etc.

  • Used in Kafka: Often used with Kafka producers/consumers to encode messages in a structured, versioned way.


2️⃣ Key Features

FeatureDescription
Schema evolution Avro supports adding/removing fields without breaking consumers. Schemas are versioned.
Compact Binary format is smaller than JSON or XML.
Fast Serialization/deserialization is optimized.
Language-neutral Same Avro data can be read/written by different languages.
Self-describing with schema registry With Confluent Schema Registry, the schema can be stored separately, and messages only need a schema ID.

3️⃣ How Avro Works

a) Schema

An Avro schema is usually in JSON format, for example:

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "name", "type": "string"},
    {"name": "age", "type": "int"}
  ]
}

b) Serialization

  • The data is converted to a compact binary representation according to the schema.

  • Example in Python:

import avro.schema
from avro.io import DatumWriter, BinaryEncoder
import io

schema = avro.schema.parse(open("user.avsc").read())
writer = DatumWriter(schema)
bytes_writer = io.BytesIO()
encoder = BinaryEncoder(bytes_writer)

writer.write({"name": "Alice", "age": 30}, encoder)
data_bytes = bytes_writer.getvalue()

c) Deserialization

  • To read Avro data, you use the same (or compatible) schema:

from avro.io import DatumReader, BinaryDecoder

bytes_reader = io.BytesIO(data_bytes)
decoder = BinaryDecoder(bytes_reader)
reader = DatumReader(schema)
record = reader.read(decoder)
print(record)  # {'name': 'Alice', 'age': 30}

 


4️⃣ Avro in Kafka

  • Avro is widely used in Kafka pipelines because:

    1. It provides strongly typed, schema-based messages.

    2. Schemas can be stored in Confluent Schema Registry, so producers and consumers can evolve independently.

    3. Saves space vs JSON (binary format).

Typical flow:

Producer (Avro serialize) --> Kafka Topic --> Consumer (Avro deserialize)

 


5️⃣ Advantages over JSON

FeatureJSONAvro
Size Larger Smaller (binary)
Schema Implicit Explicit (validated)
Performance Slower Faster
Evolution Hard Supported (forward/backward compatible)

✅ Summary

  • Avro = schema-based serialization format for structured data.

  • Ensures compact, fast, cross-language, and versioned data exchange.

  • Perfect for Kafka messaging, data lakes, and big data pipelines.

 

posted on 2025-12-02 19:34  ZhangZhihuiAAA  阅读(0)  评论(0)    收藏  举报