ZhangZhihui's Blog  

An ORC (Optimized Row Columnar) file is a self-describing, type-aware columnar storage format designed for Hadoop workloads. Originally created for Apache Hive, it has become a standard for big data processing because it significantly improves performance when reading, writing, and processing data.

Think of it as a highly efficient way to organize a massive spreadsheet so that a computer can find specific information without having to read every single line.


How ORC Works: The Structure

Standard files (like CSV) store data row-by-row. If you only need to see "Total Sales," the computer still has to read the "Customer Name," "Address," and "Date" for every single row.

ORC flips this. It groups data into Stripes, and within those stripes, it stores data in Columns.

Key Components:

  • Stripes: The body of the file. Each stripe (usually 64MB–250MB) acts as an independent chunk of rows stored in a columnar format.

  • File Footer: Contains a list of stripes, the number of rows per stripe, and data types.

  • Postscript: A small section at the very end of the file that contains metadata like the compression parameters and the size of the Footer.

  • Index Data: Stored within each stripe, this includes min/max values for each column, allowing the system to "skip" entire blocks of data that don't match a query.


Why Use ORC? (The Benefits)

Feature Benefit
Columnar Storage Only reads the specific columns needed for a query, reducing I/O.
High Compression Uses specialized encoders (like Zlib or Snappy) for different data types, leading to much smaller file sizes than CSV or JSON.
Predicate Pushdown Uses the built-in index to skip stripes that don't contain the relevant data (e.g., "Skip all stripes where Year < 2023").
ACID Support Supports transactional updates (Insert/Update/Delete) in Hive.

ORC vs. Parquet

Both are columnar formats, and the choice often depends on your ecosystem:

  • ORC: Best if you are heavily invested in the Apache Hive or Hortonworks/Cloudera ecosystem. It is slightly better at compression and handles complex nested types very efficiently.

  • Parquet: Best for Apache Spark or Amazon Redshift users. It is generally considered more "universal" across different data processing tools.


Summary of Use Cases

  • Data Warehousing: Storing massive historical datasets.

  • High-Speed Analytics: When you need to run complex SQL queries on billions of rows.

  • Reducing Costs: Smaller file sizes mean lower storage costs in cloud environments like S3 or Azure Data Lake.

 

Could you give me an example of ORC file so that I can visually see how its data looks like?

 

Since ORC is a binary format, you cannot open it in a text editor like Notepad to see the data—it would just look like garbled symbols. To "see" the data, you have to use a tool to decode it.

Here is what the data looks like from two perspectives: the Logical View (what you see when you query it) and the Internal Metadata View (what the computer sees to optimize the search).


1. The Logical View (The "Table" View)

When you query an ORC file using a tool like Apache Hive, Presto, or Spark, it looks just like a standard table.

user_id order_date total_price status
1001 2026-01-10 45.99 Shipped
1002 2026-01-11 120.50 Pending
1003 2026-01-11 12.00 Shipped

2. The Internal View (Metadata & Statistics)

If you use a utility like orc-tools to peek inside the file's "brain," you'll see the metadata that makes ORC so fast. This is how the file organizes itself into Stripes and Indexes.

Metadata Example (JSON output from orc-tools meta):

{
  "rows": 10000,
  "compression": "ZLIB",
  "schema": "struct<user_id:int,order_date:timestamp,total_price:double,status:string>",
  "stripes": [
    {
      "stripeNumber": 0, "rows": 5000, "offset": 3,
      "columnStatistics": [
        { "column": "total_price", "min": 5.00, "max": 250.00, "sum": 45000.00 },
        { "column": "order_date", "min": "2026-01-01", "max": "2026-01-05" }
      ]
    }
  ]
}

3. Visualizing the Columnar Storage

If you could "see" the raw bytes on the disk, they wouldn't be grouped by customer. Instead, they are grouped by category.

Instead of:

[1001, 2026-01-10, Shipped], [1002, 2026-01-11, Pending]

ORC stores it like this:

  • Column 1 (IDs): 1001, 1002, 1003...

  • Column 2 (Dates): 2026-01-10, 2026-01-11, 2026-01-11...

  • Column 3 (Status): Shipped, Pending, Shipped...

Why this matters: If you ask the computer, "What is the average total_price?", it only touches the "Price" block of memory. It completely ignores the "IDs," "Dates," and "Status" blocks, which is why it's so much faster than reading a whole CSV.

 

 

A Parquet file is an open-source, columnar storage format designed for efficient data processing in the Hadoop ecosystem and modern cloud data warehouses. While ORC is very popular in the Hive world, Parquet is the industry standard for Apache Spark.

It is designed to handle complex data at scale while minimizing storage space and maximizing query speed.


1. How Parquet Stores Data

Like ORC, Parquet uses columnar storage. In a traditional format (CSV), data is stored row by row. In Parquet, data is grouped into Row Groups, and within those groups, each column is stored separately.

The Hierarchy:

  • Row Groups: A horizontal partitioning of the data into chunks.

  • Column Chunks: Within a row group, the data for a specific column is stored together.

  • Pages: Column chunks are divided into pages (typically 1MB), which is the smallest unit of compression and encoding.


2. Why Parquet is Popular

  • Efficiency: Because it's columnar, if you only need 2 columns out of 100, the system only reads those 2 columns from the disk.

  • Metadata Power: Each Parquet file contains a footer with min/max statistics for every column in every row group.

    • Example: If you search for "Sale Price > 1000", Parquet looks at the metadata and skips entire row groups where the "max" price is only 500.

  • Compression: Since data in a column is of the same type (e.g., all integers), it compresses much better than mixed-row data.

  • Universal Compatibility: It works seamlessly with Spark, AWS Athena, Google BigQuery, Snowflake, and Azure Synapse.


3. Parquet vs. ORC: The Key Differences

Feature Apache Parquet Apache ORC
Best Used With Apache Spark, AWS Athena, Delta Lake Apache Hive, Trino (Presto)
Data Model Better for "nested" data (JSON-like structures) Better for flat or simple nested data
ACID Support Limited (usually handled by Delta Lake) Strong native support in Hive
Ecosystem Very broad (Standard for Cloud providers) Specific to the Hadoop/Hive stack

4. What a Parquet File "Looks" Like

If you were to inspect the metadata of a Parquet file using a tool like parquet-tools, you would see a detailed breakdown of the schema and the storage stats:

# Parquet File Metadata
Creator: parquet-cpp-arrow version 14.0.0
Rows: 1,000,000
Number of Row Groups: 5
Column: 'transaction_id' (INT64) [Min: 1, Max: 1000000]
Column: 'amount' (DOUBLE) [Min: 0.99, Max: 5000.00]
Column: 'city' (BINARY/UTF8) [Encodings: PLAIN, RLE_DICTIONARY]

Fun Fact: Parquet uses Dictionary Encoding. If you have a "City" column with 1 million rows but only 50 unique cities, Parquet stores the 50 names once and uses a tiny numeric index for the rest, saving massive amounts of space.

 

posted on 2026-01-12 21:39  ZhangZhihuiAAA  阅读(5)  评论(0)    收藏  举报