ZhangZhihui's Blog  

In Hive, what file formats support ACID?

 

In Hive, the file formats that support ACID (Atomicity, Consistency, Isolation, Durability) transactions depend on which "table type" you are using.

Hive has evolved from a legacy transactional system to a modern one integrated with Apache Iceberg, which significantly changes the answer.

1. Modern Hive (Integrated with Iceberg)

If you are using the modern Iceberg-based tables (like the code in your first question), you have much more flexibility. Iceberg manages the ACID properties at the table metadata level, which allows it to support multiple underlying formats:

  • ORC (Optimized Row Columnar)

  • Parquet

  • Avro

In this setup, you can perform full CRUD (Create, Read, Update, Delete) operations regardless of whether the physical files are Parquet or ORC.


2. Legacy "Native" Hive ACID Tables

Before Iceberg integration, Hive had its own internal ACID implementation (often called "Native ACID" or "CRUD Tables"). In this legacy mode, support is extremely restricted:

Format Full ACID (Updates/Deletes) Insert-Only ACID
ORC Supported (Only format supported) Supported
Parquet Not Supported Supported (Hive 3.x+)
Avro Not Supported Supported (Hive 3.x+)
TextFile Not Supported Supported (Hive 3.x+)

Why was it restricted to ORC?

Legacy Hive ACID relies on a specific internal structure where every row is assigned a hidden ROW__ID. This ID tracks the original transaction, bucket, and row number.5 Because ORC was developed alongside Hive, it was the only format built to store and index these hidden ACID columns efficiently.


Which one should you use?

If you are starting a new project in Hive 4.x, the industry standard is to use Iceberg with Parquet or ORC.

  • Choose ORC if your entire pipeline is Hive-centric, as it often provides the best compression and "Type-Native" ordering performance within Hive.

  • Choose Parquet if you need to share that data with other engines like Spark, Presto, or Trino, as Parquet has broader ecosystem support.

 

Starting with Hive 4.0.0, Iceberg is a first-class citizen.

 

Updates can only be performed on tables that support ACID.

Deletes can only be performed on tables that support ACID. 

Merge can only be performed on tables that support ACID.

 

The MERGE statement in HiveQL allows you to perform updates, deletes, and inserts in a single atomic operation. This is a hallmark of ACID functionality, commonly used for synchronizing a "source" table (new data) with a "target" table (existing master data).

In Hive 4.2.0, MERGE is highly optimized for Apache Iceberg tables, making it much more efficient than the legacy Hive ACID implementation.


Basic Syntax Structure

The MERGE statement follows a standard SQL pattern:

MERGE INTO target_table AS T
USING source_table AS S
ON T.id = S.id
WHEN MATCHED [AND extra_condition] THEN
  UPDATE SET T.col = S.col
WHEN MATCHED [AND extra_condition] THEN
  DELETE
WHEN NOT MATCHED [AND extra_condition] THEN
  INSERT (id, col) VALUES (S.id, S.col);

Key Clauses Explained

  • ON Clause: Defines the join condition. It determines if a row in the source already exists in the target.

  • WHEN MATCHED: Defines what to do when a record exists in both tables (usually an UPDATE or DELETE).

  • WHEN NOT MATCHED: Defines what to do when a record exists in the source but not the target (usually an INSERT).


Why use MERGE in Hive 4.2.0 with Iceberg?

Because you are using Iceberg, the MERGE operation is significantly more powerful than in older versions of Hive:

  1. Schema Evolution: If your source data has new columns, Iceberg can often handle the merge even if the target schema needs to evolve.

  2. Snapshot Isolation: While the MERGE is running, other users can still read the table. They will see the "old" data until the exact moment the MERGE commits.

  3. Position Deletes: Instead of rewriting entire data files (which is slow), Hive 4 + Iceberg can write small "delete files" that mark old rows as deleted, making the operation much faster.


Prerequisites for MERGE

To run a MERGE operation successfully in your environment, ensure the following:

  • ACID Enabled: The target table must be an Iceberg table or a Legacy ACID table.

  • Bucketing (Legacy Only): If you aren't using Iceberg, the legacy table must be bucketed. (With Iceberg, this is not required).

  • Tez Engine: MERGE operations are optimized for the Tez engine; ensure set hive.execution.engine=tez; is active.


Practical Example: Upserting Events

If you have your events table and a new batch of corrected data (events_staging), you can "upsert" (Update + Insert) like this:

MERGE INTO events AS T
USING events_staging AS S
ON T.event_id = S.event_id
WHEN MATCHED THEN
  UPDATE SET T.event_date = S.event_date, T.event_type = S.event_type
WHEN NOT MATCHED THEN
  INSERT VALUES (S.event_id, S.event_date, S.event_type);

 

posted on 2026-01-14 18:02  ZhangZhihuiAAA  阅读(3)  评论(0)    收藏  举报