In Hive, what file formats support ACID?
In Hive, the file formats that support ACID (Atomicity, Consistency, Isolation, Durability) transactions depend on which "table type" you are using.
Hive has evolved from a legacy transactional system to a modern one integrated with Apache Iceberg, which significantly changes the answer.
1. Modern Hive (Integrated with Iceberg)
If you are using the modern Iceberg-based tables (like the code in your first question), you have much more flexibility. Iceberg manages the ACID properties at the table metadata level, which allows it to support multiple underlying formats:
-
ORC (Optimized Row Columnar)
-
Parquet
-
Avro
In this setup, you can perform full CRUD (Create, Read, Update, Delete) operations regardless of whether the physical files are Parquet or ORC.
2. Legacy "Native" Hive ACID Tables
Before Iceberg integration, Hive had its own internal ACID implementation (often called "Native ACID" or "CRUD Tables"). In this legacy mode, support is extremely restricted:
| Format | Full ACID (Updates/Deletes) | Insert-Only ACID |
| ORC | Supported (Only format supported) | Supported |
| Parquet | Not Supported | Supported (Hive 3.x+) |
| Avro | Not Supported | Supported (Hive 3.x+) |
| TextFile | Not Supported | Supported (Hive 3.x+) |
Why was it restricted to ORC?
Legacy Hive ACID relies on a specific internal structure where every row is assigned a hidden ROW__ID. This ID tracks the original transaction, bucket, and row number.5 Because ORC was developed alongside Hive, it was the only format built to store and index these hidden ACID columns efficiently.
Which one should you use?
If you are starting a new project in Hive 4.x, the industry standard is to use Iceberg with Parquet or ORC.
-
Choose ORC if your entire pipeline is Hive-centric, as it often provides the best compression and "Type-Native" ordering performance within Hive.
-
Choose Parquet if you need to share that data with other engines like Spark, Presto, or Trino, as Parquet has broader ecosystem support.
Starting with Hive 4.0.0, Iceberg is a first-class citizen.
Updates can only be performed on tables that support ACID.
Deletes can only be performed on tables that support ACID.
Merge can only be performed on tables that support ACID.
The MERGE statement in HiveQL allows you to perform updates, deletes, and inserts in a single atomic operation. This is a hallmark of ACID functionality, commonly used for synchronizing a "source" table (new data) with a "target" table (existing master data).
In Hive 4.2.0, MERGE is highly optimized for Apache Iceberg tables, making it much more efficient than the legacy Hive ACID implementation.
Basic Syntax Structure
The MERGE statement follows a standard SQL pattern:
MERGE INTO target_table AS T
USING source_table AS S
ON T.id = S.id
WHEN MATCHED [AND extra_condition] THEN
UPDATE SET T.col = S.col
WHEN MATCHED [AND extra_condition] THEN
DELETE
WHEN NOT MATCHED [AND extra_condition] THEN
INSERT (id, col) VALUES (S.id, S.col);
Key Clauses Explained
-
ONClause: Defines the join condition. It determines if a row in the source already exists in the target. -
WHEN MATCHED: Defines what to do when a record exists in both tables (usually anUPDATEorDELETE). -
WHEN NOT MATCHED: Defines what to do when a record exists in the source but not the target (usually anINSERT).
Why use MERGE in Hive 4.2.0 with Iceberg?
Because you are using Iceberg, the MERGE operation is significantly more powerful than in older versions of Hive:
-
Schema Evolution: If your source data has new columns, Iceberg can often handle the merge even if the target schema needs to evolve.
-
Snapshot Isolation: While the
MERGEis running, other users can still read the table. They will see the "old" data until the exact moment theMERGEcommits. -
Position Deletes: Instead of rewriting entire data files (which is slow), Hive 4 + Iceberg can write small "delete files" that mark old rows as deleted, making the operation much faster.
Prerequisites for MERGE
To run a MERGE operation successfully in your environment, ensure the following:
-
ACID Enabled: The target table must be an Iceberg table or a Legacy ACID table.
-
Bucketing (Legacy Only): If you aren't using Iceberg, the legacy table must be bucketed. (With Iceberg, this is not required).
-
Tez Engine:
MERGEoperations are optimized for the Tez engine; ensureset hive.execution.engine=tez;is active.
Practical Example: Upserting Events
If you have your events table and a new batch of corrected data (events_staging), you can "upsert" (Update + Insert) like this:
MERGE INTO events AS T
USING events_staging AS S
ON T.event_id = S.event_id
WHEN MATCHED THEN
UPDATE SET T.event_date = S.event_date, T.event_type = S.event_type
WHEN NOT MATCHED THEN
INSERT VALUES (S.event_id, S.event_date, S.event_type);

浙公网安备 33010602011771号