ZhangZhihui's Blog  

In Hive, a Skewed Table is a special type of table designed to handle "data skew"—a situation where one or two specific values appear much more frequently than others in a dataset.

 

Normally, Hive distributes data evenly. However, if 90% of your data has the same ID (e.g., a "Guest" user ID in a retail dataset), the reducer handling that ID will work for hours while other reducers finish in seconds. Skewed tables solve this by physically separating those "heavy" values into their own files or directories.


How Skewed Tables Work

When you define a table as skewed, Hive changes its storage strategy:

  1. Separation: The frequently occurring values (the "skewed" keys) are written into their own dedicated files.

  2. The Remainder: All other "normal" values are grouped together in a separate file.

  3. Optimization: During a query, Hive’s execution engine (especially in joins) knows exactly where the heavy data is. It can skip files or process the heavy and light data using different strategies, preventing a single reducer from becoming a bottleneck.


How to Create a Skewed Table

You define a skewed table using the SKEWED BY clause. You must tell Hive which column is skewed and exactly which values are the culprits.

CREATE TABLE orders (
    order_id INT,
    customer_id INT,
    amount DOUBLE
)
SKEWED BY (customer_id) ON (101, 102) -- Tell Hive IDs 101 and 102 are huge
STORED AS DIRECTORIES;               -- Optional: Creates separate sub-directories

In this example:

  • Data for customer_id = 101 goes into its own folder.

  • Data for customer_id = 102 goes into its own folder.

  • All other customers go into a default "others" folder.

CREATE TABLE list_bucket_multiple (col1 STRING, col2 int, col3 STRING)
  SKEWED BY (col1, col2) ON (('s1',1), ('s3',3), ('s13',13), ('s78',78)) [STORED AS DIRECTORIES];

 

Benefits of Skewed Tables

  • Faster Joins: Hive can use "Skew Join" optimization. It can process the skewed values via a Map-side join (loading the small matching dimension table into memory) while the rest of the data undergoes a standard Shuffle join.

  • Better Resource Usage: It prevents "Straggler Tasks"—those 99% complete jobs that hang for an hour because one reducer is overloaded.

  • Reduced I/O: Hive can skip entire directories if your query filters out the skewed values.

When should you use them?

You should only use skewed tables if:

  1. You have a huge disparity in data (e.g., one value has millions of rows while others have hundreds).

  2. You know the skewed values in advance (you have to hard-code them into the DDL).

  3. You are experiencing performance issues during joins or aggregations on that specific column.

Would you like me to show you the Beeline/Hive settings you need to enable so that the execution engine actually uses these skewed optimizations during your joins?

 

Yes.

 

To get the most out of skewed tables, you need to enable specific settings in your Hive session (or within Beeline). Defining the table as skewed is only half the battle; these settings tell the optimizer to actually change its execution plan when it encounters those heavy values.

1. Enable Skew Join Optimization

This is the most important setting. It tells Hive to split a join into two parts: one for the skewed keys and one for everything else.

-- Enable the optimization
SET hive.optimize.skewjoin=true;

-- Determine the threshold for what is considered "skewed"
-- (Example: 100,000 rows. If a key has more than this, Hive treats it as skew)
SET hive.skewjoin.key=100000;

2. List Bucketing Optimization

If you used the STORED AS DIRECTORIES clause when creating your skewed table, you should enable List Bucketing. This allows Hive to "prune" (skip) the directories that don't contain the data you are looking for.

SET hive.optimize.listbucketing=true;

How "Skew Join" Works Internally

When these settings are on, Hive performs a "Map-side Join" for the skewed keys.

  1. The Heavy Keys: Hive takes the skewed keys (like customer_id = 101) and joins them by loading the small table into memory (Map Join). This avoids the slow "Shuffle" phase for the biggest chunk of data.

  2. The Rest: The remaining "normal" data is joined using a standard Shuffle Join (Map-Reduce).

  3. Union: Hive then combines (unions) the results of both paths into a single result set.


Comparison: Skewed Tables vs. Standard Tables

Feature Standard Table Skewed Table
Data Layout All data mixed in files Heavy values isolated in dedicated files
Reducer Load One reducer gets overloaded (Straggler) Work is balanced across many tasks
Join Strategy Standard Shuffle Join Skew Join (Map Join + Shuffle Join)
Maintenance Low (Automatic) High (Must know skewed values beforehand)

A Quick Tip on "Dynamic" Skew

If you don't know your skewed values in advance (and therefore can't use SKEWED BY in your DDL), you can still tell Hive to watch for skew during the job itself:

-- Use this if you don't have a Skewed Table but suspect data skew
SET hive.groupby.skewindata=true;

Note: This will trigger two MapReduce jobs instead of one. The first job randomly distributes data to reducers to perform partial aggregation, and the second job performs the final aggregation. It’s slower for "clean" data but a lifesaver for "dirty" skewed data.

Would you like to see how to troubleshoot a query that is hanging at 99% to confirm if data skew is actually the problem?

 

Yes.

 

To confirm if data skew is the reason your query is "stuck," you need to look at the individual task durations in the Tez or MapReduce UI. If 999 tasks finished in 30 seconds, but 1 task has been running for 45 minutes, you have a skew problem.

1. Identify the "Straggler" in Beeline

When running a query in Beeline, look at the status updates. If you see something like this:

Map 1: 100/100 Reduce 1: 499/500

The 499/500 means one single reducer (the "straggler") is holding up the entire job.

2. Check the Counters

In your job logs or the Tez UI, look for these specific counters to diagnose the severity:

  • REDUCE_INPUT_RECORDS: Compare the min, max, and median. If the Max is 100 million and the Median is 10,000, that specific reducer is overwhelmed.

  • FILE_BYTES_WRITTEN: A skewed reducer will write significantly more data to local disk (spilling) than its peers.


3. The "Detection" Query

If you suspect a column is skewed but aren't sure which values are the culprits, run this "profiling" query before setting up your skewed table:

SELECT column_name, COUNT(*) as cnt
FROM your_table
GROUP BY column_name
ORDER BY cnt DESC
LIMIT 10;

If the top 1 or 2 values have a count that is exponentially higher than the rest, those are the values you should use in your SKEWED BY clause.

4. How to Handle "Dynamic" Skew (No DDL changes)

If you cannot change the table schema but need a fix now, use the following session settings to force Hive to handle the imbalance:

Setting What it does
set hive.groupby.skewindata=true; Splits the Group By into two jobs to balance the load.
set hive.optimize.skewjoin=true; Tells the Join to treat heavy keys separately.
set hive.skewjoin.key=200000; Defines the threshold (number of rows) to trigger skew join.

Summary Checklist for a "Stuck" Query

  1. Check Reducer Progress: Is it stuck at exactly one or two remaining tasks?

  2. Verify Row Counts: Do certain keys have millions more rows than others?

  3. Apply Skew Settings: Enable hive.optimize.skewjoin or hive.groupby.skewindata.

  4. Consider Salting: If the above fails, you might need to "salt" your keys (adding a random prefix like 1_key, 2_key) to manually break them across reducers.

 

posted on 2026-01-13 17:30  ZhangZhihuiAAA  阅读(0)  评论(0)    收藏  举报