公告

日历

Feature	ORDER BY	SORT BY
Ordering	Global (Total) Order. The entire result set is sorted from top to bottom.	Local Order. Data is only sorted within each individual reducer.
Reducers	Uses a single reducer to perform the final sort.	Uses multiple reducers to sort in parallel.
Performance	Slow for large datasets (bottlenecked by one reducer).	Fast because it leverages distributed processing.
Reliability	May cause "Out of Memory" errors on huge datasets.	Very stable; handles large-scale data easily.

1. ORDER BY (Total Ordering)

When you use ORDER BY, Hive forces all data to pass through one single reducer. This ensures that the final output is 100% sorted correctly across all rows.

Best for: Small datasets or when you absolutely need a perfectly ordered list (e.g., a Top 10 list).
Risk: If you have billions of rows, a single reducer will struggle to process them, leading to extremely long wait times or job failure.

2. SORT BY (Local Ordering)

SORT BY allows Hive to use multiple reducers. Each reducer sorts its own batch of data. While the rows inside each output file are sorted, the overall dataset is not.

Example: Reducer 1 might output [1, 5, 10] and Reducer 2 might output [2, 6, 11]. Both are sorted internally, but the combined result [1, 5, 10, 2, 6, 11] is not.
Best for: Improving performance on massive datasets where you just need "grouped" or "locally sorted" data before further processing.

3. When to use each?

Use ORDER BY when you are doing a final SELECT and need to see the results in order (usually paired with a LIMIT to avoid crashing the reducer).
Use SORT BY when you are preparing data for another step (like a join) or when you are using DISTRIBUTE BY to ensure specific keys go to the same reducer.

Related Tip: CLUSTER BY

If you find yourself using DISTRIBUTE BY (to group data) and SORT BY (to sort it) on the same column, you can use the shortcut CLUSTER BY.

CLUSTER BY X is equivalent to DISTRIBUTE BY X SORT BY X.

Would you like to see how to use these together with DISTRIBUTE BY to get globally sorted results while still using multiple reducers?

Yes, I would.

To get globally sorted results while leveraging the performance of multiple reducers, you combine DISTRIBUTE BY and SORT BY.

Here is how the workflow works and how to write the queries based on the mechanics of Hive:

1. The Strategy: Distribute + Sort

If you simply use SORT BY, your data is sorted inside each reducer, but since Reducer A might have values 1-10 and Reducer B might have values 2-11, the combined output is a mess.

By adding DISTRIBUTE BY, you force all rows with the same key (or range of keys) to go to the same reducer.

DISTRIBUTE BY: Ensures all "like" data goes to one place.
SORT BY: Orders that data once it arrives at the reducer.

The Query:

SELECT * FROM employee_table 
DISTRIBUTE BY dept_number 
SORT BY dept_number ASC;

2. The Shortcut: `CLUSTER BY`

If you are distributing and sorting on the exact same column, Hive provides a shortcut called CLUSTER BY. It is functionally identical to using both commands together.

The Query:

-- This does the exact same thing as the Distribute + Sort query above
SELECT * FROM employee_table 
CLUSTER BY dept_number;

3. Achieving "Total" Global Order with Multiple Reducers

While CLUSTER BY ensures that each reducer's output is sorted and that keys don't overlap, the final files (e.g., 000000_0, 000001_0) are still separate. To get one single perfectly sorted list:

Use ORDER BY with a LIMIT: This is the most common way to force a global sort safely.
Manually set Reducers: If you must have global order across multiple files, you often have to rely on ORDER BY, but ensure your Hive environment has enough memory assigned to that single reducer.

Summary Table of Advanced Sorting

Clause	Multiple Reducers?	Global Sort?	What it does
`SORT BY`	Yes	No	Sorts only within the reducer.
`DISTRIBUTE BY`	Yes	No	Moves similar keys to the same reducer.
`CLUSTER BY`	Yes	"Partial"	Combined Distribute + Sort on one column.
`ORDER BY`	No	Yes	Single reducer handles everything for a perfect sort.

Pro-Tip: If you are in Strict Mode, Hive will actually prevent you from using ORDER BY unless you also include a LIMIT clause to protect the cluster from crashing.

posted on 2026-01-13 16:46 ZhangZhihuiAAA 阅读(0) 评论(0) 收藏举报

刷新页面返回顶部

导航

Comparison at a Glance

1. ORDER BY (Total Ordering)

2. SORT BY (Local Ordering)

3. When to use each?

Related Tip: CLUSTER BY

1. The Strategy: Distribute + Sort

2. The Shortcut: `CLUSTER BY`

3. Achieving "Total" Global Order with Multiple Reducers

Summary Table of Advanced Sorting


博客园 © 2004-2026 浙公网安备 33010602011771号浙ICP备2021040463号-3

导航

Comparison at a Glance

1. ORDER BY (Total Ordering)

2. SORT BY (Local Ordering)

3. When to use each?

Related Tip: CLUSTER BY

1. The Strategy: Distribute + Sort

2. The Shortcut: CLUSTER BY

3. Achieving "Total" Global Order with Multiple Reducers

Summary Table of Advanced Sorting

2. The Shortcut: `CLUSTER BY`