In HiveQL, the primary difference between ORDER BY and SORT BY lies in how many reducers are used and whether the entire dataset is guaranteed to be sorted.
Comparison at a Glance
| Feature | ORDER BY | SORT BY |
| Ordering | Global (Total) Order. The entire result set is sorted from top to bottom. | Local Order. Data is only sorted within each individual reducer. |
| Reducers | Uses a single reducer to perform the final sort. | Uses multiple reducers to sort in parallel. |
| Performance | Slow for large datasets (bottlenecked by one reducer). | Fast because it leverages distributed processing. |
| Reliability | May cause "Out of Memory" errors on huge datasets. | Very stable; handles large-scale data easily. |
1. ORDER BY (Total Ordering)
When you use ORDER BY, Hive forces all data to pass through one single reducer. This ensures that the final output is 100% sorted correctly across all rows.
-
Best for: Small datasets or when you absolutely need a perfectly ordered list (e.g., a Top 10 list).
-
Risk: If you have billions of rows, a single reducer will struggle to process them, leading to extremely long wait times or job failure.
2. SORT BY (Local Ordering)
SORT BY allows Hive to use multiple reducers. Each reducer sorts its own batch of data. While the rows inside each output file are sorted, the overall dataset is not.
-
Example: Reducer 1 might output
[1, 5, 10]and Reducer 2 might output[2, 6, 11]. Both are sorted internally, but the combined result[1, 5, 10, 2, 6, 11]is not. -
Best for: Improving performance on massive datasets where you just need "grouped" or "locally sorted" data before further processing.
3. When to use each?
-
Use ORDER BY when you are doing a final
SELECTand need to see the results in order (usually paired with aLIMITto avoid crashing the reducer). -
Use SORT BY when you are preparing data for another step (like a join) or when you are using
DISTRIBUTE BYto ensure specific keys go to the same reducer.
Related Tip: CLUSTER BY
If you find yourself using DISTRIBUTE BY (to group data) and SORT BY (to sort it) on the same column, you can use the shortcut CLUSTER BY.
-
CLUSTER BY Xis equivalent toDISTRIBUTE BY X SORT BY X.
Would you like to see how to use these together with DISTRIBUTE BY to get globally sorted results while still using multiple reducers?
Yes, I would.
To get globally sorted results while leveraging the performance of multiple reducers, you combine DISTRIBUTE BY and SORT BY.
Here is how the workflow works and how to write the queries based on the mechanics of Hive:
1. The Strategy: Distribute + Sort
If you simply use SORT BY, your data is sorted inside each reducer, but since Reducer A might have values 1-10 and Reducer B might have values 2-11, the combined output is a mess.
By adding DISTRIBUTE BY, you force all rows with the same key (or range of keys) to go to the same reducer.
-
DISTRIBUTE BY: Ensures all "like" data goes to one place. -
SORT BY: Orders that data once it arrives at the reducer.
The Query:
SELECT * FROM employee_table
DISTRIBUTE BY dept_number
SORT BY dept_number ASC;
2. The Shortcut: CLUSTER BY
If you are distributing and sorting on the exact same column, Hive provides a shortcut called CLUSTER BY. It is functionally identical to using both commands together.
The Query:
-- This does the exact same thing as the Distribute + Sort query above
SELECT * FROM employee_table
CLUSTER BY dept_number;
3. Achieving "Total" Global Order with Multiple Reducers
While CLUSTER BY ensures that each reducer's output is sorted and that keys don't overlap, the final files (e.g., 000000_0, 000001_0) are still separate. To get one single perfectly sorted list:
-
Use
ORDER BYwith aLIMIT: This is the most common way to force a global sort safely. -
Manually set Reducers: If you must have global order across multiple files, you often have to rely on
ORDER BY, but ensure your Hive environment has enough memory assigned to that single reducer.
Summary Table of Advanced Sorting
| Clause | Multiple Reducers? | Global Sort? | What it does |
SORT BY |
Yes | No | Sorts only within the reducer. |
DISTRIBUTE BY |
Yes | No | Moves similar keys to the same reducer. |
CLUSTER BY |
Yes | "Partial" | Combined Distribute + Sort on one column. |
ORDER BY |
No | Yes | Single reducer handles everything for a perfect sort. |
Pro-Tip: If you are in Strict Mode, Hive will actually prevent you from using ORDER BY unless you also include a LIMIT clause to protect the cluster from crashing.

浙公网安备 33010602011771号