https://hive.apache.org/docs/latest/language/languagemanual-udf/
Built-in Functions
Mathematical Functions
Collection Functions
Type Conversion Functions
Date Functions
Conditional Functions
String Functions
Data Masking Functions
Misc. Functions
| Return Type | Name(Signature) |
|---|---|
| varies | java_method(class, method[, arg1[, arg2..]]) |
| varies | reflect(class, method[, arg1[, arg2..]]) |
| int | hash(a1[, a2…]) |
| string | current_user() |
| string | logged_in_user() |
| string | current_database() |
| string | md5(string/binary) |
| string | sha1(string/binary)sha(string/binary) |
| bigint | crc32(string/binary) |
| string | sha2(string/binary, int) |
| binary | aes_encrypt(input string/binary, key string/binary) |
| binary | aes_decrypt(input binary, key string/binary) |
| string | version() |
| bigint | surrogate_key([write_id_bits, task_id_bits]) |
xpath
The following functions are described in LanguageManual XPathUDF:
- xpath, xpath_short, xpath_int, xpath_long, xpath_float, xpath_double, xpath_number, xpath_string
get_json_object
A limited version of JSONPath is supported:
- $ : Root object
- . : Child operator
- [] : Subscript operator for array
- *: Wildcard for []
Syntax not supported that’s worth noticing:
- : Zero length string as key
- .. : Recursive descent
- @ : Current object/element
- () : Script expression
- ?() : Filter (script) expression.
- [,] : Union operator
- [start:end.step] : array slice operator
Built-in Aggregate Functions (UDAF)
Built-in Table-Generating Functions (UDTF)
Normal user-defined functions, such as concat(), take in a single input row and output a single output row. In contrast, table-generating functions transform a single input row to multiple output rows.
| Row-set columns types | Name(Signature) | Description |
|---|---|---|
| T | explode(ARRAY a) | Explodes an array to multiple rows. Returns a row-set with a single column (col), one row for each element from the array. |
| Tkey,Tvalue | explode(MAP<Tkey,Tvalue> m) | Explodes a map to multiple rows. Returns a row-set with a two columns (key,value) , one row for each key-value pair from the input map. (As of Hive 0.8.0.). |
| int,T | posexplode(ARRAY a) | Explodes an array to multiple rows with additional positional column of int type (position of items in the original array, starting with 0). Returns a row-set with two columns (pos,val), one row for each element from the array. |
| T1,…,Tn | inline(ARRAY<STRUCTf1:T1,...,fn:Tn> a) | Explodes an array of structs to multiple rows. Returns a row-set with N columns (N = number of top level elements in the struct), one row per struct from the array. (As of Hive 0.10.) |
| T1,…,Tn/r | stack(int r,T1 V1,…,Tn/r Vn) | Breaks up n values V1,…,Vn into r rows. Each row will have n/r columns. r must be constant. |
| string1,…,stringn | json_tuple(string jsonStr,string k1,…,string kn) | Takes JSON string and a set of n keys, and returns a tuple of n values. This is a more efficient version of the get_json_object UDF because it can get multiple keys with just one call. |
| string 1,…,stringn | parse_url_tuple(string urlStr,string p1,…,string pn) | Takes URL string and a set of n URL parts, and returns a tuple of n values. This is similar to the parse_url() UDF but can extract multiple parts at once out of a URL. Valid part names are: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:. |
Usage Examples
explode (array)
select explode(array('A','B','C'));
select explode(array('A','B','C')) as col;
select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf;
select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf as col;
explode (map)
select explode(map('A',10,'B',20,'C',30));
select explode(map('A',10,'B',20,'C',30)) as (key,value);
select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf;
select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf as key,value;
posexplode (array)
select posexplode(array('A','B','C'));
select posexplode(array('A','B','C')) as (pos,val);
select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf;
select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf as pos,val;
inline (array of structs)
select inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02')));
select inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02'))) as (col1,col2,col3);
select tf.* from (select 0) t lateral view inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02'))) tf;
select tf.* from (select 0) t lateral view inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02'))) tf as col1,col2,col3;
stack (values)
select stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01');
select stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01') as (col0,col1,col2);
select tf.* from (select 0) t lateral view stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01') tf;
select tf.* from (select 0) t lateral view stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01') tf as col0,col1,col2;
Using the syntax “SELECT udtf(col) AS colAlias…” has a few limitations:
- No other expressions are allowed in SELECT
- SELECT pageid, explode(adid_list) AS myCol… is not supported
- UDTF’s can’t be nested
- SELECT explode(explode(adid_list)) AS myCol… is not supported
- GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported
- SELECT explode(adid_list) AS myCol … GROUP BY myCol is not supported
Please see LanguageManual LateralView for an alternative syntax that does not have these limitations.
0: jdbc:hive2://localhost:10000> SELECT inline(array(struct('A', 10, date '2015-01-01'), struct('B', 20, date '2016-02-02'))); +-------+-------+-------------+ | col1 | col2 | col3 | +-------+-------+-------------+ | A | 10 | 2015-01-01 | | B | 20 | 2016-02-02 | +-------+-------+-------------+ 0: jdbc:hive2://localhost:10000> SELECT explode(array(struct('A', 10, date '2015-01-01'), struct('B', 20, date '2016-02-02'))); +---------------------------------------------+ | col | +---------------------------------------------+ | {"col1":"A","col2":10,"col3":"2015-01-01"} | | {"col1":"B","col2":20,"col3":"2016-02-02"} | +---------------------------------------------+ 0: jdbc:hive2://localhost:10000> SELECT stack(2, 'A', 10, date '2015-01-01', 'B', 20, date '2016-01-01'); +-------+-------+-------------+ | col0 | col1 | col2 | +-------+-------+-------------+ | A | 10 | 2015-01-01 | | B | 20 | 2016-01-01 | +-------+-------+-------------+ 0: jdbc:hive2://localhost:10000> SELECT posexplode(array('A', 'B', 'C')); +------+------+ | pos | val | +------+------+ | 0 | A | | 1 | B | | 2 | C | +------+------+ 0: jdbc:hive2://localhost:10000> SELECT explode(map('A', 10, 'B', 20, 'C', 30)); +------+--------+ | key | value | +------+--------+ | A | 10 | | B | 20 | | C | 30 | +------+--------+
0: jdbc:hive2://localhost:10000> SELECT positive(-1); +------+ | _c0 | +------+ | -1 | +------+ 0: jdbc:hive2://localhost:10000> SELECT negative(-1); +------+ | _c0 | +------+ | 1 | +------+ 0: jdbc:hive2://localhost:10000> SELECT MAX(1, 2, 3); Error: Error while compiling statement: FAILED: UDFArgumentTypeException Exactly one argument is expected.; Query ID: frank_20260114201220_ee4a83dd-1c59-4b86-88e9-fdf4321d8c53 (state=42000,code=40000) 0: jdbc:hive2://localhost:10000> SELECT greatest(1, 2, 3); +------+ | _c0 | +------+ | 3 | +------+ 0: jdbc:hive2://localhost:10000> SELECT MIN(1, 2, 3); Error: Error while compiling statement: FAILED: UDFArgumentTypeException Exactly one argument is expected.; Query ID: frank_20260114201249_5375a975-62c4-4837-9480-3de1d3c3bab6 (state=42000,code=40000) 0: jdbc:hive2://localhost:10000> SELECT least(1, 2, 3); +------+ | _c0 | +------+ | 1 | +------+
The width_bucket() function in Hive is used to create equi-width histograms. It assigns a value to a "bucket" number based on where it falls within a specific range.
How it Works
The function divides a range (from a minimum to a maximum) into a specific number of buckets of equal size.
Syntax:
width_bucket(expression, min_value, max_value, num_buckets)
-
expression: The value you want to categorize (e.g., a price or age).
-
min_value: The lower bound of the histogram.
-
max_value: The upper bound of the histogram.
-
num_buckets: How many buckets to create between the min and max.
Example 1: Categorizing Sales Amounts
Suppose you have a list of transaction amounts and you want to group them into 3 buckets ranging from 0 to 300.
| Amount | Calculation | Bucket Result |
| 50 | Falls in 0–100 range | 1 |
| 150 | Falls in 100–200 range | 2 |
| 250 | Falls in 200–300 range | 3 |
The Query:
SELECT
amount,
width_bucket(amount, 0, 300, 3) AS bucket_id
FROM transactions;
Example 2: Handling Outliers (Underflow and Overflow)
One of the most important things to know about width_bucket is how it handles values outside your range:
-
Bucket 0: Reserved for values less than the
min_value(Underflow). -
Bucket N+1: Reserved for values equal to or greater than the
max_value(Overflow).
The Query:
SELECT
width_bucket(-50, 0, 100, 10) AS underflow, -- Returns 0
width_bucket(150, 0, 100, 10) AS overflow; -- Returns 11
Practical Use Case: Age Demographics
If you want to group users into age brackets (0–20, 21–40, 41–60, 61–80) for a report:
SELECT
user_id,
age,
width_bucket(age, 0, 80, 4) AS age_group
FROM users;
Key Rules to Remember
-
Equal Width: Every bucket (except 0 and N+1) has exactly the same "width" calculated as $(max - min) / num\_buckets$.
-
Direction: If $min &amp;amp;amp;amp;amp;amp;amp;amp;gt; max$, the buckets will decrease in value as the expression increases.
-
Nulls: If any input is
NULL, the function returnsNULL.
Would you like me to show you how to combine width_bucket with a GROUP BY to create a count of how many items fall into each category?
Yes, I would.
Here is an example of how you can use the width_bucket() function alongside a GROUP BY clause to generate a frequency distribution (a count of how many events fall into each "bucket").
The Scenario
Imagine your events table has a column for processing_time (in seconds). You want to see how many events fall into four performance tiers between 0 and 20 seconds.
The Query
SELECT
width_bucket(processing_time, 0, 20, 4) AS bucket_id,
COUNT(*) AS event_count
FROM events
GROUP BY width_bucket(processing_time, 0, 20, 4)
ORDER BY bucket_id;
Understanding the Results
In this example, the width of each bucket is 5 seconds ($20 / 4 = 5$). The output would look something like this:
| bucket_id | Range (Seconds) | event_count | Description |
| 0 | $< 0$ | 0 | Underflow (Invalid times) |
| 1 | $0 \le x < 5$ | 1,200 | Fast |
| 2 | $5 \le x < 10$ | 450 | Average |
| 3 | $10 \le x < 15$ | 120 | Slow |
| 4 | $15 \le x < 20$ | 30 | Very Slow |
| 5 | $\ge 20$ | 5 | Overflow (Critical) |
Pro-Tip: Making it Readable
Since bucket_id numbers aren't very descriptive for a final report, you can wrap the function in a CASE statement to provide labels:
SELECT
CASE width_bucket(processing_time, 0, 20, 4)
WHEN 1 THEN '0-5s (Fast)'
WHEN 2 THEN '5-10s (Avg)'
WHEN 3 THEN '10-15s (Slow)'
WHEN 4 THEN '15-20s (Very Slow)'
ELSE 'Out of Range'
END AS performance_category,
COUNT(*) AS total
FROM events
GROUP BY 1;
In HiveQL, Data Masking functions are used to hide or obfuscate sensitive information (like Credit Card numbers or PII) while still allowing the data to be queried.
These are typically part of the Hive Generic UDFs and are very useful for creating "Redacted" views of your data.
1. mask()
The most basic masking function. By default, it converts uppercase letters to 'X', lowercase letters to 'x', and numbers to 'n'.
-
Syntax:
mask(string value [, upper_char, lower_char, number_char]) -
Example:
SELECT mask("Abc-123"); -- Result: Xxx-nnn
2. mask_first_n()
Masks only the first n characters of a string.
-
Example: Masking the beginning of a Username.
SELECT mask_first_n("ConfidentialData", 4); -- Result: XXXXidentialData
3. mask_last_n()
Masks only the last n characters. This is the "Industry Standard" for showing the end of a card or account number.
-
Example: Masking a Credit Card.
SQLSELECT mask_last_n("1234-5678-9012-3456", 4); -- Result: 1234-5678-9012-nnnn
4. mask_show_first_n()
This "flips" the logic: it keeps the first $n$ characters visible and masks everything else.
-
Example: Showing only the first 3 digits of a phone number.
SELECT mask_show_first_n("555-0199", 3); -- Result: 555-nnnn
5. mask_hash()
Returns a consistent SHA-256 hash of the value. This is powerful because it allows you to perform JOINs or GROUP BYs on masked data without ever seeing the original value.
-
Example:
SELECT mask_hash("user_email@example.com"); -- Result: 876930... (long alphanumeric hash)
Summary Table: Masking in Action
| Function | Input | Result | Best Used For |
| mask() | A1-b2 | Xn-xn | Full redaction |
| mask_last_n() | 4567-8901 | 4567-nnnn | Credit Cards |
| mask_show_first_n() | SecretMsg | SecXXXXXX | Partial ID visibility |
| mask_hash() | Password123 | Hash Value | Secure Joins / Analysis |
Implementing with a View
A common practice in Hive is to hide the base table and provide a Masked View to analysts:
CREATE VIEW events_redacted AS
SELECT
event_id,
mask_show_first_n(user_email, 3) AS user_email,
event_date
FROM events;

浙公网安备 33010602011771号