ZhangZhihui's Blog  

https://hive.apache.org/docs/latest/language/languagemanual-udf/

 

Built-in Functions

Mathematical Functions

Collection Functions

Type Conversion Functions

Date Functions

Conditional Functions

String Functions

Data Masking Functions

Misc. Functions

Return TypeName(Signature)
varies java_method(class, method[, arg1[, arg2..]])
varies reflect(class, method[, arg1[, arg2..]])
int hash(a1[, a2…])
string current_user()
string logged_in_user()
string current_database()
string md5(string/binary)
string sha1(string/binary)sha(string/binary)
bigint crc32(string/binary)
string sha2(string/binary, int)
binary aes_encrypt(input string/binary, key string/binary)
binary aes_decrypt(input binary, key string/binary)
string version()
bigint surrogate_key([write_id_bits, task_id_bits])

 

xpath

The following functions are described in LanguageManual XPathUDF:

  • xpath, xpath_short, xpath_int, xpath_long, xpath_float, xpath_double, xpath_number, xpath_string

get_json_object

A limited version of JSONPath is supported:

  • $ : Root object
  • . : Child operator
  • [] : Subscript operator for array
  • *: Wildcard for []

Syntax not supported that’s worth noticing:

  • : Zero length string as key
  • .. : Recursive descent
  • @ : Current object/element
  • () : Script expression
  • ?() : Filter (script) expression.
  • [,] : Union operator
  • [start:end.step] : array slice operator

 

Built-in Aggregate Functions (UDAF)

Built-in Table-Generating Functions (UDTF)

Normal user-defined functions, such as concat(), take in a single input row and output a single output row. In contrast, table-generating functions transform a single input row to multiple output rows.

Row-set columns typesName(Signature)Description
T explode(ARRAY a) Explodes an array to multiple rows. Returns a row-set with a single column (col), one row for each element from the array.
Tkey,Tvalue explode(MAP<Tkey,Tvalue> m) Explodes a map to multiple rows. Returns a row-set with a two columns (key,value) , one row for each key-value pair from the input map. (As of Hive 0.8.0.).
int,T posexplode(ARRAY a) Explodes an array to multiple rows with additional positional column of int type (position of items in the original array, starting with 0). Returns a row-set with two columns (pos,val), one row for each element from the array.
T1,…,Tn inline(ARRAY<STRUCTf1:T1,...,fn:Tn> a) Explodes an array of structs to multiple rows. Returns a row-set with N columns (N = number of top level elements in the struct), one row per struct from the array. (As of Hive 0.10.)
T1,…,Tn/r stack(int r,T1 V1,…,Tn/r Vn) Breaks up n values V1,…,Vn into r rows. Each row will have n/r columns. r must be constant.
     
string1,…,stringn json_tuple(string jsonStr,string k1,…,string kn) Takes JSON string and a set of n keys, and returns a tuple of n values. This is a more efficient version of the get_json_object UDF because it can get multiple keys with just one call.
string 1,…,stringn parse_url_tuple(string urlStr,string p1,…,string pn) Takes URL string and a set of n URL parts, and returns a tuple of n values. This is similar to the parse_url() UDF but can extract multiple parts at once out of a URL. Valid part names are: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:.

Usage Examples

explode (array)

select explode(array('A','B','C'));
select explode(array('A','B','C')) as col;
select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf;
select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf as col;

explode (map)

select explode(map('A',10,'B',20,'C',30));
select explode(map('A',10,'B',20,'C',30)) as (key,value);
select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf;
select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf as key,value;

posexplode (array)

select posexplode(array('A','B','C'));
select posexplode(array('A','B','C')) as (pos,val);
select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf;
select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf as pos,val;

inline (array of structs)

select inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02')));
select inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02'))) as (col1,col2,col3);
select tf.* from (select 0) t lateral view inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02'))) tf;
select tf.* from (select 0) t lateral view inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02'))) tf as col1,col2,col3;

stack (values)

select stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01');
select stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01') as (col0,col1,col2);
select tf.* from (select 0) t lateral view stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01') tf;
select tf.* from (select 0) t lateral view stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01') tf as col0,col1,col2;

 

Using the syntax “SELECT udtf(col) AS colAlias…” has a few limitations:

  • No other expressions are allowed in SELECT
    • SELECT pageid, explode(adid_list) AS myCol… is not supported
  • UDTF’s can’t be nested
    • SELECT explode(explode(adid_list)) AS myCol… is not supported
  • GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported
    • SELECT explode(adid_list) AS myCol … GROUP BY myCol is not supported

Please see LanguageManual LateralView for an alternative syntax that does not have these limitations.

 

 

0: jdbc:hive2://localhost:10000> SELECT inline(array(struct('A', 10, date '2015-01-01'), struct('B', 20, date '2016-02-02')));
+-------+-------+-------------+
| col1  | col2  |    col3     |
+-------+-------+-------------+
| A     | 10    | 2015-01-01  |
| B     | 20    | 2016-02-02  |
+-------+-------+-------------+
0: jdbc:hive2://localhost:10000> SELECT explode(array(struct('A', 10, date '2015-01-01'), struct('B', 20, date '2016-02-02')));
+---------------------------------------------+
|                     col                     |
+---------------------------------------------+
| {"col1":"A","col2":10,"col3":"2015-01-01"}  |
| {"col1":"B","col2":20,"col3":"2016-02-02"}  |
+---------------------------------------------+
0: jdbc:hive2://localhost:10000> SELECT stack(2, 'A', 10, date '2015-01-01', 'B', 20, date '2016-01-01');
+-------+-------+-------------+
| col0  | col1  |    col2     |
+-------+-------+-------------+
| A     | 10    | 2015-01-01  |
| B     | 20    | 2016-01-01  |
+-------+-------+-------------+
0: jdbc:hive2://localhost:10000> SELECT posexplode(array('A', 'B', 'C'));
+------+------+
| pos  | val  |
+------+------+
| 0    | A    |
| 1    | B    |
| 2    | C    |
+------+------+
0: jdbc:hive2://localhost:10000> SELECT explode(map('A', 10, 'B', 20, 'C', 30));
+------+--------+
| key  | value  |
+------+--------+
| A    | 10     |
| B    | 20     |
| C    | 30     |
+------+--------+

 

0: jdbc:hive2://localhost:10000> SELECT positive(-1);
+------+
| _c0  |
+------+
| -1   |
+------+
0: jdbc:hive2://localhost:10000> SELECT negative(-1);
+------+
| _c0  |
+------+
| 1    |
+------+
0: jdbc:hive2://localhost:10000> SELECT MAX(1, 2, 3);
Error: Error while compiling statement: FAILED: UDFArgumentTypeException Exactly one argument is expected.; Query ID: frank_20260114201220_ee4a83dd-1c59-4b86-88e9-fdf4321d8c53 (state=42000,code=40000)
0: jdbc:hive2://localhost:10000> SELECT greatest(1, 2, 3);
+------+
| _c0  |
+------+
| 3    |
+------+
0: jdbc:hive2://localhost:10000> SELECT MIN(1, 2, 3);
Error: Error while compiling statement: FAILED: UDFArgumentTypeException Exactly one argument is expected.; Query ID: frank_20260114201249_5375a975-62c4-4837-9480-3de1d3c3bab6 (state=42000,code=40000)
0: jdbc:hive2://localhost:10000> SELECT least(1, 2, 3);
+------+
| _c0  |
+------+
| 1    |
+------+

 

The width_bucket() function in Hive is used to create equi-width histograms. It assigns a value to a "bucket" number based on where it falls within a specific range.

How it Works

The function divides a range (from a minimum to a maximum) into a specific number of buckets of equal size.

Syntax:

width_bucket(expression, min_value, max_value, num_buckets)

  • expression: The value you want to categorize (e.g., a price or age).

  • min_value: The lower bound of the histogram.

  • max_value: The upper bound of the histogram.

  • num_buckets: How many buckets to create between the min and max.


Example 1: Categorizing Sales Amounts

Suppose you have a list of transaction amounts and you want to group them into 3 buckets ranging from 0 to 300.

Amount Calculation Bucket Result
50 Falls in 0–100 range 1
150 Falls in 100–200 range 2
250 Falls in 200–300 range 3

The Query:

SELECT 
  amount, 
  width_bucket(amount, 0, 300, 3) AS bucket_id
FROM transactions;

Example 2: Handling Outliers (Underflow and Overflow)

One of the most important things to know about width_bucket is how it handles values outside your range:

  • Bucket 0: Reserved for values less than the min_value (Underflow).

  • Bucket N+1: Reserved for values equal to or greater than the max_value (Overflow).

The Query:

SELECT 
  width_bucket(-50, 0, 100, 10) AS underflow, -- Returns 0
  width_bucket(150, 0, 100, 10) AS overflow;   -- Returns 11

Practical Use Case: Age Demographics

If you want to group users into age brackets (0–20, 21–40, 41–60, 61–80) for a report:

SELECT 
  user_id,
  age,
  width_bucket(age, 0, 80, 4) AS age_group
FROM users;

Key Rules to Remember

  1. Equal Width: Every bucket (except 0 and N+1) has exactly the same "width" calculated as $(max - min) / num\_buckets$.

  2. Direction: If $min &amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; max$, the buckets will decrease in value as the expression increases.

  3. Nulls: If any input is NULL, the function returns NULL.

Would you like me to show you how to combine width_bucket with a GROUP BY to create a count of how many items fall into each category?

 

Yes, I would.

 

Here is an example of how you can use the width_bucket() function alongside a GROUP BY clause to generate a frequency distribution (a count of how many events fall into each "bucket").

The Scenario

Imagine your events table has a column for processing_time (in seconds). You want to see how many events fall into four performance tiers between 0 and 20 seconds.

The Query

SELECT 
  width_bucket(processing_time, 0, 20, 4) AS bucket_id,
  COUNT(*) AS event_count
FROM events
GROUP BY width_bucket(processing_time, 0, 20, 4)
ORDER BY bucket_id;

Understanding the Results

In this example, the width of each bucket is 5 seconds ($20 / 4 = 5$). The output would look something like this:

bucket_id Range (Seconds) event_count Description
0 $< 0$ 0 Underflow (Invalid times)
1 $0 \le x < 5$ 1,200 Fast
2 $5 \le x < 10$ 450 Average
3 $10 \le x < 15$ 120 Slow
4 $15 \le x < 20$ 30 Very Slow
5 $\ge 20$ 5 Overflow (Critical)

Pro-Tip: Making it Readable

Since bucket_id numbers aren't very descriptive for a final report, you can wrap the function in a CASE statement to provide labels:

SELECT 
  CASE width_bucket(processing_time, 0, 20, 4)
    WHEN 1 THEN '0-5s (Fast)'
    WHEN 2 THEN '5-10s (Avg)'
    WHEN 3 THEN '10-15s (Slow)'
    WHEN 4 THEN '15-20s (Very Slow)'
    ELSE 'Out of Range'
  END AS performance_category,
  COUNT(*) AS total
FROM events
GROUP BY 1;

 

In HiveQL, Data Masking functions are used to hide or obfuscate sensitive information (like Credit Card numbers or PII) while still allowing the data to be queried.

These are typically part of the Hive Generic UDFs and are very useful for creating "Redacted" views of your data.


1. mask()

The most basic masking function. By default, it converts uppercase letters to 'X', lowercase letters to 'x', and numbers to 'n'.

  • Syntax: mask(string value [, upper_char, lower_char, number_char])

  • Example:

    SELECT mask("Abc-123"); 
    -- Result: Xxx-nnn
    

2. mask_first_n()

Masks only the first n characters of a string.

  • Example: Masking the beginning of a Username.

    SELECT mask_first_n("ConfidentialData", 4);
    -- Result: XXXXidentialData
    

3. mask_last_n()

Masks only the last n characters. This is the "Industry Standard" for showing the end of a card or account number.

  • Example: Masking a Credit Card.

    SQL
     
    SELECT mask_last_n("1234-5678-9012-3456", 4);
    -- Result: 1234-5678-9012-nnnn
    

4. mask_show_first_n()

This "flips" the logic: it keeps the first $n$ characters visible and masks everything else.

  • Example: Showing only the first 3 digits of a phone number.

    SELECT mask_show_first_n("555-0199", 3);
    -- Result: 555-nnnn
    

5. mask_hash()

Returns a consistent SHA-256 hash of the value. This is powerful because it allows you to perform JOINs or GROUP BYs on masked data without ever seeing the original value.

  • Example:

    SELECT mask_hash("user_email@example.com");
    -- Result: 876930... (long alphanumeric hash)
    

Summary Table: Masking in Action

Function Input Result Best Used For
mask() A1-b2 Xn-xn Full redaction
mask_last_n() 4567-8901 4567-nnnn Credit Cards
mask_show_first_n() SecretMsg SecXXXXXX Partial ID visibility
mask_hash() Password123 Hash Value Secure Joins / Analysis

Implementing with a View

A common practice in Hive is to hide the base table and provide a Masked View to analysts:

CREATE VIEW events_redacted AS
SELECT 
  event_id,
  mask_show_first_n(user_email, 3) AS user_email,
  event_date
FROM events;

 

posted on 2026-01-14 20:22  ZhangZhihuiAAA  阅读(3)  评论(0)    收藏  举报