公告

日历

Return Type	Name(Signature)
varies	java_method(class, method[, arg1[, arg2..]])
varies	reflect(class, method[, arg1[, arg2..]])
int	hash(a1[, a2…])
string	current_user()
string	logged_in_user()
string	current_database()
string	md5(string/binary)
string	sha1(string/binary)sha(string/binary)
bigint	crc32(string/binary)
string	sha2(string/binary, int)
binary	aes_encrypt(input string/binary, key string/binary)
binary	aes_decrypt(input binary, key string/binary)
string	version()
bigint	surrogate_key([write_id_bits, task_id_bits])

xpath

The following functions are described in LanguageManual XPathUDF:

xpath, xpath_short, xpath_int, xpath_long, xpath_float, xpath_double, xpath_number, xpath_string

get_json_object

A limited version of JSONPath is supported:

$ : Root object
. : Child operator
[] : Subscript operator for array
*: Wildcard for []

Syntax not supported that’s worth noticing:

: Zero length string as key
.. : Recursive descent
@ : Current object/element
() : Script expression
?() : Filter (script) expression.
[,] : Union operator
[start:end.step] : array slice operator

Built-in Aggregate Functions (UDAF)

Built-in Table-Generating Functions (UDTF)

Normal user-defined functions, such as concat(), take in a single input row and output a single output row. In contrast, table-generating functions transform a single input row to multiple output rows.

Row-set columns types	Name(Signature)	Description
T	explode(ARRAY a)	Explodes an array to multiple rows. Returns a row-set with a single column (col), one row for each element from the array.
Tkey,Tvalue	explode(MAP<Tkey,Tvalue> m)	Explodes a map to multiple rows. Returns a row-set with a two columns (key,value) , one row for each key-value pair from the input map. (As of Hive 0.8.0.).
int,T	posexplode(ARRAY a)	Explodes an array to multiple rows with additional positional column of int type (position of items in the original array, starting with 0). Returns a row-set with two columns (pos,val), one row for each element from the array.
T1,…,Tn	inline(ARRAY<STRUCTf1:T1,...,fn:Tn> a)	Explodes an array of structs to multiple rows. Returns a row-set with N columns (N = number of top level elements in the struct), one row per struct from the array. (As of Hive 0.10.)
T1,…,Tn/r	stack(int r,T1 V1,…,Tn/r Vn)	Breaks up n values V1,…,Vn into r rows. Each row will have n/r columns. r must be constant.

string1,…,stringn	json_tuple(string jsonStr,string k1,…,string kn)	Takes JSON string and a set of n keys, and returns a tuple of n values. This is a more efficient version of the `get_json_object` UDF because it can get multiple keys with just one call.
string 1,…,stringn	parse_url_tuple(string urlStr,string p1,…,string pn)	Takes URL string and a set of n URL parts, and returns a tuple of n values. This is similar to the `parse_url()` UDF but can extract multiple parts at once out of a URL. Valid part names are: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:.

Usage Examples

explode (array)

select explode(array('A','B','C'));
select explode(array('A','B','C')) as col;
select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf;
select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf as col;

explode (map)

select explode(map('A',10,'B',20,'C',30));
select explode(map('A',10,'B',20,'C',30)) as (key,value);
select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf;
select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf as key,value;

posexplode (array)

select posexplode(array('A','B','C'));
select posexplode(array('A','B','C')) as (pos,val);
select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf;
select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf as pos,val;

inline (array of structs)

select inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02')));
select inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02'))) as (col1,col2,col3);
select tf.* from (select 0) t lateral view inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02'))) tf;
select tf.* from (select 0) t lateral view inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02'))) tf as col1,col2,col3;

stack (values)

select stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01');
select stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01') as (col0,col1,col2);
select tf.* from (select 0) t lateral view stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01') tf;
select tf.* from (select 0) t lateral view stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01') tf as col0,col1,col2;

Using the syntax “SELECT udtf(col) AS colAlias…” has a few limitations:

No other expressions are allowed in SELECT
- SELECT pageid, explode(adid_list) AS myCol… is not supported
UDTF’s can’t be nested
- SELECT explode(explode(adid_list)) AS myCol… is not supported
GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported
- SELECT explode(adid_list) AS myCol … GROUP BY myCol is not supported

Please see LanguageManual LateralView for an alternative syntax that does not have these limitations.

0: jdbc:hive2://localhost:10000> SELECT inline(array(struct('A', 10, date '2015-01-01'), struct('B', 20, date '2016-02-02')));
+-------+-------+-------------+
| col1  | col2  |    col3     |
+-------+-------+-------------+
| A     | 10    | 2015-01-01  |
| B     | 20    | 2016-02-02  |
+-------+-------+-------------+
0: jdbc:hive2://localhost:10000> SELECT explode(array(struct('A', 10, date '2015-01-01'), struct('B', 20, date '2016-02-02')));
+---------------------------------------------+
|                     col                     |
+---------------------------------------------+
| {"col1":"A","col2":10,"col3":"2015-01-01"}  |
| {"col1":"B","col2":20,"col3":"2016-02-02"}  |
+---------------------------------------------+
0: jdbc:hive2://localhost:10000> SELECT stack(2, 'A', 10, date '2015-01-01', 'B', 20, date '2016-01-01');
+-------+-------+-------------+
| col0  | col1  |    col2     |
+-------+-------+-------------+
| A     | 10    | 2015-01-01  |
| B     | 20    | 2016-01-01  |
+-------+-------+-------------+
0: jdbc:hive2://localhost:10000> SELECT posexplode(array('A', 'B', 'C'));
+------+------+
| pos  | val  |
+------+------+
| 0    | A    |
| 1    | B    |
| 2    | C    |
+------+------+
0: jdbc:hive2://localhost:10000> SELECT explode(map('A', 10, 'B', 20, 'C', 30));
+------+--------+
| key  | value  |
+------+--------+
| A    | 10     |
| B    | 20     |
| C    | 30     |
+------+--------+

0: jdbc:hive2://localhost:10000> SELECT positive(-1);
+------+
| _c0  |
+------+
| -1   |
+------+
0: jdbc:hive2://localhost:10000> SELECT negative(-1);
+------+
| _c0  |
+------+
| 1    |
+------+
0: jdbc:hive2://localhost:10000> SELECT MAX(1, 2, 3);
Error: Error while compiling statement: FAILED: UDFArgumentTypeException Exactly one argument is expected.; Query ID: frank_20260114201220_ee4a83dd-1c59-4b86-88e9-fdf4321d8c53 (state=42000,code=40000)
0: jdbc:hive2://localhost:10000> SELECT greatest(1, 2, 3);
+------+
| _c0  |
+------+
| 3    |
+------+
0: jdbc:hive2://localhost:10000> SELECT MIN(1, 2, 3);
Error: Error while compiling statement: FAILED: UDFArgumentTypeException Exactly one argument is expected.; Query ID: frank_20260114201249_5375a975-62c4-4837-9480-3de1d3c3bab6 (state=42000,code=40000)
0: jdbc:hive2://localhost:10000> SELECT least(1, 2, 3);
+------+
| _c0  |
+------+
| 1    |
+------+

The width_bucket() function in Hive is used to create equi-width histograms. It assigns a value to a "bucket" number based on where it falls within a specific range.

How it Works

The function divides a range (from a minimum to a maximum) into a specific number of buckets of equal size.

Syntax:

width_bucket(expression, min_value, max_value, num_buckets)

expression: The value you want to categorize (e.g., a price or age).
min_value: The lower bound of the histogram.
max_value: The upper bound of the histogram.
num_buckets: How many buckets to create between the min and max.

Example 1: Categorizing Sales Amounts

Suppose you have a list of transaction amounts and you want to group them into 3 buckets ranging from 0 to 300.

Amount	Calculation	Bucket Result
50	Falls in 0–100 range	1
150	Falls in 100–200 range	2
250	Falls in 200–300 range	3

The Query:

SELECT 
  amount, 
  width_bucket(amount, 0, 300, 3) AS bucket_id
FROM transactions;

Example 2: Handling Outliers (Underflow and Overflow)

One of the most important things to know about width_bucket is how it handles values outside your range:

Bucket 0: Reserved for values less than the min_value (Underflow).
Bucket N+1: Reserved for values equal to or greater than the max_value (Overflow).

The Query:

SELECT 
  width_bucket(-50, 0, 100, 10) AS underflow, -- Returns 0
  width_bucket(150, 0, 100, 10) AS overflow;   -- Returns 11

Practical Use Case: Age Demographics

If you want to group users into age brackets (0–20, 21–40, 41–60, 61–80) for a report:

SELECT 
  user_id,
  age,
  width_bucket(age, 0, 80, 4) AS age_group
FROM users;

Key Rules to Remember

Equal Width: Every bucket (except 0 and N+1) has exactly the same "width" calculated as $(max - min) / num\_buckets$
Direction: If $min > max$
Nulls: If any input is NULL, the function returns NULL.

Would you like me to show you how to combine width_bucket with a GROUP BY to create a count of how many items fall into each category?

Yes, I would.

Here is an example of how you can use the width_bucket() function alongside a GROUP BY clause to generate a frequency distribution (a count of how many events fall into each "bucket").

The Scenario

Imagine your events table has a column for processing_time (in seconds). You want to see how many events fall into four performance tiers between 0 and 20 seconds.

The Query

SELECT 
  width_bucket(processing_time, 0, 20, 4) AS bucket_id,
  COUNT(*) AS event_count
FROM events
GROUP BY width_bucket(processing_time, 0, 20, 4)
ORDER BY bucket_id;

Understanding the Results

In this example, the width of each bucket is 5 seconds ( $20 / 4 = 5$

bucket_id	Range (Seconds)	event_count	Description
0	$< 0$	0	Underflow (Invalid times)
1	$0 \le x < 5$	1,200	Fast
2	$5 \le x < 10$	450	Average
3	$10 \le x < 15$	120	Slow
4	$15 \le x < 20$	30	Very Slow
5	$\ge 20$	5	Overflow (Critical)

Pro-Tip: Making it Readable

Since bucket_id numbers aren't very descriptive for a final report, you can wrap the function in a CASE statement to provide labels:

SELECT 
  CASE width_bucket(processing_time, 0, 20, 4)
    WHEN 1 THEN '0-5s (Fast)'
    WHEN 2 THEN '5-10s (Avg)'
    WHEN 3 THEN '10-15s (Slow)'
    WHEN 4 THEN '15-20s (Very Slow)'
    ELSE 'Out of Range'
  END AS performance_category,
  COUNT(*) AS total
FROM events
GROUP BY 1;

In HiveQL, Data Masking functions are used to hide or obfuscate sensitive information (like Credit Card numbers or PII) while still allowing the data to be queried.

These are typically part of the Hive Generic UDFs and are very useful for creating "Redacted" views of your data.

1. mask()

The most basic masking function. By default, it converts uppercase letters to 'X', lowercase letters to 'x', and numbers to 'n'.

Syntax: mask(string value [, upper_char, lower_char, number_char])

Example:

SELECT mask("Abc-123"); 
-- Result: Xxx-nnn

2. mask_first_n()

Masks only the first n characters of a string.

Example: Masking the beginning of a Username.

SELECT mask_first_n("ConfidentialData", 4);
-- Result: XXXXidentialData

3. mask_last_n()

Masks only the last n characters. This is the "Industry Standard" for showing the end of a card or account number.

Example: Masking a Credit Card.

SQL 

SELECT mask_last_n("1234-5678-9012-3456", 4);
-- Result: 1234-5678-9012-nnnn

4. mask_show_first_n()

This "flips" the logic: it keeps the first $n$

Example: Showing only the first 3 digits of a phone number.

SELECT mask_show_first_n("555-0199", 3);
-- Result: 555-nnnn

5. mask_hash()

Returns a consistent SHA-256 hash of the value. This is powerful because it allows you to perform JOINs or GROUP BYs on masked data without ever seeing the original value.

Example:

SELECT mask_hash("user_email@example.com");
-- Result: 876930... (long alphanumeric hash)

Summary Table: Masking in Action

Function	Input	Result	Best Used For
mask()	A1-b2	Xn-xn	Full redaction
mask_last_n()	4567-8901	4567-nnnn	Credit Cards
mask_show_first_n()	SecretMsg	SecXXXXXX	Partial ID visibility
mask_hash()	Password123	Hash Value	Secure Joins / Analysis

Implementing with a View

A common practice in Hive is to hide the base table and provide a Masked View to analysts:

CREATE VIEW events_redacted AS
SELECT 
  event_id,
  mask_show_first_n(user_email, 3) AS user_email,
  event_date
FROM events;

posted on 2026-01-14 20:22 ZhangZhihuiAAA 阅读(3) 评论(0) 收藏举报

刷新页面返回顶部

导航

Built-in Functions

Mathematical Functions

Collection Functions

Type Conversion Functions

Date Functions

Conditional Functions

String Functions

Data Masking Functions

Misc. Functions