ZhangZhihui's Blog

公告

日历

导航

Hive SQL - Remove non-Chinese characters

regexp_replace(zczb,'([^\\u4E00-\\u9FA5]+)','')*10000

1️⃣ `regexp_replace` function

Syntax:

string: The input string (zczb in your case)
pattern: The regular expression to match
replacement: The string to replace each match with

It replaces all parts of the string that match the pattern with the replacement.

2️⃣ The regular expression

Breaking it down:

\\u4E00-\\u9FA5 → Unicode range for Chinese characters (from 一 to 龥)
[^ ... ] → negation, i.e., anything not in this range
+ → one or more occurrences
() → capturing group (not strictly needed here, just for grouping)

✅ So ([^\\u4E00-\\u9FA5]+) matches any sequence of characters that are NOT Chinese characters.

3️⃣ Replacement string

Replaces all non-Chinese sequences with an empty string → effectively keeps only Chinese characters.

4️⃣ Multiply by 10000

After removing non-Chinese characters, the result is likely a numeric string extracted from Chinese-formatted text (maybe Chinese numbers like “万” or other formatting removed earlier).
Multiplying by 10000 converts the cleaned value to a numeric scale (common in financial data where figures are written in 万).

posted on 2025-08-26 18:11 ZhangZhihuiAAA 阅读(11) 评论(0) 收藏举报

刷新页面返回顶部