公告

日历

+-------+------------------------+------------+
|  id   |      random_value      |  username  |
+-------+------------------------+------------+
| 1     | 0.23125841924973844    | user_1     |
| 2     | 0.9572671270290934     | user_2     |
| 3     | 0.09378905685510308    | user_3     |
| 4     | 0.02584054246669054    | user_4     |
| 5     | 0.15518210029431245    | user_5     |
| 6     | 0.25702072226733286    | user_6     |
| 7     | 0.5590842146818326     | user_7     |
| 8     | 0.27759514143025454    | user_8     |
| 9     | 0.7628897453382623     | user_9     |
| 10    | 0.17675502130566112    | user_10    |
......
| 995   | 0.6802569230484555     | user_995   |
| 996   | 0.8416054555139822     | user_996   |
| 997   | 0.43476838750935265    | user_997   |
| 998   | 0.4968230174121445     | user_998   |
| 999   | 0.9535706150492017     | user_999   |
| 1000  | 0.9526080345475065     | user_1000  |
+-------+------------------------+------------+

SELECT split(space(9), ' ');

+----------------------------------+
|               _c0                |
+----------------------------------+
| ["","","","","","","","","",""]  |
+----------------------------------+

SELECT posexplode(split(space(9), ' '));

+------+------+
| pos  | val  |
+------+------+
| 0    |      |
| 1    |      |
| 2    |      |
| 3    |      |
| 4    |      |
| 5    |      |
| 6    |      |
| 7    |      |
| 8    |      |
| 9    |      |
+------+------+

3. Using Randomization Functions

To make the data feel "real," you can combine rand() with mathematical functions to create varied distributions.

Function	Use Case
rand()	Returns a random double between 0 and 1.
floor(rand() * 100)	Generates a random integer between 0 and 99.
if(rand() > 0.5, 'Active', 'Inactive')	Randomly assigns a categorical status.
date_add('2026-01-01', floor(rand() * 365))	Generates a random date within a specific year.

4. Best Practices for Mocking Data

Use Temporary Tables: Always create mock data in TEMPORARY tables first so you don't clutter the production schema.
Schema Matching: Ensure your mock column types (INT, STRING, TIMESTAMP) exactly match your target table to avoid casting errors later.
External Generators: If you need highly complex data (like realistic names or addresses), it is often easier to use a Python script with the Faker library to generate a CSV, then load that into Hive:

LOAD DATA LOCAL INPATH '/tmp/mock_data.csv' INTO TABLE my_table;

SELECT ascii('0');
--48

SELECT ascii('9');
--57

SELECT chr(58);
--:

SELECT chr(59);
--;

SELECT chr(60);
--<

SELECT chr(61);
--=

SELECT chr(62);
-->

SELECT chr(63);
--?

SELECT chr(64);
--@

SELECT ascii('A');
--65

SELECT ascii('Z');
--90

SELECT chr(91);
--[

SELECT chr(92);
--\

SELECT chr(93);
--]

SELECT chr(94);
--^

SELECT chr(95);
--_

SELECT chr(96);
--`

SELECT ascii('a');
--97

SELECT ascii('z');
--122

posted on 2026-02-04 21:30 ZhangZhihuiAAA 阅读(0) 评论(0) 收藏举报

刷新页面返回顶部

导航

1. Using `stack()` and `lateral view`

2. Generating Large Series with `posexplode()`

3. Using Randomization Functions

4. Best Practices for Mocking Data


博客园 © 2004-2026 浙公网安备 33010602011771号浙ICP备2021040463号-3

导航

1. Using stack() and lateral view

2. Generating Large Series with posexplode()

3. Using Randomization Functions

4. Best Practices for Mocking Data

1. Using `stack()` and `lateral view`

2. Generating Large Series with `posexplode()`