Hive Essential (3): DDL

1. data types

Hive data types are categorized into two types: primitive and complex. String and Int are the most useful primitive types, which are supported by most HQL functions.

Prinmitive Types

Primitive type	Description	Example
TINYINT	It has 1 byte, from -128 to 127 . The postfix is Y . It is used as a small range of numbers.	10Y
SMALLINT	It has 2 bytes, from -32,768 to 32,767 . The postfix is S . It is used as a regular descriptive number.	10S
INT	It has 4 bytes, from -2,147,483,648 to 2,147,483,647	10
BIGINT	It has 8 bytes, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 . The postfix is L	10L
FLOAT	This is a 4 byte single-precision floating-point number, from 1.40129846432481707e -45 to 1.23456793.40282346638528860e +38 (positive or negative). Scientific notation is not yet supported. It stores very close approximations of numeric values.	1.2345679
DOUBLE	This is an 8 byte double-precision floating-point number, -324d from 4.94065645841246544e to +308d 1.2345678901234567 1.79769313486231570e (positive or negative.Scientific notation is not yet supported. It stores very close approximations of numeric values.	1.2345678901234567
BINARY	This was introduced in Hive 0.8.0 and only supports CAST to STRING and vice versa.	1011
BOOLEAN	This is a TRUE or FALSE value.	true
STRING	This includes characters expressed with either single quotes( ' ) or double quotes ( " ). Hive uses C-style escaping within 'Books' or "Books"the strings. The max size is around 2 G.	'books'
CHAR	This is available starting with Hive 0.13.0. Most UDF will work for this type after Hive 0.14.0. The maximum length is 'US' or "US" fixed at 255 .	'US' or "US"
VARCHAR	This is available starting with Hive 0.12.0. Most UDF will work for this type after Hive 0.14.0. The maximum length is fixed at 65,355 . If a string value being converted assigned to a varchar value exceeds the length specified, the string is silently truncated.	'Books' or "Books"
DATE	This describes a specific year, month, and day in the format of YYYY-MM-DD . It is available starting with Hive 0.12.0.The range of dates is from 0000-01-01 to 9999-12-31.	2013-01-01
TIMESTAMP	This describes a specific year, month, day, hour, minute, second, and millisecond in the format of YYYY-MM-DD HH:MM:SS[.fff...] . It is available starting with Hive 0.8.0.	2013-01-01 12:00:01.345

complex data types are as follows:

Complex type	Description	example
ARRAY	This is a list of items of the same type, such as [ val1 ,val2 , and so on]. You can access the value using array_name[index] , for example, fruit[0]="apple" . Index starts from 0.	[ "apple","orange","mango" ]
MAP	This is a set of key-value pairs, such as {key1, val1, key2, val2, and so on} . You can access the value using map_name[key] for example, fruit[1]="apple" .	{1: "apple",2: "orange"}
STRUCT	This is a user-defined structure of any type of field, such as {val1, val2, val3, and so on}. By default, STRUCT field names will be col1, col2, and so on. You can access the value using structs_name.column_name , for example, fruit.col1=1 .	{1, "apple"}
NAMED STRUCT	This is a user-defined structure of any number of typed fields, such as {name1, val1, name2,val2, and so on} . You can access the value using structs_name.column_name , for example, fruit.apple="gala".	{"apple":"gala","weight kg":1}
UNION	This is a structure that has exactly any one of the specified data types. It is available starting with Hive 0.7.0. It is not commonly used.	{2:["apple","orange"]}

For MAP , the type of keys and values are unified. However, STRUCT is more flexible. STRUCT is more like a table, whereas MAP is more like an ARRAY with a customized index.

2.Data type conversions

Similar to SQL, HQL supports both implicit and explicit type conversion. Primitive-type conversion from a narrow to a wider type is known as implicit conversion. However, the reverse conversion is not allowed. All the integral numeric types, FLOAT , and STRING can be implicitly converted to DOUBLE , and TINYINT , SMALLINT , and INT can all be converted to FLOAT . BOOLEAN types cannot be converted to any other type.

Explicit-type conversion uses the CAST function with the CAST(value as TYPE) syntax. For example, CAST('100' as INT) will convert the 100 string to the 100 integer value. If the cast fails, such as CAST('INT' as INT) , the function returns NULL . In addition, the BINARY type can only first cast to STRING , then cast from STRING to other types if needed.

3. Comments

HQL uses -- before a single line of characters as comments, and it does not support multiline comments until v2.3.0. After v2.3.0, we can use bracketed single or multiline comments between /* and */ .

4.Database

The database in Hive describes a collection of tables that are used for a similar purpose or belong to the same groups. If the database is not specified, the default database is used and uses /user/hive/warehouse in HDFS as its root directory. This path is configurable by the hive.metastore.warehouse.dir property in hive-site.xml . Whenever a new database is created, Hive creates a new directory for each database under /user/hive/warehouse . For example, the myhivebook database is located at /user hive/datawarehouse/myhivebook.db . In addition, DATABASE has a name alias, SCHEMA , meaning they are the same thing in HQL. The following is the major DDL for databases operations:

Since Hive v2.2.1, the ALTER DATABASE ... SET LOCATION statement can be used to modify the database's location, but it does not move all existing tables/partitions in the current database directory to the newly specified location. It only changes the location for newly added tables after the database is altered. This behavior is analogous to how changing a table-directory does not move existing partitions to a different location. The SHOW and DESC (or DESCRIBE ) statements in Hive are used to show the definition for most of the objects, such as tables and partitions. The SHOW statement supports a wide range of Hive objects, such as tables, tables' properties, table DDL,index, partitions, columns, functions, locks, roles, configurations, transactions, and compactions. The DESC statement supports a small range of Hive objects, such as databases, tables, views, columns, and partitions. However, the DESC statement is able to provide more detailed information combined with the EXTENDED or FORMATTED keywords. In this book, there is no dedicated section to introduce SHOW and DESC . Instead, we introduce them in line with other HQL through the remaining post.

--Create the database with the location, comments, and metadata information
CREATE DATABASE IF NOT EXISTS myhivebook
COMMENT 'hive database demo'
LOCATION '/hdfs/directory'
WITH DBPROPERTIES ('creator'='lenmom','date'='2019-09-09');

-- To show the DDL use show create database since v2.1.0
SHOW CREATE DATABASE default;


--Show and describe the database with wildcards
SHOW DATABASES;
SHOW DATABASES LIKE 'my.*';
DESCRIBE DATABASE default;

--Switch to use one database or directly qualify the table name with the database name:
USE myhivebook;
SELECT * FROM myhivebook.table_name;

--Show the current database
SELECT current_database();

--Drop the database
--failed when database is not empty
DROP DATABASE IF EXISTS myhivebook;

--drop database and tables
DROP DATABASE IF EXISTS myhivebook CASCADE;

--Alter the database properties. The ALTER DATABASE statement can only apply
--to dbproperties , owner , and location on the database. The other database
--properties cannot be changed:
ALTER DATABASE myhivebook SET DBPROPERTIES ('edited-by'='lenmom');
ALTER DATABASE myhivebook SET OWNER user lenmom;
ALTER DATABASE myhivebook SET LOCATION '/tmp/data/myhivebook';

5.Tables

5.1 Table creation

5.1.1 create managed table and load data

Prepare data:

$ vi /home/hadoop/employee.txt
Michael|Montreal,Toronto|Male,30|DB:80|Product:Developer^DLead
Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Shelley|New York|Female,27|Python:80|Test:Lead,COE:Architect
Lucy|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead

create table

CREATE TABLE IF NOT EXISTS employee_internal (
name          STRING         COMMENT 'this is optinal column comments',
work_place    ARRAY<STRING>,                   -- table column names are NOT case sensitive
gender_age    STRUCT<gender:STRING,age:INT>,
skills_score  MAP<STRING,INT>,                 -- columns names are lower case
depart_title  MAP<STRING,ARRAY<STRING>>        -- No "," for the last column
)
COMMENT 'This is an internal table'   -- This is optional table comments
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
-- Symbol to seperate columns
COLLECTION ITEMS TERMINATED BY ','    -- Seperate collection elements
MAP KEYS TERMINATED BY ':'            -- Symbol to seperate keys and values
STORED as TEXTFILE;                   -- Table file format

load data into table

LOAD DATA INPATH '/tmp/hivedemo/data/employee.txt'
OVERWRITE INTO TABLE employee_internal;

If the folder path does not exist in the LOCATION property, Hive will create that folder. If there is another folder inside it, Hive will NOT report errors when creating the table but querying the table.

5.1.2 Create an external table and load the data

CREATE EXTERNAL TABLE employee_external ( -- Use EXTERNAL keywords
name string,
work_place ARRAY<string>,
gender_age STRUCT<gender:string,age:int>,
skills_score MAP<string,int>,
depart_title MAP<STRING,ARRAY<STRING>>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
STORED as TEXTFILE
LOCATION '/user/dayongd/employee'; -- Specify data folder location


LOAD DATA INPATH '/tmp/hivedemo/data/employee.txt'
OVERWRITE INTO TABLE employee_external;

Since v2.1.0, Hive supports primary and foreign key constraints. However, these constraints are not validated, so the upstream system needs to ensure data integrity before it's loaded into Hive. The Hive constraints may benefit some SQL tools to generate more efficient queries with them, but they are not used very often.

5.1.3 creating temporary tables

Hive also supports creating temporary tables. A temporary table is only visible to the current user session. It's automatically deleted at the end of the session. The data of the temporary table is stored in the user's scratch directory, such as /tmp/hive-<username> . Therefore, make sure the folder is properly configured or secured when you have sensitive data in temporary tables. Whenever a temporary table has the same name as a permanent table, the temporary table will be chosen rather than the permanent table. A temporary table does not support partitions and indexes. The following are three ways to create temporary tables:

CREATE TEMPORARY TABLE IF NOT EXISTS tmp_emp1 (
name string,
work_place ARRAY<string>,
gender_age STRUCT<gender:string,age:int>,
skills_score MAP<string,int>,
depart_title MAP<STRING,ARRAY<STRING>>
);

CREATE TEMPORARY TABLE tmp_emp2 as SELECT * FROM tmp_emp1;
CREATE TEMPORARY TABLE tmp_emp3 like tmp_emp1;

5.1.4 Create-Table-As-Select (CTAS)

Tables can also be created and populated by the results of a query in one statement, called Create-Table-As-Select (CTAS). The table created by CTAS is not visible by other users until all the query results are populated. CTAS has the following restrictions:

The table created cannot be a partitioned table
The table created cannot be an external table
The table created cannot be a list-bucketing table

A CTAS statement always triggers a yarn job to populate the data, although the SELECT * statement itself does not trigger any yarn job. CTAS can also be used with CTE, which stands for Common Table Expression. CTE is a temporary result set derived from a simple select query specified in a WITH clause, followed by the SELECT or INSERT statement to build the result set. The CTE is defined only within the execution scope of a single statement. One or more CTEs can be used in a nested or chained way with keywords, such as the SELECT , INSERT , CREATE TABLE AS SELECT , or CREATE VIEW AS SELECT statements. Using CTE of HQL makes the query more concise and clear than writing complex nested queries.

The following are examples using CTAS and CTE for table creation:

--1. Create a table with CTAS:
CREATE TABLE ctas_employee as SELECT * FROM employee_external;

--2. Create a table with both CTAS and CTE:
CREATE TABLE cte_employee as
WITH r1 as (
SELECT name FROM r2 WHERE name = 'Michael'
),
r2 as (
SELECT name FROM employee WHERE gender_age.gender= 'Male'
),
r3 as (
SELECT name FROM employee WHERE gender_age.gender= 'Female'
)
SELECT * FROM r1
UNION ALL
SELECT * FROM r3;

SELECT * FROM cte_employee;
+----------------------------+
| cte_employee.name|
+----------------------------+
| Michael | 
| Shelley |
| Lucy    |
+----------------------------+


--3. Use CTAS to create an empty table by copying the schema from another table. It is empty because the where condition is false:
CREATE TABLE empty_ctas_employee as
SELECT * FROM employee_internal WHERE 1=2;


--4. In another way, we can also use CREATE TABLE LIKE to create an empty table. 
--   This is a faster way to copy the table schema since it does not trigger any jobs but only copies metadata:
CREATE TABLE empty_like_employee LIKE employee_internal;

5.2 Table description

--Show tables with regular expression filters:
SHOW TABLES; -- Show all tables in database
SHOW TABLES '*sam*'; -- Show tables name contains "sam"
SHOW TABLES '*sam|lily*'; -- Show tables name contains "sam" or "lily"


--List detailed table information for all tables matching the given regular expression
SHOW TABLE EXTENDED LIKE 'employee_int*'; 

--Show table-column information:
SHOW COLUMNS IN employee_internal;
DESC employee_internal;

--Show create-table DDL statements for the specified table
SHOW CREATE TABLE employee_internal;

--Show table properties for the specified table
SHOW TBLPROPERTIES employee_internal;

5.3 Table cleaning

Sometimes, we may need to clean up the table, either by deleting only the records or the table along with the records. There are two statements in HQL to do such cleaning. One is the DROP TABLE statement, and the other is TRUNCATE TABLE . The drop-table statement on an internal table removes the table completely and moves data to .trash in the current user directory, if the trash setting is configured. The drop-table statement on an external table will only remove the table definition, but keeps data:

DROP TABLE IF EXISTS empty_ctas_employee;
No rows affected (0.283 seconds)

On the other hand, the truncate table statement only removes data from the table. The table still exists, but is empty. Note, truncate table can only apply to an internal table:

TRUNCATE TABLE cte_employee;-- Only apply to internal tables
No rows affected (0.093 seconds)

SELECT name FROM cte_employee;--Other hand, the truncate Not data left, but empty table exists
+--------------------+
| cte_employee.name |
+--------------------+
+--------------------+
No rows selected (0.059 seconds)

5.4 Table alteration

--1. Rename a table with the ALTER statement. This is quite often used as data backup:
ALTER TABLE cte_employee RENAME TO cte_employee_backup;

--2. Change the table properties with TBLPROPERTIES 
ALTER TABLE c_employee SET TBLPROPERTIES
('comment'='New comments');

--3. Change the table's row format and SerDe with SERDEPROPERTIES
ALTER TABLE employee_internal SET SERDEPROPERTIES
('field.delim' = '$');

--4. Change the table's location, a full URI of HDFS, with LOCATION
ALTER TABLE c_employee SET LOCATION
'hdfs://localhost:9000/tmp/employee';

--5. Change the table's file format with FILEFORMAT
ALTER TABLE c_employee SET FILEFORMAT RCFILE;

--6. Enable/Disable the table's protection; NO_DROP or OFFLINE . NO_DROP prevents a
--table from being dropped, while OFFLINE prevents data (not metadata) from
--being queried in a table
ALTER TABLE  c_employee ENABLE NO_DROP;
ALTER TABLE  c_employee DISABLE NO_DROP;
ALTER TABLE  c_employee ENABLE OFFLINE;
ALTER TABLE  c_employee DISABLE OFFLINE;

Since v0.8.0, RCFile is added to support fast block-level merging of small RCFiles using the CONCATENATE option. Since v0.14.0, ORC file is added to support the fast stripe-level merging of small ORC files using the CONCATENATE option. Other file formats are not supported yet. RCfiles merge at the block level, while ORC files merge at the stripe level, there by avoiding the overhead of decompressing and decoding the data.

--7. Enable concatenation in an RCFile , or ORC table if it has many small files
ALTER TABLE c_employee SET FILEFORMAT ORC; -- Convert to ORC
ALTER TABLE c_employee CONCATENATE;

Other alterations:

--8. Change the column's data type, position (with AFTER or FIRST ), and comment
ALTER TABLE employee_internal
CHANGE name employee_name string AFTER gender_age;

ALTER TABLE employee_internal
CHANGE employee_name name string COMMENT 'updated' FIRST;

--9. Add new columns to a table
ALTER TABLE c_employee ADD COLUMNS (work string);

--10 Replace all columns in a table using the new columns specified
ALTER TABLE c_employee REPLACE COLUMNS (name string);

6. Partitions

By default, a simple HQL query scans the whole table. This slows down the performance when querying a big table. This issue could be resolved by creating partitions, which are very similar to what's in the RDBMS. In Hive, each partition corresponds to a predefined partition column(s), which maps to subdirectories in the table's directory in HDFS. When the table gets queried, only the required partitions (directory) of data in the table are being read, so the I/O and time of the query is greatly reduced. Using partition is a very easy and effective way to improve performance in Hive.

Partitions is not enabled automatically. We have to use the ALTER TABLE ADD PARTITION statement to add static partitions to a table. Here, static means the partition is being added manually. This command changes the table's metadata but does not load data. If the data does not exist in the partition's location, queries will not return any results. To drop the partition metadata, use the ALTER TABLE ... DROP PARTITION statement.

For external tables, ALTER does not change data but metadata, drop partition will not drop data inside the partition. In order to remove data, we can use the hdfs dfs -rm command to remove data from HDFS for the external table.

For internal tables, ALTER TABLE ... DROP PARTITION will remove both partition and data. The following are more examples of common operations on partition tables.

CREATE TABLE employee_partitioned (
name STRING,
work_place ARRAY<STRING>,
gender_age STRUCT<gender:STRING,age:INT>,
skills_score MAP<STRING,INT>,
depart_title MAP<STRING,ARRAY<STRING>>
-- This is regular column
)
PARTITIONED BY (year INT, month INT)
-- Use lower case partition column
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';

6.1 Perform partition operations, such as add, remove, and rename partitions

--show all partions in table employee_partitioned
SHOW PARTITIONS employee_partitioned; 

ALTER TABLE employee_partitioned ADD -- Add multiple static partitions
PARTITION (year=2018, month=11) 
PARTITION (year=2018, month=12);

-- Drop partition with PURGE at the end will remove completely
-- Drop partition will NOT remove data for external table
-- Drop partition will remove data with partition for internal table
ALTER TABLE employee_partitioned
DROP IF EXISTS PARTITION (year=2018, month=11);

-- Drop all partitions in 2017
ALTER TABLE employee_partitioned
DROP IF EXISTS PARTITION (year=2017); 

-- Drop all month is 9
ALTER TABLE employee_partitioned
DROP IF EXISTS PARTITION (month=9); 

-- Rename exisiting partition values
ALTER TABLE employee_partitioned 
PARTITION (year=2018, month=12)
RENAME TO 
PARTITION (year=2018,month=10);

-- Below is failed
-- Because all partition columns should be specified for partition rename
--ALTER TABLE employee_partitioned PARTITION (year=2018)
--RENAME TO PARTITION (year=2017);

6.2 Load data into a table partition

LOAD DATA INPATH '/tmp/hivedemo/data/employee.txt'
OVERWRITE INTO TABLE employee_partitioned
PARTITION (year=2019, month=9);

To avoid manually adding static partitions, dynamic partition insert (or multipartition insert) is designed for dynamically determining which partitions should be added and populated while scanning the input table hive.exec.dynamic.partition.mode=nonstrict. To populate data in the partition, we can use the LOAD or INSERT statements. The statement only loads the data in the specified partition lists.

Although partition columns map to directory names rather than data, we can query or select them like regular columns in HQL to narrow down the result set.

The use case for static and dynamic partition is quite different. Static partition is often used for an external table containing data newly landed in HDFS. In this case, it often uses the date, such as yyyyMMdd , as the partition column. Whenever the data of the new day arrives, we add the day-specific static partition (by script) to the table, and then the newly arrived data is queryable from the table immediately. For dynamic partition, it is often being used for data transformation between internal tables with partition columns derived from data itself;

6.3 Remove data from the partition

removing data will not remove the partition information. In order to do a complete data cleaning, we can drop the partition described in step 1 after the data is removed.

-- For internal table, we use truncate
TRUNCATE TABLE employee_partitioned PARTITION (year=2018,month=12);

-- For external table, we have to use hdfs command
dfs -rm -r -f /user/lenmom/employee_partitioned;

6.4 Add regular columns to a partition table

we CANNOT add new columns as partition columns. There are two options when adding/removing columns from a partition table, CASCADE and RESTRICT . The commonly used CASCADE option cascades the same change to all the partitions in the table. However, RESTRICT is the default, limiting column changes only to table metadata, which means the changes will be only applied to new partitions rather than existing partitions

ALTER TABLE employee_partitioned ADD COLUMNS (work string) CASCADE;

6.5 change the existing partition column data type

ALTER TABLE employee_partitioned PARTITION COLUMN(year string);

Right now, we can only change the partition column data type. We cannot add/remove a column from partition columns. If we have to change the partition design, we must back up and recreate the table, and then migrate the data. In addition, we are NOT able to change a non-partition table to a partition table directly.

6.6 Changing the partition's other properties

ALTER TABLE employee_partitioned PARTITION (year=2018)
SET FILEFORMAT ORC;
ALTER TABLE employee_partitioned PARTITION (year=2018)
SET LOCATION '/tmp/data';
ALTER TABLE employee_partitioned PARTITION (year=2018) ENABLE NO_DROP;
ALTER TABLE employee_partitioned PARTITION (year=2018) ENABLE OFFLINE;
ALTER TABLE employee_partitioned PARTITION (year=2018) DISABLENO_DROP;
ALTER TABLE employee_partitioned PARTITION (year=2018) DISABLEOFFLINE;
ALTER TABLE employee_partitioned PARTITION (year=2018) CONCATENATE;

7. Buckets

Besides partition, the bucket is another technique to cluster datasets into more manageable parts to optimize query performance. Different from a partition, a bucket corresponds to segments of files in HDFS. For example, the employee_partitioned table from the previous section uses year and month as the top-level partition. If there is a further request to use employee_id as the third level of partition, it creates many partition directories. For instance, we can bucket the employee_partitioned table using employee_id as a bucket column. The value of this column will be hashed by a user-defined number of buckets. The records with the same employee_id will always be stored in the same bucket (segment of files). The bucket columns are defined by CLUSTERED BY keywords. It is quite different from partition columns since partition columns refer to the directory, while bucket columns have to be actual table data columns. By using buckets, an HQL query can easily and efficiently do sampling, bucket-side joins, and map-side joins. An example of creating a bucket table is shown as follows:

--Prepare table employee_id and its dataset to populate bucket table
CREATE TABLE employee_id (
name STRING,
employee_id INT,
work_place ARRAY<STRING>,
gender_age STRUCT<gender:STRING,age:INT>,
skills_score MAP<STRING,INT>,
depart_title MAP<STRING,ARRAY<STRING>>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';
No rows affected (0.101 seconds)

LOAD DATA INPATH
'/tmp/hivedemo/data/employee_id.txt'
OVERWRITE INTO TABLE employee_id
No rows affected (0.112 seconds)

--Create the buckets table
CREATE TABLE employee_id_buckets (
name STRING,
employee_id INT, -- Use this table column as bucket column later
work_place ARRAY<STRING>,
gender_age STRUCT<gender:string,age:int>,
skills_score MAP<string,int>,
depart_title MAP<string,ARRAY<string >>
)
CLUSTERED BY (employee_id) INTO 2 BUCKETS -- Support more columns
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';
No rows affected (0.104 seconds)

To define the proper number of buckets, we should avoid having too much or too little data in each bucket. A better choice is somewhere near two blocks of data, such as 512 MB of data in each bucket. As a best practice, use 2 N as the number of buckets. Bucketing has a close dependency on the data-loading process. To properly load data into a bucket table, we need to either set the maximum number of reducers to the same number of buckets specified in the table creation (for example, 2), or enable enforce bucketing (recommended), as follows:

set map.reduce.tasks = 2;
No rows affected (0.026 seconds)

set hive.enforce.bucketing = true; -- This is recommended
No rows affected (0.002 seconds)

To populate the data to a bucket table, we cannot use the LOAD DATA statement, because it does not verify the data against the metadata. Instead, INSERT should be used to populate the bucket table all the time:

INSERT OVERWRITE TABLE employee_id_buckets SELECT * FROM employee_id;
No rows affected (75.468 seconds)

-- Verify the buckets in the HDFS from shell
$hdfs dfs -ls /user/hive/warehouse/employee_id_buckets
Found 2 items
-rwxrwxrwx 1 hive hive 900 2018-07-02 10:54 /user/hive/warehouse/employee_id_buckets/000000_0 
-rwxrwxrwx 1 hive hive 582 2018-07-02 10:54 /user/hive/warehouse/employee_id_buckets/000001_0

8. Views

Views are logical data structures that can be used to simplify queries by hiding the complexities, such as joins, subqueries, and filters. It is called logical because views are only defined in metastore without the footprint in HDFS. Unlike what's in the relational database, views in HQL do not store data or get materialized. Once the view is created, its schema is frozen immediately. Subsequent changes to the underlying tables (for example, adding a column) will not be reflected in the view's schema. If an underlying table is dropped or changed, subsequent attempts to query the invalid view will fail. In addition, views are read-only and may not be used as the target of the LOAD / INSERT / ALTER statements. The following is an example of a view creation statement:

CREATE VIEW IF NOT EXISTS employee_skills
AS
SELECT
name, skills_score['DB'] as DB,
skills_score['Perl'] as Perl,
skills_score['Python'] as Python,
skills_score['Sales'] as Sales,
skills_score['HR'] as HR
FROM employee;

When creating views, there is no yarn job triggered since this is only a metadata change. However, the job will be triggered when querying the view. To check the view definition, we can use the SHOW statement. When modifying the view definition, we can use the ALTER VIEW statement. The following are some examples to show, check, and modify the view:

8.1 Show only views in the database.

This was introduced in Hive v2.2.0. We can use the SHOW TABLES statement in the earlier version of Hive instead

SHOW VIEWS;
SHOW VIEWS 'employee_*';

8.2 Show the view's definition

DESC FORMATTED employee_skills;
SHOW CREATE TABLE employee_skills;

8.3 Alter the views' properties

ALTER VIEW employee_skills SET TBLPROPERTIES ('comment'='A view');

8.4 Redefine the views

ALTER VIEW employee_skills as SELECT * from employee;

8.5 Drop the views

DROP VIEW employee_skills;

There is a special view in HQL, called LateralView . It is usually used with user-defined table-generating functions in Hive, such as explode() , for data normalization or processing JSON data. LateralView first applies the table-generation function to the data, and then joins the function's input and output together. See the following examples:

SELECT name, workplace FROM employee_internal
LATERAL VIEW explode(work_place) wp as workplace;
+---------+-----------+
| name| workplace |
+---------+-----------+
| Michael | Montreal |
| Michael | Toronto|
| Will    | Montreal |
| Shelley | Montreal |
| Lucy    | Vancouver |
+---------+-----------+
5 rows selected (6.693 seconds)

By adding OUTER after LATERAL VIEW , we can ensure we generate the result even if the table-generating function's output is NULL :

SELECT name, workplace FROM employee_internal
LATERAL VIEW explode(split(null, ',')) wp as workplace;
+-------+------------+
| name | workplace |
+-------+------------+
+-------+------------+
No rows selected (5.499 seconds)

SELECT name, workplace FROM employee_internal
LATERAL VIEW OUTER explode(split(null, ',')) wp as workplace;
+---------+-----------+
| name    | workplace |
+---------+-----------+
| Michael | NULL |
| Michael | NULL |
| Will    | NULL |
| Shelley | NULL |
| Lucy    | NULL |
+---------+-----------+
5 rows selected (5.745 seconds)

posted on 2019-09-09 22:57 老董阅读(340) 评论(0) 收藏举报

刷新页面返回顶部

lenmom

公告