大数据学习(09)—— Hive语法
Hive官方网站上有详细的语法说明,参考LanguageManual。
这里我把最常用的几块列出来。
HIVE DDL
Database
建库语句 CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name [COMMENT database_comment] [LOCATION hdfs_path] [MANAGEDLOCATION hdfs_path] [WITH DBPROPERTIES (property_name=property_value, ...)];
关键字DATABASE和SCHEMA是一个意思。LOCATION指定了外部表的文件路径,MANAGEDLOCATION指定了管理表(普通表)的文件路径。要是搞不懂区别,往下看表部分的语句。
删库语句 DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
删库默认是RESTRICT,如果里面有表的话,就不能删。使用CASCADE可以级联删除库里所有内容。
切换当前数据库 USE database_name;
--切换到默认库
USE DEFAULT;
切换当前库后,所有之后执行的命令都对这个库生效。
要想知道现在用的是哪个库,可执行 SELECT current_database() 来查询。
Table操作
建表语句 CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later) [(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [SKEWED BY (col_name, col_name, ...) -- (Note: Available in Hive 0.10.0 and later)] ON ((col_value, col_value, ...), (col_value, col_value, ...), ...) [STORED AS DIRECTORIES] [ [ROW FORMAT row_format] [STORED AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available in Hive 0.6.0 and later) ] [LOCATION hdfs_path] [TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive 0.6.0 and later) [AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external tables) CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name LIKE existing_table_or_view_name [LOCATION hdfs_path]; data_type : primitive_type | array_type | map_type | struct_type | union_type -- (Note: Available in Hive 0.7.0 and later) primitive_type : TINYINT | SMALLINT | INT | BIGINT | BOOLEAN | FLOAT | DOUBLE | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later) | STRING | BINARY -- (Note: Available in Hive 0.8.0 and later) | TIMESTAMP -- (Note: Available in Hive 0.8.0 and later) | DECIMAL -- (Note: Available in Hive 0.11.0 and later) | DECIMAL(precision, scale) -- (Note: Available in Hive 0.13.0 and later) | DATE -- (Note: Available in Hive 0.12.0 and later) | VARCHAR -- (Note: Available in Hive 0.12.0 and later) | CHAR -- (Note: Available in Hive 0.13.0 and later) array_type : ARRAY < data_type > map_type : MAP < primitive_type, data_type > struct_type : STRUCT < col_name : data_type [COMMENT col_comment], ...> union_type : UNIONTYPE < data_type, data_type, ... > -- (Note: Available in Hive 0.7.0 and later) row_format : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] [NULL DEFINED AS char] -- (Note: Available in Hive 0.13 and later) | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)] file_format: : SEQUENCEFILE | TEXTFILE -- (Default, depending on hive.default.fileformat configuration) | RCFILE -- (Note: Available in Hive 0.6.0 and later) | ORC -- (Note: Available in Hive 0.11.0 and later) | PARQUET -- (Note: Available in Hive 0.13.0 and later) | AVRO -- (Note: Available in Hive 0.14.0 and later) | JSONFILE -- (Note: Available in Hive 4.0.0 and later) | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname column_constraint_specification: : [ PRIMARY KEY|UNIQUE|NOT NULL|DEFAULT [default_value]|CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ] default_value: : [ LITERAL|CURRENT_USER()|CURRENT_DATE()|CURRENT_TIMESTAMP()|NULL ] constraint_specification: : [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ] [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ] [, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE [, CONSTRAINT constraint_name UNIQUE (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ] [, CONSTRAINT constraint_name CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]
这个建表语句看起来非常复杂,前半部分是建表语句,后半部分是对数据类型和约束的描述。刚开始学可以不用关注太多复杂的东西,就把SQL里经常用的拿过来做试验即可。
管理表和外部表
管理表是Hive默认创建的表,外部表在Create后加上EXTERNAL关键字。它们的区别是,对于管理表,既管元数据,又管数据;对于外部表,只管元数据,不管数据。
| Managed Tables | External Tables | |
| ARCHIVE/UNARCHIVE | 支持 | 不支持 | 
| DROP | 删除数据和元数据 | 默认不删除数据,只删除元数据 | 
| ACID/Transactional | 支持 | 不支持 | 
| Query Results Caching | 支持 | 不支持 | 
| RELY constraint | 支持 | 支持 | 
分区表
为了减少查询数据的量,可以按照查询条件对表做分区。分区表在HDFS的存储形式是按照分区字段创建存储目录,不同分区的数据放在不同的目录下。分区表在物理上按照分区字段独立开来。分区表有静态分区和动态分区两种形式,静态分区是人为给分区字段赋固定值,动态分区是根据分区字段的值自动分区。
删表语句 DROP TABLE [IF EXISTS] table_name [PURGE]; 如果外部表设置了 external.table.purge=true,那么DROP的时候也会删除数据。
除了上面提到的库定义和表定义,DDL里还包含索引定义、视图定义、宏定义、函数定义和角色权限定义,请参考官网说明。
HIVE DML
Loading files into tables
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] [INPUTFORMAT 'inputformat' SERDE 'serde'] (3.0 or later) HIVE导入数据文件的时候,并没有做任何转换。导入文件到管理表,只是把文件复制到HDFS的存储目录下。导入文件到外部表,连文件复制的操作都没有。 --例子 CREATE TABLE tab1 (col1 int, col2 int) PARTITIONED BY (col3 int) STORED AS ORC; LOAD DATA LOCAL INPATH 'filepath' INTO TABLE tab1;
Inserting data into Hive Tables from queries
Standard syntax: INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement; INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement; 这里除了标准语法,还有HIVE的扩展语法,对于初学者来说,掌握标准语法就行。 --例子 INSERT INTO TABLE page_view PARTITION(dt='2008-06-08', country) SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt FROM page_view_stg pvs
Inserting values into tables from SQL
Standard Syntax: INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...] --例子 CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2)); INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);
Update
Standard Syntax: UPDATE tablename SET column = value [, column = value ...] [WHERE expression] 这个语法跟普通SQL没什么两样
Delete
Standard Syntax: DELETE FROM tablename [WHERE expression] 不要太简单
由于HIVE的数据存放在HDFS中,HDFS不支持数据的修改和删除。因此,在使用hive的过程中,一般不会产生删除和更新的操作,如果需要支持这些操作,必须修改hive-site.xml
//在hive的hive-site.xml中添加如下配置: <property> <name>hive.support.concurrency</name> <value>true</value> </property> <property> <name>hive.enforce.bucketing</name> <value>true</value> </property> <property> <name>hive.exec.dynamic.partition.mode</name> <value>nonstrict</value> </property> <property> <name>hive.txn.manager</name> <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value> </property> <property> <name>hive.compactor.initiator.on</name> <value>true</value> </property> <property> <name>hive.compactor.worker.threads</name> <value>1</value> </property>
 
                    
                     
                    
                 
                    
                 
                
            
         浙公网安备 33010602011771号
浙公网安备 33010602011771号 
