【大数据系列】apache hive 官方文档翻译

GettingStarted

  开始

Table of Contents

 表格内容包含

 

Installation and Configuration   安装和配置

You can install a stable release of Hive by downloading a tarball, or you can download the source code and build Hive from that.

你可以通过下载一个tar文件或者下载源文件进行编译来安装一个稳定版的Hive.

Running HiveServer2 and Beeline

Requirements   前置条件

  • Java 1.7   Java1.7
    Note:  Hive versions 1.2 onward require Java 1.7 or newer. Hive versions 0.14 to 1.1 work with Java 1.6 as well. Users are strongly advised to start moving to Java 1.8 (see HIVE-8607).  
  • 注意:从1.2.版本之后Hive需要Java1.7或者更高的版本,Hive0.14到1.1版本依然可以在Java1.6环境下使用,强烈建议用户开始使用java1.8.
  • Hadoop 2.x (preferred), 1.x (not supported by Hive 2.0.0 onward). 
  • 建议使用hadoop2.x,hive2.0.0之后不再支持hadoop1.x
  • Hive versions up to 0.13 also supported Hadoop 0.20.x, 0.23.x.
  • 直到0.13版本Hive同样支持hadoop0.20.x和0.23.x
  • Hive is commonly used in production Linux and Windows environment. Mac is a commonly used development environment. The instructions in this document are applicable to Linux and Mac. Using it on Windows would require slightly different steps.  
  • 生产中Hive通常使用时在linux和windows环境中,开发环境中通常使用Mac环境。在本文档中的直到适用于linux和mac环境。在windows中使用的话将需要略微不同的步骤。

Installing Hive from a Stable Release  从一个稳定的版本中安装Hive

Start by downloading the most recent stable release of Hive from one of the Apache download mirrors (see Hive Releases).

从apache下载镜像中下载一个最近的稳定的发行版。

Next you need to unpack the tarball. This will result in the creation of a subdirectory named hive-x.y.z (where x.y.z is the release number):

接下来你需要解压tar包。这会导致创建一个名为hive-x.y.z的子目录(x.y.z是发行版本号);

  $ tar -xzvf hive-x.y.z.tar.gz

Set the environment variable HIVE_HOME to point to the installation directory:

设置环境变量HIVE_HOME指向安装目录:

  $ cd hive-x.y.z
  $ export HIVE_HOME={{pwd}}

Finally, add $HIVE_HOME/bin to your PATH:

最后,添加HIVE_HOME到PATH环境变量中。

  $ export PATH=$HIVE_HOME/bin:$PATH

Building Hive from Source  从源码中编译Hive

The Hive GIT repository for the most recent Hive code is located here: git clone https://git-wip-us.apache.org/repos/asf/hive.git (the master branch).

最新的Hive代码的git仓库在这里:git  clone https://git-wip-us.apache.org/repos/asf/hive.git   (主分支)

All release versions are in branches named "branch-0.#" or "branch-1.#" or the upcoming "branch-2.#", with the exception of release 0.8.1 which is in "branch-0.8-r2". Any branches with other names are feature branches for works-in-progress. See Understanding Hive Branches for details.

所有的发行版在名为branch-0.#、branch-1.#还有即将来临的branch-2.#分支中。有异常的发行版0.8.1在分支branch-0.8-r2上。任何其他名称的都是在开发中具有特性的分支。查看理解Hive分支的链接获得更多详情。

As of 0.13, Hive is built using Apache Maven.

0.13版本Hive是使用maven编译的。

Compile Hive on master   在主分支上编译Hive

To build the current Hive code from the master branch: 从主分支上编译现在的Hive代码

  $ git clone https://git-wip-us.apache.org/repos/asf/hive.git
  $ cd hive
  $ mvn clean package -Pdist     #打包
  $ cd packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin
  $ ls
  LICENSE
  NOTICE
  README.txt
  RELEASE_NOTES.txt
  bin/ (all the shell scripts)
  lib/ (required jar files)
  conf/ (configuration files)
  examples/ (sample input and query files)
  hcatalog / (hcatalog installation)
  scripts / (upgrade scripts for hive-metastore)

Here, {version} refers to the current Hive version. 

这里,{version}指的是当前的Hive版本。

If building Hive source using Maven (mvn), we will refer to the directory "/packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin" as <install-dir> for the rest of the page.

如果使用maven编译hive原码,我们将指定packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin目录作为其余页中的安装目录(install-dir)。

Compile Hive on branch-1  在branch-1上编译Hive

In branch-1, Hive supports both Hadoop 1.x and 2.x.  You will need to specify which version of Hadoop to build against via a Maven profile.  To build against Hadoop 1.x use the profile hadoop-1; for Hadoop 2.x use hadoop-2.  For example to build against Hadoop 1.x, the above mvn command becomes:

在branch-1分支上,Hive支持hadoop1.x和hadoop2.x。你需要确定通过maven profile去编译哪个版本的hadoop.如果编译hadoop1.x使用hadoop-1文件,编译hadoop2.x使用hadoop-2文件。例如编译hadoop1.x则上面讲到的mvn命令将变成为mvn clean package -Phadoop-1, dist

  $ mvn clean package -Phadoop-1,dist

Compile Hive Prior to 0.13 on Hadoop 0.20  在hadoop0.20上编译Hive0.13之前的版本

Prior to Hive 0.13, Hive was built using Apache Ant.  To build an older version of Hive on Hadoop 0.20:

在hive0.13版本之前是使用apache  ant进行编译的。为了在hadoop0.20上编译旧版本的hive:

  $ svn co http://svn.apache.org/repos/asf/hive/branches/branch-{version} hive
  $ cd hive
  $ ant clean package
  $ cd build/dist
  # ls
  LICENSE
  NOTICE
  README.txt
  RELEASE_NOTES.txt
  bin/ (all the shell scripts)
  lib/ (required jar files)
  conf/ (configuration files)
  examples/ (sample input and query files)
  hcatalog / (hcatalog installation)
  scripts / (upgrade scripts for hive-metastore)

If using Ant, we will refer to the directory "build/dist" as <install-dir>.

如果使用Ant,我们将认为build/dist为安装目录install-dir

Compile Hive Prior to 0.13 on Hadoop 0.23 在hadoop0.23上编译hive0.13之前的版本

To build Hive in Ant against Hadoop 0.23, 2.0.0, or other version, build with the appropriate flag; some examples below:

为了在hadoop0.23 ,2.0.0或者其他版本中编译hive,使用正确的标识去build,如下是一些例子:

  $ ant clean package -Dhadoop.version=0.23.3 -Dhadoop-0.23.version=0.23.3 -Dhadoop.mr.rev=23
  $ ant clean package -Dhadoop.version=2.0.0-alpha -Dhadoop-0.23.version=2.0.0-alpha -Dhadoop.mr.rev=23

Running Hive   启动Hive

Hive uses Hadoop, so: Hive需要用到hadoop,所以 你必须添加hadoop到path路径中或者设置HADOOP_HOME环境变量。

  • you must have Hadoop in your path OR 
  • export HADOOP_HOME=<hadoop-install-dir>

In addition, you must use below HDFS commands to create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w before you can create a table in Hive.

除此之外,你必须使用如下的HDFS命令去创建/tmp和/user/hive/warehouse(hive.metastore.warehouse.dir)并且要在hive中创建表之前设置chmod g+w。

  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse

You may find it useful, though it's not necessary, to set HIVE_HOME:

你可能会发现设置HIVE_HOME环境变量虽然不时必要的但是是有用的。

  $ export HIVE_HOME=<hive-install-dir>

Running Hive CLI  启动hive客户端

To use the Hive command line interface (CLI) from the shell: 从shell中使用hive客户端

  $ $HIVE_HOME/bin/hive

Running HiveServer2 and Beeline

Starting from Hive 2.1, we need to run the schematool command below as an initialization step. For example, we can use "derby" as db type. 

从hive2.1版本开始我们需要运行如下schematool命令作为初始化步骤,比如,我们可以使用derby作为数据库类型。

  $ $HIVE_HOME/bin/schematool -dbType <db type> -initSchema

HiveServer2 (introduced in Hive 0.11) has its own CLI called Beeline.  HiveCLI is now deprecated in favor of Beeline, as it lacks the multi-user, security, and other capabilities of HiveServer2.  To run HiveServer2 and Beeline from shell:

HiveServer2(hive0.11引进的)有其自己的客户端叫做Beeline.HiveCLI取代Beeline是被反对的。因为其缺少多用户、安全以及HiveServer2拥有的其他特性。从脚本中运行HiveServer2和Beeline:

  $ $HIVE_HOME/bin/hiveserver2

  $ $HIVE_HOME/bin/beeline -u jdbc:hive2://$HS2_HOST:$HS2_PORT

Beeline is started with the JDBC URL of the HiveServer2, which depends on the address and port where HiveServer2 was started.  By default, it will be (localhost:10000), so the address will look like jdbc:hive2://localhost:10000.

Beeline是由HiveServer2的JDBC URL启动的,它依赖于HiveServer2启动时的地址和端口。默认情况下,它是(localhost:10000),所以它的地址看起来像觉得不错:hive://localhost:10000

Or to start Beeline and HiveServer2 in the same process for testing purpose, for a similar user experience to HiveCLI:

或者为了测试启动HiveServer2和Beeline在同一个进程中,与使用HiveCli的经历一样:

  $ $HIVE_HOME/bin/beeline -u jdbc:hive2://

Running HCatalog  运行Hcatalog

To run the HCatalog server from the shell in Hive release 0.11.0 and later:

从hive0.11.0或之后的版本中运行HCatalog服务

  $ $HIVE_HOME/hcatalog/sbin/hcat_server.sh

To use the HCatalog command line interface (CLI) in Hive release 0.11.0 and later:

从hive0.11.0或之后的发行版找那个使用HCatalog客户端:

  $ $HIVE_HOME/hcatalog/bin/hcat

For more information, see HCatalog Installation from Tarball and HCatalog CLI in the HCatalog manual.

如果想获取更多的信息,在 HCatalog manual.查看HCatalog Installation from Tarball 和HCatalog CLI 

Running WebHCat (Templeton)  运行WebHCat

To run the WebHCat server from the shell in Hive release 0.11.0 and later:

在Hive0.11.0和之后的发行版中使用WebHCat

  $ $HIVE_HOME/hcatalog/sbin/webhcat_server.sh

For more information, see WebHCat Installation in the WebHCat manual.

如果要获取更多的信息,在  WebHCat manual.中查看WebHCat Installation 

Configuration Management Overview  配置管理总览

  • Hive by default gets its configuration from <install-dir>/conf/hive-default.xml
  • 从<install-dir>/conf/hive-default.xml中获取默认配置信息。
  • The location of the Hive configuration directory can be changed by setting the HIVE_CONF_DIR environment variable.
  • 通过修改HIVE_CONF_DIR环境变量可以修改HIVE配置目录
  • Configuration variables can be changed by (re-)defining them in <install-dir>/conf/hive-site.xml
  • 配置变量可以通过修改或重定义<install>/conf/hive-site.xml来修改。
  • Log4j configuration is stored in <install-dir>/conf/hive-log4j.properties
  • log4j配置存放在 <install-dir>/conf/hive-log4j.properties
  • Hive configuration is an overlay on top of Hadoop – it inherits the Hadoop configuration variables by default.
  • hive的配置是对hadoop配置的一个覆盖--默认情况下它继承了hadoop的配置变量。
  • Hive configuration can be manipulated by:hive的配置参数可以通过如下方式操作
    • Editing hive-site.xml and defining any desired variables (including Hadoop variables) in it
    • 编辑hive-site.xml在其中定义任何需要的变量(包括hadoop变量)
    • Using the set command (see next section)
    • 使用set命令(查看下一节)
    • Invoking Hive (deprecated), Beeline or HiveServer2 using the syntax:调用Hive(这个是不推荐的),Beeline或者HiveServer2使用如下语法
      • $ bin/hive --hiveconf x1=y1 --hiveconf x2=y2  //this sets the variables x1 and x2 to y1 and y2 respectively
      • $ bin/hiveserver2 --hiveconf x1=y1 --hiveconf x2=y2  //this sets server-side variables x1 and x2 to y1 and y2 respectively
      • $ bin/beeline --hiveconf x1=y1 --hiveconf x2=y2  //this sets client-side variables x1 and x2 to y1 and y2 respectively.
    • Setting the HIVE_OPTS environment variable to "--hiveconf x1=y1 --hiveconf x2=y2" which does the same as above.
    • 设置HIVE_OPTS环境变量为"--hiveconf x1=y1 --hiveconf x2=y2" 这与上一个命令时一样的

Runtime Configuration  运行配置

  • Hive queries are executed using map-reduce queries and, therefore, the behavior of such queries can be controlled by the Hadoop configuration variables.
  • Hive查询是使用map-reduce查询执行的,所以,这些查询行为可以通过hadoop配置变量来控制。
  • The HiveCLI (deprecated) and Beeline command 'SET' can be used to set any Hadoop (or Hive) configuration variable. For example:

  • HiveCli(不推荐的)和Beeline的SET命令可以被用来设置任何的Hadoop(或者Hive)参数变量,比如:


    beeline> SET mapred.job.tracker=myhost.mycompany.com:50030; beeline> SET -v;

    The latter shows all the current settings. Without the -v option only the variables that differ from the base Hadoop configuration are displayed.

  • 稍后将显示所有的当前的配置。除了-v选项与基本的hadoop配置的区别将会展示。

Hive, Map-Reduce and Local-Mode    Hive、Map-Reduce和本地模式

Hive compiler generates map-reduce jobs for most queries. These jobs are then submitted to the Map-Reduce cluster indicated by the variable:

Hive编译者为大部分的查询生成了map-reduce job,这些job之后被提交到由mapred.job.tracker标记的map-reduce集群中

  mapred.job.tracker

While this usually points to a map-reduce cluster with multiple nodes, Hadoop also offers a nifty option to run map-reduce jobs locally on the user's workstation. This can be very useful to run queries over small data sets – in such cases local mode execution is usually significantly faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode only runs with one reducer and can be very slow processing larger data sets.

虽然这些通常指向拥有几个节点的map-reduce集群,hadoop同样也提供了一个极好的选项去运行map-reduce job在用户本地的工作区间。这对于小数据集的查询是十分有用的--在那样的情形下本地执行要比提交job到一个大的集群快很多。数据可以很显而易见的从HDFS中接触到。相反的,本地模式只运行一个reducer因此在处理大数据集的时候会非常慢。

Starting with release 0.7, Hive fully supports local mode execution. To enable this, the user can enable the following option:

从0.7发行版开会,HIVE完全支持本地模式运行。为了开启这个功能,用户需要开启以下选项。

  hive> SET mapreduce.framework.name=local;

In addition, mapred.local.dir should point to a path that's valid on the local machine (for example /tmp/<username>/mapred/local). (Otherwise, the user will get an exception allocating local disk space.)

除此之外,mapred.local.dir应该指向一个本地基取上有效的路径(比如/tmp/<username>/mapred/local)(否则的话,用户将会得到一个分配本地磁盘空间的异常)

Starting with release 0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options are hive.exec.mode.local.autohive.exec.mode.local.auto.inputbytes.max, and hive.exec.mode.local.auto.tasks.max:

从0.7发行版之后,Hive同样支持本地自动运行map-reduce job的模式。与此相关的选项是 hive.exec.mode.local.autohive.exec.mode.local.auto.inputbytes.max, and hive.exec.mode.local.auto.tasks.max:

  hive> SET hive.exec.mode.local.auto=false;

Note that this feature is disabled by default. If enabled, Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied:

请注意,默认情况下这些特性是不可用的。如果启用的话,Hive会分析查询中的每一个map-reduce job,如果以下临界值满足的话本地也可以运行:

  • The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)
  • job输入值的大小小于: hive.exec.mode.local.auto.inputbytes.max 默认是128MB
  • The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default)
  • map-tasks的数量小于hive.exec.mode.local.auto.tasks.max 默认是4
  • The total number of reduce tasks required is 1 or 0.
  • 总的reducer任务的数量是1或者0

So for queries over small data sets, or for queries with multiple map-reduce jobs where the input to subsequent jobs is substantially smaller (because of reduction/filtering in the prior job), jobs may be run locally.

所以对于小数据集或者对于后来的job实质上输入非常小的话(因为之前job中的减少和过滤),jobs会在本地运行。

Note that there may be differences in the runtime environment of Hadoop server nodes and the machine running the Hive client (because of different jvm versions or different software libraries). This can cause unexpected behavior/errors while running in local mode. Also note that local mode execution is done in a separate, child jvm (of the Hive client). If the user so wishes, the maximum amount of memory for this child jvm can be controlled via the option hive.mapred.local.mem. By default, it's set to zero, in which case Hive lets Hadoop determine the default memory limits of the child jvm.

请注意在hadoop服务节点的运行环境和运行Hive客户端的机器之间可能会存在不同(因为不同的Java虚拟机版本号或者不同的软件库)。在本地模式运行时这些将会导致不可期望的行为/错误。同样要注意的是本地模式执行是独立的,Hive客户端的虚拟机。如果用户希望的话,Hive 客户端内存的最大值可以通过hive.mapred.local.mem来控制。默认情况下,它被设置为0,在这种情况下,Hive交由Hadoop来确定默认的虚拟机内存的限制值。

Hive Logging  Hive日志

Hive uses log4j for logging. By default logs are not emitted to the console by the CLI. The default logging level is WARN for Hive releases prior to 0.13.0. Starting with Hive 0.13.0, the default logging level is INFO.

Hive使用log4j去记录日志,默认情况下,日志不会打印到客户端上。在0.13.0的发行版本之前默认的日志级别的警告,从0.13.0开始,日志的默认级别的INFO。

The logs are stored in the directory /tmp/<user.name>:

日志存储在/tmp/<user.name>目录下:

  • /tmp/<user.name>/hive.log
    Note: In local mode, prior to Hive 0.13.0 the log file name was ".log" instead of "hive.log". This bug was fixed in release 0.13.0 (see HIVE-5528 and HIVE-5676).
  • 注意:在本地模式中,在0.13.0版本之前日志的名字是.log而不是hive.log.这个漏洞在0.13.0中得到了修正。

To configure a different log location, set hive.log.dir in $HIVE_HOME/conf/hive-log4j.properties. Make sure the directory has the sticky bit set (chmod 1777 <dir>).

如果要设置一个不同的log存储位置,在 $HIVE_HOME/conf/hive-log4j.properties中设置hive.log.dir.确定这个目录有最大权限(chmod 777 <dir>)

  • hive.log.dir=<other_location>

If the user wishes, the logs can be emitted to the console by adding the arguments shown below:

如果用户希望的话,日志可以通过添加如下参数使其显示在控制台上。

  • bin/hive --hiveconf hive.root.logger=INFO,console  //for HiveCLI (deprecated)
  • bin/hiveserver2 --hiveconf hive.root.logger=INFO,console

Alternatively, the user can change the logging level only by using:

或者,用户也可以通过如下的设置去改变默认的日志级别:

  • bin/hive --hiveconf hive.root.logger=INFO,DRFA //for HiveCLI (deprecated)
  • bin/hiveserver2 --hiveconf hive.root.logger=INFO,DRFA

Another option for logging is TimeBasedRollingPolicy (applicable for Hive 0.15.0 and above, HIVE-9001) by providing DAILY option as shown below:

日志的另一个选择是TimeBasesRollingPolicy(适用于Hive0.15.0或更高的版本),提供了如下所示的DAILY选择:

  • bin/hive --hiveconf hive.root.logger=INFO,DAILY //for HiveCLI (deprecated)
  • bin/hiveserver2 --hiveconf hive.root.logger=INFO,DAILY

Note that setting hive.root.logger via the 'set' command does not change logging properties since they are determined at initialization time.

请注意通过set命令设置hive.root.logger不会改变logging的特性,因为他们是在初始化的时间确定的。

Hive also stores query logs on a per Hive session basis in /tmp/<user.name>/, but can be configured in hive-site.xml with the hive.querylog.location property.

Hive同样在/tmp/<user.name>目录中保存了每一个Hive回话的查询日志m,但是这个同样可以通过hive-site.xml中的hive.querylog.location参数来设置。

Logging during Hive execution on a Hadoop cluster is controlled by Hadoop configuration. Usually Hadoop will produce one log file per map and reduce task stored on the cluster machine(s) where the task was executed. The log files can be obtained by clicking through to the Task Details page from the Hadoop JobTracker Web UI.

在hadoop集群上Hive的执行日志是可以通过Hadoop配置参数控制。通常情况下hadoop会为每一个任务执行的集群机器中的map和reduce任务生成日志文件。该日志文件可以通过Hadoop JobTracker的web页面垫底任务详情获得。

When using local mode (using mapreduce.framework.name=local), Hadoop/Hive execution logs are produced on the client machine itself. Starting with release 0.6 – Hive uses the hive-exec-log4j.properties (falling back to hive-log4j.properties only if it's missing) to determine where these logs are delivered by default. The default configuration file produces one log file per query executed in local mode and stores it under /tmp/<user.name>. The intent of providing a separate configuration file is to enable administrators to centralize execution log capture if desired (on a NFS file server for example). Execution logs are invaluable for debugging run-time errors.

当使用本地模式的时候,Hadoop/Hive执行日志会在客户端机器上生成。从0.6版本开始-Hive使用hive-exec-log4j.properties(如果它缺失的话可以会退到hive-log4j.properties)去决定日志默认在什么位置。默认配置下本地模式或为每一个查询生成一个日志文件并且将其存储在/tmp/<user.name>。提供一个单独的配置文件时为了使管理员在需要的时候集中执行日志快照(比如在一个NFS文件服务中)。执行日志对于跟踪运行时异常是十分宝贵的。

For information about WebHCat errors and logging, see Error Codes and Responses and Log Files in the WebHCat manual.

如果想获取到WebHCat错误和日志,请在 WebHCat manual.中查看Error Codes and Responses and Log Files 。

Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to hive-dev@hadoop.apache.org.

错误日志对于跟踪错误是非常有用的。请发送他们到hive@hadoop.apache.org.

From Hive 2.1.0 onwards (with HIVE-13027), Hive uses Log4j2's asynchronous logger by default.

从hive2.1.0开始,Hive默认使用了Log4j2异步日志。

Setting hive.async.log.enabled to false will disable asynchronous logging and fallback to synchronous logging. Asynchronous logging can give significant performance improvement as logging will be handled in a separate thread that uses the LMAX disruptor queue for buffering log messages. Refer to https://logging.apache.org/log4j/2.x/manual/async.html for benefits and drawbacks.

设置hive.async.log.enable为false可以关闭异步日志并且返回到同步日志。异步日志将会提供一个非常重要的性能提升,因为它会用一个单独的LMAX分裂的队列缓存日志信息的线程去处理。具体请参考 https://logging.apache.org/log4j/2.x/manual/async.html 去查看优势和劣势。

HiveServer2 Logs  HiveServer2日志

HiveServer2 operation logs are available to clients starting in Hive 0.14. See HiveServer2 Logging for configuration.

HiveServer2操作日志是从Hive0.14版本开始的,查看HiveServer2 Logging去获得如何配置。

Audit Logs  监控日志

Audit logs are logged from the Hive metastore server for every metastore API invocation.

监控日志是从Hive元存储服务中对每一个元存储API请求记录的。

An audit log has the function and some of the relevant function arguments logged in the metastore log file. It is logged at the INFO level of log4j, so you need to make sure that the logging at the INFO level is enabled (see HIVE-3505). The name of the log entry is "HiveMetaStore.audit".

一个监控日志拥有该功能,并且一些相关的函数参数记录在元存储日志文件中。它是在log4j的INFO级别记录的,所以你需要确定日志的INFO级别是开启的。日志入口的名称是“HiveMetaStore.audit”.

Audit logs were added in Hive 0.7 for secure client connections (HIVE-1948) and in Hive 0.10 for non-secure connections (HIVE-3277; also see HIVE-2797).

监控日志是为Hive0.7中安全的客户端连接和Hive0.10张非安全连接而添加的。

Perf Logger  Perf 日志

In order to obtain the performance metrics via the PerfLogger, you need to set DEBUG level logging for the PerfLogger class (HIVE-12675). This can be achieved by setting the following in the log4j properties file.

为了通过PerfLogger获取 performance metrics,你需要为PerfLogger设置DUBUG级别。这些可以通过设置如下的log4j配置文件达到。

log4j.logger.org.apache.hadoop.hive.ql.log.PerfLogger=DEBUG

If the logger level has already been set to DEBUG at root via hive.root.logger, the above setting is not required to see the performance logs.

如果日志级别已经通过hive.root.logger设置为DEBUG,以上的设置是不需要的。

DDL Operations  DDL操作

 The Hive DDL operations are documented in Hive Data Definition Language.

Creating Hive Tables   创建HIVE表

  hive> CREATE TABLE pokes (foo INT, bar STRING);

creates a table called pokes with two columns, the first being an integer and the other a string.

创建一个叫做pokes的表,它拥有两列,第一列是integer类型另外的是String类型。

  hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into.

创建一个表名字叫做invites,它含有两列和一个分区叫做ds。这个分区卷是一个虚拟卷。它不是数据本身的一部分,但是他来自于一个被载入的特定的数据集的一个分区。

By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a).

默认情况下,表假定是文本输入格式,默认的分隔符是^A

Browsing through Tables   预览表

  hive> SHOW TABLES;

lists all the tables.   查看所有的表

  hive> SHOW TABLES '.*s';

lists all the table that end with 's'. The pattern matching follows Java regular expressions. Check out this link for documentation http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html.

列出所有的以s结尾的表。模式匹配与Java常用表达式一致。查看如下链接获取文档http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html.

hive> DESCRIBE invites;

shows the list of columns.

显示卷。

Altering and Dropping Tables   修改和删除表

Table names can be changed and columns can be added or replaced:  表名字可以被修改,列可以添加和替换。

  hive> ALTER TABLE events RENAME TO 3koobecaf;
  hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
  hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
  hive> ALTER TABLE invites REPLACE COLUMNS (foo INT, bar STRING, baz INT COMMENT 'baz replaces new_col2');

Note that REPLACE COLUMNS replaces all existing columns and only changes the table's schema, not the data. The table must use a native SerDe. REPLACE COLUMNS can also be used to drop columns from the table's schema:

注意:REPLACE COLUMNS替换已经存在的列只是改变表的schema定义而不是数据。REPLACE COKUMNS同样可以用来从表schema中删除列。

  hive> ALTER TABLE invites REPLACE COLUMNS (foo INT COMMENT 'only keep the first column');

Dropping tables:  删除表.

  hive> DROP TABLE pokes;

Metadata Store  元数据存储

Metadata is in an embedded Derby database whose disk storage location is determined by the Hive configuration variable named javax.jdo.option.ConnectionURL. By default this location is ./metastore_db (see conf/hive-default.xml).

元数据是包含在内置的Derby数据库中的,该数据库的存储路径由Hive的配置变量javax.jdo.option.ConnectionURL来决定。默认情况下该路径是./metastore_db(查看conf/hive-default.xml配置文件)

Right now, in the default configuration, this metadata can only be seen by one user at a time.

目前,在默认的配置中,元数据在同一时间只能由一个人看到。

Metastore can be stored in any database that is supported by JPOX. The location and the type of the RDBMS can be controlled by the two variables javax.jdo.option.ConnectionURL and javax.jdo.option.ConnectionDriverName. Refer to JDO (or JPOX) documentation for more details on supported databases. The database schema is defined in JDO metadata annotations file package.jdo at src/contrib/hive/metastore/src/model.

元数据可以存储在任何一个由JPOX支持的数据库中。关系型数据库的位置和类型可以有javax.jdo.option.ConnectionURL和javax.jdo.option.ConnectionDriverName来控制.通过查询JDO文档获取所支持的数据库的更多详情。元数据schema是在JDO元数据注解文件package.jdo(src/contrib/hive/metastore/src/model)中定义的,

In the future, the metastore itself can be a standalone server.

在将来,元数据可以单独称为一个服务。

If you want to run the metastore as a network server so it can be accessed from multiple nodes, see Hive Using Derby in Server Mode.

如果你想以一个网络服务运行元数据以便于获取多节点。请看:Hive Using Derby in Server Mode.

DML Operations   DML操作

The Hive DML operations are documented in Hive Data Manipulation Language.

HIVE的DML操作见文档Hive Data Manipulation Language.

Loading data from flat files into Hive:

加载普通的文件到Hive库中:

  hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

Loads a file that contains two columns separated by ctrl-a into pokes table. 'LOCAL' signifies that the input file is on the local file system. If 'LOCAL' is omitted then it looks for the file in HDFS.

加载一个包含两列以ctrl-a切割的文件到pocks表中。‘LOCAL’指的是输入文件在本地文件系统中。如果LOCAL缺省的话它将会去HDFS中查找文件。

The keyword 'OVERWRITE' signifies that existing data in the table is deleted. If the 'OVERWRITE' keyword is omitted, data files are appended to existing data sets.

关键字‘OVERWRITE’指的是表中存在的数据会被删除。如果‘OVERWRITE’被缺省的话,数据将会追加在已经存在的数据集后。

NOTES:

注意:

  • NO verification of data against the schema is performed by the load command.
  • If the file is in hdfs, it is moved into the Hive-controlled file system namespace.
    The root of the Hive directory is specified by the option hive.metastore.warehouse.dir in hive-default.xml. We advise users to create this directory before trying to create tables via Hive.
  • 如果文件存在于hdfs,它将移动到Hive控制文件系统空间,Hive的根目录由hive-site.xml中的hive.warehouse.dir变量指定,我们建议用户在通过HIVE创建表之前先创建此目录。
  hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
  hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');

The two LOAD statements above load data into two different partitions of the table invites. Table invites must be created as partitioned by the key ds for this to succeed.

以上两个加载命令会加载数据到两个不同的分区中。

  hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');

The above command will load data from an HDFS file/directory to the table.

上述的命令将会加载一个HDFS文件或目录到表中
Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous.

注意:从HDFS中加载数据会导致文件或目录的移动。因此这个操作是瞬间的。

SQL Operations  SQL操作

The Hive query operations are documented in Select.

HIVE的查询操作见文档Select.

Example Queries   查询示例

Some example queries are shown below. They are available in build/dist/examples/queries.
More are available in the Hive sources at ql/src/test/queries/positive.

一些查询示例将在下面展示。可以在build/dist/examples/queries中获得

更多的Hive源文件位于ql/src/test/queries/positive

SELECTS and FILTERS    查询和过滤

  hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';

selects column 'foo' from all rows of partition ds=2008-08-15 of the invites table. The results are not stored anywhere, but are displayed on the console.

从partition ds=2008-08-15的所有行中查找列foo.这个结果不会在任何地方存储,但是会在控制台展示。

Note that in all the examples that follow, INSERT (into a Hive table, local directory or HDFS directory) is optional.

注意在以下的多有例子中插入是可选的

  hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';

selects all rows from partition ds=2008-08-15 of the invites table into an HDFS directory. The result data is in files (depending on the number of mappers) in that directory.

从分区ds=2008-08-15的invites表中查询所有的行到HDFS目录中,这个结果将会在该目中的文件中,文件的个数由mappers的个数决定。

NOTE: partition columns if any are selected by the use of *. They can also be specified in the projection clauses.

Partitioned tables must always have a partition selected in the WHERE clause of the statement.

  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;

selects all rows from pokes table into a local directory.查询pokes表的所有列到本地的目录中

  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100;
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE a.ds='2008-08-15';
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;

selects the sum of a column. The avg, min, or max can also be used. Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

查询列的和,平均值。最小值或最大值同样也可以使用。注意在一些HIVE的版本中,你需要使用count(1)去代替count(*).

GROUP BY    汇总

  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;
  hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

JOIN   连接

  hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;

MULTITABLE INSERT  多表插入

  FROM src
  INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
  INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
  INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
  INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;

STREAMING

  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';

This streams the data in the map phase through the script /bin/cat (like Hadoop streaming).
Similarly – streaming can be used on the reduce side (please see the Hive Tutorial for examples).

Simple Example Use Cases   简单的使用示例

MovieLens User Ratings  

First, create a table with tab-delimited text file format:

CREATE TABLE u_data (
  userid INT,
  movieid INT,
  rating INT,
  unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

Then, download the data files from MovieLens 100k on the GroupLens datasets page (which also has a README.txt file and index of unzipped files):

GroupLens datasets 下载文件

wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

or:

curl --remote-name http://files.grouplens.org/datasets/movielens/ml-100k.zip

Note:  If the link to GroupLens datasets does not work, please report it on HIVE-5341 or send a message to the user@hive.apache.org mailing list.

Unzip the data files:  解压数据文件

unzip ml-100k.zip

And load u.data into the table that was just created:   加载u.data到创建的表中

LOAD DATA LOCAL INPATH '<path>/u.data'
OVERWRITE INTO TABLE u_data;

Count the number of rows in table u_data:  汇总表u_data的数量

SELECT COUNT(*) FROM u_data;

Note that for older versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

Now we can do some complex data analysis on the table u_data:

注意在旧版本的Hive中缺少 HIVE-287,,你将需要使用count(1)区代替count(*)

现在我们可以在表u_data上做一些复杂的数据分析。

Create weekday_mapper.py:

import sys
import datetime

for line in sys.stdin:
  line = line.strip()
  userid, movieid, rating, unixtime = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print '\t'.join([userid, movieid, rating, str(weekday)])

Use the mapper script:

CREATE TABLE u_data_new (
  userid INT,
  movieid INT,
  rating INT,
  weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

add FILE weekday_mapper.py;

INSERT OVERWRITE TABLE u_data_new
SELECT
  TRANSFORM (userid, movieid, rating, unixtime)
  USING 'python weekday_mapper.py'
  AS (userid, movieid, rating, weekday)
FROM u_data;

SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;

Note that if you're using Hive 0.5.0 or earlier you will need to use COUNT(1) in place of COUNT(*).

Apache Weblog Data

The format of Apache weblog is customizable, while most webmasters use the default.
For default Apache weblog, we can create a table with the following command.

More about RegexSerDe can be found here in HIVE-662 and HIVE-1719.

CREATE TABLE apachelog (
  host STRING,
  identity STRING,
  user STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"
)
STORED AS TEXTFILE;
posted @ 2017-08-03 15:39  霓裳梦竹  阅读(1212)  评论(0编辑  收藏  举报