Hadoop分布式存储和计算MapReduce的使用以及Hive数据仓库等内容精讲

一，zookeeper环境搭建
二，zookeeper的shell客户端操作
三，hadoop重新编译-准备工作
四，hdfs的命令行操作
- 1，基础命令
- 2，高级命令
五，hdfs的API操作
六，MapReduce案例
七，MapReduce分区
八，MapReduce中的计数器
- 1，自定义计数器
  - （1）第一种方式
  - （2）第二种方式
九，MapReduce的排序和序列化
十，规约Combiner
十一，MapReduce综合案例-统计求和步骤分析
十二，MapReduce案例-Reduce端join操作
十三，自定义InputFormat合并小文件
十四，自定义OutPutFormat
- 1，自定义MyOutputFormat类
- 2，自定义MyRecordWritter类
十五，自定义分组-求TopN
十六，Hive数据仓库

一，zookeeper环境搭建

ZooKeeper致力于为分布式应用提供一个高性能、高可用，且具有严格顺序访问控制能力的分布式协调服务

服务器IP	主机名	myid的值
192.168.186.133	vmone	1
192.168.186.134	vmtwo	2
192.168.186.135	vmthree	3

myid的值越高，被选举的几率越大！

先把这三台机器分别重置主机名为vmone，vmtwo，vmthree，具体命令参考链接：https://blog.csdn.net/chen1092248901/article/details/81556774

1，下载zookeeper的压缩包，下载网址如下

http://archive.apache.org/dist/zookeeper/zookeeper-3.4.10/

打开vm_one虚拟机，把zookeeper压缩包放置/export/softwares目录下准备进行安装

2，解压

解压zookeeper的压缩包到/export/servers路径下，然后准备进行安装

tar -zxvf zookeeper-3.4.10.tar.gz -C ../servers/

3，修改配置文件

第一台机器（vm_one）修改配置文件

进到 /export/servers/zookeeper-3.4.10/conf 目录，拷贝zoo_sample.cfg文件

cp zoo_sample.cfg zoo.cfg
vim zoo.cfg

# myid文件的存放位置
dataDir=/export/servers/zookeeper-3.4.10/zkdatas
# 保留多少个快照
autopurge.snapRetainCount=3
# 日志多少小时清除一次
autopurge.purgeInterval=1
# 集群中服务器地址（没有就新加进去）
server.1=vmone:2888:3888
server.2=vmtwo:2888:3888
server.3=vmthree:2888:3888

4，添加myid配置

在第一台机器的

/export/servers/zookeeper-3.4.10/zkdatas/这个路径下创建一个文件，文件名叫myid，文件内容为1

5，安装包分发并修改myid的值

安装包分发到其他机器

此处为CentOs虚拟机，克隆了两个，并修改myid分别为2和3，过程忽略

6，启动zookeeper

三台机器都要启动zookeeper服务，这个命令三台机器都要执行（在bin目录下）

./zkServer.sh start

验证是否启动成功

jps需要安装openjdk

yum install java-1.8.0-openjdk-devel.x86_64

[root@localhost bin]# jps
3283 Jps
2996 QuorumPeerMain    zookeeper的标志
[root@localhost bin]#

查看zookeeper是leader还是follower，如下命令可查看

./zkServer.sh status

zookeeper集群安装结束

二，zookeeper的shell客户端操作

1，连接zookeeper客户端

连接zookeeper客户端

./zkCli.sh -server vmone:2181

当启动一个zookeeper，另外两个关闭时，登录这个开启的zookeeper服务，会无限打印 Unable to read additional data from server sessionid 0x0 错误，原因是因为我在zoo.cfg中配置了3台机器，但是只启动了1台，zookeeper就会认为服务处于不可用状态，zookeeper有个选举算法，当整个集群超过半数机器宕机，zookeeper会认为集群处于不可用状态。所以，3台机器启动1台无法连接，如果启动2台及以上就可以连接了。

如果还没有解决这个问题，检查Linux是否开启了防火墙，放开端口（这里关闭了防火墙）

另外，编辑 /etc/hosts 文件，把三台主机名和三个对应ip配置进去

三台主机都要配置，然后就解决了这个启动问题了

1:查看防火状态
systemctl status firewalld
service  iptables status

2:暂时关闭防火墙
systemctl stop firewalld
service  iptables stop

3:永久关闭防火墙
systemctl disable firewalld
chkconfig iptables off

4:重启防火墙
systemctl enable firewalld
service iptables restart

启动之后就可以在终端进行命令操作zookeeper了，断开连接是quit

2，zookeeper常用命令

命令

命令	说明	参数
create [-s] [-e] path data acl	创建Znode	-s 是指定顺序节点 -e 是指定临时节点
ls path [watch]	列出path下所有子Znode
get path [watch]	获取path对应的Znode的数据和属性
ls2 path [watch]	查看path下所有子Znode以及子Znode的属性
set path data [version]	更新节点	version数据版本
delete path [version]	删除节点	version数据版本

## 3，命令操作示例

列出path下（zookeeper的根目录下）所有Znode

ls /

创建永久节点 hello：节点名称 world 节点携带信息（携带参数）

create /hello world

创建临时节点（临时节点无法创建子节点）

create -e /abc world

创建永久序列化节点

create -s /zhangsan boy

创建临时序列化节点

create -e -s /lisi boy

修改节点数据

set /hello zookeeper

删除节点，如果要删除的节点有子节点Znode则无法删除

delete /hello

删除节点，如果有子Znode则递归删除

rmr /hello

列出历史记录

history

3，zookeeper的watch机制

类似于数据库的触发器，对某个Znode设置watcher，当Znode发生变化时（删除，修改，创建，子节点被修改），watchmanager就会调用对应的watcher发送给客户端。

watch机制只可以被触发一次，如需再次触发，只能手动再次添加！

4，zookeeper的javaAPI操作

创建springboot工程，加入依赖

		<dependency>
            <groupId>org.apache.curator</groupId>
            <artifactId>curator-framework</artifactId>
            <version>2.12.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.curator</groupId>
            <artifactId>curator-recipes</artifactId>
            <version>2.12.0</version>
        </dependency>
        <dependency>
            <groupId>com.google.collections</groupId>
            <artifactId>google-collections</artifactId>
            <version>1.0</version>
        </dependency>

（1）创建节点

	@Test
    public void test() throws Exception{
        createZnode(2);
    }

    /**
     * 创建zookeeper节点
     * param: 1：永久节点  2：临时节点
     */
    public void createZnode(Integer param) throws Exception{
        //定制一个重试策略
        /**
         * param1：重试的间隔时间
         * param2：重试的最大次数
         */
        RetryPolicy retryPolicy=new ExponentialBackoffRetry(1000,1);
        //获取一个客户端对象
        /**
         * param1：要连接的zookeeper服务器列表
         * param2：会话的超时时间
         * param3：链接超时时间
         * param4：重试策略
         */
        String connectionStr="192.168.186.133:2181,192.168.186.134:2181,192.168.186.135:2181";
        CuratorFramework client = CuratorFrameworkFactory.newClient(connectionStr, 8000, 8000, retryPolicy);
        //开启客户端
        client.start();
        //创建节点
        if(param==1){            client.create().creatingParentsIfNeeded().withMode(CreateMode.PERSISTENT).forPath("/hello2","world".getBytes());
        }else if(param==2){            client.create().creatingParentsIfNeeded().withMode(CreateMode.EPHEMERAL).forPath("/hello_tmp","world".getBytes());
            //睡眠10秒：临时节点只在会话周期内存在，如果关闭客户端，临时节点会消失，因此此处作用是为了看效果
            Thread.sleep(10000);
        }
        //关闭客户端
        client.close();
    }

可以配合终端查看节点 ls /来看节点是否创建成功

（2）更新节点数据

	/**
     * 更细节点数据
     * 节点下面添加数据和修改数据是类似的，一个节点下面会有一个数据，新的数据会覆盖掉旧的数据
     */
    public void updateZnodeData() throws Exception{
        RetryPolicy retryPolicy=new ExponentialBackoffRetry(1000,1);
        String connectionStr="192.168.186.133:2181,192.168.186.134:2181,192.168.186.135:2181";
        CuratorFramework client = CuratorFrameworkFactory.newClient(connectionStr, 8000, 8000, retryPolicy);
        client.start();
        client.setData().forPath("/hello2","dataValue".getBytes());
        client.close();
    }

（3）查询节点数据

public void getZnodeData() throws Exception{
        RetryPolicy retryPolicy=new ExponentialBackoffRetry(1000,1);
        String connectionStr="192.168.186.133:2181,192.168.186.134:2181,192.168.186.135:2181";
        CuratorFramework client = CuratorFrameworkFactory.newClient(connectionStr, 8000, 8000, retryPolicy);
        client.start();
        byte[] bytes = client.getData().forPath("/hello2");
        System.out.println("查询的节点数据："+bytes);
        client.close();
    }

（4）节点的watch机制

public void watch() throws Exception{
        RetryPolicy retryPolicy=new ExponentialBackoffRetry(1000,1);
        String connectionStr="192.168.186.133:2181,192.168.186.134:2181,192.168.186.135:2181";
        CuratorFramework client = CuratorFrameworkFactory.newClient(connectionStr, 8000, 8000, retryPolicy);
        client.start();
        //创建一个TreeCache对象，指定要监控的节点路径
        TreeCache treeCache=new TreeCache(client,"hello2");
        //自定义一个监听器
        treeCache.getListenable().addListener(new TreeCacheListener() {
            @Override
            public void childEvent(CuratorFramework curatorFramework, TreeCacheEvent treeCacheEvent) throws Exception {
                ChildData data = treeCacheEvent.getData();
                if(data!=null){ //监听器被触发
                    switch (treeCacheEvent.getType()){
                        case NODE_ADDED:
                            System.out.println("监控到有新增节点");
                            break;
                        case NODE_REMOVED:
                            System.out.println("监控到有节点被移除");
                            break;
                        case NODE_UPDATED:
                            System.out.println("监控到有节点被更新");
                            break;
                        default:
                            break;
                    }
                }
            }
        });
        //启动监听器
        treeCache.start();
        Thread.sleep(1000000);//只是为了看效果

        client.close();
    }

三，hadoop重新编译-准备工作

这里下载appach的hadoop2.7.5版本（源码版本）

http://archive.apache.org/dist/hadoop/core/hadoop-2.7.5/

在这里插入图片描述

appach版本的hadoop重新编译

1，为什么要重新编译hadoop？

由于appach给出的hadoop安装包没有提供带C程序访问的接口，所以我们在使用本地库的时候就会出问题，需要对hadoop源码进行重新编译。

2，编译环境的准备

Linux环境，内存4G以上，64位操作系统

3，虚拟机联网，关闭防火墙，关闭selinux

# 关闭防火墙
systemctl stop firewalld
# 关闭selinux
vim /etc/selinux/config
设置SELINUX=disabled

4，安装jdk1.7

注意hadoop2.7.5这个版本的编译，只能使用jdk1.7，如果使用jdk1.8那么就会报错，查看是否自带openjdk，如果有则卸载掉。

查看是否存在自带openjdk

rpm -qa|grep java

将所有openjdk全部卸载掉

rpm -e --nodeps java-1.8.0-openjdk-devel-1.8.0.262.b10-0.el7_8.x86_64 java-1.7.0-openjdk-headless-1.7.0.221-2.6.18.1.el7.x86_64 java-1.8.0-openjdk-headless-1.8.0.262.b10-0.el7_8.x86_64 java-1.7.0-openjdk-1.7.0.221-2.6.18.1.el7.x86_64 tzdata-java-2020a-1.el7.noarch python-javapackages-3.4.1-11.el7.noarch javapackages-tools-3.4.1-11.el7.noarch

下载jdk1.7安装包

https://www.oracle.com/java/technologies/javase/javase7-archive-downloads.html#jdk-7u80-oth-JPR

在这里插入图片描述

下载好之后把安装包上传到/export/softwares

解压

cd /export/softwares
tar -zxvf jdk-7u71-linux-x64.tar.gz -C ../servers/

配置环境变量

vim /etc/profile
加入下面两句配置
export JAVA_HOME=/export/servers/jdk1.7.0_71
export PATH=:$JAVA_HOME/bin:$PATH
重新生效
source /etc/profile

5，安装maven

这里使用maven3.x版本以上都可以，强烈建议使用3.0.5版本

下载maven

http://archive.apache.org/dist/maven/maven-3/3.0.5/binaries/

在这里插入图片描述

将maven的安装包上传到/export/softwares，然后解压到/export/servers下

解压

tar -zxvf apache-maven-3.0.5-bin.tar.gz -C ../servers/

配置环境变量

vim /etc/profile
添加配置
export MAVEN_HOME=/export/servers/apache-maven-3.0.5
export MAVEN_OPTS="-Xms4096m -Xmx4096m"
export PATH=:$MAVEN_HOME/bin:$PATH
重新生效
source /etc/profile
查看maven版本
mvn -version

6，安装maven-repository

创建一个文件夹名为mvnrepository，用来存储maven的jar包

然后更改maven的conf文件夹中的settings.xml配置文件，指定仓库位置

<localRepository>/export/servers/mvnrepository</localRepository>

然后加入阿里云镜像

<mirror>
    <id>alimaven</id>
    <name>aliyun maven</name>
    <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
    <mirrorOf>central</mirrorOf>
</mirror>

7，安装findbugs

下载地址

https://sourceforge.net/projects/findbugs/files/findbugs/1.3.9/

在这里插入图片描述

解压

tar -zxvf findbugs-1.3.9.tar.gz -C ../servers/

配置findbugs的环境变量

vim /etc/profile
添加配置
export FINDBUGS_HOME=/export/servers/findbugs-1.3.9
export PATH=:$FINDBUGS_HOME/bin:$PATH
重新生效
source /etc/profile
查看版本
findbugs -version

8，在线安装一些依赖包

yum install autoconf automake libtool cmake
yum install ncurses-devel
yum install openssl-devel
yum install lzo-devel zlib-devel gcc gcc-c++

bzip2压缩需要的依赖包

yum install -y bzip2-devel

9，安装protobuf

解压并编译

tar -zxvf protobuf-2.5.0.tar.gz -C ../servers/
cd protobuf-2.5.0/
./configure
make && make install

10，安装snappy

tar -zxvf snappy-1.1.1.tar.gz -C ../servers/
cd snappy-1.1.1
./configure
make && make install

11，编译hadoop源码

下载hadoop压缩包

下载地址：http://archive.apache.org/dist/hadoop/core/hadoop-2.7.5/

在这里插入图片描述

对源码进行编译

tar -zxvf hadoop-2.7.5-src.tar.gz -C ../servers/
cd hadoop-2.7.5-src/

执行下面的命令编译支持snappy压缩

mvn package -DskipTests -Pdist,native -Dtar -Drequire.snappy -e -X -Dmaven.wagon.http.ssl.insecure=true -Dmaven.wagon.http.ssl.allowall=true

必须把虚拟机的允许内存设置为4G以上，磁盘容量设置为20G，前者会报JVM崩溃错误，而后者则可能会报磁盘容量不足的错误！
-Dmaven.wagon.http.ssl.insecure=true -Dmaven.wagon.http.ssl.allowall=true的作用：告诉maven忽略SSL证书问题，否则会报错！

原因：由于阿里仓库地址更新成https后,下载需要使用ssl认证,如果本地没有配置的话,导致依然使用的是默认仓库，它实现maven对数据源供应商的http访问，能够使用存储在HTTP服务器中的远程存储库

编译之后，在hadoop-2.7.5-src/hadoop-dist/target目录下会有hadoop-2.7.5.tar.gz包，这也是重新编译之后的hadoop的安装包

12，安装hadoop

这里只演示单机版，以上面的vmtwo机器为例，够学习即可

安装集群版参考下面链接（只是复制几份扔到另外几台机器上）

https://www.cnblogs.com/rqx-20181108/p/10278038.html

上传hadoop安装包并解压

tar -zxvf hadoop-2.7.5.tar.gz -C ../servers/

13，修改配置文件

因为需要修改的配置文件较多，在xshell里使用vim编辑器一个一个的改特别繁琐，这里使用nopead++工具进行远程连接虚拟机进行文件的修改。

具体连接的教程参考下方链接

使用notepad++远程编辑虚拟机文档： https://zhuanlan.zhihu.com/p/56313557

连接上之后找到要修改配置文件的位置，然后双击开始修改

core-site.xml

<configuration>
    <!-- 指定集群的文件系统类型：分布式文件系统 -->
	<property>
		<name>fs.default.name</name>
		<value>hdfs://192.168.186.133:8020</value>
	</property>
    <!-- 指定临时文件存储目录 -->
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/export/servers/hadoop-2.7.5/hadoopDatas/tempDatas</value>
	</property>
    <!-- 缓冲区大小 实际工作中根据服务器性能动态调整 -->
	<property>
		<name>io.file.buffer.size</name>
		<value>4096</value>
	</property>
    <!-- 开启hadoop的垃圾捅机制 删除掉的数据可以从垃圾通中回收 单位分钟 -->
	<property>
		<name>fs.trash.interval</name>
		<value>10080</value>
	</property>
</configuration>

hdfs-site.xml

<configuration>
	<property>
		<name>dfs.namenode.secondary.http-address</name>
		<value>vmtwo:50090</value>
	</property>
	<!-- 指定namenode的访问地址和端口 -->
	<property>
		<name>dfs.namenode.http-address</name>
		<value>vmtwo:50070</value>
	</property>
	<!-- 指定namenode元数据的存放位置 -->
	<property>
		<name>dfs.namenode.name.dir</name>
		<value>file:///export/servers/hadoop-2.7.5/hadoopDatas/namenodeDatas,file:///export/servers/hadoop-2.7.5/hadoopDatas/namenodeDatas2</value>
	</property>
	<!-- 定义dataNode数据存储的节点位置，实际工作中，一般先确定磁盘的挂载目录，然后多个目录用，进行分割 -->
	<property>
		<name>dfs.namenode.data.dir</name>
		<value>file:///export/servers/hadoop-2.7.5/hadoopDatas/datanodeDatas,file:///export/servers/hadoop-2.7.5/hadoopDatas/datanodeDatas2</value>
	</property>
	<!-- 指定namenode的日志文件的存放目录 -->
	<property>
		<name>dfs.namenode.edits.dir</name>
		<value>file:///export/servers/hadoop-2.7.5/hadoopDatas/nn/edits</value>
	</property>
	
	<property>
		<name>dfs.namenode.checkpoint.dir</name>
		<value>file:///export/servers/hadoop-2.7.5/hadoopDatas/nn/name</value>
	</property>
	<property>
		<name>dfs.namenode.checkpoint.edits.dir</name>
		<value>file:///export/servers/hadoop-2.7.5/hadoopDatas/dfs/snn/edits</value>
	</property>
	<!-- 文本切片的复制个数 -->
	<property>
		<name>dfs.replication</name>
		<value>3</value>
	</property>
	<!-- 设置HDFS的文件权限 -->
	<property>
		<name>dfs.permissions</name>
		<value>false</value>
	</property>
	<!-- 设置一个文本切片的大小  128M-->
	<property>
		<name>dfs.blocksize</name>
		<value>134217728</value>
	</property>
</configuration>

hadoop-env.sh

这个文件需要修改jdk的路径

上面重新编译hadoop时安装的是jdk1.7，这里要修改成jdk1.8，重新装一个即可

export JAVA_HOME=/export/servers/jdk1.8.0_231

mapred-site.xml

需要把mapred-site.xml.template文件重命名为mapred-site.xml文件，然后再修改

<configuration>
	<!-- 开启mapreduce小任务模式 -->
	<property>
		<name>mapreduce.job.ubertask.enable</name>
		<value>true</value>
	</property>
	<!-- 设置历史任务的主机和端口 -->
	<property>
		<name>mapreduce.jobhistory.address</name>
		<value>vmtwo:10020</value>
	</property>
	<!-- 设置网页设置历史任务的主机和端口 -->
	<property>
		<name>mapreduce.jobhistory.webapp.address</name>
		<value>vmtwo:19888</value>
	</property>
</configuration>

yarn-site.xml

<configuration>

	<!-- 配置yarn主节点的位置 -->
	<property>
		<name>yarn.resourcemanager.hostname</name>
		<value>vmtwo</value>
	</property>
	<property>
		<name>yarn.nodemanager.aux-services</name>
		<value>mapreduce_shuffle</value>
	</property>
	<!-- 开启日志聚合功能 -->
	<property>
		<name>yarn.log-aggregation-enable</name>
		<value>true</value>
	</property>
	<!-- 设置聚合日志在hdfs上的保存时间 -->
	<property>
		<name>yarn.log-aggregation.retain-seconds</name>
		<value>604800</value>
	</property>
	<!-- 设置yarn集群的内存分配方案 -->
	<property>
		<name>yarn.nodemanager.resource.memory-mb</name>
		<value>20480</value>
	</property>
	<property>
		<name>yarn.scheduler.minimum-allocation-mb</name>
		<value>2048</value>
	</property>
	<property>
		<name>yarn.nodemanager.vmem-pmem-ratio</name>
		<value>2.1</value>
	</property>
</configuration>

mapred-env.sh

配置jdk1.8路径

export JAVA_HOME=/export/servers/jdk1.8.0_231

slaves

写入从机名称，修改为三个主机名

vmtwo

14，创建文件夹

都是上面用到的文件夹，这里来创建

如果是集群，所有机器都要创建下面的目录

mkdir -p /export/server/hadoop-2.7.5/hadoopDatas/tempDatas
mkdir -p /export/server/hadoop-2.7.5/hadoopDatas/namenodeDatas
mkdir -p /export/server/hadoop-2.7.5/hadoopDatas/namenodeDatas2
mkdir -p /export/server/hadoop-2.7.5/hadoopDatas/datanodeDatas
mkdir -p /export/server/hadoop-2.7.5/hadoopDatas/datanodeDatas2
mkdir -p /export/server/hadoop-2.7.5/hadoopDatas/nn/edits
mkdir -p /export/server/hadoop-2.7.5/hadoopDatas/snn/name
mkdir -p /export/server/hadoop-2.7.5/hadoopDatas/dfs/snn/edits

15，配置hadoop的环境变量

如果是集群，所有机器都要配置hadoop的环境变量

执行以下命令

vim /etc/profile

export HADOOP_HOME=/export/servers/hadoop-2.7.5
export PATH=:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

配置完生效

source /etc/profile

16，启动Hadoop（集群暂不做演示）

要启动hadoop，需要启动HDFS和YARN两个模块

注意：首次启动HDFS时，必须对其进行格式化操作，本质上是一些清理和准备工作，因为此时HDFS在物理上还是不存在的。

执行以下命令

cd /export/servers/hadoop-2.7.5/
#  只在第一次启动时执行，看上面的注意事项 之后不可再次执行，否则会丢失文件
bin/hdfs namenode -format  
sbin/start-dfs.sh
sbin/start-yarn.sh
sbin/mr-jobhistory-daemon.sh start historyserver

这里因为时单机版，所以主节点和从节点都是自己！

三个端口查看页面

http://vmtwo:50070/explorer.html#/ 查看hdfs

http://vmtwo:8088/cluster 查看yarn集群

http://vmtwo:19888/jobhistory 查看历史完成的任务

启动hadoop结束！

四，hdfs的命令行操作

1，基础命令

命令	作用	格式	示例
ls	查看文件列表	hdfs dfs -ls URI	hdfs dfs -ls /
lsr	递归显示某个目录下的所有文件		hdfs dfs -lsr / 旧命令 hdfs dfs ls -R 新命令
mkdir	以paths中的URI作为参数，创建目录，使用 -p 参数可以递归创建目录	hdfs dfs [-p] -mkdir PATHS	hdfs dfs -mkdir /test
put	将单个源文件src或多个源文件srcs从本地文件系统拷贝到目标文件系统中，也可以从标准输入中读取输入，写入目标文件系统中	hdfs dfs -put LOCALSRC ... DST	hdfs dfs -put /local/a.txt /test
moveFormLocal	剪切到hdfs 和put命令类似，但是源文件被删除		hdfs dfs -moveFormLocal /local/a.txt /test
get	将文件从hdfs拷贝到本地文件系统		hdfs dfs -get /a.txt /export/servers
mv	将hdfs的文件移动位置（只在hdfs里移动）		hdfs dfs -mv /dir1/a.txt /dir2
rm	删除指定的文件，参数可以有多个，此命令只删除文件和非空目录，如果指定-skipTrash，则跳过回收站直接删除	hdfs dfs -rm [-r] [-skipTrash] URI [URI...]	hdfs dfs /dir/a.txt
cp	将文件拷贝到目标路径中 -f选项将覆盖目标，存在则覆盖 -p深度拷贝（时间戳，所有权，许可等待）	hdfs dfs -cp URI [URI...] DEST	hdfs dfs -cp /dir1/a.txt /dir2
cat	将参数所指示的文件内容输出到stdout		hdfs dfs -cat /dir1/a.txt
chmod	改变文件权限		hdfs dfs -chmod 777 -R /dir/a.txt
chown	改变文件的用户和用户组，如果使用-R则目录递归执行	hdfs dfs -chown [-R] URI [URI...]	hdfs dfs -chown -R username:groupname /dir/a.txt
appendToFile	追加一个或多个文件到hdfs指定文件中，也可以从命令行中读取输入，前面的源路径是本地路径，后面的目标路径是hdfs文件路径	hdfs dfs -appenToFile LOCALSRC ...DST	hdfs dfs -appendToFile a.txt b.txt /big.txt

2，高级命令

（1）HDFS文件限额配置

在多人共用HDFS条件下，如果没有配额管理，很容易把所有空间用完造成别人无法存取，HDFS的限额是针对目录而不是针对账号，可以让每个账号仅操作一个目录，然后对目录设置配置。

HDFS文件的限额配置允许我们以文件个数，或者文件大小来限制我们在某个目录下上传的文件数量或者文件内容总量，以便达到类似百度网盘限制每个用户允许上传的最大的文件的量。

hdfs dfs -count -q -h /user/root/dir  # 查看配额信息

数量限额

hdfs dfs -mkdir -p /user/root/dir  # 创建文件夹
hdfs dfsadmin -setQuota 2 /user/root/dir # 给文件夹下面设置最多上传两个文件，发现只能上传一个文件

hdfs dfsadmin -clrQuota /user/root/dir # 清除文件数量限制

空间大小限额

在设置空间配额时，设置的空间至少时block*3大小

hdfs dfsadmin -setSpaceQuota 384M /user/root/dir # 限制空间大小384M（上传文件大小）

生成任意大小文件的命令

dd if=/dev/zero of=1.txt bs=1M count=2 #生成2M的文件

（2）HDFS的安全模式

安全模式是hadoop的一种保护机制，当hadoop集群启动时，会自动进入安全模式，检查数据块的完整性。

假设我们设置的副本数是3，那么在datanode中应该有3个副本，假设只存在2个副本，比例就是2/3=0.666，hdfs默认的副本率是0.999，明显小于默认副本率，因此系统会自动的复制副本到其他datanode，使得副本率不小于0.999。

在安全模式状态下，文件系统只接收读数据请求，而不接受删除、修改等变更请求，当整个系统达到安全标准时，HDFS会自动离开安全模式。

安全模式操作命令

hdfs dfsadmin -safemode get # 查看安全模式状态
hdfs dfsadmin -safemode enter # 进入安全模式
hdfs dfsadmin -safemode leave # 离开安全模式

（3）HDFS的基准测试

实际生产环境中，hadoop的环境搭建完成之后，第一步就是进行压力测试，测试集群的读取和写入速度，测试我们的网络带宽是否足够等一些基准测试。

测试写入速度

向HDFS文件系统中写入数据，10个文件，每个文件10M，文件会自动存放在/bechmarks/TestDFSIO中

hadoop jar hadoop2.7.5/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.5.jar TestDFSIO -write -nrFiles 10 -fileSize 10MB

在当前文件夹下会生成log文件，这就是读取结果报告

cat TestDFSIO_results.log

测试读取速度

在HDFS文件系统中读入10个文件，每个文件10M

hadoop jar hadoop2.7.5/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.5.jar TestDFSIO -read -nrFiles 10 -fileSize 10MB

查看结果（还是上面的文件，会更新结果）

清除测试数据

hadoop jar hadoop2.7.5/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.5.jar TestDFSIO -clean

五，hdfs的API操作

1，配置windows下的hadoop环境

在windows系统需要配置hadoop环境，否则直接允许代码会出现问题。

把windows版本的hadoop-2.7.5拷贝到一个没有中文没有空格的路径下面
配置hadoop环境变量
把hadoop-2.7.5文件夹中bin目录下的hadoop.dll文件放在系统盘：C：\windows\system32 下
关闭windows重启

window版本的hadoop-2.7.5如果找不到可以使用以下方式

首先到官方下载官网的hadoop2.7.5

https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/

然后下载hadooponwindows-master.zip

https://pan.baidu.com/s/1vxtBxJyu7HNmOhsdjLZkYw 提取码：y9a4

把hadoop-2.7.5.tar.gz解压后，使用hadooponwindows-master的bin和etc替换hadoop2.7.5的bin和etc

2，导入maven坐标

		<!-- hadoop开始 -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.5</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.5</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.7.5</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>2.7.5</version>
        </dependency>
        <!-- hadoop结束 -->

3，url访问方式

实现hdfs文件下载到本地（此方法不常用）

    @Test
    public void downLoad() throws IOException {
        //注册url
        URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
        //获取hdfs文件的输入流
        InputStream inputStream = new URL("hdfs://192.168.186.133:8020/tmp/a.txt").openStream();
        //获取本地文件的输出流
        FileOutputStream fileOutputStream=new FileOutputStream(new File("D:\\hello.txt"));
        IOUtils.copy(inputStream,fileOutputStream);
        IOUtils.closeQuietly(inputStream);
        IOUtils.closeQuietly(fileOutputStream);
    }

4，使用文件系统的方式访问数据（掌握）

涉及到的类：

Configuration：该类的对象封装了客户端或服务器的配置

FileSystem：该类的对象是一个文件系统对象，可以用该对象的一些方法来对文件进行操作，通过FileSystem的静态方法get获得该对象

FileSystem fs=FileSystem.get(conf)

获取FileSystem的几种方式？

第一种

    public void getFileSys() throws IOException{
        //创建configuration对象
        Configuration configuration = new Configuration();
        //设置文件系统的类型
        configuration.set("fs.defaultFS","hdfs://192.168.186.133:8020");
        //获取指定的文件系统
        FileSystem fileSystem = FileSystem.get(configuration);
        //输出
        System.out.println(fileSystem);
    }

第二种

    public void getFileSys() throws IOException, URISyntaxException {
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.186.133:8020"),new Configuration());
        //输出
        System.out.println(fileSystem);
    }

第三种

 public void getFileSys() throws IOException{
        //创建configuration对象
        Configuration configuration = new Configuration();
        //设置文件系统的类型
        configuration.set("fs.defaultFS","hdfs://192.168.186.133:8020");
        //获取指定的文件系统
        FileSystem fileSystem = FileSystem.newInstance(configuration);
        //输出
        System.out.println(fileSystem.toString());
    }

第四种

 public void getFileSys() throws IOException, URISyntaxException {
        FileSystem fileSystem = FileSystem.newInstance(new URI("hdfs://192.168.186.133:8020"),new Configuration());
        //输出
        System.out.println(fileSystem);
    }

5，遍历所有文件

public void getFileSys() throws IOException, URISyntaxException {
        //获取FileSystem实例
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.186.133:8020"),new Configuration());
        //调用方法listFiles获取 / 目录下所有的文件信息
        RemoteIterator<LocatedFileStatus> locatedFileStatusRemoteIterator
                = fileSystem.listFiles(new Path("/"), true);//true为递归查询
        while (locatedFileStatusRemoteIterator.hasNext()){
            LocatedFileStatus fileStatus = locatedFileStatusRemoteIterator.next();
            //打印路径和文件名
            System.out.println(fileStatus.getPath()+"----"+fileStatus.getPath().getName());
            //文件的block信息
            BlockLocation[] blockLocations = fileStatus.getBlockLocations();
            System.out.println("block数："+blockLocations.length);

        }
        fileSystem.close();
    }

hdfs://192.168.186.133:8020/tmp/a.txt----a.txt
block数：1

6，在hdfs上创建文件夹

public void create() throws Exception{
        //获取FileSystem实例
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.186.133:8020"),new Configuration());
        boolean mkdirs = fileSystem.mkdirs(new Path("/dir/dir123"));//mkdirs会递归创建
        fileSystem.close();
    }

7，hdfs文件下载和上传

文件从hdfs下载到本地

    public void create() throws Exception{
        //获取FileSystem实例
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.186.133:8020"),new Configuration());
       fileSystem.copyToLocalFile(new Path("/tmp/a.txt"),new Path("D:\\test.txt"));
       fileSystem.close();
    }

文件从本地上传到hdfs

 public void create() throws Exception{
        //获取FileSystem实例
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.186.133:8020"),new Configuration());
       fileSystem.copyFromLocalFile(new Path("file:///D:\\工作\\学习资料\\我的账号.txt"),new Path("/myAccount.txt"));
       fileSystem.close();
    }

8，hdfs文件权限控制

在上面三（13）处已经讲了配置文件，在那里配置了hdfs是否开启文件权限控制，如果开启了为true，则无权限操作

这里提一个知识点，在java操作文件时，并没有指定的Owner是谁，因此就算Permission有权限可还是不可以操作文件，因为Owner不对应，所以在java里可以指定Owner

FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.186.133:8020"),new Configuration(),"root");

9，小文件的合并

java在上传文件时，把所有小文件合并成一个大文件然后上传

public void create() throws Exception{
        //获取FileSystem实例
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.186.133:8020"),new Configuration(),"root");
        //获取hdfs大文件的输出流
        FSDataOutputStream outputStream = fileSystem.create(new Path("big_txt.txt"));
        //获取本地文件系统
        LocalFileSystem local = FileSystem.getLocal(new Configuration());
        FileStatus[] fileStatuses = local.listStatus(new Path("D:\\input"));
        //遍历每个文件，获取每个文件的输入流，并累计拷贝到hdfs新文件的输出流中
        for(FileStatus fileStatus:fileStatuses){
            FSDataInputStream open = local.open(fileStatus.getPath());
            //将小文件的数据复制到大数据
            IOUtils.copy(open,outputStream);
            IOUtils.closeQuietly(open);
        }
        IOUtils.closeQuietly(outputStream);
        local.close();
        fileSystem.close();
    }

六，MapReduce案例

什么是MapReduce？先上一张图，简单明了说明MapReduce的工作原理。

1，wordCount-准备工作

需求：在一堆给定的文本文件中统计输出每一个单词出现的总次数

数据格式准备

（1）创建一个新的文件

cd /export/servers
vim wordcount.txt

（2）向其中放入以下内容并保存

hello,world,hadoop
hive,sqoop,flume,hello
kitty,tom,jerry,world
hadoop

（3）上传到hdfs

hdfs dfs -mkdir /wordcount/
hdfs dfs -put wordcount.txt /wordcount/

2，Mapper

package com.ftx.zkp.java_zookeeper.test;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class WordCountMapper extends Mapper<LongWritable, Text,Text,LongWritable> {
    //map方法就是将K1，V1转为K2，V2

    /**
     *参数：
     * key：K1 行偏移量
     * value：V1 每一行的文本数据
     * context：表示上下文对象
     */
    /**
     * K1          V1
     * 0    hello,world,hadoop
     * 15   hdfs,hive,hello
     * --------- 转为 ----------
     * K2          V2
     * hello       1
     * world       1
     * hadoop      1
     * hdfs        1
     * hive        1
     * hello       1
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        Text text=new Text();
        LongWritable longWritable=new LongWritable();
        //将每一行文本进行拆分
        String[] worlds = value.toString().split(",");
        //遍历数组，组装K2和V2
        for(String world:worlds){
            //将K2和V2写入上下文
            text.set(world);
            longWritable.set(1);
            context.write(text,longWritable);
        }
    }
}

3,Reducer

package com.ftx.zkp.java_zookeeper.test;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class WordCountReducer extends Reducer<Text, LongWritable,Text,LongWritable> {
    //reduce作用：将新的K2，V2转成K3，V3

    /**
     *参数：
     * key：新K2
     * values：新V2
     * context：表示上下文对象
     */
    /**
     *新K2       新V2
     * hello    <1,1,1>
     * world    <1,1>
     * hadoop   <1>
     * -------转成--------
     * K3       V3
     * hello    3
     * world    2
     * hadoop   1
     */
    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
        long count=0;
        //遍历集合，将集合中的数字相加
        for(LongWritable value:values){
            count+=value.get();
        }
        //将K3，V3写入上下文中
        context.write(key,new LongWritable(count));
    }
}

4，JobMain主类

package com.ftx.zkp.java_zookeeper.test;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class JobMain extends Configured implements Tool {
    //该方法指定一个job任务
    @Override
    public int run(String[] strings) throws Exception {
        //创建一个job任务对象
        Job job = Job.getInstance(super.getConf(), "wordcount");
         //如果打包运行出错，则需要增加该配置
        job.setJarByClass(JobMain.class);
        //配置job任务对象（8个步骤）
        //1，指定文件的读取方式和读取路径
        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job,new Path("hdfs://192.168.186.133:8020/wordcount"));
        //2，指定map阶段的处理方式和数据类型
        job.setMapperClass(WordCountMapper.class);
        //3，设置map阶段K2的类型
        job.setMapOutputKeyClass(Text.class);
        //4，设置map阶段V2的类型
        job.setMapOutputValueClass(LongWritable.class);
        //5，指定reduce阶段的处理方式和数据类型
        job.setReducerClass(WordCountReducer.class);
        //6，设置K3的类型
        job.setOutputKeyClass(Text.class);
        //7，设置V3的类型
        job.setOutputValueClass(LongWritable.class);
        //8，设置输出类型
        job.setOutputFormatClass(TextOutputFormat.class);
        //9，设置输出路径
        TextOutputFormat.setOutputPath(job,new Path("hdfs://192.168.186.133:8020/wordcount_out"));
        //等待任务结束
        boolean b = job.waitForCompletion(true);
        return b?0:1;
    }

    public static void main(String[] args) throws Exception {
        Configuration configuration=new Configuration();
        //启动job任务
        int run = ToolRunner.run(configuration, new JobMain(), args);
        System.exit(run);
    }
}

5，MapReduce运行模式

集群运行模式

将MapReduce程序提交给Yarn集群，分发到很多的节点上并发执行
处理的数据和输出结果应该位于HDFS文件系统
提交集群的实现步骤：
- 将上面写的springboot程序打成jar包
- 上传
- 然后在集群上用hadoop命令启动
- 运行结束之后hdfs会在代码里指定的文件路径处生成统计结果的文件

hadoop jar hadoop-1.0.jar cn.ftx.mapreduce.JobMain

本地运行方式

以测试为主，MapReduce程序在本地以单进程的形式运行，处理的数据和输出结果在本地文件系统。

只需要把上面的输入路径和输出路径换成本地路径即可

TextInputFormat.addInputPath(job,new Path("file:///D:\\suibian\\mapreduce"));
TextOutputFormat.setOutputPath(job,new Path("file:///D:\\suibian\\qqqqqq"));

不管在本地运行还是集群中运行，如果输出目录已经存在了，则会运行失败！

解决方式：获取到路径然后判断该文件夹是否已存在，如果存在则删除即可！

 		Path path = new Path("file:///D:\\suibian\\mapreduce");
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.186.133:8020"), new Configuration());
        //判断目录是否存在
        boolean exists = fileSystem.exists(path);
        if(exists){
            //删除目标目录
            fileSystem.delete(path,true);// true 递归删除
        }
        TextInputFormat.addInputPath(job,path);

七，MapReduce分区

在MapReduce中，通过指定分区，会将同一个分区的数据发送给同一个reduce当中进行处理，就是有相同类型的数据，有共性的数据，送到一起去处理。

分区步骤

这里以彩票数据为例，txt文本存储了一条一条的中奖数据，如以下格式：空格分隔，只是例子，不要死扣下面示例代码！杠精勿扰！

2020-10-01 12:02:56 大乐斗 16 单 148750234

2020-10-01 11:02:20 大乐斗 12 双 148750234

2020-10-01 10:02:38 大乐斗 25 开 148750234

。。。

Mapper中的K1，V1，注意：通常K1为第一行文本的偏移量，V1为文本内容

定义Mapper类

这个Mapper程序不做任何逻辑，也不对Key-Value做任何改变，只是接收数据，然后往下发送

package com.ftx.zkp.java_zookeeper.partition;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
 * K1：行偏移量
 * V1：行数据文本（必须包括要分区的值）
 *
 * K2：行数据文本
 * V2：NullWritable
 */
public class PartitionMapper extends Mapper<LongWritable, Text,Text, NullWritable> {
    //map方法把K1，V1转为K2，V2
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {		//把V1作为K2，V2设为null
        context.write(value,NullWritable.get());
    }
}

定义Partitioner类

主要逻辑都在这儿，通过Partitioner将数据分发给不同的Reducer

package com.ftx.zkp.java_zookeeper.partition;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class MyPartitioner extends Partitioner<Text, NullWritable> {
    /**
     * 1，定义分区规则
     * 2，返回对应的分区编号
     */
    @Override
    public int getPartition(Text text, NullWritable nullWritable, int i) {
        //拆分行数据文本，获取中奖字段的值
        String[] strings = text.toString().split("\t");
        String numStr=strings[2];//在行文本的第二个
        //根据15进行拆分，小于15的返回0分区编号，大于15的返回1分区编号
        if(Integer.parseInt(numStr)>15){
            return 1;
        }else {
            return 0;
        }
    }
}

定义Reducer逻辑

package com.ftx.zkp.java_zookeeper.partition;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
 * K2：Text
 * V2：NullWritable
 *
 * K3：Text
 * V3：NullWritable
 */
public class PartitionerReducer extends Reducer<Text, NullWritable,Text,NullWritable> {
    @Override
    protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        context.write(key,NullWritable.get());
    }
}

主类中设置分区类和ReduceTask个数

package com.ftx.zkp.java_zookeeper.partition;
import com.ftx.zkp.java_zookeeper.test.WordCountMapper;
import com.ftx.zkp.java_zookeeper.test.WordCountReducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import java.net.URI;
public class PartitionJobMain extends Configured implements Tool {
    //该方法指定一个job任务
    @Override
    public int run(String[] strings) throws Exception {
        //创建一个job任务对象
        Job job = Job.getInstance(super.getConf(), "wordcount");
        //如果打包运行出错，则需要增加该配置
        job.setJarByClass(PartitionJobMain.class);
        //配置job任务对象（8个步骤）
        //1，指定文件的读取方式和读取路径
        job.setInputFormatClass(TextInputFormat.class);
        Path path = new Path("file:///D:\\suibian\\mapreduce");
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.186.133:8020"), new Configuration());
        //判断目录是否存在
        boolean exists = fileSystem.exists(path);
        if(exists){
            //删除目标目录
            fileSystem.delete(path,true);// true 递归删除
        }
        TextInputFormat.addInputPath(job,path);
        //2，指定map阶段的处理方式和数据类型
        job.setMapperClass(WordCountMapper.class);
        //3，设置map阶段K2的类型
        job.setMapOutputKeyClass(Text.class);
        //4，设置map阶段V2的类型
        job.setMapOutputValueClass(NullWritable.class);

        ######### 指定分区开始 ############
        //指定分区类
        job.setPartitionerClass(MyPartitioner.class);
        //设置ReduceTask的个数 我们这里根据15分区分了2类，所以个数是2
        job.setNumReduceTasks(2);
		######### 指定分区结束 ############

        //5，指定reduce阶段的处理方式和数据类型
        job.setReducerClass(WordCountReducer.class);
        //6，设置K3的类型
        job.setOutputKeyClass(Text.class);
        //7，设置V3的类型
        job.setOutputValueClass(NullWritable.class);
        //8，设置输出类型
        job.setOutputFormatClass(TextOutputFormat.class);
        //9，设置输出路径
        TextOutputFormat.setOutputPath(job,new Path("hdfs://192.168.186.133:8020/wordcount_out"));
        //等待任务结束
        boolean b = job.waitForCompletion(true);
        return b?0:1;
    }

    public static void main(String[] args) throws Exception {
        Configuration configuration=new Configuration();
        //启动job任务
        int run = ToolRunner.run(configuration, new PartitionJobMain(), args);
        System.exit(run);
    }
}

最后把项目打成jar包上传并运行，然后结果会在hdfs文件系统生成两个结果文件，一个是小于15的列表，另一个则是大于15的列表。

八，MapReduce中的计数器

hadoop内置计数器有MapReduce任务计数器、文件系统计数器、FileInputFormat计数器、FileOutputFormat计数器，作业计数器。

1，自定义计数器

（1）第一种方式

public class PartitionMapper extends Mapper<LongWritable, Text,Text, NullWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //定义计数器  参数：计数器类型  计数器名字（描述内容）
        Counter counter = context.getCounter("MR_COUNTER", "partition_counter");
        //每次执行该方法，计数器变量值+1
        counter.increment(1L);
        
        context.write(value,NullWritable.get());
    }
}

（2）第二种方式

通过enum枚举类型来定义计数器，统计reduce端数据的输入的key有多少个

public class PartitionerReducer extends Reducer<Text, NullWritable,Text,NullWritable> {
    public static enum Counter{
        MY_INPUT_RECORDS,MY_INPUT_BYTES
    }
    
    @Override
    protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        //使用枚举来定义计数器
        context.getCounter(Counter.MY_INPUT_BYTES).increment(1L);
        context.write(key,NullWritable.get());
    }
}

九，MapReduce的排序和序列化

现在有一个文件有以下内容

a 1

b 4

a 2

c 5

a 9

b 6

b 0

。。。

现在要对这个文件的第一列按照英文字母顺序进行排序，如果字母相同则再按照第二列的大小排序

实现效果如下

a 1 a 1

b 4 a 2

a 2 >>>>>>>>> 排序后 >>>>>>>>>> a 9

c 5 b 0

a 9 <<<<<<<<< 排序前 <<<<<<<<<< b 4

b 6 b 6

b 0 c 5

。。。。。。

1，自定义类型和比较器

public class SortBean implements WritableComparable<SortBean> {

    private String word;
    private Integer num;

    //实现比较器，指定排序规则
    /**
     *规则：第一列按照字典顺序排序，如果字母相同，第二列再按照大小排序
     */
    @Override
    public int compareTo(SortBean sortBean) {
        //先对第一列排序，如果第一列相同，再按照第二列排序
        int i = this.word.compareTo(sortBean.getWord());
        if(i==0){//相同
            return this.num.compareTo(sortBean.getNum());
        }
        return i;//大于0则前者比后者大，小于0则反之
    }

    //实现序列化
    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeUTF(word);
        dataOutput.writeInt(num);
    }
    //实现反序列化
    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.word = dataInput.readUTF();
        this.num=dataInput.readInt();
    }
     @Override
    public String toString() {
        return  word + "---" + num;
    }
	//get、set方法省略
}

2，Mapper

public class SortMapper extends Mapper<LongWritable, Text,SortBean, NullWritable> {
    //map方法将K1，V1转为K2，V2
    /**
     * K1                     V1
     * 0                      a  3
     * 10                     b  7
     * ================================
     *  K2                  V2
     *  sortBean(a,3)     NullWritable
     *  sortBean(b,7)     NullWritable
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] split = value.toString().split("\t");
        SortBean sortBean = new SortBean();
        sortBean.setWord(split[0]);
        sortBean.setNum(Integer.parseInt(split[1]));
        context.write(sortBean,NullWritable.get());
    }
}

3，Reducer

public class SortReducer extends Reducer<SortBean, NullWritable,SortBean,NullWritable> {
    //reduce方法把K2，V2转为K3，V3
    @Override
    protected void reduce(SortBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        context.write(key,NullWritable.get());
    }
}

4，Main主类

public class JobMain extends Configured implements Tool {
    @Override
    public int run(String[] strings) throws Exception {
        //创建job对象
        Job job = Job.getInstance(super.getConf(), "mapreduce_sort");
        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job,new Path("file:///D:\\suibian\\mapreduce\\input.txt"));
        //设置mapper类和数据类型
        job.setMapperClass(SortMapper.class);
        job.setMapOutputKeyClass(SortBean.class);
        job.setMapOutputValueClass(NullWritable.class);
        //设置reducer类和数据类型
        job.setReducerClass(SortReducer.class);
        job.setOutputKeyClass(SortBean.class);
        job.setOutputValueClass(NullWritable.class);
        //设置输出类和输出路径
        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job,new Path("file:///D:\\suibian\\mapreduce222"));
        //等待任务结束
        boolean b = job.waitForCompletion(true);
        return b?0:1;
    }

    public static void main(String[] args) throws Exception {
        Configuration configuration=new Configuration();
        //启动job任务
        int run = ToolRunner.run(configuration, new JobMain(), args);
        System.exit(run);
    }
}

本地运行，结果生成在目录D:/suibian/mapreduce222/part-r-00000中

a---1
a---2
a---9
b---0
b---4
b---6
c---5

十，规约Combiner

每一个map都可能产生大量的本地输出，Combiner的作用就是对map端的输出先做一次合并，以减少在map和reduce节点之间的数据传输量，以提高网络IO性能，是MapReduce优化手段之一。

规约操作是在map端进行的！

combiner是MR程序中mapper和reduce之间的一个组件
zombiner组件的父类就是reducer
combiner和reducer的区别就在于运行的位置：
- combiner是在每一个MapTask所在的节点运行
- reducer是接收全局所有Mapper的输出结果
combiner的意义就是对每一个MapTask的输出进行局部汇总，以减小网络传输量

实现步骤

自定义一个combiner继承reducer，重写reduce方法
在job中设置job.setCombinerClass(CustomCombiner.class)

以上面的wordCount为例，加上规约

public class MyCombiner extends Reducer<Text, LongWritable,Text,LongWritable> {
    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
        long count=0;
        for(LongWritable longWritable:values){
            count+=longWritable.get();
        }
        context.write(key,new LongWritable(count));
    }
}

效果描述：打一个比方，没有规约：reduce会接收map发来的100条数据，而有规约：reduce会接收map发来的30条数据，产生的结果是一样的，只是减少了数据传输量，提高了网络IO性能。

十一，MapReduce综合案例-统计求和步骤分析

现在有一个文件，里面是对手机访问网站的流量消耗和ip地址等信息的记录，如下

目的：计算汇总每个手机号（下面的列表手机号会重复）的加粗内容的之和

13758425815 00-FD-07-A4-72-B8:CMCC 120.196.100.82 taobao.com 淘宝商城 38 27 2481 12345 200
17611117038 00-FD-07-A4-72-B8:CMCC 120.196.100.83 tianmao.com 天猫商城 2 24 27 4950 43567 200
13910567960 00-FD-07-A4-72-B8:CMCC 120.196.100.84 huya.com 虎牙直播 4 230 27 5232 48535 200
19110392563 00-FD-07-A4-72-B8:CMCC 120.196.100.00 douyu.com 斗鱼视频 10 27 256 456 200

。。。

1，自定义JavaBean存储要计算的内容

public class FlowBean implements Writable {
    //序列化
    @Override
    public void write(DataOutput dataOutput) throws IOException {
    dataOutput.writeInt(upFlow);
    dataOutput.writeInt(downFlow);
    dataOutput.writeInt(upCountFlow);
    dataOutput.writeInt(downCountFlow);
    }
    //反序列化
    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.upFlow=dataInput.readInt();
        this.downFlow=dataInput.readInt();
        this.upCountFlow=dataInput.readInt();
        this.downCountFlow=dataInput.readInt();
    }
    private Integer upFlow;//上行流量
    private Integer downFlow;//下行流量
    private Integer upCountFlow;//上行流量总和
    private Integer downCountFlow;//下行流量总和
    @Override
    public String toString() {
        return upFlow+"\t"+downFlow+"\t"+upCountFlow+"\t"+downCountFlow;
    }
   //get、set方法省略
}

2，定义FlowMapper类

public class FlowMapper extends Mapper<LongWritable, Text,Text,FlowBean> {
    //map方法K1，V1转为K2，V2
    /**
     *    K1                    V1
     *    0        18892837485  。。。  98  2325    2345    234556
     *    K2                     V2
     * 18892837485       98  2325    2345    234556
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] split = value.toString().split("\t");
        FlowBean flowBean = new FlowBean();
        flowBean.setUpFlow(Integer.parseInt(split[6]));
        flowBean.setDownFlow(Integer.parseInt(split[7]));
        flowBean.setUpCountFlow(Integer.parseInt(split[8]));
        flowBean.setDownCountFlow(Integer.parseInt(split[9]));
         //将K2和V2写入上下文
        context.write(new Text(split[1]),flowBean);
    }
}

3，定义FlowReducer类

public class FlowReducer extends Reducer<Text,FlowBean,Text,FlowBean> {
    @Override
    protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {
        //遍历集合，对集合中的对应四个字段累加
        Integer upFlow=0;
        Integer downFlow=0;
        Integer upCountFlow=0;
        Integer downCountFlow=0;
        for(FlowBean flowBean:values){
            upFlow+=flowBean.getUpFlow();
            downFlow+=flowBean.getDownFlow();
            upCountFlow+=flowBean.getUpCountFlow();
            downCountFlow+=flowBean.getDownCountFlow();
        }
        //创建对象，给对象赋值
        FlowBean flowBean = new FlowBean();
        flowBean.setUpFlow(upFlow);
        flowBean.setDownFlow(downFlow);
        flowBean.setUpCountFlow(upCountFlow);
        flowBean.setDownCountFlow(downCountFlow);
        //将K3和V3写入上下文
        context.write(key,flowBean);
    }
}

4，程序main函数

public class FlowJobMain extends Configured implements Tool {
    @Override
    public int run(String[] strings) throws Exception {
        //创建job对象
        Job job = Job.getInstance(super.getConf(), "mapreduce_sort");
        job.setInputFormatClass(TextInputFormat.class);
        Path path = new Path("file:///D:\\suibian\\input.txt");
        TextInputFormat.addInputPath(job,path);
        //设置mapper类和数据类型
        job.setMapperClass(FlowMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(FlowBean.class);
        //设置reducer类和数据类型
        job.setReducerClass(FlowReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);
        //设置输出类和输出路径
        job.setOutputFormatClass(TextOutputFormat.class);
        Path path1 = new Path("file:///D:\\suibian\\mapreduce222\\output");
        LocalFileSystem local = FileSystem.getLocal(new Configuration());
        if(local.exists(path1)){
            local.delete(path1,true);
        }
        TextOutputFormat.setOutputPath(job,path1);
        //等待任务结束
        boolean b = job.waitForCompletion(true);
        return b?0:1;
    }
    public static void main(String[] args) throws Exception {
        Configuration configuration=new Configuration();
        //启动job任务
        int run = ToolRunner.run(configuration, new FlowJobMain(), args);
        System.exit(run);
    }
}

结果会生成结果文件，把相同的手机号的数据合并并求和进行展示。

5，流量排序

上面的四步已经实现了数据的统计，这里接着上面的效果实现数据统计结果的排序

上面的FlowBean实现了Writable接口，这里要改成WritableComparable接口

Writable接口和WritableComparable接口的区别：

实现Writable接口是让该实体类可以序列化和反序列化

实现WritableComparable接口不仅可以让实体类序列化和反序列化，还可以设置排序

只需要更改实现的接口即可，其他的和上面的JavaBean内容一致

public class FlowBean implements WritableComparable<FlowBean> {

    //排序方法
    @Override
    public int compareTo(FlowBean flowBean) {
        return flowBean.getUpFlow()-this.upFlow;
    }
 //其他的和上面的JavaBean内容一致
}

十二，MapReduce案例-Reduce端join操作

需求：

假如数据量巨大，两表的数据是以文件的形式存储在HDFS中，需要用MapReduce程序来实现以下SQL查询运算

select a.id ,b.name from t_order a left join t_product b on a.pid=b.id

商品表t_product

id	pname	category_id	price
p0001	小米5	1000	2000
p0002	锤子T1	1000	3000

订单数据表t_order

id	date	pid	amount
1001	20150710	p0001	2
1002	20150710	p0002	3

实际上，上面的需求只是一个联查就行了，但是数据量特别大，以文件形式存储在HDFS时，就只能使用MapReduce来做了。

和上面的map-reduce过程不同的是：上面的重写的map方法只是处理来自唯一一个文件的每一行，而这里在map里要处理两个文件，t_product.txt 和 t_order.txt，可以从context拿到文件名进行判断，其他的过程这里就不写伪代码了，都是逻辑处理。

 @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        FileSplit fileSplit=(FileSplit) context.getInputSplit();
        String name = fileSplit.getPath().getName();
        if(name.equals("t_product.txt")){
            
        }else if(name.equals("t_order.txt")){
            
        }
        。。。
 }

十三，自定义InputFormat合并小文件

这里实现自定义InputFormat合并文件为二进制文件，虽然二进制文件我们肉眼凡胎看不懂，但还是有必要生成二进制文件的，因为二进制文件可以继续被mapreduce进行处理（重复map--reduce--shuffle）

map过程之后可以没有reduce过程！

1，自定义MyRecordReader类

public class MyRecordReader extends RecordReader<NullWritable, BytesWritable> {

    Configuration configuration = null;
    FileSplit fileSplit=null;
    private boolean processed=false;//标志文件是否读取完
    BytesWritable bytesWritable=null;
    FileSystem fileSystem =null;
    FSDataInputStream inputStream = null;
    //初始化方法
    @Override
    public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
    //获取configuration对象
         configuration = taskAttemptContext.getConfiguration();
         //获取文件的切片
        fileSplit=(FileSplit)inputSplit;

    }
    //该方法用于获取K1 和 V1
    /**
     *K1:NullWritable
     * V1:BytesWritable
     */
    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        if(!processed){
            //获取源文件的字节输入流
            //获取源文件的FileSystem
             fileSystem = FileSystem.get(configuration);
            //获取文件字节输入流
             inputStream = fileSystem.open(fileSplit.getPath());
            //读取源文件数据到普通的字节数组（byte[]）
            byte[] bytes=new byte[(int)fileSplit.getLength()];
            IOUtils.readFully(inputStream,bytes,0,(int)fileSplit.getLength());
            //把普通自己数组封装到hadoop的byteswritable中
             bytesWritable=new BytesWritable();
            bytesWritable.set(bytes,0,(int)fileSplit.getLength());
            this.processed=true;
            return true;
        }else{
            return false;
        }
        //得到V1
    }
    //是用来返回K1的
    @Override
    public NullWritable getCurrentKey() throws IOException, InterruptedException {
        return NullWritable.get();
    }
    //是用来返回V1的
    @Override
    public BytesWritable getCurrentValue() throws IOException, InterruptedException {
        return bytesWritable;
    }
    //获取文件读取的进度
    @Override
    public float getProgress() throws IOException, InterruptedException {
        return 0;
    }
    //进行资源释放
    @Override
    public void close() throws IOException {
        inputStream.close();
        fileSystem.close();
    }
}

2，自定义MyInputFormat类

public class MyInputFormat extends FileInputFormat<NullWritable, BytesWritable> {
    @Override
    public RecordReader<NullWritable, BytesWritable> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        //创建自定义RecordReader对象
        MyRecordReader myRecordReader = new MyRecordReader();
        //将inputSplit和TaskAttemptContext传给myRecordReader
        myRecordReader.initialize(inputSplit,taskAttemptContext);
        return myRecordReader;
    }
    //设置文件是否可以切割
    @Override
    protected boolean isSplitable(JobContext context, Path filename) {
        return false;//这里不需要
    }
}

3，mapper类

public class SequenceFileMapper extends Mapper<NullWritable, BytesWritable, Text,BytesWritable> {
    @Override
    protected void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException {
        //获取文件名字，作为K2
        FileSplit fileSplit=(FileSplit) context.getInputSplit();
        String name = fileSplit.getPath().getName();
        //K2和V2写入上下文
        context.write(new Text(name),value);
    }
}

4，JobMain主类

SequenceFileOutputFormat类是输出二进制格式的文件！

public class FormatJobMain extends Configured implements Tool {
    @Override
    public int run(String[] strings) throws Exception {
        //创建job对象
        Job job = Job.getInstance(super.getConf(), "自定义任务名字");
        job.setInputFormatClass(MyInputFormat.class);
        Path path = new Path("file:///D:\\suibian\\mapreduce\\input.txt");
        MyInputFormat.addInputPath(job,path);
        //设置mapper类和数据类型
        job.setMapperClass(SequenceFileMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(BytesWritable.class);
        //不用设置reducer类，但是一定要设置数据类型，和mapper一样，否则会报错
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(BytesWritable.class);
        //设置输出类和输出路径
        job.setOutputFormatClass(SequenceFileOutputFormat.class);
        Path path1 = new Path("file:///D:\\suibian\\mapreduce222\\output");
        SequenceFileOutputFormat.setOutputPath(job,path1);
        //等待任务结束
        boolean b = job.waitForCompletion(true);
        return b?0:1;
    }
    public static void main(String[] args) throws Exception {
        Configuration configuration=new Configuration();
        //启动job任务
        int run = ToolRunner.run(configuration, new FormatJobMain(), args);
        System.exit(run);
    }
}

十四，自定义OutPutFormat

场景：一个文件中存储了淘宝中用户的评论信息，其中第9个字段是0，1，2，分别代表好评，中评和差评，现在要自定义OutputFormat，实现好评和中评的内容写到A文件夹，差评的内容写到B文件夹

1，自定义MyOutputFormat类

public class MyOutputFormat extends FileOutputFormat<Text, NullWritable> {
    @Override
    public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext taskAttemptContext) throws IOException {
        //获取目标文件的输出流（两个）
        FileSystem fileSystem = FileSystem.get(taskAttemptContext.getConfiguration());
        //指定输出文件
        FSDataOutputStream goodCommentsOutputStream = fileSystem.create(new Path("file:///D:\\suibian\\good_out.txt"));
        FSDataOutputStream badCommentsOutputStream = fileSystem.create(new Path("file:///D:\\suibian\\bad_out.txt"));
        MyRecordWritter myRecordWritter = new MyRecordWritter(goodCommentsOutputStream, badCommentsOutputStream);
        //将输出流传给MyRecordWritter
        return myRecordWritter;
    }
}

2，自定义MyRecordWritter类

public class MyRecordWritter extends RecordWriter<Text, NullWritable> {
    private FSDataOutputStream goodCommentsOutputStream;
    private FSDataOutputStream badCommentsOutputStream;
    public MyRecordWritter(){}

    public MyRecordWritter(FSDataOutputStream goodCommentsOutputStream, FSDataOutputStream badCommentsOutputStream) {
        this.goodCommentsOutputStream = goodCommentsOutputStream;
        this.badCommentsOutputStream = badCommentsOutputStream;
    }

    //行文本内容
    @Override
    public void write(Text text, NullWritable nullWritable) throws IOException, InterruptedException {
        //从行文本中获取评论值
        String[] split = text.toString().split("\t");
        String numStr=split[9];
        //根据字段值判断评论类型然后将对应的数据写入不同的文件夹文件中
        if(Integer.parseInt(numStr)<=1){//好评+中评
        goodCommentsOutputStream.write(text.toString().getBytes());
        goodCommentsOutputStream.write("\r\n".getBytes());
        }else{//差评
        badCommentsOutputStream.write(text.toString().getBytes());
        badCommentsOutputStream.write("\r\n".getBytes());
        }
    }

    @Override
    public void close(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        goodCommentsOutputStream.close();
        badCommentsOutputStream.close();
    }
}

这样就可以了，最后在主类中指定输出类为自定义的OutputFormat即可

十五，自定义分组-求TopN

分组时mapreduce当中reduce端的一个功能组件，主要作用是决定哪些数据作为一组，调用一次reduce的逻辑，默认是每个不同的key，作为多个不同的分组，每个组调用一次reduce逻辑，我们可以自定义分组实现不同的key作为同一个组，调用一次reduce逻辑。

需求：

有如下订单数据

order_001 p_01 222.8

order_001 p_05 25.8

order_002 p_03 522.8

order_002 p_04 122.4

order_002 p_05 722.4

order_003 p_01 222.8

。。。

现在需要求出每一个订单中成交金额最大的一笔交易

步骤：

同一个订单的数据，需要同一个reduce来处理（分区）
同一个订单内的数据，按照价格排序（排序）
把订单ID相同的数据分到同一个组（自定义分组）

1，定义OrderBean类

public class OrderBean implements WritableComparable<OrderBean> {
    //实现对象序列化
    @Override
    public void write(DataOutput dataOutput) throws IOException {
    dataOutput.writeUTF(orderId);
    dataOutput.writeDouble(price);
    }
    //实现对象反序列化
    @Override
    public void readFields(DataInput dataInput) throws IOException {
    this.orderId=dataInput.readUTF();
    this.price=dataInput.readDouble();
    }
    //指定排序规则
    @Override
    public int compareTo(OrderBean orderBean) {
        //先比较订单id，如果订单id一样，则排序订单金额（降序）
        int i = this.orderId.compareTo(orderBean.getOrderId());
        if(i==0){
             i = (this.price.compareTo(orderBean.getPrice()))*(-1);
        }
        return 0;
    }

    private String orderId;
    private Double price;

    @Override
    public String toString() {
        return orderId+" "+price;
    }
//set、get方法省略
}

2，定义OrderPartition分区类

public class OrderPartition extends Partitioner<OrderBean, Text> {
    //分区规则：根据订单的id实现分区
    /**
     *
     * @param orderBean
     * @param text
     * @param i ReduceTask个数
     * @return  返回分区的编号
     */
    @Override
    public int getPartition(OrderBean orderBean, Text text, int i) {
        return (orderBean.getOrderId().hashCode() & 2147483647) % i;
    }
}

3，定义OrderGroupComparator分组类

**
 * 1，继承WritableComparator
 * 2，调用父类的有参构造
 * 3，指定分组的规则
 */
public class OrderGroupComparator extends WritableComparator {
    public OrderGroupComparator(){
        super(OrderBean.class,true);
    }
    //指定分组的规则（重写方法）
    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        //对参数做强制类型转换
        OrderBean first=(OrderBean)a;
        OrderBean second=(OrderBean)b;
        //指定分组规则
        return first.getOrderId().compareTo(second.getOrderId());
    }
}

4，定义mapper类

public class OrderMapper extends Mapper<LongWritable, Text,OrderBean,Text> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //拆分行文本数据，得到订单id，订单金额
        String[] split = value.toString().split("\t");
        OrderBean orderBean = new OrderBean();
        orderBean.setOrderId(split[0]);
        orderBean.setPrice(Double.valueOf(split[2]));
        context.write(orderBean,value);
    }
}

5，定义Reducer类

public class GroupReducer extends Reducer<OrderBean, Text,Text, NullWritable> {
    @Override
    protected void reduce(OrderBean key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        int i=0;
        for(Text text:values){
            context.write(text,NullWritable.get());
            i++;
            if(i>=1){
                break;
            }
        }
    }
}

6，定义JobMain主类

这里就不列代码了，和上面的JobMain主类的区别就是要指定分区类和分组类

//设置分区类

job.setPartitionerClass(OrderPartition.class);

//设置分组类

job.setGroupingComparatorClass(OrderGroupComparator.class);

十六，Hive数据仓库

数据仓库的目的是构建面向分析的集成化数据环境，为企业提供决策支持。

数据仓库是存数据的，企业的各种数据往里面存，主要目的是为了分析有效数据，后续会基于它产生供分析挖掘的数据，或者数据应用需要的数据。

1，Hive简介

Hive是基于hadoop的一个数据仓库工具，可以将结构化的文件映射为一张表，并提供类sql查询功能，其本质是将sql转换为mapreduce的任务进行运算，底层由hdfs来提供数据的存储，。

2，为什么使用Hive？

采用类sql语法去操作数据，提供快速开发的能力
避免了去写mapreduce
功能扩展很方便

3，Hive安装

这里选用的版本是2.1.1，下载地址：http://disk.tiger2.cn/

Hive不需要装在每一台机器上，只需要装在其中一台机器上即可。

Hive安装分为5步

（1）上传并解压安装包

cd /export/softwares
tar -zxvf apache-hive-2.1.1-bin.tar.gz -C ../servers/

（2）安装mysql

参考：https://www.cnblogs.com/fantongxue/p/12443575.html

这里不安装mysql，使用云服务器上安装好的mysql

（3）修改Hive的配置文件

修改hive-env.sh

cd /export/servers/apache-hive-2.1.1-bin/conf
cp hive-env.sh.template hive-env.sh

# 配置hadoop安装路径
HADOOP_NAME=/export/servers/hadoop-2.7.5
# 配置Hive的配置文件路径
export HIVE_CONF_DIR=/export/servers/apache-hive-2.1.1-bin/conf

修改hive-site.xml（没有就创建）

vim hive-site.xml

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
	<property>
		<name>javax.jdo.option.ConnectionUserName</name>
		<value>root</value>
	</property>
	<property>
		<name>javax.jdo.option.ConnectionPassword</name>
		<value>1234</value>
	</property>
    <!-- 云服务器上的mysql数据库就会多出一个hive数据库 -->
	<property>
		<name>javax.jdo.option.ConnectionURL</name>
		<value>jdbc:mysql://101.201.101.206/hive?createDatabaseIfNotExist=true&amp;useSSL=false</value>
	</property>
	<property>
		<name>javax.jdo.option.ConnectionDriverName</name>
		<value>com.mysql.jdbc.Driver</value>
	</property>
	<property>
		<name>hive.metastore.schema.verification</name>
		<value>false</value>
	</property>
	<property>
		<name>datanucleus.schema.autoCreateAll</name>
		<value>true</value>
	</property>
	<property>
		<name>hive.server2.thrift.bind.host</name>
		<value>101.201.101.206</value>
	</property>
</configuration>

（4）添加mysql的连接驱动包到hive的lib目录下

hive使用mysql作为元数据存储，需要添加一个连接驱动包到hive的安装目录下，然后就可以准备启动hive了。

（5）配置hive的环境变量

sudo vim /etc/profile

export HIVE_HOME=/export/servers/apache-hive-2.1.1-bin
export PATH=:$HIVE_HOME/bin:$PATH

source /etc/profile

4，Hive的交互方式

要交互之前，一定要保证hadoop的上面的三个服务都要开启状态！

（1）第一种方式 bin/hive

cd /export/servers/apache-hive-2.1.1-bin/
bin/hive

创建一个数据库

create database if not exists mytest;

在这里插入图片描述

（2）第二种交互方式

使用hql语句或者hql脚本进行交互

不进入hive的客户端直接执行hive的hql语句

cd /export/servers/apache-hive-2.1.1-bin/
bin/hive -e "create database if not exists mytest;"

或者我们可以将我们的hql语句写成一个hql脚本然后执行

cd /export/servers
vim hive.sql

# 放一些hql
create database if not exists mytest;
use mytest;
create table stu(id int,name string);

# 通过hive -f 来执行我们的hql脚本
bin/hive -f /export/servers/hive.sql

5，Hive的基本操作

（1）数据库操作

创建数据库

create database if not exists mytest;

说明：hive的表存放位置模式是由hive-site.xml当中的一个属性来指定的

如果不指定，会默认存放在hdfs文件服务器的/user/hive/warehouse文件夹下，如/user/hive/warehouse/mytest.db，我们刚才创建的mytest数据库实际上是创建了一个mytest.db文件

创建数据库并指定存放位置

create database test location '/myhive';

设置数据库键值对信息

数据库可以有一些描述性的键值对信息，在创建时添加

create database foo with dbproperties('owner'='test','data'='20201010');

查看数据库的键值对信息

describe database extended foo;

修改数据库的键值对信息

alter database foo set dbproperties('owner'='test2');

查看数据库更多详细信息

desc database extended foo;

删除数据库

删除一个空数据库，如果数据库下面存在表，那么就会报错

drop database foo;

强制删除数据库，包含数据库下面的表一起删除

drop database foo cascade;

（2）数据库表操作

创建表的语法（中括号的可写可不写）

create [external] table [if not exsits] tablename (
col_name data_type [comment '字段描述信息']
col_name data_type [comment '字段描述信息'])
[comment '表的描述信息']
[partitioned by (col_name data_type,...)]
[clustered by (col_name data_type,...)]
[sorted by (col_name [asc|desc], ...) into num_buckets buckets]
[row format row_format]
[sorted as ...]
[location '指定表的路径']

说明：

create table

创建表

external

创建一个外部表，在删除表时，内部表的元数据和数据会被一起删除，而外部表只会删除元数据，不删除数据

partitioned by

表示使用表分区，一个表可以拥有一个或多个分区，每一个分区单独存在一个目录下

clustered by

对于每一个表分文件，hive可以进一步组织成桶

sorted by

指定排序字段和排序规则

row format

指定表文件字段分隔符

sorted as

指定表文件的存储格式，常用格式：SEQUENCEFILE，TEXTFILE，RCFILE，如果文件时纯文本，可以使用 sorted as TEXTFILE，如果数据需要压缩，则使用SEQUENCEFILE

（3）内部表操作

向hive数据库中的一张表插入字段（其实是文件的形式）存储到了hdfs中，下载并打开表文件，会发现字段之间有分隔符，这是Hive自带的分隔符，叫 /001

建表入门

use myhive;
create table stu(id int,name string);
insert into stu values(1,'test');
select * from stu;

创建表并指定字段之间的分隔符

create table if not exists stu2(id int,name string) row format delimited fields terminated by '\t';

创建表并指定表文件的存放路径

create table if not exists stu2(id int,name string) row format delimited fields terminated by '\t' location '/usr/stu2';

根据查询结果创建表

create table stu3 as select * from stu2;

（4）外部表操作

具体外部表和内部表的区别自行百度

加载数据（把本地的文件中的数据加载到Hive表中）

load data local inpath '/export/servers/hivedatas/student.csv' into table student;

加载数据并覆盖已有数据

load data local inpath '/export/servers/hivedatas/student.csv' overwrite into table student;

从hdfs文件系统上向表中加载数据（需要提前把数据文件上传到hdfs文件系统）

cd /export/servers/hivedatas
hdfs dfs -mkdir -p /hivedatas
hdfs dfs -put teacher.csv /hivedatas/
load data inpath '/hivedatas/teacher.csv' into table teacher;

（5）分区表操作

在Hive中把大的数据按照每月，或者天进行切分成一个个小的文件，存放在不同的文件夹中。

创建分区表语法

create table if not exists stu2(id int,name string) partitioned by (month string) row format delimited fields terminated by '\t';

创建一个表带多个分区

多个分区，会在hdfs上以多个层级目录进行存储！

partitioned by (month string,day string)

加载数据到分区表中

load data local inpath '/export/servers/hivedatas/student.csv' into table student partition(month='201806');

加载数据到多分区表中

load data local inpath '/export/servers/hivedatas/student.csv' into table student partition(year='2020',month='06',day='20');

多分区表联合查询

select * from score where month='201806' union all select * from score where month='201806';

查看分区

show partitions score;

（6）分桶表操作

分桶，就是将数据按照指定的字段进行划分到多个文档当中去，分桶就是MapReduce中的分区。

开启Hive的分桶功能

set hive.enforce.bucketing=true;

设置Reduce个数

set mapreduce.job.reduces=3;

创建分桶表

create table course(id int,name string) clustered by(id) into 3 buckets row format delimited fields terminated by '\t';

给桶表里加载数据，只能通过insert overwrite（创建普通表，并通过insert overwrite的方式将普通表的数据通过查询的方式加载到桶表当中去）

创建普通表（普通表的字段要和桶表的字段对应上）

create table course_common(id int,name string) row format delimited fields terminated by '\t';

普通表中加载数据

load data local inpath '/export/servers/hivedatas/student.csv' into table course_common;

通过insert overwrite方式给桶表中加载数据

insert overwrite table course select * from course_common cluster by(id);

（7）修改表结构

重命名

alter table old_name rename to new_name;

增加/修改列信息

查询表结构

desc tablename;

添加列

alter table tablename add columns (age int,address string);

更新列

alter table tablename change column myso mysonew int;

删除表

drop table tablename;

（8）常用函数

其实HQL和SQL特别像，或者直接写SQL就很容易保证正确性。

count(*)
max(score)
min(age)
sum(score)
avg(score)

查询分数在80到100的数据

select * from score where s_score between 80 and 100;

查看以8开头的所有成绩

select * from score where s_score like '8%';

JOIN联查

Hive只支持等值连接，不支持非等值连接

select a.id,a.name from score a join student stu on a.id=stu.id;

（9）分区排序

Distribute by：类似MR中partition，进行分区，结合sort by使用。

Hive要求 DISTRIBUTE BY语句要写在SORT BY语句之前。

对于DISTRIBUTE BY进行测试，一定要多分配reduce进行处理，否则看不到效果。

示例：先按照学生id进行分区，再按照学生成绩进行排序

1，设置reduce的个数，将我们对应的s_id划分到对应的reduce当中去

set mapreduce.job.reduces=7;

2，通过DISTRIBUTE BY进行数据的分区

insert overwrite local directory '/export/servers/hivedatas/sort' select * from score DISTRIBUTE BY s_id sort by s_score;

Cluster by：当Distribute by 和Sort by字段相同时，可以使用Cluster by方式

Cluster by除了具有Distribute by的功能外，还具备Sort by的功能，但是排序只能是倒序排序，不能指定排序规则。

6，Hive shell参数

（1）Hive命令行

语法结构

bin/hive [-hiveconf x=y]* [<-f filename>]* [<-f filename>|<-e query-string>] [-S]

说明

-i：从文件初始化HQL
-e：从命令行执行指定的HQL
-f：执行HQL脚本
-v：输出执行的HQL语句到控制台
-p：可以指定端口号
-hiveconf x=y：设置hive运行时的参数配置

开发Hive应用时，不可避免的需要设置Hive的参数，设定Hive的参数可以调优HQL的执行效率。

对于一般参数，有以下三种配置方式

配置文件
命令行参数
参数声明

配置文件：Hive的配置文件包括用户自定义配置文件（Hive安装目录/conf/hive-site.xml）和默认配置文件（Hive安装目录/conf/hive-default.xml）

用户自定义配置会覆盖默认配置！

posted @ 2020-11-26 20:57 修电脑的阅读(33) 评论(0) 收藏举报

刷新页面返回顶部

修电脑的

Hadoop分布式存储和计算MapReduce的使用以及Hive数据仓库等内容精讲

一，zookeeper环境搭建

1，下载zookeeper的压缩包，下载网址如下

2，解压

3，修改配置文件

4，添加myid配置

5，安装包分发并修改myid的值

6，启动zookeeper

二，zookeeper的shell客户端操作

1，连接zookeeper客户端

2，zookeeper常用命令

3，zookeeper的watch机制

4，zookeeper的javaAPI操作

（1）创建节点

（2）更新节点数据

（3）查询节点数据

（4）节点的watch机制

三 ，hadoop重新编译-准备工作

1，为什么要重新编译hadoop？

2，编译环境的准备

3，虚拟机联网，关闭防火墙，关闭selinux

4，安装jdk1.7

5，安装maven

6，安装maven-repository

7，安装findbugs

8，在线安装一些依赖包

9，安装protobuf

10，安装snappy

11，编译hadoop源码

12，安装hadoop

13，修改配置文件

core-site.xml

hdfs-site.xml

hadoop-env.sh

mapred-site.xml

yarn-site.xml

mapred-env.sh

slaves

14，创建文件夹

15，配置hadoop的环境变量

16，启动Hadoop（集群暂不做演示）

四，hdfs的命令行操作

1，基础命令

2，高级命令

（1）HDFS文件限额配置

（2）HDFS的安全模式

（3）HDFS的基准测试

测试写入速度

测试读取速度

清除测试数据

五，hdfs的API操作

1，配置windows下的hadoop环境

2，导入maven坐标

3，url访问方式

4，使用文件系统的方式访问数据（掌握）

获取FileSystem的几种方式？

5，遍历所有文件

6，在hdfs上创建文件夹

7，hdfs文件下载和上传

8，hdfs文件权限控制

9，小文件的合并

六，MapReduce案例

1，wordCount-准备工作

2，Mapper

3,Reducer

4，JobMain主类

5，MapReduce运行模式

集群运行模式

本地运行方式

七，MapReduce分区

定义Mapper类

定义Partitioner类

定义Reducer逻辑

主类中设置分区类和ReduceTask个数

八，MapReduce中的计数器

1，自定义计数器

（1）第一种方式

（2）第二种方式

九，MapReduce的排序和序列化

三，hadoop重新编译-准备工作