大数据Hadoop简介、安装、使用
Hadoop
Hadoop是用java语言编写的,在分布式服务器集群上存储海量数据并运行分布式分析应用的开源框架,其核心部分是HDFS、MapReduce与Yarn
HDFS是分布式文件系统,引入存放文件元数据的服务器NameNode和实际存放数据的服务器DataNode,对数据进行分布式存储和读取
MapReduce是分布式计算框架,MapRuduce的核心思想是把计算任务分配给集群内的服务器执行,通过对计算任务的拆分(Map计算/Reduce计算)再根据任务调度器(JobTracher)对任务进行分布式计算。
Yarn是分布式资源框架,管理整个集群的资源(内存、CPU核数)分配调度集群的资源
把HDFS理解为一个分布式的,有冗余备份的,可以动态扩展的用来存储大规模数据的大硬盘
把MapRuduce理解成一个计算引擎,按照MapReduce的规则编写Map计算/Reduce计算的程序,可以完成计算任务。
怎么使用Hadoop
①Hadoop集群的搭建
无论是在windows上装几台虚拟机玩Hadoop,还是真实的服务器来玩,说简单点就是把Hadoop的安装包放在每一台服务器上,改改配置,启动就完成了Hadoop集群的搭建。
②上传文件到Hadoop集群,实现文件存储
Hadoop集群搭建好以后,可以通过web页面查看集群的情况,还可以通过Hadoop命令来上传文件到hdfs集群,通过Hadoop命令在hdfs集群上建立目录,通过Hadoop命令删除集群上的文件等等。
③编写map/reduce程序,完成计算任务
通过集成开发工具(例如eclipse)导入Hadoop相关的jar包,编写map/reduce程序,将程序打成jar包扔在集群上执行,运行后出计算结果。
Hadoop 单机部署安装
- 创建一个用于管理hadood的用户(可新建或者使用已有的用户)
- 安装并配置ssh免密码登陆
- 安装Java环境
- 下载hadoop并配置环境变量
- 配置相关的Hadoop配置
- 验证hadoop安装并启动
Hadoop 集群部署
基础系统环境准备
在VMware中创建3台centos7.6主机,空间50G:
- 配置/etc/hostname
- 修改/etc/sysconfig/network-scripts/ifcfg-ens33静态ip、网关(192.168.208.1)、DNS(8.8.8.8)
- 配置/etc/hosts , 测试是否相互ping通
主机名 | IP | 用户 | HDFS | Yarn |
---|---|---|---|---|
node-4 | 192.168.208.132 | hp | NameNode、DataNode | NodeManager、ResourceManager |
node-5 | 192.168.208.133 | hp | DataNode、SecondaryNameNode | NodeManager |
node-6 | 192.168.208.134 | hp | DataNode | NodeManager |
配置服务器之间的ssh免密码登录(master-> node)
ssh localhost # 会提示输入密码
cd ~/.ssh/
ssh-keygen -t rsa # 会有提示,都按回车就行
cat ./id_rsa.pub >> ./authorized_keys # 加入授权
现在再使用"ssh localhost",就可以不用输入密码登录ssh
将master节点(node-4)的公钥传给各个Slave节点
scp ~/.ssh/id_rsa.pub node-5:/home/hp
scp ~/.ssh/id_rsa.pub node-6:/home/hp
Java1.8、Hodoop 3.2.4下载\安装
配置 ~.bashrc
# java
export JAVA_HOME=/home/hp/jdk1.8.0_311
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin
# hadoop
export HADOOP_HOME=/home/hp/hadoop-3.2.4
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME
安装情况:
java版本
[hp@node-5 .ssh]$ java -version
java version "1.8.0_311"
Java(TM) SE Runtime Environment (build 1.8.0_311-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.311-b11, mixed mode)
Hadoop版本
[hp@node-4 ~]$ hadoop version
Hadoop 3.2.4
Source code repository Unknown -r 7e5d9983b388e372fe640f21f048f2f2ae6e9eba
Compiled by ubuntu on 2022-07-12T11:58Z
Compiled with protoc 2.5.0
From source with checksum ee031c16fe785bbb35252c749418712
This command was run using /home/hp/hadoop-3.2.4/share/hadoop/common/hadoop-common-3.2.4.jar
修改Hadoop配置
配置集群模式时,需要修改Hadoop解压包下配置文件,包括workers、core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml
修改workers文件
node-4
node-5
node-6
修改core-site.xml文件
<configuration>
<!-- 指定NameNode的内部通信地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://node-4:9000</value>
</property>
<!-- 指定hadoop集群在工作时存储的一些临时文件存放的目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/hp/hadoop-data/tmp</value>
</property>
</configuration>
修改hdfs-site.xml
dfs.namenode.name.dir:namenode数据的存放位置,元数据存放位置
dfs.datanode.data.dir:datanode数据的存放位置,block块存放的位置
dfs.repliction:hdfs的副本数设置,默认为3
dfs.secondary.http.address:secondarynamenode运行节点的信息,应该和namenode存放在不同节点
<configuration>
<property>
<!-- namenode web端访问的地址 -->
<name>dfs.namenode.http-address</name>
<value>node-4:9870</value>
</property>
<property>
<!-- secondarynamenode(简称2nn) web端访问的地址 -->
<name>dfs.namenode.secondary.http-address</name>
<value>node-5:50090</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hp/hadoop-data/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hp/hadoop-data/data</value>
</property>
</configuration>
修改mapred-site.xml
mapreduce.framework.name:指定mapreduce框架为yarn方式
mapreduce.jobhistory.address:指定历史服务器的地址和端口
mapreduce.jobhistory.webapp.address:查看历史服务器已经运行完的Mapreduce作业记录的web地址,需要启动该服务才行
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>node-4:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>node-4:19888</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/home/hp/hadoop-3.2.4</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/home/hp/hadoop-3.2.4</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/home/hp/hadoop-3.2.4</value>
</property>
</configuration>
修改yarn-site文件
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node-4</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
修改hadoop-env.sh
# 新增
export JAVA_HOME=/home/hp/jdk1.8.0_311
# 用户配置(不一定需要)
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
修改完上面几个配置文件之后,同步这些文件到其他节点的hadoop配置上
HDFS初始化只能在主节点(node-4)进行
cd /home/hp/hadoop-3.2.4
./bin/hdfs namenode -format
执行部分结果:
2022-09-02 11:45:22,618 INFO snapshot.SnapshotManager: SkipList is disabled
2022-09-02 11:45:22,627 INFO util.GSet: Computing capacity for map cachedBlocks
2022-09-02 11:45:22,627 INFO util.GSet: VM type = 64-bit
2022-09-02 11:45:22,627 INFO util.GSet: 0.25% max memory 828.5 MB = 2.1 MB
2022-09-02 11:45:22,627 INFO util.GSet: capacity = 2^18 = 262144 entries
2022-09-02 11:45:22,656 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
2022-09-02 11:45:22,657 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
2022-09-02 11:45:22,657 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
2022-09-02 11:45:22,681 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
2022-09-02 11:45:22,681 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
2022-09-02 11:45:22,684 INFO util.GSet: Computing capacity for map NameNodeRetryCache
2022-09-02 11:45:22,684 INFO util.GSet: VM type = 64-bit
2022-09-02 11:45:22,684 INFO util.GSet: 0.029999999329447746% max memory 828.5 MB = 254.5 KB
2022-09-02 11:45:22,684 INFO util.GSet: capacity = 2^15 = 32768 entries
2022-09-02 11:45:22,729 INFO namenode.FSImage: Allocated new BlockPoolId: BP-354792982-192.168.208.132-1662090322715
2022-09-02 11:45:22,743 INFO common.Storage: Storage directory /home/hp/hadoop-data/name has been successfully formatted.
2022-09-02 11:45:22,794 INFO namenode.FSImageFormatProtobuf: Saving image file /home/hp/hadoop-data/name/current/fsimage.ckpt_0000000000000000000 using no compression
2022-09-02 11:45:22,985 INFO namenode.FSImageFormatProtobuf: Image file /home/hp/hadoop-data/name/current/fsimage.ckpt_0000000000000000000 of size 397 bytes saved in 0 seconds .
2022-09-02 11:45:22,995 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2022-09-02 11:45:23,023 INFO namenode.FSNamesystem: Stopping services started for active state
2022-09-02 11:45:23,024 INFO namenode.FSNamesystem: Stopping services started for standby state
2022-09-02 11:45:23,032 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
2022-09-02 11:45:23,033 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at node-4/192.168.208.132
************************************************************/
看到以下一句提示,说明初始化成功:
INFO common.Storage: Storage directory /home/hp/hadoop-data/name has been successfully formatted.
继续在主节点上运行:
./sbin/start-dfs.sh
./sbin/start-yarn.sh
# 3.2之后可以用: mapred --daemon start
./sbin/mr-jobhistory-daemon.sh start historyserver
之后可以看到一些进程:
[hp@node-4 hadoop-3.2.4]$ jps
20522 JobHistoryServer
19563 NameNode
19676 DataNode
20173 NodeManager
20718 Jps
20047 ResourceManager
说明主节点启动成功
node-5节点进程:
[hp@node-5 hadoop-3.2.4]$ jps
20098 NodeManager
19348 SecondaryNameNode
19608 DataNode
21192 Jps
node-6节点进程:
[hp@node-6 hadoop-3.2.4]$ jps
19858 NodeManager
19417 DataNode
19997 Jps
最后,可以在node-4节点查看汇总报告:
[hp@node-4 hadoop-3.2.4]$ ./bin/hdfs dfsadmin -report
Configured Capacity: 86905466880 (80.94 GB)
Present Capacity: 55142129664 (51.36 GB)
DFS Remaining: 55141244928 (51.35 GB)
DFS Used: 884736 (864 KB)
DFS Used%: 0.00%
Replicated Blocks:
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
Erasure Coded Block Groups:
Low redundancy block groups: 0
Block groups with corrupt internal blocks: 0
Missing block groups: 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
-------------------------------------------------
Live datanodes (3):
Name: 192.168.208.132:9866 (node-4)
Hostname: node-4
Decommission Status : Normal
Configured Capacity: 28968488960 (26.98 GB)
DFS Used: 294912 (288 KB)
Non DFS Used: 10588753920 (9.86 GB)
DFS Remaining: 18379440128 (17.12 GB)
DFS Used%: 0.00%
DFS Remaining%: 63.45%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Fri Sep 02 14:45:58 CST 2022
Last Block Report: Fri Sep 02 11:46:30 CST 2022
Num of Blocks: 4
Name: 192.168.208.133:9866 (node-5)
Hostname: node-5
Decommission Status : Normal
Configured Capacity: 28968488960 (26.98 GB)
DFS Used: 294912 (288 KB)
Non DFS Used: 10587422720 (9.86 GB)
DFS Remaining: 18380771328 (17.12 GB)
DFS Used%: 0.00%
DFS Remaining%: 63.45%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Fri Sep 02 14:45:59 CST 2022
Last Block Report: Fri Sep 02 11:55:04 CST 2022
Num of Blocks: 4
Name: 192.168.208.134:9866 (node-6)
Hostname: node-6
Decommission Status : Normal
Configured Capacity: 28968488960 (26.98 GB)
DFS Used: 294912 (288 KB)
Non DFS Used: 10587160576 (9.86 GB)
DFS Remaining: 18381033472 (17.12 GB)
DFS Used%: 0.00%
DFS Remaining%: 63.45%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Fri Sep 02 14:45:59 CST 2022
Last Block Report: Fri Sep 02 13:30:26 CST 2022
Num of Blocks: 4
也可以通过浏览器访问:
HDFS: http://192.168.208.132:9870/dfshealth.html#tab-datanode
YARN: http://192.168.208.132:8088/cluster
测试执行分布式实例
在HDFS上创建一个文件夹/test/input
cd /home/hp/hadoop-3.2.4
./bin/hdfs dfs -mkdir -p /test/input
# 查看文件夹
./bin/hdfs dfs -ls /
Found 2 items
drwxr-xr-x - hp supergroup 0 2022-09-02 12:12 /test
drwxrwx--- - hp supergroup 0 2022-09-02 11:52 /tmp
创建一个~/word.txt测试文件
填充入一段英文文章
Be not alarmed, madam, on receiving this letter, by the apprehension of its containing any repetition of those
sentiments or renewal of those offers which were last night so disgusting to you. I write without any intention of
paining you, or humbling myself, by dwelling on wishes which, for the happiness of both, cannot be too soon
forgotten; and the effort which the formation and the perusal of this letter must occasion, should have been spared,
had not my character required it to be written and read. You must, therefore, pardon the freedom with which I
demand your attention; your feelings, I know, will bestow it unwillingly, but I demand it of your justice.
My objections to the marriage were not merely those which I last night acknowledged to have the utmost required
force of passion to put aside, in my own case; the want of connection could not be so great an evil to my friend as to
me. But there were other causes of repugnance; causes which, though still existing, and existing to an equal degree
in both instances, I had myself endeavored to forget, because they were not immediately before me. These causes
must be stated, though briefly. The situation of your mother's family, though objectionable, was nothing in
comparison to that total want of propriety so frequently, so almost uniformly betrayed by herself, by your three
younger sisters, and occasionally even by your father. Pardon me. It pains me to offend you. But amidst your
concern for the defects of your nearest relations, and your displeasure at this representation of them, let it give you
consolation to consider that, to have conducted yourselves so as to avoid any share of the like censure, is praise no
less generally bestowed on you and your eldersister, than it is honorable to the sense and disposition of both. I will
only say farther that from what passed that evening, my opinion of all parties was confirmed, and every inducement
heightened which could have led me before, to preserve my friend from what I esteemed a most unhappy
connection. He left Netherfield for London, on the day following, as you, I am certain, remember, with the design of
soon returning.
将word.txt上传到HDFS的/test/input文件夹中
./bin/hdfs dfs -put ~/word.txt /test/input
运行一个mapreduce的例子程序:wordcount
./bin/hadoop jar /home/hp/hadoop-3.2.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.4.jar wordcount /test/input /test/output
执行成功之后如下所示,输出相关信息:
2022-09-02 12:12:27,631 INFO client.RMProxy: Connecting to ResourceManager at node-4/192.168.208.132:8032
2022-09-02 12:12:28,232 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hp/.staging/job_1662090752653_0001
2022-09-02 12:12:28,534 INFO input.FileInputFormat: Total input files to process : 1
2022-09-02 12:12:28,669 INFO mapreduce.JobSubmitter: number of splits:1
2022-09-02 12:12:28,962 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1662090752653_0001
2022-09-02 12:12:28,963 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-09-02 12:12:29,352 INFO conf.Configuration: resource-types.xml not found
2022-09-02 12:12:29,352 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2022-09-02 12:12:29,785 INFO impl.YarnClientImpl: Submitted application application_1662090752653_0001
2022-09-02 12:12:29,819 INFO mapreduce.Job: The url to track the job: http://node-4:8088/proxy/application_1662090752653_0001/
2022-09-02 12:12:29,820 INFO mapreduce.Job: Running job: job_1662090752653_0001
2022-09-02 12:12:38,006 INFO mapreduce.Job: Job job_1662090752653_0001 running in uber mode : false
2022-09-02 12:12:38,007 INFO mapreduce.Job: map 0% reduce 0%
2022-09-02 12:12:43,076 INFO mapreduce.Job: map 100% reduce 0%
2022-09-02 12:12:50,115 INFO mapreduce.Job: map 100% reduce 100%
2022-09-02 12:12:51,125 INFO mapreduce.Job: Job job_1662090752653_0001 completed successfully
2022-09-02 12:12:51,206 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=2896
FILE: Number of bytes written=481867
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2269
HDFS: Number of bytes written=2014
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=2884
Total time spent by all reduces in occupied slots (ms)=3521
Total time spent by all map tasks (ms)=2884
Total time spent by all reduce tasks (ms)=3521
Total vcore-milliseconds taken by all map tasks=2884
Total vcore-milliseconds taken by all reduce tasks=3521
Total megabyte-milliseconds taken by all map tasks=2953216
Total megabyte-milliseconds taken by all reduce tasks=3605504
Map-Reduce Framework
Map input records=21
Map output records=370
Map output bytes=3643
Map output materialized bytes=2896
Input split bytes=103
Combine input records=370
Combine output records=220
Reduce input groups=220
Reduce shuffle bytes=2896
Reduce input records=220
Reduce output records=220
Spilled Records=440
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=317
CPU time spent (ms)=1310
Physical memory (bytes) snapshot=478359552
Virtual memory (bytes) snapshot=5567832064
Total committed heap usage (bytes)=409468928
Peak Map Physical memory (bytes)=292597760
Peak Map Virtual memory (bytes)=2781896704
Peak Reduce Physical memory (bytes)=185761792
Peak Reduce Virtual memory (bytes)=2785935360
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=2166
File Output Format Counters
Bytes Written=2014
也可以在YARN Web界面,Applications栏目点击查看执行输出信息
查看运行结果(部分):
[hp@node-4 hadoop-3.2.4]$ ./bin/hdfs dfs -cat /test/output/*
Be 1
But 2
He 1
I 9
It 1
London, 1
My 1
Netherfield 1
Pardon 1
The 1
These 1
You 1
a 1
acknowledged 1
alarmed, 1
all 1
almost 1
am 1
amidst 1
an 2
and 9
any 3
apprehension 1
as 3
aside, 1
at 1
attention; 1
至此,可以看到单词统计结果有了,说明集群搭建起来了
关闭集群命令:
./sbin/stop-yarn.sh
./sbin/stop-dfs.sh
# 3.2之后可以用: mapred --daemon stop
./sbin/mr-jobhistory-daemon.sh stop historyserver
部分错误处理:
warn library警告问题
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable警告问题
这个是因为hadoop程序内置的java类,需要读取/home/hp/hadoop-3.2.4/lib/native内的libhadoop.so,但系统变量没有设置,补充设置:
vi ~/.bashrc
export JAVA_LIBRARY_PATH=/home/hp/hadoop-3.2.4/lib/native
source ~/.bashrc
也可以通过配置core-site.xml来解决(没验证)
<property>
<name>hadoop.native.lib</name>
<value>false</value>
<description>Should native hadoop libraries, if present, be used.</description>
</property>
非root用户下,hadoop解压文件的权限问题
切换到root用户,chown修改用户组权限
chown -R hp:hp /home/hp/hadoop-3.2.4
参考资料:
Hadoop: Setting up a Single Node Cluster.
Hadoop Cluster Setup
HDFS简介和部分架构图
Hadoop单机版安装
HDFS简介与部署
HDFS 的Java API操作
Hadoop集群集群搭建