HDFS读写流程与API
一、读写机制
-
客户端访问NameNode请求上传文件;
-
NameNode检查目标文件和目录是否已经存在;
-
NameNode响应客户端是否可以上传;
-
客户端请求NameNode文件块Block01上传服务位置;
-
NameNode响应返回3个DataNode节点;
-
客户端通过输入流建立DataNode01传输通道;
-
DataNode01调用DataNode02,DataNode02调用DataNode03,通信管道建立完成;
-
DataNode01、DataNode02、DataNode03逐级应答客户端。
-
客户端向DataNode01上传第一个文件块Block;
-
DataNode01接收后传给DataNode02,DataNode02传给DataNode03;
-
Block01传输完成之后,客户端再次请求NameNode上传第二个文件块;
-
客户端通过向NameNode请求下载文件;
-
NameNode查询获取文件元数据并返回;
-
客户端通过元数据信息获取文件DataNode地址;
-
就近原则选择一台DataNode服务器,请求读取数据;
-
DataNode传输数据返回给客户端;
-
客户端以本地处理目标文件;
二、基础API案例
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.atguigu</groupId> <artifactId>hdfs</artifactId> <version>1.0-SNAPSHOT</version> <properties> <scala.version>2.11</scala.version> <scala.compat.version>2.11</scala.compat.version> <spark.version>2.2.0</spark.version> <hadoop.version>2.7.2</hadoop.version> <hbase.version>1.0</hbase.version> <mysql.version>8.0.15</mysql.version> </properties> <repositories> <repository> <id>nexus-aliyun</id> <name>Nexus aliyun</name> <url>http://maven.aliyun.com/nexus/content/groups/public</url> </repository> </repositories> <dependencies> <!-- hadoop --> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>RELEASE</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j</artifactId> <version>2.8.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <!-- hadoop的客户端,用于访问HDFS --> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-auth</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-yarn-client</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-yarn-api</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-yarn-common</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-yarn-server-common</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-yarn-server-web-proxy</artifactId> <version>${hadoop.version}</version> </dependency> </dependencies> </project>
package com.atguigu.hdfsclient; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.*; import java.io.FileInputStream; import java.net.URI; import org.apache.hadoop.io.IOUtils; import org.junit.Test; public class HdfsClient { @Test public void put() throws Exception { //TODO 1.获取一个HDFS的抽象封装对象 //设置配置文件 Configuration configuration=new Configuration(); configuration.setInt("dfs.replication",2); FileSystem fileSystem = FileSystem.get(URI.create("hdfs://hadoop102:9000"),configuration,"atguigu"); //TODO 2.通过这个对象操作文件系统 // 文件上传:copyFromLocalFile fileSystem.copyFromLocalFile(new Path("input/2.txt"),new Path("/")); // 文件下载:copyToLocalFile fileSystem.copyToLocalFile(new Path("/README.txt"),new Path("input/")); //修改文件名:rename fileSystem.rename(new Path("/1.txt"),new Path("/2.txt")); //删除文件夹:delete fileSystem.delete(new Path("/2.txt"),true); //流式拷贝 FSDataOutputStream out = fileSystem.append(new Path("/1.txt"),1024); FileInputStream in = new FileInputStream("input/2.txt"); IOUtils.copyBytes(in,out,1024,true); //查看文件:listStatus(即可以查看文件,又可以查看文件夹) FileStatus[] fileStatuses = fileSystem.listStatus(new Path("/")); for(FileStatus fileStatus:fileStatuses){ if (fileStatus.isFile()){ System.out.println("file:"+ fileStatus.getPath()); }if(fileStatus.isDirectory()){ System.out.println("directory:"+ fileStatus.getPath()); } } //查看文件:listFiles(只查看文件,可以获取块信息) RemoteIterator<LocatedFileStatus> files = fileSystem.listFiles(new Path("/"), true); while (files.hasNext()){ LocatedFileStatus file = files.next(); System.out.println("****************"); System.out.println(file.getPath()); BlockLocation[] blockLocations = file.getBlockLocations(); for(BlockLocation blockLocation:blockLocations){ String[] hosts = blockLocation.getHosts(); for(String host:hosts){ System.out.println(host); } } System.out.println(); } //TODO 3.关闭文件系统 fileSystem.close(); } }
三、机架感知
第一个副本和client在一个节点里,如果client不在集群范围内,则这第一个node是随机选取的;
第二个副本和第一个副本放在相同的机架上随机选择;
第三个副本在不同的机架上随机选择,减少了机架间的写流量,通常可以提高写性能,机架故障的概率远小于节点故障的概率,因此该策略不会影响数据的稳定性。
Distance(/D1/R1/N1,/D1/R1/N1)=0 相同的节点
Distance(/D1/R1/N1,/D1/R1/N2)=2 同一机架下的不同节点
Distance(/D1/R1/N1,/D1/R2/N1)=4 同一IDC下的不同datanode
Distance(/D1/R1/N1,/D2/R3/N1)=6 不同IDC下的datanode
posted on 2020-11-11 17:01 happygril3 阅读(121) 评论(0) 收藏 举报