hive 动态分区 debug
debug
loadDymicParitions 的过程 ,通过debug 可以知道 在加载 hive 分区表的流程为:
注意:加载动态的过程,是在job end之后的,也就是说sparkUI界面 上所运行的job已经完成,此时spark处理完成的文件在是在一个临时目录下的
可以见附件的时序图
-
spark会把处理完成的文件先放置到一个临时文件下 例如
hdfs://localhost:9000/tmp/hive/mrbear/5fa62547-f5c1-4366-b420-bba9c0c4f317/hive_2019-01-13_16-30-41_375_1028567113987977727-1/-ext-10000/ -
校验路径下的分区路径
-
// 获得虽有的 叶子节点的分区路径下面的文件,getParent()验证路径,然后放如Set validPartitions FileSystem fs = loadPath.getFileSystem(conf); -
对通过验证的路径执行
loadParition方法 该方法就是 把 临时路径的文件 放到hive的表的路径下,
artition newPartition = loadPartition(partPath, tbl, fullPartSpec, replace,
holdDDLTime, true, listBucketingEnabled, false, isAcid)
遍历validPartitions的内容
Iterator<Path> iter = validPartitions.iterator();
//这里会一直循环 叶子分区的路径,只要是目录就进入loadPartition方法,所有也就是说所有叶子分区都会执行loadPartition方法
while (iter.hasNext()) {
// get the dynamically created directory
Path partPath = iter.next();
assert fs.getFileStatus(partPath).isDir():
"partitions " + partPath + " is not a directory !";
LinkedHashMap<String, String> fullPartSpec = new LinkedHashMap<String, String>(partSpec);
Warehouse.makeSpecFromName(fullPartSpec, partPath);
//加载分区
Partition newPartition = loadPartition(partPath, tbl, fullPartSpec, replace,
holdDDLTime, true, listBucketingEnabled, false, isAcid);
partitionsMap.put(fullPartSpec, newPartition);
LOG.info("New loading path = " + partPath + " with partSpec " + fullPartSpec);
}
- 例如:
partPath:hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/.hive-staging_hive_2019-01-27_16-15-07_042_3692139049516127444-1/-ext-10000/month=201608/day=20160823tbl:hive_debug_pd_tt2
- 其中的
replaceFiles方法 该方法中 无论是否存在之前的路径 都会调用inheritFromTable
Hive.replaceFiles(tbl.getPath(), loadPath, newPartPath, oldPartPath, getConf(),
isSrcLocal);
loadPath:hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/.hive-staging_hive_2019-01-27_16-15-07_042_3692139049516127444-1/-ext-10000/month=201608/day=20160823
newPartPath:hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/month=201608/day=20160823
oldPartPath:新建分区但是没有的,老分区的话就是有路径的和destfp一样
if (oldPath != null) {
try {
FileSystem fs2 = oldPath.getFileSystem(conf);
if (fs2.exists(oldPath)) {
if (FileUtils.isSubDir(oldPath, destf, fs2)) {
FileUtils.trashFilesUnderDir(fs2, oldPath, conf);
}
if (inheritPerms) {
inheritFromTable(tablePath, destf, conf, destFs);
}
}
} catch (Exception e) {
//swallow the exception
LOG.warn("Directory " + oldPath.toString() + " cannot be removed: " + e, e);
}
}
// rename src directory to destf
//srcs = srcFs.globStatus(srcf);
//srcf:hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/.hive-staging_hive_2019-01-28_13-45-07_142_6949073873152197880-1/-ext-10000/month=201701/day=20170105
// 有且只有一个 是目录
if (srcs.length == 1 && srcs[0].isDir()) {
// rename can fail if the parent doesn't exist
//destfp:
//hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/month=201701/day=20170105
Path destfp = destf.getParent();
if (!destFs.exists(destfp)) {
boolean success = destFs.mkdirs(destfp);
if (!success) {
LOG.warn("Error creating directory " + destf.toString());
}
if (inheritPerms && success) {
//这个方法就是 会对文件目录进行赋权操作
inheritFromTable(tablePath, destfp, conf, destFs);
}
}
// Copy/move each file under the source directory to avoid to delete the destination
// directory if it is the root of an HDFS encryption zone.
// result = checkPaths(conf, destFs, srcs, srcFs, destf,true);简单说就 临时路径与目标路径的映射 sdpair[
//hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/.hive-staging_hive_2019-01-28_13-45-07_142_6949073873152197880-1/-ext-10000/month=201701/day=20170105/000000_0,
//hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/month=201701/day=20170105/000000_0]
for (List<Path[]> sdpairs : result) {
for (Path[] sdpair : sdpairs) {
Path destParent = sdpair[1].getParent();
FileSystem destParentFs = destParent.getFileSystem(conf);
if (!destParentFs.isDirectory(destParent)) {
boolean success = destFs.mkdirs(destParent);
if (!success) {
LOG.warn("Error creating directory " + destParent);
}
if (inheritPerms && success) {
inheritFromTable(tablePath, destParent, conf, destFs);
}
}
//moveFile方法通过rename的方式移动文件 如果是覆盖写入 会先删除 -r 原来的路径 然后再rename然后对哪个文件进行赋权 没有往上循环赋权的 这里就不分析了
if (!moveFile(conf, sdpair[0], sdpair[1], destFs, true, isSrcLocal)) {
throw new IOException("Unable to move file/directory from " + sdpair[0] +
" to " + sdpair[1]);
}
}
}
} else {
// srcf is a file or pattern containing wildcards 文件或者带有通配符
if (!destFs.exists(destf)) {
boolean success = destFs.mkdirs(destf);
if (!success) {
LOG.warn("Error creating directory " + destf.toString());
}
if (inheritPerms && success) {
//这个方法就是 会对文件目录进行赋权操作
inheritFromTable(tablePath, destf, conf, destFs);
}
}
// srcs must be a list of files -- ensured by LoadSemanticAnalyzer 必须是文件列表
for (List<Path[]> sdpairs : result) {
for (Path[] sdpair : sdpairs) {
if (!moveFile(conf, sdpair[0], sdpair[1], destFs, true,
isSrcLocal)) {
throw new IOException("Error moving: " + sdpair[0] + " into: " + sdpair[1]);
}
}
}
}
} catch (IOException e) {
throw new HiveException(e.getMessage(), e);
}
- 分析
inheritFromTable方法
inheritFromTable(tablePath, destfp, conf, destFs);
tablePath:hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2
destfs:hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/month=201610/day=20161001
/**
* This method sets all paths from tablePath to destf (including destf) to have same permission as tablePath.
* @param tablePath path of table
* @param destf path of table-subdir.
* @param conf
* @param fs
*/
private static void inheritFromTable(Path tablePath, Path destf, HiveConf conf, FileSystem fs) {
if (!FileUtils.isSubDir(destf, tablePath, fs)) {
//partition may not be under the parent.
return;
}
HadoopShims shims = ShimLoader.getHadoopShims();
//Calculate all the paths from the table dir, to destf
//At end of this loop, currPath is table dir, and pathsToSet contain list of all those paths.
Path currPath = destf;
List<Path> pathsToSet = new LinkedList<Path>();
//这个方法会拿到 目标表路径的 拿到从一级目录 到 和二级目录到 路径
//hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/month=201610/day=20161001
//hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/month=201610
while (!currPath.equals(tablePath)) {
pathsToSet.add(currPath);
currPath = currPath.getParent();
}
try {
HadoopShims.HdfsFileStatus fullFileStatus = shims.getFullFileStatus(conf, fs, currPath);
for (Path pathToSet : pathsToSet) {
//该方法会对路径进行 赋权 该方法里面 进行循环赋权
shims.setFullFileStatus(conf, fullFileStatus, fs, pathToSet);
}
} catch (Exception e) {
LOG.warn("Error setting permissions or group of " + destf, e);
}
}
- 分析
shims.setFullFileStatus(conf, fullFileStatus, fs, pathToSet);
例如:
pathToSet:
-
hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/month=201610/day=20161001 -
hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/month=201610
@Override
public void setFullFileStatus(Configuration conf, HdfsFileStatus sourceStatus,
FileSystem fs, Path target) throws IOException {
String group = sourceStatus.getFileStatus().getGroup();
//use FsShell to change group, permissions, and extended ACL's recursively
try {
FsShell fsShell = new FsShell();
fsShell.setConf(conf);
//该方法会对 传入的路径进行循环赋权
//hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2 -R
//hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/month=201610/day=20161001 -R
run(fsShell, new String[]{"-chgrp", "-R", group, target.toString()});
if (isExtendedAclEnabled(conf)) {
//Attempt extended Acl operations only if its enabled, 8791but don't fail the operation regardless.
try {
AclStatus aclStatus = ((Hadoop23FileStatus) sourceStatus).getAclStatus();
if (aclStatus != null) {
List<AclEntry> aclEntries = aclStatus.getEntries();
removeBaseAclEntries(aclEntries);
//the ACL api's also expect the tradition user/group/other permission in the form of ACL
FsPermission sourcePerm = sourceStatus.getFileStatus().getPermission();
aclEntries.add(newAclEntry(AclEntryScope.ACCESS, AclEntryType.USER, sourcePerm.getUserAction()));
aclEntries.add(newAclEntry(AclEntryScope.ACCESS, AclEntryType.GROUP, sourcePerm.getGroupAction()));
aclEntries.add(newAclEntry(AclEntryScope.ACCESS, AclEntryType.OTHER, sourcePerm.getOtherAction()));
//construct the -setfacl command
String aclEntry = Joiner.on(",").join(aclStatus.getEntries());
//因为 isExtendedAclEnabled(conf) 为true 参数配置 所以这边也会进行循环赋权
run(fsShell, new String[]{"-setfacl", "-R", "--set", aclEntry, target.toString()});
}
} catch (Exception e) {
LOG.info("Skipping ACL inheritance: File system for path " + target + " " +
"does not support ACLs but dfs.namenode.acls.enabled is set to true: " + e, e);
}
} else {
String permission = Integer.toString(sourceStatus.getFileStatus().getPermission().toShort(), 8);
//因为 isExtendedAclEnabled(conf) 为false时 参数配置 所以这边也会进行循环赋权
run(fsShell, new String[]{"-chmod", "-R", permission, target.toString()});
}
} catch (Exception e) {
throw new IOException("Unable to set permissions of " + target, e);
}
try {
if (LOG.isDebugEnabled()) { //some trace logging
getFullFileStatus(conf, fs, target).debugLog();
}
} catch (Exception e) {
//ignore.
}
}
总结
对以上为debug进行总结
若 一级分区 为 30个 叶子分区为 300个
其中一级分区为 /user/mrbear/hive_debug_pd_tt1/year=201601/
二级分区为
/user/mrbear/hive_debug_pd_tt1/year=201601/day=20160101
/user/mrbear/hive_debug_pd_tt1/year=201601/day=20160102
/user/mrbear/hive_debug_pd_tt1/year=201601/day=20160103
......
如果 插入一个新的 分区
/user/mrbear/hive_debug_pd_tt1/year=201601/day=20160104
程序则会对
/user/mrbear/hive_debug_pd_tt1/year=201601/
/user/mrbear/hive_debug_pd_tt1/year=201601/day=20160104
进行 -R 的循环赋权
特别是对父目录的赋值,会造成对之前存在的分区及其文件都 进行一次赋值操作,提高了
假设子分区的文件个数 与 m 个,之前存在的子分区个数为 n
插入一个分区后 进行 -R操作 则会对 至少会 发生 m x n + n 次操作
若一个父分区下 有 300个 子分区,一个分区下15个文件 ,进行一次插入分区的操作
至少发生 4800 次 -chgrp以及4800次 -setfacl
如果是在动态 插入的情况下 反复的插入分区 对 namenode 操作的负载时 蛮大的
下面是时序图

浙公网安备 33010602011771号