hive 动态分区 debug

debug

loadDymicParitions 的过程 ,通过debug 可以知道 在加载 hive 分区表的流程为:

注意:加载动态的过程,是在job end之后的,也就是说sparkUI界面 上所运行的job已经完成,此时spark处理完成的文件在是在一个临时目录下的

可以见附件的时序图

  1. spark会把处理完成的文件先放置到一个临时文件下 例如

    hdfs://localhost:9000/tmp/hive/mrbear/5fa62547-f5c1-4366-b420-bba9c0c4f317/hive_2019-01-13_16-30-41_375_1028567113987977727-1/-ext-10000/

  2. 校验路径下的分区路径

  3. // 获得虽有的 叶子节点的分区路径下面的文件,getParent()验证路径,然后放如Set validPartitions
    FileSystem fs = loadPath.getFileSystem(conf);
    
  4. 对通过验证的路径执行loadParition方法 该方法就是 把 临时路径的文件 放到hive的表的路径下,

artition newPartition = loadPartition(partPath, tbl, fullPartSpec, replace,
            holdDDLTime, true, listBucketingEnabled, false, isAcid)

遍历validPartitions的内容

Iterator<Path> iter = validPartitions.iterator();
//这里会一直循环 叶子分区的路径,只要是目录就进入loadPartition方法,所有也就是说所有叶子分区都会执行loadPartition方法
while (iter.hasNext()) {
  // get the dynamically created directory
  Path partPath = iter.next();
  assert fs.getFileStatus(partPath).isDir():
    "partitions " + partPath + " is not a directory !";
  LinkedHashMap<String, String> fullPartSpec = new LinkedHashMap<String, String>(partSpec);
  Warehouse.makeSpecFromName(fullPartSpec, partPath);
    //加载分区 
  Partition newPartition = loadPartition(partPath, tbl, fullPartSpec, replace,
      holdDDLTime, true, listBucketingEnabled, false, isAcid);
  partitionsMap.put(fullPartSpec, newPartition);
  LOG.info("New loading path = " + partPath + " with partSpec " + fullPartSpec);
}
  • 例如:
  • partPath:hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/.hive-staging_hive_2019-01-27_16-15-07_042_3692139049516127444-1/-ext-10000/month=201608/day=20160823
  • tbl:hive_debug_pd_tt2
  1. 其中的replaceFiles方法 该方法中 无论是否存在之前的路径 都会调用inheritFromTable
Hive.replaceFiles(tbl.getPath(), loadPath, newPartPath, oldPartPath, getConf(),
    isSrcLocal);

loadPath:hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/.hive-staging_hive_2019-01-27_16-15-07_042_3692139049516127444-1/-ext-10000/month=201608/day=20160823

newPartPath:hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/month=201608/day=20160823

oldPartPath:新建分区但是没有的,老分区的话就是有路径的和destfp一样

	if (oldPath != null) {
        try {
          FileSystem fs2 = oldPath.getFileSystem(conf);
          if (fs2.exists(oldPath)) {
            if (FileUtils.isSubDir(oldPath, destf, fs2)) {
              FileUtils.trashFilesUnderDir(fs2, oldPath, conf);
            }
            if (inheritPerms) {
              inheritFromTable(tablePath, destf, conf, destFs);
            }
          }
        } catch (Exception e) {
          //swallow the exception
          LOG.warn("Directory " + oldPath.toString() + " cannot be removed: " + e, e);
        }
      }

      // rename src directory to destf
//srcs = srcFs.globStatus(srcf);
//srcf:hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/.hive-staging_hive_2019-01-28_13-45-07_142_6949073873152197880-1/-ext-10000/month=201701/day=20170105
// 有且只有一个 是目录
      if (srcs.length == 1 && srcs[0].isDir()) {
        // rename can fail if the parent doesn't exist
     //destfp:
      //hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/month=201701/day=20170105
        Path destfp = destf.getParent();
        if (!destFs.exists(destfp)) {
          boolean success = destFs.mkdirs(destfp);
          if (!success) {
            LOG.warn("Error creating directory " + destf.toString());
          }
          if (inheritPerms && success) {
             //这个方法就是 会对文件目录进行赋权操作
            inheritFromTable(tablePath, destfp, conf, destFs);
          }
        }

        // Copy/move each file under the source directory to avoid to delete the destination
        // directory if it is the root of an HDFS encryption zone.
    // result = checkPaths(conf, destFs, srcs, srcFs, destf,true);简单说就 临时路径与目标路径的映射 sdpair[
 //hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/.hive-staging_hive_2019-01-28_13-45-07_142_6949073873152197880-1/-ext-10000/month=201701/day=20170105/000000_0,
 //hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/month=201701/day=20170105/000000_0]
        for (List<Path[]> sdpairs : result) {
          for (Path[] sdpair : sdpairs) {
            Path destParent = sdpair[1].getParent();
            FileSystem destParentFs = destParent.getFileSystem(conf);
            if (!destParentFs.isDirectory(destParent)) {
              boolean success = destFs.mkdirs(destParent);
              if (!success) {
                LOG.warn("Error creating directory " + destParent);
              }
              if (inheritPerms && success) {
                inheritFromTable(tablePath, destParent, conf, destFs);
              }
            }
              //moveFile方法通过rename的方式移动文件 如果是覆盖写入 会先删除 -r 原来的路径 然后再rename然后对哪个文件进行赋权 没有往上循环赋权的 这里就不分析了
            if (!moveFile(conf, sdpair[0], sdpair[1], destFs, true, isSrcLocal)) {
              throw new IOException("Unable to move file/directory from " + sdpair[0] +
                  " to " + sdpair[1]);
            }
          }
        }
      } else { 
          // srcf is a file or pattern containing wildcards 文件或者带有通配符
        if (!destFs.exists(destf)) {
          boolean success = destFs.mkdirs(destf);
          if (!success) {
            LOG.warn("Error creating directory " + destf.toString());
          }
          if (inheritPerms && success) {
              //这个方法就是 会对文件目录进行赋权操作
            inheritFromTable(tablePath, destf, conf, destFs);
          }
        }
        // srcs must be a list of files -- ensured by LoadSemanticAnalyzer 必须是文件列表
        for (List<Path[]> sdpairs : result) {
          for (Path[] sdpair : sdpairs) {
            if (!moveFile(conf, sdpair[0], sdpair[1], destFs, true,
                isSrcLocal)) {
              throw new IOException("Error moving: " + sdpair[0] + " into: " + sdpair[1]);
            }
          }
        }
      }
    } catch (IOException e) {
      throw new HiveException(e.getMessage(), e);
    }
  1. 分析inheritFromTable方法

inheritFromTable(tablePath, destfp, conf, destFs);

tablePath:hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2

destfs:hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/month=201610/day=20161001

 /**
   * This method sets all paths from tablePath to destf (including destf) to have same permission as tablePath.
   * @param tablePath path of table
   * @param destf path of table-subdir.
   * @param conf
   * @param fs
   */
  private static void inheritFromTable(Path tablePath, Path destf, HiveConf conf, FileSystem fs) {
    if (!FileUtils.isSubDir(destf, tablePath, fs)) {
      //partition may not be under the parent.
      return;
    }
    HadoopShims shims = ShimLoader.getHadoopShims();
    //Calculate all the paths from the table dir, to destf
    //At end of this loop, currPath is table dir, and pathsToSet contain list of all those paths.
    Path currPath = destf;
    List<Path> pathsToSet = new LinkedList<Path>();
      //这个方法会拿到 目标表路径的 拿到从一级目录 到 和二级目录到 路径
      //hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/month=201610/day=20161001
      //hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/month=201610
    while (!currPath.equals(tablePath)) {
      pathsToSet.add(currPath);
      currPath = currPath.getParent();
    }

    try {
      HadoopShims.HdfsFileStatus fullFileStatus = shims.getFullFileStatus(conf, fs, currPath);
      for (Path pathToSet : pathsToSet) {
          //该方法会对路径进行 赋权 该方法里面 进行循环赋权
        shims.setFullFileStatus(conf, fullFileStatus, fs, pathToSet);
      }
    } catch (Exception e) {
      LOG.warn("Error setting permissions or group of " + destf, e);
    }
  }

  1. 分析shims.setFullFileStatus(conf, fullFileStatus, fs, pathToSet);

例如:

pathToSet:

  • hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/month=201610/day=20161001

  • hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/month=201610

@Override
  public void setFullFileStatus(Configuration conf, HdfsFileStatus sourceStatus,
    FileSystem fs, Path target) throws IOException {
    String group = sourceStatus.getFileStatus().getGroup();
    //use FsShell to change group, permissions, and extended ACL's recursively
    try {
      FsShell fsShell = new FsShell();
      fsShell.setConf(conf);
        //该方法会对 传入的路径进行循环赋权 
        //hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2 -R
        //hdfs://localhost:9000/user/mrbear/hive_debug_pd_tt2/month=201610/day=20161001 -R
      run(fsShell, new String[]{"-chgrp", "-R", group, target.toString()});

      if (isExtendedAclEnabled(conf)) {
        //Attempt extended Acl operations only if its enabled, 8791but don't fail the operation regardless.
        try {
          AclStatus aclStatus = ((Hadoop23FileStatus) sourceStatus).getAclStatus();
          if (aclStatus != null) {
            List<AclEntry> aclEntries = aclStatus.getEntries();
            removeBaseAclEntries(aclEntries);

            //the ACL api's also expect the tradition user/group/other permission in the form of ACL
            FsPermission sourcePerm = sourceStatus.getFileStatus().getPermission();
            aclEntries.add(newAclEntry(AclEntryScope.ACCESS, AclEntryType.USER, sourcePerm.getUserAction()));
            aclEntries.add(newAclEntry(AclEntryScope.ACCESS, AclEntryType.GROUP, sourcePerm.getGroupAction()));
            aclEntries.add(newAclEntry(AclEntryScope.ACCESS, AclEntryType.OTHER, sourcePerm.getOtherAction()));

            //construct the -setfacl command
            String aclEntry = Joiner.on(",").join(aclStatus.getEntries());
         
              //因为 isExtendedAclEnabled(conf) 为true 参数配置 所以这边也会进行循环赋权
            run(fsShell, new String[]{"-setfacl", "-R", "--set", aclEntry, target.toString()});
          }
        } catch (Exception e) {
          LOG.info("Skipping ACL inheritance: File system for path " + target + " " +
                  "does not support ACLs but dfs.namenode.acls.enabled is set to true: " + e, e);
        }
      } else {
        String permission = Integer.toString(sourceStatus.getFileStatus().getPermission().toShort(), 8);
          //因为 isExtendedAclEnabled(conf) 为false时 参数配置 所以这边也会进行循环赋权
        run(fsShell, new String[]{"-chmod", "-R", permission, target.toString()});
      }
    } catch (Exception e) {
      throw new IOException("Unable to set permissions of " + target, e);
    }
    try {
      if (LOG.isDebugEnabled()) {  //some trace logging
        getFullFileStatus(conf, fs, target).debugLog();
      }
    } catch (Exception e) {
      //ignore.
    }
  }

总结

对以上为debug进行总结

若 一级分区 为 30个 叶子分区为 300个

其中一级分区为 /user/mrbear/hive_debug_pd_tt1/year=201601/

二级分区为

/user/mrbear/hive_debug_pd_tt1/year=201601/day=20160101

/user/mrbear/hive_debug_pd_tt1/year=201601/day=20160102

/user/mrbear/hive_debug_pd_tt1/year=201601/day=20160103

......

如果 插入一个新的 分区

/user/mrbear/hive_debug_pd_tt1/year=201601/day=20160104

程序则会对

/user/mrbear/hive_debug_pd_tt1/year=201601/

/user/mrbear/hive_debug_pd_tt1/year=201601/day=20160104

进行 -R 的循环赋权

特别是对父目录的赋值,会造成对之前存在的分区及其文件都 进行一次赋值操作,提高了

假设子分区的文件个数 与 m 个,之前存在的子分区个数为 n

插入一个分区后 进行 -R操作 则会对 至少会 发生 m x n + n 次操作

若一个父分区下 有 300个 子分区,一个分区下15个文件 ,进行一次插入分区的操作

至少发生 4800 次 -chgrp以及4800次 -setfacl

如果是在动态 插入的情况下 反复的插入分区 对 namenode 操作的负载时 蛮大的

下面是时序图

posted on 2019-06-25 15:14  小熊先生不开心  阅读(365)  评论(1)    收藏  举报

导航