分布式并行模块总结

分布式并行模块

1、分布式并行训练的优势

在深度学习发展的过程中为了更好的网络学习能力和泛化能力，数据集和模型规模都呈指数式提高。在NLP领域随着Transformer层的堆叠，模型的精度确实有所提高，但与此同时，模型参数所需的内存很快达到了性能上线。在人脸识别领域，因为全连接层的参数规模收到分类数量的影响，因此大规模人脸识别的参数也受到了内存的限制。而在推荐领域，由于人类网络活动的增加，使得基础数据的特征过多，在较为经典的wide&deep等模型中会使得模型头部的Embedding层规模巨大。

分布式并行训练，可以降低对内存、计算性能等硬件的需求，是进行训练的重要优化手段

2、并行类型

1、数据并行：对数据进行切分的并行模式，一般按照batch维度切分，将数据分配到各个计算单元中，进行模型计算。

2、模型并行：对模型进行切分的并行模式。MindSpore中支持层内模型并行模式，对参数切分后分配到各个计算单元中进行训练。

3、混合并行：涵盖数据并行和模型并行的并行模式。

3、MindSpore支持的多种模式

当前MindSpore也提供分布式并行训练的功能。它支持了多种模式包括：

DATA_PARALLEL：数据并行模式。
AUTO_PARALLEL：自动并行模式，融合了数据并行、模型并行及混合并行的1种分布式并行模式，可以自动建立代价模型，为用户选择1种并行模式。其中，代价模型指围绕Ascend 910芯片基于内存的计算开销和通信开销对训练时间建模，并设计高效的算法找到训练时间较短的并行策略。

4、需配置环境

在裸机环境（对比云上环境，即本地有Ascend 910 AI 处理器）进行分布式训练时，需要配置当前多卡环境的组网信息文件。MindSpore分布式并行训练的通信使用了华为集合通信库Huawei Collective Communication Library（以下简称HCCL），可以在Ascend AI处理器配套的软件包中找到。同时mindspore.communication.management中封装了HCCL提供的集合通信接口，方便用户配置分布式信息。

5、优秀代码解析

Status L2NormalizeInfo::InferMirrorOps() {
  mirror_ops_.clear(); // 清除mirror
  Shape input_tensor_map = inputs_tensor_map_.at(0);
  std::vector<Group> input_group;
  // 创建组失败
  if (CreateGroupByTensorMap(input_tensor_map, &input_group) != SUCCESS) {
    MS_LOG(ERROR) << name_ << " : Create group failed.";
    return FAILED;
  }
  OperatorVector op_for_weight;
  // 镜像操作为空
  if (input_group.empty()) {
    MS_LOG(INFO) << name_ << " : The mirror ops is empty.";
    return SUCCESS;
  } else {
    op_for_weight = CreateMirrorOps(input_group[0].name(), input_group[0].GetDevNum());
    mirror_ops_.push_back(op_for_weight);
    // 创建镜像操作成功，组为input_group[0]
    MS_LOG(INFO) << name_ << " : Create the mirror ops success, the group is " << input_group[0].name();
  }
  return SUCCESS;
}
Status L2NormalizeInfo::GenerateStrategies(int64_t stage_id) {
    // 获取属性失败
  if (GetAttrs() != SUCCESS) {
    MS_LOG(ERROR) << name_ << " : GetAttrs failed.";
    return FAILED;
  }
  Shape input0_split(inputs_shape_[0].size() - 1, 1);
  int64_t axis_index = axis_;
  if (axis_ < 0) {
    size_t input_dim = inputs_shape_.at(0).size();
    axis_index = static_cast<int64_t>(input_dim) + axis_;
  }
  (void)input0_split.insert(input0_split.begin() + axis_index, 0);
  Shapes splittable_inputs = {input0_split};
  std::vector<StrategyPtr> sp_vector;
  // 生成策略失败
  if (GenerateStrategiesForIndependentInputs(stage_id, inputs_shape_, splittable_inputs, &sp_vector) != SUCCESS) {
    MS_LOG(ERROR) << name_ << " : Generate strategies failed.";
    return FAILED;
  }
  size_t success = 0;
  for (auto &sp : sp_vector) { // 遍历sp_vector
    if (SetCostUnderStrategy(sp) == SUCCESS) {
      success++;
      // 成功生成strategy
      MS_LOG(INFO) << name_ << " : Successfully generated " << success << " strategy.";
      PrintStrategy(sp);
    }
  }
  return SUCCESS;
}

bool StepAllreduceFusion(const FuncGraphPtr &root, const opt::OptimizerPtr &optimizer) {
  // 判断是否为空
  MS_EXCEPTION_IF_NULL(root);
  MS_EXCEPTION_IF_NULL(optimizer);
  MS_EXCEPTION_IF_NULL(ParallelContext::GetInstance());
  std::string parallel_mode = ParallelContext::GetInstance()->parallel_mode();
  bool enable_all_reduce_fusion = ParallelContext::GetInstance()->enable_all_reduce_fusion();
  // 假设图形没有变化
  bool changes = false;
  // 控制是否使用model_parallel模式
  if (!root->has_flag(AUTO_PARALLEL) || ((parallel_mode != AUTO_PARALLEL) && (parallel_mode != SEMI_AUTO_PARALLEL)) ||
      (!enable_all_reduce_fusion) || (root->has_flag(ALLREDUCE_FUSION_RUN_ONCE_ONLY))) {
    return changes;
  }
#if defined(_WIN32) || defined(_WIN64) // 如果是window系统，运行以下语句
  auto start_time = std::chrono::steady_clock::now();
#else // 否则：
  struct timeval start_time, end_time;
  (void)gettimeofday(&start_time, nullptr);
#endif
  // 现在进入allreduce融合
  MS_LOG(INFO) << "Now entering allreduce fusion";
  DumpGraph(root, std::string(ALLREDUCE_FUSION_BEGIN));
  pipeline::ResourceBasePtr res = optimizer->resource();
  MS_EXCEPTION_IF_NULL(res);
  FuncGraphManagerPtr manager = res->manager();
  MS_EXCEPTION_IF_NULL(manager);
  CNodePtr ret = root->get_return();
  MS_EXCEPTION_IF_NULL(ret);
  AllreduceFusion allreduce_fusion;
  // ProcessAllreduceFusion 失败
  if (allreduce_fusion.ProcessAllreduceFusion(ret) != SUCCESS) {
    MS_LOG(EXCEPTION) << "ProcessAllreduceFusion failed";
  }
  DumpGraph(root, std::string(ALLREDUCE_FUSION_END));
  // allreduce fusion 只运行一次
  root->set_flag(ALLREDUCE_FUSION_RUN_ONCE_ONLY, true);
  res->results()[pipeline::kStepParallelGraph] = root;
#if defined(_WIN32) || defined(_WIN64) // 在win32或者win64系统，便运行此段代码
  auto end_time = std::chrono::steady_clock::now();
  std::chrono::duration<double, std::ratio<1, 1000000>> cost = end_time - start_time;
  // 现在离开allreduce fusion，用了时间
  MS_LOG(INFO) << "Now leaving allreduce fusion, used time: " << cost.count() << " us";
#else // 在非win32或者win64系统，便运行此段代码
  (void)gettimeofday(&end_time, nullptr);
  uint64_t time = 1000000 * static_cast<uint64_t>(end_time.tv_sec - start_time.tv_sec);
  time += static_cast<uint64_t>(end_time.tv_usec - start_time.tv_usec);
  // 现在离开allreduce fusion，用了时间
  MS_LOG(INFO) << "Now leaving allreduce fusion, used time: " << time << " us";
#endif
  return changes;
}

// 双重命名空间
namespace mindspore {
namespace parallel { // 类似python的命名空间
Status AllreduceNode::AddNext(const AllreduceNodePtr &next_node) {
  if (next_node == nullptr) {
    MS_LOG(ERROR) << "next_node is nullptr!";
    return FAILED;
  }
  next_.emplace_back(next_node); // 非空便尾部插入next_node
  return SUCCESS;
}
Status AllreduceNode::AddPrev(const AllreduceNodePtr &prev_node, double dist, double *max) {
  if (prev_node == nullptr) {
    MS_LOG(ERROR) << "next_node is nullptr!";
    return FAILED;
  }
  // dist 必须是正的
  if (dist <= 0) {
    MS_LOG(ERROR) << "dist must be positive! dist: " << dist;
    return FAILED;
  }
  prev_.emplace_back(prev_node); // 尾部插入prev_node
  double add_dist = prev_node->depend_feat_size() + dist;
  depend_feat_size_ += add_dist;
  if (depend_feat_size_ > *max) {
    *max = depend_feat_size_; // 指针遍历max
  }
  std::queue<AllreduceNodePtr> next_queue; // 队
  for (auto &next : next_) { // 遍历next
    next_queue.push(next); // 尾部插入next
  }
  while (!next_queue.empty()) {
    auto ele = next_queue.front();
    ele->AddDependFeatSize(add_dist);
    if (ele->depend_feat_size() > *max) { // 循环遍历depend_feat
      *max = ele->depend_feat_size();
    }
    for (auto &next : ele->next()) {
      next_queue.push(next);
    }
    next_queue.pop(); // 在队尾插入
  }
  return SUCCESS;
}
Status AllreduceNode::Init(const CNodePtr &cnode_ptr) {
    // 判断非空
  if (cnode_ptr == nullptr) {
    MS_LOG(ERROR) << "cnode_ptr is nullptr!";
    return FAILED;
  }
  cnode_ptr_ = cnode_ptr;
  return SUCCESS;
}
Status AllreduceNode::AddPara(const AnfNodePtr &node_ptr) {
    // 判断非空
  if (node_ptr == nullptr) {
    MS_LOG(ERROR) << "node_ptr is nullptr!";
    return FAILED;
  }
  if (!node_ptr->isa<Parameter>()) {
      // node_ptr 不是 ParameterPtr
    MS_LOG(ERROR) << "node_ptr is not a ParameterPtr!";
    return FAILED;
  }
  auto para_ptr = node_ptr->cast<ParameterPtr>();
  MS_EXCEPTION_IF_NULL(para_ptr);
  auto layout_ptr = para_ptr->user_data<TensorLayout>();
  if (layout_ptr == nullptr) {
    MS_LOG(ERROR) << "layout_ptr is nullptr!";
    return FAILED;
  }
  auto emplace_return = paras_.emplace(node_ptr); // 尾部插入node_ptr
  if (emplace_return.second) {
    double para_size = static_cast<double>(layout_ptr->slice_shape().size());
    curr_para_size_ += para_size;
    para_size_map_[node_ptr] = para_size;
  } else {
      // 节点已经存在
    MS_LOG(INFO) << "node already exist!";
  }
  return SUCCESS;
}
Status AllreduceNode::RemovePara(const AnfNodePtr &node_ptr) {
   // node_ptr 是 nullptr！
  if (node_ptr == nullptr) {
    MS_LOG(ERROR) << "node_ptr is nullptr!";
    return FAILED;
  }
  auto erase_num = paras_.erase(node_ptr);
  // 段没有找到！
  if (erase_num == 0) {
    MS_LOG(ERROR) << "para not find!";
    return FAILED;
  }
  curr_para_size_ -= para_size_map_[node_ptr];
  return SUCCESS;
}
void AllreduceNode::ToString() const {
  MS_LOG(INFO) << "cnode: " << cnode_ptr_->DebugString() << "para size: " << paras_.size();
  for (auto &para : paras_) {
    // para的名字
    MS_LOG(INFO) << "para name: " << para->fullname_with_scope() << " size: " << para_size_map_.at(para);
  }
  // depend feat和curr para的大小
  MS_LOG(INFO) << "depend_feat_size: " << depend_feat_size_ << " curr_para_size: " << curr_para_size_;
}
}  // 命名空间 parallel
}  // 命名空间 mindspore

总结

主要难点：

模型切分难，不同维度的模型切分会引入不同的通信量，性能不同，要从海量切分策略中分析出一个性能较好的策略，难度高，需要专家经验；要考虑内存上限让切分后的子模型能够在卡中运行；要考虑切分后各子模型的计算量，保持计算相对均衡，从而避免性能短板。
需要理解底层硬件网络组网的拓扑，节点内和节点间的设备分布方式，把子模型间通信量多的放到节点内，通信量小的放到节点间，以提高网络的利用效率。一般来讲，集群网络有三层，服务器内多卡间互联、机柜内服务器间互联及机柜间互联，带宽和时延依次降低。
在编码方面，手动模型并行，用户需要显式地写很多设备绑定以及通信代码，比较繁复，并行逻辑与算法逻辑耦合在一起，加重了算法科学家的开发工做量。

posted @ 2021-12-20 15:06 MS小白阅读(125) 评论(0) 收藏举报

刷新页面返回顶部