OneFlow: 计算数据的来源

前言

上一篇,我们分析了启动 Runtime,其中我们着重分析了线程的启动,线程是计算的抽象。这篇我们来关注存储,Runtime 在启动的时候,会给 RegstMgr 添加一个 Plan,RegstMgr 根据 Plan 申请分配内存。

回顾

上篇的最后,我们看到了最终一个 Kernel 是如何被调用执行的,数据的来源和目的地都存储在 KernelComputeContext 上。Compute 是最终消费数据的地方,这是数据的目的地,那么数据从哪里来呢?

// oneflow/user/kernels/add_n_kernel.cpp: 41
void Compute(user_op::KernelComputeContext* ctx) const override {
  size_t in_num = ctx->inputs().size();

  user_op::Tensor* out = ctx->Tensor4ArgNameAndIndex("out", 0);
  int64_t n = out->shape().elem_cnt();
  T* out_dptr = out->mut_dptr<T>();

  std::vector<const T*> in_dptrs(in_num);
  for (int32_t i = 0; i < in_num; ++i) {
    in_dptrs.at(i) = ctx->Tensor4ArgNameAndIndex("in", i)->dptr<T>();
  }

  cpu_add<T>(n, out_dptr, in_dptrs);
}

add_n 脚本

首先写一个调用了 add_n 脚本,这个脚本适用于 0.5.0 版本的 OneFlow。它首先将两个矩阵相乘,然后将所有的三个矩阵相加。

from oneflow.compatible import single_client as flow
from oneflow.compatible.single_client import linalg as linalg
from oneflow.compatible.single_client.ops import math_ops as math_ops
from oneflow.compatible.single_client import typing as tp
import numpy as np


@flow.global_function()
def matmul(
        x: tp.Numpy.Placeholder((3, 3), dtype=flow.float),
        y: tp.Numpy.Placeholder((3, 3), dtype=flow.float),
) -> tp.Numpy:
    return linalg.matmul(x, y)


@flow.global_function()
def add_n(
        x: tp.Numpy.Placeholder((3, 3), dtype=flow.float),
        y: tp.Numpy.Placeholder((3, 3), dtype=flow.float),
        z: tp.Numpy.Placeholder((3, 3), dtype=flow.float),
) -> tp.Numpy:
    return math_ops.add_n([x, y, z])


if __name__ == '__main__':
    x = np.arange(0, 9).reshape(3, 3).astype(np.float32)
    y = np.arange(9, 18).reshape(3, 3).astype(np.float32)
    z = matmul(x, y)
    print(z)

    a = add_n(x, y, z)
    print(a)

流程分析

我们尝试一种自底向上的视角去分析。

  • Compute 函数中最终消费了数据,它调用了 KernelComputeContext 中 Tensor4ArgNameAndIndex 来获取输入和输出。
// oneflow/user/kernels/add_n_kernel.cpp: 42
void Compute(user_op::KernelComputeContext* ctx) const override {
  std::cout << std::this_thread::get_id() << std::endl;
  size_t in_num = ctx->inputs().size();

  user_op::Tensor* out = ctx->Tensor4ArgNameAndIndex("out", 0);
  int64_t n = out->shape().elem_cnt();
  T* out_dptr = out->mut_dptr<T>();

  std::vector<const T*> in_dptrs(in_num);
  for (int32_t i = 0; i < in_num; ++i) {
    in_dptrs.at(i) = ctx->Tensor4ArgNameAndIndex("in", i)->dptr<T>();
  }

  cpu_add<T>(n, out_dptr, in_dptrs);
}
  • UserKernelComputeContext 中会初始化那个元组,不过并没有初始化 BlobTensorView,那这块对应的内存应该在哪里、什么时候进行了初始化?
// oneflow/core/kernel/user_kernel.cpp: 457
struct BnTensorPair {
  std::string bn;
  std::unique_ptr<user_op::BlobTensorView> tensor;
};

BnTensorPair MakeBnTensorPair(const std::string& bn) {
  BnTensorPair pair;
  pair.bn = bn;
  return pair;
}

// oneflow/core/kernel/user_kernel.cpp: 478
explicit UserKernelComputeContext(DeviceCtx* device_ctx, const KernelConf& kernel_conf,
                                  const JobDesc& job_desc)
    : user_op_conf_(kernel_conf.op_attribute().op_conf()),
      device_ctx_(device_ctx),
      base_ctx_(kernel_conf, job_desc) {
  auto InitInOrOut = [&](const PbMap<std::string, UserOpConf::ListString>& arg_map) {
    for (const auto& it : arg_map) {
      const std::string& arg_name = it.first;
      for (int32_t i = 0; i < it.second.s_size(); ++i) {
        arg2bn_tensor_pair_.emplace(std::make_pair(arg_name, i),
                                    MakeBnTensorPair(GenRepeatedBn(arg_name, i)));
      }
    }
  };
  InitInOrOut(kernel_conf.op_attribute().op_conf().user_conf().input());
  InitInOrOut(kernel_conf.op_attribute().op_conf().user_conf().output());
  arg2bn_tensor_pair_.emplace(std::make_pair("tmp_buffer", 0),
                              MakeBnTensorPair(GenRepeatedBn("tmp_buffer", 0)));
}
  • UserKernelComputeContext 提供了一个更新的方法,应该是用这个方法进行了初始化、更新等操作。这里面的逻辑是这样的,传入一个函数,用于将 blob name 映射到 blob。对当前 Context 中存储的每一个 pair,从 pair 里拿到 blob name,调用函数去获取 blob。如果 blob 是空指针,那么将当前的 tensor 更新为空指针。如果不是空指针,那么将 blob 更新到 tensor。tensor 如果已经存在,那么更新数据,tensor 如果不存在,那么初始化。
// oneflow/core/kernel/user_kernel.cpp: 513
bool UpdateTensorWithCorrBlob(const std::function<Blob*(const std::string&)>& BnInOp2Blob) {
  bool updated = false;
  for (auto& pair : arg2bn_tensor_pair_) {
    std::unique_ptr<user_op::BlobTensorView>* arg_tensor_ptr = &pair.second.tensor;
    Blob* blob = BnInOp2Blob(pair.second.bn);
    if (blob == nullptr) {
      if (*arg_tensor_ptr) {
        arg_tensor_ptr->reset(nullptr);
        updated = true;
      }
    } else {
      if (*arg_tensor_ptr) {
        if (arg_tensor_ptr->get()->blob() != blob) {
          arg_tensor_ptr->get()->Reset(blob);
          updated = true;
        } else {
          if (blob->blob_desc().is_dynamic()) { updated = true; }
        }
      } else {
        arg_tensor_ptr->reset(new user_op::BlobTensorView(blob));
        updated = true;
      }
    }
  }
  return updated;
}
  • UpdateTensorWithCorrBlob 的调用时机呢?它是什么时候被调用的?在 Lazy 模式下面,会在 ForwardUserKernel 的时候调用。
// oneflow/core/kernel/user_kernel.cpp: 635
void UserKernel::ForwardUserKernel(const std::function<Blob*(const std::string&)>& BnInOp2Blob,
                                   user_op::OpKernelState* opkernel_state) const {
  const bool updated = ctx_->UpdateTensorWithCorrBlob(BnInOp2Blob);

#ifdef WITH_CUDA_GRAPHS
  bool capturing = false;
  if (cuda_graph_ctx_) {
    if (!cuda_graph_ctx_->IsCapturing()) {
      if (cuda_graph_ctx_->IsCaptured() && (!updated)) {
        cuda_graph_ctx_->Launch();
        return;
      }
      capturing = true;
      cuda_graph_ctx_->BeginCapture();
    }
  }
#endif  // WITH_CUDA_GRAPHS

  kernel_->Compute(ctx_.get(), opkernel_state);

#ifdef WITH_CUDA_GRAPHS
  if (cuda_graph_ctx_ && capturing) {
    cuda_graph_ctx_->EndCapture();
    cuda_graph_ctx_->Launch();
  }
#endif  // WITH_CUDA_GRAPHS
}
  • ForwardUserKernel 前面已经分析过了,是在 ForwardDataContent 的时候调用,也就是一个 Kernel Forward 的时候触发,即 Launch 的时候调用,或者说 AsyncLaunchKernel 的时候调用。AsyncLaunchKernel 传入的 lambda 函数,正是上面用来寻找 blob name 对应 blob 的函数。分析一下这个函数。
  • 输入是 blob name,输出是 Blob。先根据 blob name 寻找 BlobInfo,然后找到 regst_desc_id,然后根据 id 找到 Regst,再从 Regst 中取出 Blob。
// oneflow/core/actor/actor.cpp: 470
void Actor::AsyncLaunchKernel(const KernelCtx& kernel_ctx,
                              std::function<Regst*(int64_t)> Regst4RegstDescId) {
  for (const ExecKernel& ek : exec_kernel_vec_) {
    ek.kernel->Launch(kernel_ctx, [&](const std::string& bn_in_op) -> Blob* {
      const auto blob_info_it = ek.bn_in_op2blob_info.find(bn_in_op);
      if (blob_info_it == ek.bn_in_op2blob_info.cend()) { return nullptr; }
      const BlobInfo& info = blob_info_it->second;
      if (info.regst_desc_id == -1) { return nullptr; }
      Regst* regst;
      if (info.rs != nullptr) {
        regst = info.rs->Front(info.regst_desc_id);
      } else {
        regst = Regst4RegstDescId(info.regst_desc_id);
      }
      if (regst == nullptr) { return nullptr; }
      if (info.ordinal >= 0) {
        return regst->GetBlobByOrdinal(info.ordinal);
      } else {
        return regst->GetBlobByLbi(info.lbi);
      }
    });
  }
}
  • Blob 是从 Regst 中取出来的,Regst 的 Blob 又从何而来?有两种可能,Blob 可以来自用户的输入,也可以来自上一个 Actor 运行的结果。Regst 提供了 SetBlobByOrdinal 允许设置指定位置的 Blob。SetBlobByOrdinal 只被 RegstMgr 的 NewBlobsInOneRegst 方法调用,NewBlobsInOneRegst 则是在 NewRegsts 的时候调用,而 NewRegsts 是在 Actor Init 的时候调用。这说明,Blob 其实是在初始化的时候,就已经分配好了内存,只需要上一个 Actor 将数据写好,下一个 Actor 就会去消费。

下面的代码很长,写着上面的方法,被下面的方法调用。

// oneflow/core/register/register.cpp: 52
void Regst::SetBlobByOrdinal(int64_t ordinal, std::unique_ptr<Blob>&& blob) {
  CHECK(!sorted_blob_vec_.at(ordinal));
  sorted_blob_vec_.at(ordinal).swap(blob);
}

// oneflow/core/register/register_manager.cpp: 191
void RegstMgr::NewBlobsInOneRegst(const std::vector<LbiBlobDescPair>& lbis, Regst* regst,
                                  const RtRegstDesc* rt_regst_desc, char* main_mem_ptr,
                                  char* separated_header_mem_ptr) {
  size_t separated_header_mem_size = rt_regst_desc->SeparatedHeaderByteSize4OneRegst();
  char* cur_body_pointer = nullptr;
  char* cur_header_pointer = nullptr;
  if (separated_header_mem_size > 0) {
    MemoryCase host_mem_case;
    host_mem_case.mutable_host_mem();
    if (separated_header_mem_ptr == nullptr) {
      separated_header_mem_ptr =
          Global<MemoryAllocator>::Get()->Allocate(host_mem_case, separated_header_mem_size);
    }
    cur_header_pointer = separated_header_mem_ptr;
    cur_body_pointer = main_mem_ptr;
  } else {
    CHECK(separated_header_mem_ptr == nullptr);
    cur_header_pointer = main_mem_ptr;
    if (main_mem_ptr == nullptr) {
      cur_body_pointer = nullptr;
    } else {
      cur_body_pointer =
          main_mem_ptr + rt_regst_desc->GetSoleBlobDesc()->AlignedByteSizeOfBlobHeader();
    }
  }
  regst->set_main_mem_ptr(main_mem_ptr);
  regst->set_separated_header_mem_ptr(separated_header_mem_ptr);
  rt_regst_desc->ForEachBlobDescOffsetInOnRegst([&](int64_t ordinal, const LogicalBlobId& lbi,
                                                    const BlobDesc* blob_desc, int64_t body_offset,
                                                    int64_t header_offset) {
    std::unique_ptr<Blob> blob_ptr;
    if (cur_body_pointer == nullptr) {
      blob_ptr.reset(new Blob(regst->regst_desc()->mem_case(), blob_desc,
                              cur_header_pointer + header_offset, nullptr));
    } else {
      blob_ptr.reset(new Blob(regst->regst_desc()->mem_case(), blob_desc,
                              cur_header_pointer + header_offset, cur_body_pointer + body_offset));
      InitNonPODTypeBlobIfNeed(Global<MemoryAllocator>::Get(), blob_ptr.get());
    }
    regst->SetBlobByOrdinal(ordinal, std::move(blob_ptr));
    const int64_t regst_desc_id = rt_regst_desc->regst_desc_id();
    const auto& parallel_ctx = regst_desc_id2parallel_ctx_.at(regst_desc_id);
    if (parallel_ctx.has_parallel_id()) {
      const int64_t parallel_id = parallel_ctx.parallel_id();
      {
        std::lock_guard<std::mutex> lock(mutex_);
        lbi2parallel_id2blob_[lbi][parallel_id] = regst->GetBlobByOrdinal(ordinal);
      }
    }
  });
}

// oneflow/core/register/register_manager.cpp: 150
void RegstMgr::NewRegsts(const RegstDescProto& regst_desc_proto,
                         std::function<void(Regst*)> OneRegstDone) {
  const int64_t regst_desc_id = regst_desc_proto.regst_desc_id();
  const RegstDescTypeProto& regst_desc_type = regst_desc_proto.regst_desc_type();
  const RtRegstDesc* rt_regst_desc = regst_desc_id2rt_regst_desc_.at(regst_desc_id).get();
  char* main_mem_ptr = nullptr;
  char* separated_header_mem_ptr = nullptr;
  int64_t mem_block_id = regst_desc_proto.mem_block_id();
  int64_t header_block_id = regst_desc_proto.separated_header_mem_block_id();
  if (mem_block_id != -1 && mem_block_id2ptr_.find(mem_block_id) != mem_block_id2ptr_.end()) {
    main_mem_ptr = mem_block_id2ptr_.at(mem_block_id) + regst_desc_proto.mem_block_offset();
  }
  if (header_block_id != -1 && mem_block_id2ptr_.find(header_block_id) != mem_block_id2ptr_.end()) {
    separated_header_mem_ptr = mem_block_id2ptr_.at(header_block_id);
  }
  std::vector<LbiBlobDescPair> lbi_pairs;
  if (regst_desc_type.has_data_regst_desc()) {
    for (const LbiBlobDescPair& pair : regst_desc_type.data_regst_desc().lbi2blob_desc()) {
      lbi_pairs.push_back(pair);
    }
    std::sort(lbi_pairs.begin(), lbi_pairs.end(), &CompareLbiBlobDescPair);
    CHECK(!lbi_pairs.empty());
  }
  for (int64_t i = 0; i < rt_regst_desc->register_num(); ++i) {
    Regst* regst = new Regst;
    regst->set_regst_desc(rt_regst_desc);
    if (regst_desc_type.has_data_regst_desc()) {
      NewBlobsInOneRegst(lbi_pairs, regst, rt_regst_desc, main_mem_ptr, separated_header_mem_ptr);
      if (main_mem_ptr != nullptr) { main_mem_ptr += rt_regst_desc->MainByteSize4OneRegst(); }
      if (separated_header_mem_ptr != nullptr) {
        separated_header_mem_ptr += rt_regst_desc->SeparatedHeaderByteSize4OneRegst();
      }
    } else if (regst_desc_type.has_ctrl_regst_desc()) {
      // do nothing
    } else {
      UNIMPLEMENTED();
    }
    OneRegstDone(regst);
  }
}

// oneflow/core/actor/actor.cpp: 40
void Actor::Init(const JobDesc* job_desc, const TaskProto& task_proto,
                 const ThreadCtx& thread_ctx) {
  job_desc_ = job_desc;
  actor_id_ = task_proto.task_id();
  thrd_id_ = Global<IDMgr>::Get()->ThrdId4ActorId(actor_id_);
  job_id_ = task_proto.job_id();
  InitDeviceCtx(thread_ctx);
  if (task_proto.has_parallel_ctx()) {
    parallel_ctx_.reset(new ParallelContext(task_proto.parallel_ctx()));
  }
  for (const ExecNodeProto& node : task_proto.exec_sequence().exec_node()) {
    ExecKernel ek;
    ek.kernel = ConstructKernel(job_desc_, node.kernel_conf(), device_ctx_.get());
    exec_kernel_vec_.push_back(std::move(ek));
  }

  is_kernel_launch_synchronized_ =
      std::all_of(exec_kernel_vec_.cbegin(), exec_kernel_vec_.cend(),
                  [](const ExecKernel& ek) { return ek.kernel->IsKernelLaunchSynchronized(); });
  if (!is_kernel_launch_synchronized_) { CHECK_EQ(exec_kernel_vec_.size(), 1); }

  remaining_eord_cnt_ = 0;
  msg_handler_ = nullptr;
  eord_regst_desc_ids_.clear();

  for (const auto& pair : task_proto.produced_regst_desc()) {
    Global<RegstMgr>::Get()->NewRegsts(pair.second, [this](Regst* regst) {
      produced_regsts_[regst->regst_desc_id()].emplace_back(regst);
    });
    int64_t regst_desc_id = pair.second.regst_desc_id();
    CHECK(name2regst_desc_id_.insert({pair.first, {regst_desc_id}}).second);
    if (pair.second.regst_desc_type().has_ctrl_regst_desc()) {
      produced_ctrl_regst_desc_ids_.insert(regst_desc_id);
    }
  }
  for (const auto& pair : produced_regsts_) {
    for (const auto& regst : pair.second) { produced_regst2reading_cnt_[regst.get()] = 0; }
  }

  for (const auto& pair : task_proto.consumed_regst_desc_id()) {
    CHECK(name2regst_desc_id_.find(pair.first) == name2regst_desc_id_.end());
    std::vector<int64_t>& regst_desc_id_vec = name2regst_desc_id_[pair.first];
    for (int64_t regst_desc_id : pair.second.regst_desc_id()) {
      regst_desc_id_vec.push_back(regst_desc_id);
    }
    remaining_eord_cnt_ += pair.second.regst_desc_id_size();
    if (pair.first == "in_ctrl") {
      consumed_ctrl_regst_desc_ids_.insert(regst_desc_id_vec.begin(), regst_desc_id_vec.end());
    }
  }

  total_reading_cnt_ = 0;
  is_inplace_consumed_eord_ = false;
  CheckInplaceRegstDescId(task_proto);
  TakeOverInplaceConsumedAndProduced(task_proto.produced_regst_desc());
  is_naive_consumed_eord_ = false;
  TakeOverNaiveConsumed(task_proto.consumed_regst_desc_id());
  TakeOverNaiveProduced(task_proto.produced_regst_desc());
  InitBnInOp2BlobInfo(task_proto);
  VirtualActorInit(task_proto);
}
  • 既然上一个 Actor 将数据输出之后,下一个 Actor 就会自动消费,那么我们其实只需要关注接受数据的 Actor。数据是通过 Push Job 推送来的,并且在 Python 需要提供一个函数,将 numpy 数据复制到 Blob 里面。

下面的过程来自 InferenceSession,通过启动 Push Job 来将数据推送。Python 通过继承一个 C++ 类,实现了 PushBlob 的方法,在 PushBlob 被调用的时候,他会将指针变成 blob,然后调用 push 的回调函数。

# python/oneflow/compatible/single_client/serving/inference_session.py: 427
def _run_push_jobs(self, **kwargs):
    for (
        input_name,
        push_job_name,
    ) in self.inter_user_job_info_.input_or_var_op_name2push_job_name.items():
        if input_name not in kwargs:
            raise ValueError('input "{}" is absent'.format(input_name))
        input_numpy = kwargs[input_name]
        if not isinstance(input_numpy, np.ndarray):
            raise ValueError('input "{}" requires numpy.ndarray'.format(input_name))
        push_fn = input_blob_util._MakePushNdarrayCallback(input_numpy)
        push_job_inst = job_instance_util.MakePushJobInstance(
            push_job_name, input_name, push_fn
        )
        self._run_job(push_job_inst)

# python/oneflow/compatible/single_client/framework/input_blob_def.py: 249
def _MakePushNdarrayCallback(ndarray):
    copied = np.copy(ndarray, order="C")

    def Copy(ofblob):
        capacity = reduce(lambda x, y: x * y, ofblob.static_shape, 1)
        elem_cnt = reduce(lambda x, y: x * y, copied.shape, 1)
        assert elem_cnt <= capacity, "%s v.s. %s" % (copied.shape, ofblob.static_shape)
        ofblob.CopyFromNdarray(copied)

    return Copy

# python/oneflow/compatible/single_client/framework/job_instance.py: 106
def PushBlob(self, of_blob_ptr):
    try:
        self.push_cb_(ofblob.OfBlob(of_blob_ptr))
    except Exception as e:
        print(traceback.format_exc())
        raise e
  • push job instance 什么时候被启动呢?首先从 Python 调用 C++ 的 API,将 job instance 推送过来,接着将这个 job instance 放入 Buffer Manager,等待被取出来。Launch Job 的最后,调用 Buffer Manager 获取 kBufferNameGlobalWaitJobId,向它发送 job id 启动 foreign input job。之后 ForeignInputKernel 被启动,调用了 ForwardDataContent,然后构建一个 blob 传给 job instance 让它去将数据 push 到 blob 里。
// oneflow/api/python/framework/framework.h: 67
inline Maybe<void> LaunchJob(const std::shared_ptr<oneflow::JobInstance>& cb) {
  CHECK_OR_RETURN(GlobalProcessCtx::IsThisProcessMaster());
  CHECK_NOTNULL_OR_RETURN(Global<Oneflow>::Get());
  const auto& job_name = cb->job_name();
  auto* buffer_mgr = Global<BufferMgr<std::shared_ptr<JobInstance>>>::Get();
  int64_t job_id = Global<JobName2JobId>::Get()->at(job_name);
  if (IsPullJob(job_name, *Global<InterUserJobInfo>::Get())) {
    buffer_mgr->Get(GetForeignOutputBufferName(job_name))->Send(cb);
  }
  if (IsPushJob(job_name, *Global<InterUserJobInfo>::Get())) {
    buffer_mgr->Get(GetForeignInputBufferName(job_name))->Send(cb);
  }
  buffer_mgr->Get(GetCallbackNotifierBufferName(job_name))->Send(cb);
  Global<BufferMgr<int64_t>>::Get()->Get(kBufferNameGlobalWaitJobId)->Send(job_id);
  return Maybe<void>::Ok();
}

// oneflow/core/kernel/foreign_input_kernel.cpp: 23
void ForeignInputKernel::ForwardDataContent(
    const KernelCtx& ctx, const std::function<Blob*(const std::string&)>& BnInOp2Blob) const {
  const auto& buffer_name = op_conf().foreign_input_conf().ofblob_buffer_name();
  std::shared_ptr<JobInstance> foreign_job_instance;
  BufferStatus buffer_status = Global<BufferMgr<std::shared_ptr<JobInstance>>>::Get()
                                   ->Get(buffer_name)
                                   ->TryReceive(&foreign_job_instance);
  CHECK_NE(buffer_status, kBufferStatusEmpty);
  if (buffer_status == kBufferStatusSuccess) {
    OfBlob ofblob(ctx.device_ctx, BnInOp2Blob("out"));
    foreign_job_instance->PushBlob(reinterpret_cast<uint64_t>(&ofblob));
  }
}

总结

这篇文章粗略总结了计算需要的数据从哪里来,将到哪儿去。内存是在运行时启动的时候按照 Plan 分配的,当上一个 Actor 输出之后,下一个 Actor 就可以利用上一个 Actor 的输出来进行下一步的计算。当 Python 需要将前端的数据推送的 C++ 这边的时候,使用 JobInstance 来实现推送数据。

不过这里有一个细节,我说的很模糊。在 Launch Job 的时候,向 BufferMgr 发送一个 id,这个 id 是如何被接收的呢?接收了这个 id 之后,如何启动对应的 Job 呢?下一篇将进行分析。

posted @ 2021-09-07 21:40  楷哥  阅读(156)  评论(0编辑  收藏  举报