[源码解析] PyTorch 如何使用GPU

torch.cuda用于设置 cuda 和运行cuda操作。它跟踪当前选定的GPU，默认情况下，用户分配的所有CUDA张量都将在该设备上创建。用户可以使用 torch.cuda.device 来修改所选设备。一旦分配了张量，您可以对其执行操作，而不考虑所选设备，PyTorch 会把运行结果与原始张量放在同一设备上。

默认情况下，除了~torch.Tensor.copy_和其他具有类似复制功能的方法（如~torch.Tensor.to和~torch.Tensor.cuda）之外，不允许跨GPU操作，除非启用对等（peer-to-peer）内存访问。

我们从源码之中找出一个具体示例如下，大家可以看到，张量可以在设备上被创建，操作。

    cuda = torch.device('cuda')     # Default CUDA device
    cuda0 = torch.device('cuda:0')
    cuda2 = torch.device('cuda:2')  # GPU 2 (these are 0-indexed)

    x = torch.tensor([1., 2.], device=cuda0)
    # x.device is device(type='cuda', index=0)
    y = torch.tensor([1., 2.]).cuda()
    # y.device is device(type='cuda', index=0)

    with torch.cuda.device(1):
        # allocates a tensor on GPU 1
        a = torch.tensor([1., 2.], device=cuda)

        # transfers a tensor from CPU to GPU 1
        b = torch.tensor([1., 2.]).cuda()
        # a.device and b.device are device(type='cuda', index=1)

        # You can also use ``Tensor.to`` to transfer a tensor:
        b2 = torch.tensor([1., 2.]).to(device=cuda)
        # b.device and b2.device are device(type='cuda', index=1)

        c = a + b
        # c.device is device(type='cuda', index=1)

        z = x + y
        # z.device is device(type='cuda', index=0)

        # even within a context, you can specify the device
        # (or give a GPU index to the .cuda call)
        d = torch.randn(2, device=cuda2)
        e = torch.randn(2).to(cuda2)
        f = torch.randn(2).cuda(cuda2)
        # d.device, e.device, and f.device are all device(type='cuda', index=2)

2.2 Module

深度学习的模型可以看做是一种参数的容器，运行模型其实就是对输入参数做了一些基本的矩阵运算。一般来说，用户定义的模型都是派生自 nn.modules.module 类。而分布式训练涉及到同步更新参数和把模型拷贝到多个worker之上，所以我们首先需要看看Module的状况。从定义中可以看出来，Module的成员变量主要分为状态参数和hooks函数。

class Module:
    dump_patches: bool = False
    _version: int = 1
    training: bool
    _is_full_backward_hook: Optional[bool]

    def __init__(self):
        """
        Initializes internal Module state, shared by both nn.Module and ScriptModule.
        """
        torch._C._log_api_usage_once("python.nn_module")

        self.training = True
        self._parameters = OrderedDict() # 在训练过程中会随着 BP 而更新的参数
        self._buffers = OrderedDict() # 在训练过程中不会随着 BP 而更新的参数
        self._non_persistent_buffers_set = set()
        self._backward_hooks = OrderedDict()
        self._is_full_backward_hook = None
        self._forward_hooks = OrderedDict()
        self._forward_pre_hooks = OrderedDict()
        self._state_dict_hooks = OrderedDict()
        self._load_state_dict_pre_hooks = OrderedDict()
        self._modules = OrderedDict()

我们主要对状态参数进行说明。状态参数之中，主要有四种：

self.training
- 本网络是否正在训练。
self._modules
- 是本网络下属的子模块，采取迭代的方式进行定义。
self._parameters
- 网络的参数。是在训练过程中会随着 BP 而更新的参数，就是梯度更新的对象。
self._buffers
- 在训练过程中，不会随着BP更新的参数，但需要被保存，比如BatchNorm中的moving mean and variance，其优化不是通过梯度反向传播而是通过其他途径。

从本质上讲，当一个模型的网络结构被定义之后，self._parameters 和 self._buffers的组合是一个模型的具体状态。如果需要拷贝一个模型：

self._modules属于网络结构的一部分，当我们拷贝模型到其他workers时，会一起拷贝过来。
而self._parameters 和 self._buffers 都需要显式拷贝到其他worker，这样才能在不同的Python进程之中维持相同的状态。

那么，这是不是意味着我们只需要拷贝 self._modules，self._parameters 和 self._buffers 这些就可以了？让我们继续往下看。

2.3 移动

2.3.1 示例

前面看到了如何在 GPU 上操作张量，我们接下来看看如何把模型放置到 GPU 之上。

首先我们定义了一个模型。

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))

然后通过如下方式使用模型。

model = ToyModel().cuda(device_ids[0]) # 这里复制模型到 GPU 之上
ddp_model = DDP(model, device_ids)

loss_fn = nn.MSELoss() # 接着进行训练
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

optimizer.zero_grad()
outputs = ddp_model(torch.randn(20, 10))
labels = torch.randn(20, 5).to(device_ids[0])
loss_fn(outputs, labels).backward()
optimizer.step()

2.3.2 操作

示例之中使用了 cuda 方法把模型复制到 GPU 之上，注释中指出了是把模型的 parameters 和 buffers 移动到 GPU 之上。代码中实际就是使用 self._apply 来调用 cuda(device)。

def cuda(self: T, device: Optional[Union[int, device]] = None) -> T:
    r"""Moves all model parameters and buffers to the GPU.

    This also makes associated parameters and buffers different objects. So
    it should be called before constructing optimizer if the module will
    live on GPU while being optimized.

    .. note::
        This method modifies the module in-place.

    Args:
        device (int, optional): if specified, all parameters will be
            copied to that device

    Returns:
        Module: self
    """
    return self._apply(lambda t: t.cuda(device))

我们再看大家熟悉的另外一些函数。

首先，to 方法其实本质也是使用 self._apply 来调用 to(device)，我们省略了一些检验代码。

def to(self, *args, **kwargs):
    r"""Moves and/or casts the parameters and buffers.

    This can be called as

    .. function:: to(device=None, dtype=None, non_blocking=False)
    .. function:: to(dtype, non_blocking=False)
    .. function:: to(tensor, non_blocking=False)
    .. function:: to(memory_format=torch.channels_last)

    Its signature is similar to :meth:`torch.Tensor.to`, but only accepts
    floating point or complex :attr:`dtype`s. In addition, this method will
    only cast the floating point or complex parameters and buffers to :attr:`dtype`
    (if given). The integral parameters and buffers will be moved
    :attr:`device`, if that is given, but with dtypes unchanged. When
    :attr:`non_blocking` is set, it tries to convert/move asynchronously
    with respect to the host if possible, e.g., moving CPU Tensors with
    pinned memory to CUDA devices.

    See below for examples.

    .. note::
        This method modifies the module in-place.

    Args:
        device (:class:`torch.device`): the desired device of the parameters
            and buffers in this module
        dtype (:class:`torch.dtype`): the desired floating point or complex dtype of
            the parameters and buffers in this module
        tensor (torch.Tensor): Tensor whose dtype and device are the desired
            dtype and device for all parameters and buffers in this module
        memory_format (:class:`torch.memory_format`): the desired memory
            format for 4D parameters and buffers in this module (keyword
            only argument)

    Returns:
        Module: self
    """

    device, dtype, non_blocking, convert_to_format = torch._C._nn._parse_to(*args, **kwargs)

    def convert(t):
        if convert_to_format is not None and t.dim() in (4, 5):
            return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
                        non_blocking, memory_format=convert_to_format)
        return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

    return self._apply(convert)

其次，cpu 方法也是使用 self._apply 来调用 cpu(device)。

def cpu(self: T) -> T:
    r"""Moves all model parameters and buffers to the CPU.

    .. note::
        This method modifies the module in-place.

    Returns:
        Module: self
    """
    return self._apply(lambda t: t.cpu())

因此，我们需要分析一下 _apply 方法。

2.3.3 _apply 方法

我们可以看到其主要逻辑是：

遍历 _parameters：
- 对参数调用fn进行处理，得到param_applied。
  - 用 param_applied 重新设置参数。
- 如果参数有梯度，则：
  - 对参数的grad调用fn进行处理，得到grad_applied。
  - 用 grad_applied 重新设置参数的梯度。
遍历 _buffers：
- 对buf调用fn进行处理。

def _apply(self, fn):
    for module in self.children():
        module._apply(fn)

    def compute_should_use_set_data(tensor, tensor_applied):
        if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
            # If the new tensor has compatible tensor type as the existing tensor,
            # the current behavior is to change the tensor in-place using `.data =`,
            # and the future behavior is to overwrite the existing tensor. However,
            # changing the current behavior is a BC-breaking change, and we want it
            # to happen in future releases. So for now we introduce the
            # `torch.__future__.get_overwrite_module_params_on_conversion()`
            # global flag to let the user control whether they want the future
            # behavior of overwriting the existing tensor or not.
            return not torch.__future__.get_overwrite_module_params_on_conversion()
        else:
            return False

    # 遍历 _parameters
    for key, param in self._parameters.items():
        if param is not None:
            # Tensors stored in modules are graph leaves, and we don't want to
            # track autograd history of `param_applied`, so we have to use
            # `with torch.no_grad():`
            with torch.no_grad():
                param_applied = fn(param) # 对参数调用fn进行处理，得到param_applied
            should_use_set_data = compute_should_use_set_data(param, param_applied)
            if should_use_set_data:
                param.data = param_applied # 用 param_applied 重新设置
            else:
                assert isinstance(param, Parameter)
                assert param.is_leaf
                # # 用 param_applied 重新设置
                self._parameters[key] = Parameter(param_applied, param.requires_grad)

            if param.grad is not None: # 如果参数有梯度
                with torch.no_grad():
                    grad_applied = fn(param.grad) # 对参数的grad调用fn进行处理
                should_use_set_data = compute_should_use_set_data(param.grad, grad_applied)
                if should_use_set_data:
                    param.grad.data = grad_applied # 用 grad_applied 重新设置
                else:
                    assert param.grad.is_leaf
                    self._parameters[key].grad = grad_applied.requires_grad_(param.grad.requires_grad) # 用 grad_applied 重新设置

    # 遍历 _buffers                
    for key, buf in self._buffers.items():
        if buf is not None:
            self._buffers[key] = fn(buf) # 对buf调用fn进行处理

    return self

因此我们可以看到，移动模型到GPU，其实就是把模型的self._parameters 和 self._buffers 移动到 GPU，并没有对 self._modules 进行移动。我们对模型进行 .cuda() 处理，是将模型的参数放到显存上去（实际使用的时候也是通过这些参数做运算）。

比如原来模型在下图左侧，进行 Module.cuda() 操作之后，模型如右边所示。

                                        +
                                        |
+---------------------------------+     |     +----------------------------------+
| CPU                             |     |     | CPU                              |
|  +--------------+               |     |     |       +--------------------+     |
|  |Module        |               |     |     |       | Module             |     |
|  |              |               |     |     |       |                    |     |
|  | _parameters+----> Parameters |     |     |       |     _parameters ------+  |
|  |              |               |     |     |       |                    |  |  |
|  | _buffers +------> Buffers    |     |     |     +-----+ _buffers       |  |  |
|  |              |               |     |     |     | |                    |  |  |
|  | _modules     |               |     |     |     | |     _modules       |  |  |
|  |              |               |     |     |     | |                    |  |  |
|  +--------------+               |     |     |     | +--------------------+  |  |
|                                 |     |     |     |                         |  |
+---------------------------------+     |     +----------------------------------+
                                        |           |                         |
                                        +           |                         |
+------------------------------->  Module.cuda() +---------------------------------> Time
                                        +           |                         |
                                        |           |                         |
+---------------------------------+     |     +----------------------------------+
| GPU                             |     |     | GPU |                         |  |
|                                 |     |     |     |                         |  |
|                                 |     |     |     |       Parameters  <-----+  |
|                                 |     |     |     |                            |
|                                 |     |     |     |                            |
|                                 |     |     |     +---->  Buffers              |
|                                 |     |     |                                  |
|                                 |     |     |                                  |
+---------------------------------+     |     +----------------------------------+
                                        |
                                        +

为什么 self._modules 没有被移动？这是因为没有必要，因为_modules 可以认为是一个list，其主要起到了桥梁作用，对其递归遍历可以被用来获取网络所有的 parameters。而这个功能在后续操作之中不是必须的。

DP 就是在每次网络传播开始前，会把master节点上的parameters和buffer广播给其他节点，以此来维持状态的统一。

2.4 小结

现在我们可以回答了第一个问题：移动模型到GPU这个动作的背后究竟做了哪些操作？

答案时：调用 cuda 或者 to 方法来移动模型到GPU，其实就是把模型的self._parameters 和 self._buffers 移动到 GPU，并没有对 self._modules 进行移动。这个移动过程是递归调用的，是把模型每个叶子都移动到了 GPU 之上。

0x03 在GPU之上调用函数

3.1 CUDA编程模型基础

我们首先介绍一下CUDA编程模型基础。

3.1.1 异构模型

CUDA编程模型是一个异构模型。程序运行在一个异构系统之上，这个异构系统由CPU和GPU构成，它们之间由总线分开，程序运行时候是由CPU和GPU协同工作。

在CUDA之中，有两个重要概念：host和device。

Host ：CPU及其内存。
Device ：GPU及其内存。

因此，CUDA 架构下的一个程序也对应分为两个部份：Host 代码和Device代码，它们分别在CPU和GPU上运行。host与device之间可以通信进行数据拷贝。

主机代码（Host Code）：在 CPU 上执行的部份，使用Linux（GNU gcc）和Windows（Microsoft Visual C）编译器来编译。大致可以认为认为C语言工作对象是CPU和内存条。
设备代码（Device Code）：在GPU上执行的部份，使用 NVIDIA NVCC 编译器来编译。大致可以认为 CUDA C工作对象是GPU及GPU上内存（也叫设备内存）。

+-------------------+        +--------------------+
|                   |        |                    |
|   +----------+    |        |    +----------+    |
|   |          |    |        |    |          |    |
|   |   RAM    |    |        |    |   RAM    |    |
|   |          |    |        |    |          |    |
|   +----+-----+    |        |    +----+-----+    |
|        |          +--------+         |          |
|        |          |        |         |          |
|   +----+-----+    |        |    +----+-----+    |
|   |          |    |        |    |          |    |
|   |   CPU    |    |        |    |   GPU    |    |
|   |          |    |        |    |          |    |
|   +----------+    |        |    +----------+    |
|                   |        |                    |
+-------------------+        +--------------------+

      Host                           Device

3.1.2 并行思想

CUDA 编程的思路是并行思想，大致如下：

把一个很大的执行任务划分成若干个简单的可以重复的操作，然后使用若干个线程来分别执行这些操作，达到并行的目的。
执行任务处理的数据也要对应分组成多个小数据块。比如一个大数据分成若干个GPU组，每个GPU组要再次分成多个线程组，线程组内的张量可能需要再细分为张量处理器能处理的小组。

因此，一个典型的CUDA程序包括串行代码和并行代码。

串行代码是标准C代码，由host执行。
并行代码是CUDA C代码，在device中执行。

CUDA 主程序由CPU开始，即程序由host执行串行代码开始，当遇到需要数据并行处理的部分，则由device执行并行代码来作为补足。device可以独立于host进行大部分操作。当一个device代码启动之后，控制权会立刻返还给CPU来执行其他任务，所以这是一个异步过程。

图来自 https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html 。

3.1.3 处理流程

典型的CUDA程序的执行流程如下：

分配host内存空间并且初始化数据。
分配device显存空间。
将要计算的数据从Host内存之上复制到device显存之上。
调用CUDA核函数在device上完成用户指定的运算。
将计算后GPU内存上的结果复制到Host内存上。
释放device和host上分配的内存。

具体可以参见下图。

3.2 函数

3.2.1 核函数

核函数是在device线程中并行执行的函数。在 CUDA 程序中，主程序在调用GPU内核之前需要对核进行执行配置，以确定线程块数，每个线程块中线程数和共享内存大小。比如在调用时需要用<<参数1,参数2>>来指定核函数需要的线程数量以及线程是如何组织，这样在GPU之中就会启动若干个线程来并行执行这个核函数，每个线程被分配一个唯一的线程号。

CUDA通过函数类型限定词来区别host和device上的函数，主要的三个函数类型限定词为：

限定符	执行	调用	备注
`__global__`	设备端执行	可以从主机调用也可以从某些特定设备调用	异步操作，host 将并行计算任务发射到GPU的任务调用单之后，不会等待kernel执行完就执行下一步
`__device__`	设备端执行	设备端调用	不可以和`__global__`同时用
`__host__`	主机端执行	主机调用	可省略，不可和`__global__`同时用，可和`__device__`同时用，此时函数在device和host都编译。

具体如下：

+------------------------+   +------------------------+
|                        |   |                        |
|                        |   |                        |
| __host__   __global__  |   |          __device__    |
|    +           +       |   |                        |
|    |           |       |   |              +         |
|    |           |       |   |              |         |
|    |           v--------------->          |         |
|    |                   |   |   |          |         |
|    |                   |   |   |          |         |
|    |                   |   |   |          |         |
|    |                   |   |   |          |         |
|    |                   |   |   |          |         |
|    |                   |   |   |          |         |
|    |                   |   |   |          |         |
|    |                   |   |   |          |         |
|    |                   |   |   |          |         |
|    |           +<--------------v          |         |
|    |           |       |   |              |         |
|    |           |       |   |              |         |
|    |           |       |   |              |         |
|    v           v       |   |              v         |
|                        |   |                        |
+------------------------+   +------------------------+

      Host                           Device

这三个限定词其实也是 CUDA 中常见的三种运行场景。其中，device 函数和global函数因为需要在GPU上运行，因此不能调用常见的一些 C/C++ 函数（因为这些函数没有对应的 GPU 实现）。

如下代码是 NVIDIA 的例子，使用内置的 threadIdx 变量，把 A 和 B 两个张量进行相加，得到 C。因此，N 个线程之中每个都会执行 VecAdd() 。

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
    int i = threadIdx.x;
    C[i] = A[i] + B[i];
}

int main()
{
    ...
    // Kernel invocation with N threads
    VecAdd<<<1, N>>>(A, B, C);
    ...
}

3.2.2 PyTorch 样例

我们从 third_party/cub/cub/device/dispatch/dispatch_reduce.cuh 找一个核函数例子来看看。

/**
 * Reduce region kernel entry point (multi-block).  Computes privatized reductions, one per thread block.
 */
template <
    typename                ChainedPolicyT,             ///< Chained tuning policy
    typename                InputIteratorT,             ///< Random-access input iterator type for reading input items \iterator
    typename                OutputIteratorT,            ///< Output iterator type for recording the reduced aggregate \iterator
    typename                OffsetT,                    ///< Signed integer type for global offsets
    typename                ReductionOpT>               ///< Binary reduction functor type having member <tt>T operator()(const T &a, const T &b)</tt>
__launch_bounds__ (int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS))
__global__ void DeviceReduceKernel(
    InputIteratorT          d_in,                       ///< [in] Pointer to the input sequence of data items
    OutputIteratorT         d_out,                      ///< [out] Pointer to the output aggregate
    OffsetT                 num_items,                  ///< [in] Total number of input data items
    GridEvenShare<OffsetT>  even_share,                 ///< [in] Even-share descriptor for mapping an equal number of tiles onto each thread block
    ReductionOpT            reduction_op)               ///< [in] Binary reduction functor
{
    // The output value type
    typedef typename If<(Equals<typename std::iterator_traits<OutputIteratorT>::value_type, void>::VALUE),  // OutputT =  (if output iterator's value type is void) ?
        typename std::iterator_traits<InputIteratorT>::value_type,                                          // ... then the input iterator's value type,
        typename std::iterator_traits<OutputIteratorT>::value_type>::Type OutputT;                          // ... else the output iterator's value type

    // Thread block type for reducing input tiles
    typedef AgentReduce<
            typename ChainedPolicyT::ActivePolicy::ReducePolicy,
            InputIteratorT,
            OutputIteratorT,
            OffsetT,
            ReductionOpT>
        AgentReduceT;

    // Shared memory storage
    __shared__ typename AgentReduceT::TempStorage temp_storage;

    // Consume input tiles
    OutputT block_aggregate = AgentReduceT(temp_storage, d_in, reduction_op).ConsumeTiles(even_share);

    // Output result
    if (threadIdx.x == 0)
        d_out[blockIdx.x] = block_aggregate;
}

3.3 小结

目前我们知道了，PyTorch 其实可以通过调用 __global__ 方法来在GPU之上执行并行操作。这回答了我们的第二个问题：如何在 CPU 之上调用 GPU 操作？

0x04 在GPU/CPU之间切换

我们接下来分析如何在GPU/CPU之间切换。

由示例代码可以知道，只要调用了 cuda 函数把模型移动到 GPU 之上，我们就可以使用 CUDA global 核函数在GPU上进行并行运算。

model = ToyModel().cuda(device_ids[0]) # 这里复制模型到 GPU 之上
ddp_model = DDP(model, device_ids)

loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

optimizer.zero_grad()
outputs = ddp_model(torch.randn(20, 10))

但是我们忽略了一个问题，就是 PyTorch 怎么知道此时应该调用GPU对应的 global 核函数？为什么 PyTorch 就不调用 CPU 函数或者其他设备的函数了？这就是我们接下来需要分析的。

4.1 Dispatcher 机制

此处我们主要借鉴 http://blog.ezyang.com/2020/09/lets-talk-about-the-pytorch-dispatcher/ 。

4.1.1 问题

在PyTorch中，operator 所表现出预期行为是由很多机制共同作用导致的，比如：

做实际工作的kernel。
是否支持反向自动微分，例如，使 loss.backward() 正常工作的标记位。
是否启用了torch.jit.trace。
如果你正在vmap调用中，所运行operator将会表现出不同的批处理行为。

因此，我们知道有太多不同的方式可以对PyTorch operator进行不同的解释，如果我们试图在一个名为add的单一函数里面处理所有的行为，我们的实现代码会很快演变成一个不可维护的混乱局面。

所以我们需要有一个机制来解决这个问题，这个机制不仅仅是一个if语句这么简单，而是PyTorch内部一个非常重要的抽象，而且它必须在尽可能不降低PyTorch性能的情况下做到这一点。这个机制就是 Dispatcher。

4.1.2 什么是 Dispatcher

什么是dispatcher？dispatcher对于每个operator都会维护一个函数指针表，这些函数为每个dispatch key提供了对应的实现，这套机制大致对应于PyTorch中的一个横切关注点。在上图中，你可以看到在这个表中有针对不同后端（CPU、CUDA、XLA）以及更高级概念（例如 autograd 和跟踪）的dispatch条目。dispatcher的工作是根据输入的tensor和其他一些东西（比如参数个数，返回值类型等等）来计算出一个dispatch key，然后跳转到函数指针表所指向的函数。

熟悉 C++ 的人可能会注意到，这个函数指针表与C++中的虚表非常相似。在C++中，对象的虚函数是通过将每个对象与一个虚表的指针相关联来实现的，该虚表包含了有关对象上每个虚函数的实现。在PyTorch中，我们基本上重新实现了虚拟表，但有一些区别。

dispatch表之中包括了 dispatch key 和其对应的函数指针，我们可以发现，dispatch key不仅仅有后端（CPU、CUDA、XLA），也有一些更高级的概念（例如 autograd 和跟踪）。
dispatch表是按operator分配的，而虚表是按类分配的。这意味着我们可以通过分配一个新的dispatch表来扩展所支持的operator集。与其不同的是，对于一个C++对象，你可以通过继承子类来扩展类型，但你不能轻易添加虚函数。与普通的面向对象系统不同，PyTorch大部分的可扩展性在于定义新的operator（而不是新的子类），所以这种权衡是合理的。此外，dispatch key的种类不是公开可扩展的，我们希望那些想添加新dispatch key的使用者通过向PyTorch核心团队提交一个补丁来添加他们的dispatch key。
我们的dispatch key的计算考虑了operator的所有参数（multiple dispatch）以及线程本地状态（TLS）。这与虚表不同，在虚表中只有第一个对象（this指针）很重要。
最后，dispatcher支持boxing和unboxing作为op的调用约定的一部分。在文章的最后部分会有更多关于这个的内容。

有趣的历史笔记：我们曾经使用虚函数来实现动态dispatch，当我们意识到需要比虚表更多的能力时，我们重新实现了动态dispatch。

4.1.3 如何计算key

那么，我们究竟是如何计算dispatch key的呢？我们是基于dispatch key set来完成的，dispatch key set是一个基本抽象，它是dispatch key的一个bitset。大致来讲，我们综合来自不同来源的dispatch key sets（在某些情况下屏蔽一些key）来得到一个最终的dispatch key set。然后我们在这个set中挑选优先级最高的key（dispatch keys按某些优先级隐式排序），这就是我们这次应该调用的结果。那么，这些dispatch key sets的来源是什么？

每个张量输入都有一个由该张量上的所有dispatch key组成的dispatch key set（直观地说，这些dispatch key的值会是类似 “CPU”字符串这样的东西，这告诉我们该张量是一个CPU张量，所以应该由dispatch表中的CPU handler来处理）。
我们还有一个local include set，用于 "模态（modal） "功能，例如tracing，它不与任何张量关联，而是某种线程的本地模态，用户可以在某些范围内打开或关闭。
最后，我们有一个global set，它包含了始终应该被考虑的dispatch key（自从写下这张PPT以来，Autograd已经从global set转移到了张量之上。然而系统的高级结构并没有改变）。

除了这些，还有一个local exclude set，其用从dispatch排除某些dispatch key。一个常见的场景是一个handler负责处理一个key，然后通过local exclude set将自己屏蔽掉，这样我们以后就不会尝试重新处理这个key。

4.1.4 注册

我们接下来看看如何注册这个dispatch key 到 dispatch 表之中。这个过程通过operator registration API来实现。操作符注册 API 有三种主要方式：

为operator定义模式。
然后在对应的key上注册实现。
最后，有一个 fallback 方法，用户可以使用它为某个key对应的所有运算符定义同一个处理程序。

为了可视化 operator registration的工作，让我们想象一下，所有op的dispatch表共同形成一个二维网格，像这样：

纵轴上是PyTorch中支持的每个op。
横轴上是系统支持的每个dispatch key。

operator registration 行为就是在这两个轴定义出的单元格中填写对应的实现。

在一个特定的dispatch key上为一个operator注册kernel函数时，我们会填写一个单元格（下面的蓝色）的内容。比如下图就是一个 cpu kernel mul 算子。

用户也可以使用 "catch-all" 来为所有的 dispatch keys 注册同一个kernel，比如下图的红色行。

用户也可以为下图的 "aten::add"，"aten::mul"，"aten::sub" 这样的kernel 指定同一个 dispatch key，如下图绿色列。

这些注册形式有一个优先级：特定的内核实现具有最高优先级，然后是 catch，最后是 fallback，如下图的 1，2，3 顺序，首先选择1，然后是 2，最后是 3。

4.2 Dispatcher 代码

我们接下来通过源码来看看。

4.2.1 虚函数表

4.2.1.1 Schema 例子

每个kernel 算子（虚函数）都有一个对应的schema，我们可以从 aten/src/ATen/native/native_functions.yaml 之中找到一些虚函数 schema 的例子，这些都是以字符串的形式呈现。我们可以看到，schema 包括算子名称（比如zero_sparse_），输入参数个数和类型，返回值类型，是否需要check，如何分发等等。

# zero 操作对应的虚函数表
- func: zero_(Tensor(a!) self) -> Tensor(a!) 
  device_check: NoCheck   # TensorIterator
  variants: method, function
  dispatch:
    CPU, CUDA: zero_
    Meta: zero_meta_
    SparseCPU, SparseCUDA: zero_sparse_
    MkldnnCPU: mkldnn_zero_

# sub.out 对应的虚函数表
- func: sub.out(Tensor self, Tensor other, *, Scalar alpha=1, Tensor(a!) out) -> Tensor(a!)
  device_check: NoCheck   # TensorIterator
  structured: True
  structured_inherits: TensorIteratorBase
  dispatch:
    CPU, CUDA: sub_out
    SparseCPU, SparseCUDA: sub_out_sparse

# sub.Tensor 对应的虚函数表
- func: sub.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor
  device_check: NoCheck   # TensorIterator
  variants: function, method
  structured_delegate: sub.out
  dispatch:
    SparseCPU, SparseCUDA: sub_sparse

4.2.1.2 Operator的实现

我们可以看看 zero 的两个实现，下面是MkldnnCPU的实现。

Tensor& mkldnn_zero_(Tensor& self) {
  using Vec = vec::Vectorized<float>;

  ideep::tensor& x = itensor_from_mkldnn(self);

  auto n = x.get_nelems();
  auto* x_ = static_cast<float*>(x.get_data_handle());
  parallel_for(0, n, 2048, [x_](int64_t begin, int64_t end) {
    vec::map(
        [](Vec /* unused */) { return 0.0; },
        x_ + begin,
        x_ + begin,
        end - begin);
  });

  return self;
}

又比如下面是SparseCPU, SparseCUDA 的对应实现：

// --------------------------------------------------------------------
// zero_(SparseTensor)
// --------------------------------------------------------------------
// hummu hummu
SparseTensor& zero_sparse_(SparseTensor& self) {
  AT_ASSERT(self.is_sparse());
  at::zeros_out(self, get_sparse_impl(self)->sizes());
  return self._coalesced_(true);
}

4.2.2 Dispatcher 定义

我们接下来看看Dispatcher的定义，这里只给出部分成员变量。

class TORCH_API Dispatcher final {
private:
  // For direct access to backend fallback information
  friend class impl::OperatorEntry;

  struct OperatorDef final {
    explicit OperatorDef(OperatorName&& op_name)
    : op(std::move(op_name)) {}
    impl::OperatorEntry op;
    size_t def_count = 0;
    size_t def_and_impl_count = 0;
  };
  friend class OperatorHandle;
  template<class> friend class TypedOperatorHandle;

public:

  static Dispatcher& realSingleton();

  //存储所有的算子，并在其成员变量中存储了每个算子的不同版本，比如cpu，cuda，autograd....
  std::list<OperatorDef> operators_;
  //注册算子时会将算子名称和方法也存储在这个里面, 这样就可以快速的通过名字查找到算子方法(其中包含了成员OperatorDef) 
  LeftRight<ska::flat_hash_map<OperatorName, OperatorHandle>> operatorLookupTable_;
  // Map from namespace to debug string (saying, e.g., where the library was defined)
  ska::flat_hash_map<std::string, std::string> libraries_;
  std::array<impl::AnnotatedKernel, static_cast<uint8_t>(DispatchKey::NumDispatchKeys)> backendFallbackKernels_;
  std::unique_ptr<detail::RegistrationListenerList> listeners_;
  std::mutex mutex_;
};

逻辑大致如下，operators_ 存储了所有的算子：

+--------------------------------------------+
| Dispatcher                                 |
|                                            |
|                                            |
|                                            |
|     std::list<OperatorDef> operators_      |
|                                            |
|     operatorLookupTable_                   |
|                                            |
+--------------------------------------------+

4.2.3 注册

我们接下来给出注册虚函数表的方法。

RegistrationHandleRAII Dispatcher::registerImpl(
  OperatorName op_name,
  c10::optional<DispatchKey> dispatch_key,
  KernelFunction kernel,
  c10::optional<impl::CppSignature> cpp_signature,
  std::unique_ptr<FunctionSchema> inferred_function_schema,
  std::string debug
) {
  std::lock_guard<std::mutex> lock(mutex_);
  auto op = findOrRegisterName_(op_name);
  auto handle = op.operatorDef_->op.registerKernel( // 进行注册
    *this,
    dispatch_key,
    std::move(kernel),
    std::move(cpp_signature),
    std::move(inferred_function_schema),
    std::move(debug)
  );

  ++op.operatorDef_->def_and_impl_count;

  return RegistrationHandleRAII([this, op, op_name, dispatch_key, handle] {
    deregisterImpl_(op, op_name, dispatch_key, handle);
  });
}

4.2.3.1 注册表

OperatorEntry代表了一个算子，以及该算子的dispatch table，这里只给出成员变量。

class TORCH_API OperatorEntry final { //代表了一个算子，以及该算子的dispatch table
public:
  OperatorName name_;
  c10::optional<AnnotatedSchema> schema_;
  //存储了不同key对应的算子实现版本，比如cpu，cuda，autograd 等等，所有的算子版本都会在这个table里面
  std::array<KernelFunction, static_cast<uint8_t>(DispatchKey::NumDispatchKeys)> dispatchTable_;
  DispatchKeyExtractor dispatchKeyExtractor_;
  //不同 DispatchKey对应了不同的版本的kernel算子实现版本
  ska::flat_hash_map<DispatchKey, std::list<AnnotatedKernel>> kernels_;
};

逻辑如下：

+---------------------------+     +------------------------------------------+
| OperatorEntry             |     |                                          |
|                           |     | std::array<KernelFunction, uint8_t>      |
|                           |     |                                          |
|                           |     |                                          |
|                           |     |           int('CPU') : CPU_kernel        |
|        dispatchTable_ +-------> |                                          |
|                           |     |           int('GPU') : GPU_kernel        |
|                           |     |                                          |
|                           |     |           ......                         |
|                           |     |                                          |
|                           |     |           int('Metal') : Metal_kernel    |
|                           |     |                                          |
+---------------------------+     +------------------------------------------+

4.2.3.2 注册行为

最终注册行为就是往 dispatchTable_ 之中设置。

void OperatorEntry::updateDispatchTableEntry_(const c10::Dispatcher& dispatcher, DispatchKey dispatch_key) {
  auto dispatch_ix = static_cast<uint8_t>(dispatch_key);
  dispatchTable_[dispatch_ix] = computeDispatchTableEntry(dispatcher, dispatch_key);
  dispatchKeyExtractor_.setOperatorHasFallthroughForKey(dispatch_key, dispatchTable_[dispatch_ix].isFallthrough());
}

所以 Dispatcher 数据结构拓展近似如下，这里包含了两个OperatorEntry，分别对应了op1和op2，就是说，目前系统中一共有两个operator，每个 operator 有4个kernel函数，分别对应了CPU，GPU等四个后端。

+-----------------------------------------+
| Dispatcher                              |
|                                         |
|                                         |
|   std::list<OperatorDef> operators_ +--------+
|                                         |    |
|                                         |    |
|   operatorLookupTable_                  |    |
|                                         |    |
+-----------------------------------------+    |
                                               |
                                               |
                                               v
           +-----------------------------------+------------------------------------------+
           |  +---------------------------+     +--------------------------------------+  |
           |  | OperatorEntry             |     |                                      |  |
           |  |                           |     | std::array<KernelFunction, uint8_t>  |  |
           |  |                           |     |                                      |  |
           |  |        name_ = op1        |     |                                      |  |
           |  |                           |     |           int('CPU') : op1_cpu       |  |
           |  |        dispatchTable_ +-------> |                                      |  |
           |  |                           |     |           int('GPU') : op1_gpu       |  |
           |  |                           |     |                                      |  |
           |  |                           |     |           int('XLA') : op1_xla       |  |
           |  |                           |     |                                      |  |
           |  |                           |     |           int('Metal') : op1_metal   |  |
           |  |                           |     |                                      |  |
           |  +---------------------------+     +--------------------------------------+  |
           |                                                                              |
           |                                                                              |
           |  +---------------------------+     +--------------------------------------+  |
           |  | OperatorEntry             |     |                                      |  |
           |  |                           |     | std::array<KernelFunction, uint8_t>  |  |
           |  |                           |     |                                      |  |
           |  |        name_ = op2        |     |                                      |  |
           |  |                           |     |           int('CPU') : op2_cpu       |  |
           |  |        dispatchTable_ +-------> |                                      |  |
           |  |                           |     |           int('GPU') : op2_gpu       |  |
           |  |                           |     |                                      |  |
           |  |                           |     |           int('XLA') : op2_xla       |  |
           |  |                           |     |                                      |  |
           |  |                           |     |           int('Metal') : op2_metal   |  |
           |  |                           |     |                                      |  |
           |  +---------------------------+     +--------------------------------------+  |
           +------------------------------------------------------------------------------+

4.2.4 如何dispatch

4.2.4.1 调度依据

PyTorch 之中会依据dtype、device和layout的不同来调度不同的operator。

大多数类型（比如int32）可以使用模版方式直接进行映射，但是某些operator 不支持模版功能，就需要dispatcher这样的动态调度器。
PyTorch的tensor不仅可以运行在CPU上，还可以跑在GPU，mkldnn和xla等设备，这也需要动态调度。
layout是指tensor中元素的排布，这就有strided layout和sparse layout的区别，所以也需要动态调度。

4.2.4.2 调度代码

我们这里这是给出部分代码，有兴趣的读者继续继续深入。

算子调度的逻辑是：

通过 dispatcher 类 + operator name + 操作类型等联合的形式来查找对应的算子 schema，算子的schema 定义了本算子的输入/输出/参数等等的相关信息。
调用 dispatcher::call 完成算子操作。
1. 得到 dispatcher 中的 dispatchKetSet。
2. 利用 op.lookup 找到最高优先级的 key，并且依据 key 找到对应的 KernelFunction。
3. 调用 kernel。

首先，具体以range的定义来看看如何查找schema，具体在 findSchemaOrThrow 内部是通过operatorLookupTable_来查找op：

at::Tensor range::call(const at::Scalar & start, const at::Scalar & end, c10::optional<at::ScalarType> dtype, c10::optional<at::Layout> layout, c10::optional<at::Device> device, c10::optional<bool> pin_memory) {
    static auto op = c10::Dispatcher::singleton()
        .findSchemaOrThrow("aten::range", "")
        .typed<at::Tensor (const at::Scalar &, const at::Scalar &, c10::optional<at::ScalarType>, c10::optional<at::Layout>, c10::optional<at::Device>, c10::optional<bool>)>();
    return op.call(start, end, dtype, layout, device, pin_memory);
}

其次，Dispatcher::call 定义如下：

template<class Return, class... Args>
C10_DISPATCHER_INLINE_UNLESS_MOBILE Return Dispatcher::call(const TypedOperatorHandle<Return(Args...)>& op, Args... args) const {
  detail::unused_arg_(args...);  
 
  // 得到key set
  auto dispatchKeySet = op.operatorDef_->op.dispatchKeyExtractor()
    .template getDispatchKeySetUnboxed<Args...>(args...);
  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(!c10::isAliasDispatchKey(dispatchKeySet.highestPriorityTypeId()));
  
  // 得到算子
  const KernelFunction& kernel = op.operatorDef_->op.lookup(dispatchKeySet.highestPriorityTypeId());
  
  // 进行调度
#ifndef PYTORCH_DISABLE_PER_OP_PROFILING
  bool pre_sampled = false;
  if (C10_UNLIKELY(at::shouldRunRecordFunction(&pre_sampled))) {
    return callWithDispatchKeySlowPath<Return, Args...>(op, pre_sampled, dispatchKeySet, kernel, std::forward<Args>(args)...);
  }
#endif  // PYTORCH_DISABLE_PER_OP_PROFILING
  return kernel.template call<Return, Args...>(op, dispatchKeySet, std::forward<Args>(args)...);
}

4.2.4.3 key

我们接下来看看key的定义，因为太多，所以我们只给出部分数值。

enum class DispatchKey : uint8_t {
  CPU, // registered at build/aten/src/ATen/RegisterCPU.cpp
  CUDA, // registered at build/aten/src/ATen/RegisterCUDA.cpp
  HIP, // NB: I think this is not actually used, due to Note [Masquerading as
  // CUDA]
  FPGA, // Xilinx support lives out of tree at
  // https://gitlab.com/pytorch-complex/vitis_kernels
  MSNPU, // unused externally, but tested at
  // test/cpp_extensions/msnpu_extension.cpp
  XLA, // lives out of tree at https://github.com/pytorch/xla
  MLC, // lives out of tree at https://github.com/pytorch/MLCompute
  Vulkan,
  Metal,
  XPU, // For out of tree Intel's heterogeneous computing plug-in
  HPU, // For out of tree & closed source integration of HPU / Habana
  VE, // For out of tree & closed source integration of SX-Aurora / NEC
  Lazy, // For lazy tensor backends
  // A meta tensor is a tensor without any data associated with it.  (They
  // have also colloquially been referred to as tensors on the "null" device).
  // A meta tensor can be used to dry run operators without actually doing any
  // computation, e.g., add on two meta tensors would give you another meta
  // tensor with the output shape and dtype, but wouldn't actually add anything.
  Meta,
  // Here are backends which specify more specialized operators
  // based on the dtype of the tensor.
  QuantizedCPU, // registered at build/aten/src/ATen/RegisterQuantizedCPU.cpp
  QuantizedCUDA, // registered at build/aten/src/ATen/RegisterQuantizedCUDA.cpp
  QuantizedXPU, // For out of tree Intel's heterogeneous computing plug-in
  // This backend is to support custom RNGs; it lets you go
  // to a different kernel if you pass in a generator that is not a
  // traditional CPUGeneratorImpl/CUDAGeneratorImpl.  To make use of this
  // key:
  //  1) set it as a second parameter of at::Generator constructor call in
  //     the user-defined PRNG class.
  //  2) use it as a dispatch key while registering custom kernels
  //     (templatized kernels specialized for user-defined PRNG class)
  // intended for out of tree use; tested by aten/src/ATen/test/rng_test.cpp
  CustomRNGKeyId,

  // Here are backends which specify more specialized operators
  // based on the layout of the tensor.  Note that the sparse backends
  // are one case where ordering matters: sparse multi-dispatches with
  // the corresponding dense tensors, and must be handled before them.
  MkldnnCPU, // registered at build/aten/src/ATen/RegisterMkldnnCPU.cpp
  // NB: not to be confused with MKLDNN, which is Caffe2 only
  SparseCPU, // registered at build/aten/src/ATen/RegisterSparseCPU.cpp
  SparseCUDA, // registered at build/aten/src/ATen/RegisterSparseCUDA.cpp
  SparseHIP, // TODO: I think this is not actually used, due to Note
  // [Masquerading as CUDA]
  SparseXPU, // For out of tree Intel's heterogeneous computing plug-in
  SparseVE, // For out of tree & closed source integration of SX-Aurora / NEC
  SparseCsrCPU,
  SparseCsrCUDA,

  AutogradOther,
  AutogradCPU,
  AutogradCUDA,
  AutogradXLA,
  AutogradLazy,
  AutogradXPU,
  AutogradMLC,
  AutogradHPU,

  ......
};

4.2.4.4 key的使用

因为篇幅所限，我们无法深入分析每一种情况，这里只给出从 DeviceType 出发的情景。我们从下面函数可以看到，如何从 DeviceType 映射到 DispatchKey 类型。

template <typename Func>
inline CppFunction dispatch(c10::DeviceType type, Func&& raw_f) {
  auto deviceTypeToDispatchKey = [](c10::DeviceType t){
    switch (t) {
      // This list is synchronized with the k-constants in c10/core/DeviceType.h
      case c10::DeviceType::CPU:
        return c10::DispatchKey::CPU;
      case c10::DeviceType::CUDA:
        return c10::DispatchKey::CUDA;
      case c10::DeviceType::XLA:
        return c10::DispatchKey::XLA;
      case c10::DeviceType::Lazy:
        return c10::DispatchKey::Lazy;
      case c10::DeviceType::MLC:
        return c10::DispatchKey::MLC;
      case c10::DeviceType::Meta:
        return c10::DispatchKey::Meta;
      case c10::DeviceType::HIP:
        return c10::DispatchKey::HIP;
      case c10::DeviceType::MSNPU:
        return c10::DispatchKey::MSNPU;
      case c10::DeviceType::HPU:
        return c10::DispatchKey::HPU;
      default:
        TORCH_CHECK(false,
          "Device type ", t, " cannot be overloaded at dispatch time, "
          "please file a bug report explaining what you were trying to do.");
    }
  };
  return dispatch(deviceTypeToDispatchKey(type), std::forward<Func>(raw_f));
}

4.3 小结

至此，我们知道，通过 Dispatcher 机制，PyTorch 可以依据dtype、device和layout的不同来调度不同的operator。这就解答了我们第三个问题：如何在 CPU，GPU 操作之间无缝切换？

关于第四个问题：是否需要把损失函数移动到 GPU 之上？，我们也有了解答：

损失函数的参数是前向传播的outputs和label，outputs已经在GPU之上（因为训练数据已经在GPU之上），label 也被用户手动设置到GPU之上。所以损失函数的参数都已经在GPU之上，这样 Dispather 就依据device会调用到GPU对应的operator，所以不需要把损失函数移动到GPU之上。

我们整理一个总体逻辑如下，序列是：

把训练数据 inputs 移动到GPU。
进行前向操作，假设只有一个operator，就是 op1，使用 device='GPU' 这个 dispatch key 去 Dispatcher 查找。
找到了 op1-gpu 这个operator，进行计算，得出 outputs。
outputs 就自动存在于 GPU 之上。
把 Labels 也放到 GPU 之上。
进行损失函数运算，假设只有一个 operator，就是 op2，此时损失函数的参数都在GPU之上，所以使用 device= 'GPU' 这个 dispatch key 去 Dispatcher 查找。
找到了 op2-gpu 这个operator，进行计算，得出 loss。

                           +--------------------+
         +-----------+     | Forward            |      +------------+     +------------------+
         | GPU       |     |                    |      | GPU        |     | Loss Function    |
         |           +---> |    op1   op1-gpu() +----> |            +---> |                  |   +--------+
         |   Inputs  | 1   |                    |  4   |   Outputs  |     |                  |   | GPU    |
         |           |     |     +        ^     |      |            |     |                  |   |        |
         +-----------+     |     |        |     |      +------------+     |  op2   op2-gpu() +-->+  loss  |
                           |     |        |     |                         |                  |   |        |
                           +--------------------+      +------------+     |   +        ^     |   |        |
                                 |        |            | GPU        | 5   |   |        |     |   +--------+
                                 |        |            |            +---> |   | 6      | 7   |
                               2 |        | 3          |   Labels   |     |   |        |     |
                                 |        |            |            |     |   |        |     |
                                 |        |            +------------+     +------------------+
    +----------------------------+        +--------------------------------+  |        |
    |                                                                      |  |        |
+-----------------------------------------------------------------------------+        |
|   |                                                                      |           |
|   |          +-------------------------------------------------------+   |           |
|   |          | Dispather                                             |   |           |
|   |          |       +          +          +            +            |   |           |
|   |          |       |   XLA    |   CPU    |    Metal   |    GPU     |   |           |
|   |          | +---------------------------------------------------+ |   |           |
|   |          |       |          |          |            |            |   |           |
|   +--------> |   OP1 | op1-xla  | op1-cpu  |  op1-metal |  op1-gpu   +---+           |
| 'device=GPU' |       |          |          |            |  +------+  |               |
|              | +---------------------------------------------------+ |               |
|              |       |          |          |            |            |               |
+------------> |   OP2 | op2-xla  | op2-cpu  |  op2-metal |  op2-gpu   +---------------+
  'device=GPU' |       |          |          |            |  +------+  |
               | +---------------------------------------------------+ |
               |       |          |          |            |            |
               |   OP3 | op3-xla  | op3-cpu  |  op3-metal |  op3-gpu   |
               |       |          |          |            |            |
               | +---------------------------------------------------+ |
               +-------------------------------------------------------+

手机如下：