PyTorch Profiler：简介、主要功能、数据流、环境配置和测试分析@CUDA

1. PyTorch Profiler 简介

1.1 什么是 PyTorch Profiler？

PyTorch Profiler 是官方统一性能分析工具，目标是帮助开发者：

精确定位性能瓶颈（CPU/GPU）
分析显存使用
查看算子执行顺序与耗时
分析分布式训练与通信瓶颈
输出可视化 trace（TensorBoard / Chrome）

核心组件来自 Kineto，可捕获 Python、C++、CUDA、NCCL、Memory 等层面的事件。

输出 JSON 事件示例:

{
  "schemaVersion": 1,
  "deviceProperties": [
    {
      "id": 0, "name": "NVIDIA GeForce RTX 4090", "totalGlobalMem": 25250627584,
      "computeMajor": 8, "computeMinor": 9,
      "maxThreadsPerBlock": 1024, "maxThreadsPerMultiprocessor": 1536,
      "regsPerBlock": 65536, "warpSize": 32,
      "sharedMemPerBlock": 49152, "numSms": 128
    , "regsPerMultiprocessor": 65536, "sharedMemPerBlockOptin": 101376, "sharedMemPerMultiprocessor": 102400...
  ],
...
  "with_flops": 1,
  "record_shapes": 1,
  "with_stack": 1,
  "profile_memory": 1,
  "traceEvents": [
  {
    "ph": "X", "cat": "cpu_op", "name": "autograd::engine::evaluate_function: NllLossBackward0", "pid": 586214, "tid": 589500,
    "ts": 7700400446421.433, "dur": 124.742,
    "args": {
      "External id": 513,"Record function id": 0, "Sequence number": 32, "Fwd thread id": 1, "Ev Idx": 0
    }
  },
...
  {
    "name": "Record Window End", "ph": "i", "s": "g",
    "pid": "", "tid": "", "ts": 7700400500584.044
  }
  ],
  "traceName": "./logs/training_profile/powerleader_586214.1767000474530444484.pt.trace.json",
  "displayTimeUnit": "ms",
  "baseTimeNanoseconds": 1759300074000000000
}

Json Sample

1.2 核心价值

PyTorch Profiler 的目标是“从算子级别理解 PyTorch 模型在 CPU/GPU 上的真实运行性能瓶颈”。

核心价值总结：

价值点	说明
精准定位性能瓶颈	自动捕获每一层（算子）的执行时间、内存占用、调用栈，帮助定位最耗时的算子。
跨设备分析 (CPU/GPU)	同时支持 CPU、CUDA、ROCm Profiling，查看算子在哪个设备上耗时。
显存使用追踪	记录模型在每一时刻的显存分配与释放，快速发现内存泄漏或溢出。
多维度可视化分析	提供多种可视化后端（TensorBoard、Chrome、Perfetto），支持火焰图、timeline、表格分析等。
可与训练循环无缝集成	可嵌入 training loop 中自动捕获 forward/backward/optimizer step 各阶段性能，支持 schedule 控制（warmup、active）。

1.3 支持平台

平台	支持状态	主要后端
NVIDIA GPU	✅ 完整支持	CUPTI
AMD GPU	✅ 支持	ROCtracer
Intel GPU	✅ 支持	VTune/XPU
CPU	✅ 完整支持	perf_event

2. 主要功能特性

2.1 功能矩阵

功能类别	具体能力	说明
⏱️ 性能统计	分析每个算子耗时（CPU/GPU）	找出最慢算子
💾 内存分析	CPU/GPU 内存分配与释放	profile_memory=True
📉 时间线追踪	多线程/多GPU 时间轴	Chrome trace / TensorBoard 可视化
🧩 Kernel 层分析	CUDA Kernel 执行、重叠情况	通过 Kineto 捕获
🔗 通信分析	DDP/NCCL 通信开销	通信与计算 overlap
🔍 调用栈追踪	反向映射到 Python 代码	with_stack=True
📦 数据加载分析	DataLoader 线程与 batch 时间	发现数据瓶颈
🔧 自定义事件标记	torch.profiler.record_function("my_block")	用于自定义代码块的 profiling
🧭 逐步分析 (schedule)	定义 warmup/active/skip 阶段	控制采样窗口，减少开销
📊 可视化导出	TensorBoard / Chrome trace / JSON	支持多种格式导出
🔄 分布式模式支持	多 rank trace 同步	用于多 GPU 训练分析

Profiler能分析的主要模块：

模块类别	内容	说明
🧠 算子 (Operators)	aten::conv2d, aten::matmul, aten::relu 等	分析每个算子的执行时间、显存、调用次数
⚡ GPU Kernel (CUDA Kernels)	CUDA kernel 执行时间线	对应 GPU 上的实际执行 kernel，显示 launch、执行时间
🧩 内存 (Memory)	CPU/GPU 内存分配、释放、峰值	启用 profile_memory=True 后追踪
🔗 通信 (NCCL / Distributed)	ncclAllReduce, ncclBroadcast 等	用于多 GPU 分布式训练分析通信开销
📦 DataLoader / I/O	数据加载线程、CPU→GPU 数据拷贝	分析 I/O 或数据预处理瓶颈
⚙️ Autograd Engine	前向/反向计算、梯度计算耗时	对每层反向传播开销分析
🧵 Python 调用栈	每个算子的 Python 层调用位置	启用 with_stack=True 可追踪
🕓 时间线 (Timeline)	多线程/多设备执行时间分布	用于可视化各事件交叠与同步
🌐 分布式 (DDP)	各 rank 训练时间线对齐分析	可发现通信等待或负载不均
🔬 自定义事件 (Custom Markers)	用户自定义区域标记	用于细分自定义逻辑的耗时

2.2 架构总览

3. 数据收集流程

4. 环境配置与隔离

首先安装Miniconda

4.1 创建隔离的 Python 环境

NVIDIA CUDA 13.0 环境

conda create -n torch-profiler python=3.10 -y
conda activate torch-profiler

pip install --upgrade pip

pip install torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/cu121
  
pip install tensorboard

# 验证安装
python - << 'EOF'
import torch
print("torch:", torch.__version__)
print("compiled cuda:", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
print("gpu:", torch.cuda.get_device_name(0))
EOF

环境设置

4.2 环境验证脚本

最小CUDA Kernel Trace，Sanity测试：

import torch
from torch.profiler import profile, ProfilerActivity

x = torch.randn(1024, 1024, device="cuda")

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]
) as prof:
    y = x @ x

print(prof.key_averages().table(sort_by="cuda_time_total"))

test_small_kernel.py

得到：

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                           aten::matmul         0.13%      32.038us        99.98%      24.583ms      24.583ms       0.000us         0.00%      48.128us      48.128us             1
                                               aten::mm        32.33%       7.950ms        99.85%      24.551ms      24.551ms      48.128us       100.00%      48.128us      48.128us             1
void cutlass::Kernel<cutlass_80_simt_sgemm_128x64_8x...         0.00%       0.000us         0.00%       0.000us       0.000us      48.128us       100.00%      48.128us      48.128us             1
                                               cudaFree        55.61%      13.675ms        55.61%      13.675ms       6.837ms       0.000us         0.00%       0.000us       0.000us             2
                                 cudaDeviceGetAttribute         0.02%       5.566us         0.02%       5.566us       0.371us       0.000us         0.00%       0.000us       0.000us            15
                                   cudaGetSymbolAddress         0.50%     124.003us         0.50%     124.003us     124.003us       0.000us         0.00%       0.000us       0.000us             1
                                             cudaMalloc         0.83%     203.027us         0.83%     203.027us      67.676us       0.000us         0.00%       0.000us       0.000us             3
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...        10.45%       2.569ms        10.45%       2.569ms       1.285ms       0.000us         0.00%       0.000us       0.000us             2
                                       cudaLaunchKernel         0.10%      24.312us         0.10%      24.312us      24.312us       0.000us         0.00%       0.000us       0.000us             1
                                  cudaDeviceSynchronize         0.02%       5.363us         0.02%       5.363us       5.363us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 24.588ms
Self CUDA time total: 48.128us

环境检测：

# environment_check.py
import torch
import sys

def check_environment():
    print("=" * 50)
    print("Environment Check Report")
    print("=" * 50)
    
    # Python 环境
    print(f"Python version: {sys.version}")
    print(f"PyTorch version: {torch.__version__}")
    
    # CUDA/ROCm 环境
    cuda_available = torch.cuda.is_available()
    print(f"CUDA/ROCm available: {cuda_available}")
    
    if cuda_available:
        print(f"GPU device count: {torch.cuda.device_count()}")
        for i in range(torch.cuda.device_count()):
            print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")
        print(f"CUDA version: {torch.version.cuda}")
        
        # 测试基本功能
        x = torch.randn(3, 3).cuda()
        y = torch.randn(3, 3).cuda()
        z = x + y
        print(f"GPU computation test: {z.shape}")
    
    # Profiler 可用性
    try:
        from torch.profiler import profile, ProfilerActivity
        print("Profiler module: ✅ Available")
    except ImportError:
        print("Profiler module: ❌ Not available")
    
    print("=" * 50)

if __name__ == "__main__":
    check_environment()

environment_check.py

得到：

==================================================
Environment Check Report
==================================================
Python version: 3.10.19 (main, Oct 21 2025, 16:43:05) [GCC 11.2.0]
PyTorch version: 2.5.1+cu121
CUDA/ROCm available: True
GPU device count: 2
  GPU 0: NVIDIA GeForce RTX 4090
  GPU 1: NVIDIA GeForce RTX 4090
CUDA version: 12.1
GPU computation test: torch.Size([3, 3])
Profiler module: ✅ Available
==================================================

5. 使用方法详解

5.1 基础使用模式

模式一：简单上下文分析

import torch
import torch.profiler

def simple_profiling():
    model = torch.nn.Linear(1000, 500).cuda()
    x = torch.randn(64, 1000).cuda()
    
    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        profile_memory=True,
        record_shapes=True
    ) as prof:
        for _ in range(10):
            y = model(x)
            loss = y.sum()
            loss.backward()
    
    # 控制台查看结果
    print(prof.key_averages().table(
        sort_by="cuda_time_total", 
        row_limit=10
    ))

if __name__ == "__main__":
    simple_profiling()

test_simple_context.py

python test_simple_context.py结果如下：

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    autograd::engine::evaluate_function: AddmmBackward0         0.22%     153.712us        14.73%      10.136ms       1.014ms       0.000us         0.00%      91.260us       9.126us           0 b           0 b      27.22 Mb           0 b            10
                                           aten::linear         0.11%      78.697us        45.20%      31.097ms       3.110ms       0.000us         0.00%      73.024us       7.302us           0 b           0 b       9.35 Mb           0 b            10
                                            aten::addmm        12.26%       8.433ms        44.81%      30.830ms       3.083ms      73.024us        29.67%      73.024us       7.302us           0 b           0 b       9.35 Mb       9.35 Mb            10
                                              aten::sum         0.81%     557.580us        20.54%      14.129ms     706.441us      71.168us        28.92%      71.168us       3.558us           0 b           0 b      25.00 Kb      25.00 Kb            20
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us      71.168us        28.92%      71.168us       3.558us           0 b           0 b           0 b           0 b            20
                                         AddmmBackward0         0.17%     115.482us        14.08%       9.687ms     968.705us       0.000us         0.00%      62.878us       6.288us           0 b           0 b      27.20 Mb           0 b            10
                                               aten::mm         0.80%     549.090us        13.74%       9.452ms     945.165us      49.023us        19.92%      62.878us       6.288us           0 b           0 b      27.20 Mb      25.98 Mb            10
                        ampere_sgemm_64x32_sliced1x4_tn         0.00%       0.000us         0.00%       0.000us       0.000us      55.455us        22.53%      55.455us       5.545us           0 b           0 b           0 b           0 b            10
                                 ampere_sgemm_32x128_nn         0.00%       0.000us         0.00%       0.000us       0.000us      49.023us        19.92%      49.023us       4.902us           0 b           0 b           0 b           0 b            10
autograd::engine::evaluate_function: torch::autograd...         0.17%     114.779us        13.40%       9.222ms     461.098us       0.000us         0.00%      30.241us       1.512us           0 b           0 b     -17.19 Mb           0 b            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 68.804ms
Self CUDA time total: 246.111us

模式二：完整训练循环分析

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

def get_sample_dataloader(batch_size=32):
    """创建示例数据加载器"""
    data = torch.randn(1000, 3, 224, 224)
    targets = torch.randint(0, 10, (1000,))
    dataset = TensorDataset(data, targets)
    return DataLoader(dataset, batch_size=batch_size, shuffle=True)

def build_model():
    """创建简单模型"""
    return nn.Sequential(
        nn.Conv2d(3, 64, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(2),
        nn.Conv2d(64, 128, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.AdaptiveAvgPool2d((1, 1)),
        nn.Flatten(),
        nn.Linear(128, 10)
    )

def training_profiling():
    model = build_model().cuda()
    optimizer = torch.optim.Adam(model.parameters())
    dataloader = get_sample_dataloader()
    
    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        schedule=torch.profiler.schedule(
            wait=1,      # 跳过第1个step
            warmup=1,    # 预热1个step（不记录）
            active=3,    # 记录3个step
            repeat=2     # 重复2轮
        ),
        on_trace_ready=torch.profiler.tensorboard_trace_handler(
            './logs/training_profile'
        ),
        record_shapes=True,
        profile_memory=True,
        with_stack=True,  # 记录调用栈（可定位到代码行）
        with_flops=True   # 估算FLOPs
    ) as prof:
        for step, (inputs, targets) in enumerate(dataloader):
            if step >= 10:  # 只分析前10个step
                break
                
            inputs, targets = inputs.cuda(), targets.cuda()
            
            outputs = model(inputs)
            loss = torch.nn.functional.cross_entropy(outputs, targets)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            prof.step()  # 重要：通知profiler步骤完成
            
    return prof

if __name__ == "__main__":
    prof = training_profiling()
    # 启动TensorBoard查看结果
    print("Run: tensorboard --logdir=./logs/training_profile")

test_loop_train.py

执行结果如下：

5.2 高级使用技巧

5.2.1 自定义代码段标记

# test_custom_ranges.py
import torch
import torchvision.models as models
from torch.profiler import profile, ProfilerActivity, record_function


def build_model():
    return models.resnet18()


def get_dataloader(batch_size=16):
    x = torch.randn(batch_size, 3, 224, 224)
    y = torch.randint(0, 1000, (batch_size,))
    yield x, y


def test_custom_ranges():
    device = "cuda"
    model = build_model().to(device)
    model.train()

    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        record_shapes=True,
        profile_memory=True,
        with_stack=True,
    ) as prof:

        for x, y in get_dataloader():
            x, y = x.to(device), y.to(device)

            with record_function("data_preprocessing"):
                x = (x - x.mean()) / x.std()

            with record_function("model_inference"):
                with record_function("forward"):
                    out = model(x)
                with record_function("loss"):
                    loss = torch.nn.functional.cross_entropy(out, y)

            with record_function("backward"):
                loss.backward()

            break  # 只跑一个 batch

    prof.export_chrome_trace("custom_ranges.json")
    print("Exported custom_ranges.json")

    print(
        prof.key_averages()
        .table(sort_by="cuda_time_total", row_limit=10)
    )


if __name__ == "__main__":
    test_custom_ranges()

test_custom_ranges.py

测试结果如下：

Exported custom_ranges.json
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                forward         0.00%       0.000us         0.00%       0.000us       0.000us     111.210ms      1517.90%     111.210ms      55.605ms           0 b           0 b           0 b           0 b             2
                                     data_preprocessing         0.00%       0.000us         0.00%       0.000us       0.000us      29.841ms       407.30%      29.841ms      29.841ms           0 b           0 b           0 b           0 b             1
                                                   loss         0.00%       0.000us         0.00%       0.000us       0.000us       4.794ms        65.43%       4.794ms       4.794ms           0 b           0 b           0 b           0 b             1
autograd::engine::evaluate_function: ConvolutionBack...         0.08%     353.866us        10.43%      46.475ms       2.324ms       0.000us         0.00%       3.573ms     178.654us           0 b           0 b     -54.71 Mb    -223.25 Mb            20
                                   ConvolutionBackward0         0.04%     188.077us        10.31%      45.948ms       2.297ms       0.000us         0.00%       3.472ms     173.601us           0 b           0 b     168.54 Mb           0 b            20
                             aten::convolution_backward         3.16%      14.065ms        10.27%      45.760ms       2.288ms       3.472ms        47.39%       3.472ms     173.601us           0 b           0 b     168.54 Mb     124.56 Mb            20
                                        model_inference         0.02%      91.217us        52.06%     232.036ms     232.036ms       0.000us         0.00%       1.977ms       1.977ms           0 b           0 b     341.13 Mb           0 b             1
                                                forward         0.83%       3.694ms        48.43%     215.872ms     215.872ms       0.000us         0.00%       1.973ms       1.973ms           0 b           0 b     341.07 Mb     -10.94 Mb             1
                                           aten::conv2d         0.02%     110.351us        37.87%     168.806ms       8.440ms       0.000us         0.00%       1.459ms      72.973us           0 b           0 b     153.44 Mb           0 b            20
                                      aten::convolution         0.07%     291.180us        37.85%     168.696ms       8.435ms       0.000us         0.00%       1.459ms      72.973us           0 b           0 b     153.44 Mb           0 b            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 445.709ms
Self CUDA time total: 7.327ms

5.2.2 内存泄漏调试

# test_memory_debug.py
import torch
import torchvision.models as models
from torch.profiler import profile, ProfilerActivity


def build_model():
    return models.resnet18()


def get_dataloader(batch_size=32, steps=5):
    for _ in range(steps):
        x = torch.randn(batch_size, 3, 224, 224)
        y = torch.randint(0, 1000, (batch_size,))
        yield x, y


def test_memory_debug():
    device = "cuda"

    # 开启 CUDA allocator 记录
    torch.cuda.memory._record_memory_history(True)

    try:
        model = build_model().to(device)
        optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

        with profile(
            activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
            profile_memory=True,
            with_stack=True,
            record_shapes=True,
        ) as prof:

            for epoch in range(3):
                for x, y in get_dataloader():
                    x, y = x.to(device), y.to(device)

                    out = model(x)
                    loss = torch.nn.functional.cross_entropy(out, y)

                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()

                print(
                    f"[Epoch {epoch}] "
                    f"allocated={torch.cuda.memory_allocated() / 1024**2:.1f} MB, "
                    f"peak={torch.cuda.max_memory_allocated() / 1024**2:.1f} MB"
                )

    finally:
        torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
        torch.cuda.memory._record_memory_history(False)
        print("Saved memory_snapshot.pickle")

    print(
        prof.key_averages()
        .table(sort_by="self_cuda_memory_usage", row_limit=10)
    )


if __name__ == "__main__":
    test_memory_debug()

test_memory_debug.py

使用python -m torch.utils._memory_viz memory_snapshot.pickle查看结果。

5.3 结果导出与分析

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

def get_sample_dataloader(batch_size=32):
    """创建示例数据加载器"""
    data = torch.randn(1000, 3, 224, 224)
    targets = torch.randint(0, 10, (1000,))
    dataset = TensorDataset(data, targets)
    return DataLoader(dataset, batch_size=batch_size, shuffle=True)

def build_model():
    """创建简单模型"""
    return nn.Sequential(
        nn.Conv2d(3, 64, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(2),
        nn.Conv2d(64, 128, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.AdaptiveAvgPool2d((1, 1)),
        nn.Flatten(),
        nn.Linear(128, 10)
    )

def training_profiling():
    model = build_model().cuda()
    optimizer = torch.optim.Adam(model.parameters())
    dataloader = get_sample_dataloader()
    
    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        schedule=torch.profiler.schedule(
            wait=1,      # 跳过第1个step
            warmup=1,    # 预热1个step（不记录）
            active=3,    # 记录3个step
            repeat=2     # 重复2轮
        ),
        on_trace_ready=torch.profiler.tensorboard_trace_handler(
            './logs/training_profile'
        ),
        record_shapes=True,
        profile_memory=True,
        with_stack=True,  # 记录调用栈（可定位到代码行）
        with_flops=True   # 估算FLOPs
    ) as prof:
        for step, (inputs, targets) in enumerate(dataloader):
            if step >= 10:  # 只分析前10个step
                break
                
            inputs, targets = inputs.cuda(), targets.cuda()
            
            outputs = model(inputs)
            loss = torch.nn.functional.cross_entropy(outputs, targets)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            prof.step()  # 重要：通知profiler步骤完成
            
    return prof
def safe_analyze_results(prof):
    """完全安全的分析函数 - 兼容所有PyTorch版本"""
    
    print("=" * 80)
    print("PYTORCH PROFILER SAFE ANALYSIS REPORT")
    print("=" * 80)
    
    # 1. 首先检查可用的属性
    sample_event = None
    for event in prof.key_averages():
        sample_event = event
        break
    
    if sample_event:
        print("🔍 Available event attributes:")
        for attr in dir(sample_event):
            if not attr.startswith('_') and not callable(getattr(sample_event, attr)):
                value = getattr(sample_event, attr)
                if value and attr in ['cpu_time_total', 'cuda_time_total', 'key', 'count']:
                    print(f"  {attr}: {value}")
    
    # 2. 使用表格输出（最兼容的方法）
    print("\n🔹 Performance Summary Table:")
    print("-" * 60)
    table_output = prof.key_averages().table(sort_by="cpu_time_total", row_limit=10)
    print(table_output)
    
    # 3. 手动计算统计信息
    print(f"\n📊 Manual Statistics:")
    total_events = list(prof.key_averages())
    
    total_cpu_time = 0
    total_cuda_time = 0
    event_count = 0
    
    for event in total_events:
        event_count += 1
        # 安全地获取CPU时间
        if hasattr(event, 'cpu_time_total'):
            total_cpu_time += event.cpu_time_total
        # 安全地获取CUDA时间 - 尝试不同的属性名
        cuda_time = 0
        if hasattr(event, 'cuda_time_total'):
            cuda_time = event.cuda_time_total
        elif hasattr(event, 'cuda_time_total'):  # 旧版本拼写
            cuda_time = event.cuda_time_total
        elif hasattr(event, 'self_cuda_time_total'):
            cuda_time = event.self_cuda_time_total
        
        total_cuda_time += cuda_time
    
    print(f"  Total events processed: {event_count}")
    print(f"  Total CPU time: {total_cpu_time / 1000:.2f} ms")
    print(f"  Total CUDA time: {total_cuda_time / 1000:.2f} ms")
    
    total_time = total_cpu_time + total_cuda_time
    if total_time > 0:
        cpu_percent = (total_cpu_time / total_time) * 100
        cuda_percent = (total_cuda_time / total_time) * 100
        print(f"  CPU percentage: {cpu_percent:.1f}%")
        print(f"  CUDA percentage: {cuda_percent:.1f}%")
    
    # 4. 内存统计
    print(f"\n💾 Memory Statistics:")
    current_memory = torch.cuda.memory_allocated() / 1024**2
    max_memory = torch.cuda.max_memory_allocated() / 1024**2
    print(f"  Current GPU memory: {current_memory:.2f} MB")
    print(f"  Peak GPU memory: {max_memory:.2f} MB")
    
    # 5. 导出结果
    try:
        prof.export_chrome_trace("trace.json")
        print(f"\n💾 Chrome trace exported to: trace.json")
    except Exception as e:
        print(f"⚠️  Could not export trace: {e}")
    
    print("=" * 80)
    return {
        'total_cpu_time': total_cpu_time,
        'total_cuda_time': total_cuda_time,
        'peak_memory_mb': max_memory
    }

# 使用示例
if __name__ == "__main__":
    prof = training_profiling()
    safe_analyze_results(prof)

test_analyze_results.py

6. NVIDIA GPU (CUDA 13.0) 实测示例

6.1 ResNet50 深度性能分析

import torch
import torchvision.models as models
import torch.profiler
import time
import os

def ensure_dir(path):
    os.makedirs(path, exist_ok=True)
    return path

def resnet50_profiling(batch_sizes=[1, 2, 4]):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Device: {device}")
    
    if torch.cuda.is_available():
        print(f"GPU: {torch.cuda.get_device_name()}")
    
    results = {}
    
    for batch_size in batch_sizes:
        print(f"\nBatch Size: {batch_size}")
        
        model = models.resnet50().to(device)
        model.train()
        
        inputs = torch.randn(batch_size, 3, 224, 224).to(device)
        targets = torch.randint(0, 1000, (batch_size,)).to(device)
        
        optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
        criterion = torch.nn.CrossEntropyLoss()
        
        # 预热
        for i in range(2):
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        if torch.cuda.is_available():
            torch.cuda.synchronize()
            torch.cuda.reset_peak_memory_stats()
        
        # 性能分析
        start_time = time.time()
        
        chrome_trace_file = f'./profiling_results/resnet50_bs{batch_size}.json'
        ensure_dir('./profiling_results')
        
        with torch.profiler.profile(
            activities=[
                torch.profiler.ProfilerActivity.CPU,
                torch.profiler.ProfilerActivity.CUDA,
            ] if torch.cuda.is_available() else [
                torch.profiler.ProfilerActivity.CPU,
            ],
            schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
            record_shapes=True,
            profile_memory=True,
            with_stack=True,
        ) as prof:

            for step in range(5):
                outputs = model(inputs)
                loss = criterion(outputs, targets)
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                prof.step()
        
        prof.export_chrome_trace(chrome_trace_file)
        
        end_time = time.time()
        
        if torch.cuda.is_available():
            peak_memory = torch.cuda.max_memory_allocated() / 1024**3
        else:
            peak_memory = 0
        
        total_time = end_time - start_time
        throughput = batch_size / total_time if total_time > 0 else 0
        
        # 性能分析
        events = prof.key_averages()
        cpu_time = sum(getattr(event, 'cpu_time_total', 0) for event in events) / 1000
        cuda_time = sum(getattr(event, 'cuda_time_total', 0) for event in events) / 1000
        
        results[batch_size] = {
            'total_time': total_time,
            'throughput': throughput,
            'peak_memory_gb': peak_memory,
            'cpu_time_ms': cpu_time,
            'cuda_time_ms': cuda_time,
            'chrome_trace': chrome_trace_file
        }
        
        print(f"Time: {total_time:.3f}s, Throughput: {throughput:.1f} samples/s")
        print(f"Memory: {peak_memory:.3f} GB, CPU: {cpu_time:.1f}ms, CUDA: {cuda_time:.1f}ms")
        print(f"Trace: {chrome_trace_file}")
    
    return results

def print_summary(results):
    print(f"\n{'Batch':<8} {'Time(s)':<10} {'Throughput':<12} {'Memory(GB)':<12} {'CPU(ms)':<10} {'CUDA(ms)':<10}")
    print("-" * 70)
    
    for batch_size, result in sorted(results.items()):
        print(f"{batch_size:<8} {result['total_time']:<10.3f} {result['throughput']:<12.1f} "
              f"{result['peak_memory_gb']:<12.3f} {result['cpu_time_ms']:<10.1f} {result['cuda_time_ms']:<10.1f}")

if __name__ == "__main__":
    results = resnet50_profiling([1, 2, 4])
    print_summary(results)

test_resnet50.py

6.3 NVIDIA 平台典型分析结果

控制台输出：

Device: cuda
GPU: NVIDIA GeForce RTX 4090

Batch Size: 1
Time: 0.514s, Throughput: 1.9 samples/s
Memory: 0.301 GB, CPU: 247.0ms, CUDA: 0.0ms
Trace: ./profiling_results/resnet50_bs1.json

Batch Size: 2
Time: 0.502s, Throughput: 4.0 samples/s
Memory: 0.389 GB, CPU: 260.7ms, CUDA: 0.0ms
Trace: ./profiling_results/resnet50_bs2.json

Batch Size: 4
Time: 0.472s, Throughput: 8.5 samples/s
Memory: 0.548 GB, CPU: 263.2ms, CUDA: 0.0ms
Trace: ./profiling_results/resnet50_bs4.json

Batch    Time(s)    Throughput   Memory(GB)   CPU(ms)    CUDA(ms)
----------------------------------------------------------------------
1        0.514      1.9          0.301        247.0      0.0
2        0.502      4.0          0.389        260.7      0.0
4        0.472      8.5          0.548        263.2      0.0

PyTorch Profiler 火焰图总结：

横轴宽度：表示函数耗时占比，越宽越可能是瓶颈。
纵轴深度：表示函数调用栈层次。
颜色：仅用于区分函数，无性能含义。

分析重点：

找最宽的函数：定位耗时最多的操作。
关注数据加载、模型计算、GPU 同步等关键环节。

在https://ui.perfetto.dev/中打开分析：

posted on 2026-01-01 23:59 ArnoldLu 阅读(65) 评论(0) 收藏举报

刷新页面返回顶部

Arnold Lu@南京