调试Metal错误
Metal API校验(API Validation)
API Validation验证层可以捕获Metal API调用的大多数问题,在出现错误时,能获得更多有效信息。
不开API Validation,只有Current state of errors reporting
GPU Errors
Discarded (victim of GPU errors/recovery) (IOAF code 5)
注:只是指出了GPU出现了错误,但是不知道是哪儿出错了
开了API Validation之后,则会打印出Metal API调用的详细错误,来帮助开发者定位问题
API Errors
-[MTLDebugComputeCommandEncoder setBuffer:offset:atIndex:]
` offset(200) must be < [buffer length] (100). '
* frame #6:MPSRayTracing `-[AAPLRenderer drawInMTKView:] atIndex AAPLRenderer.mm:410:5
但对于GPU中发生的错误, API Validation则不能有效定位,例如这些错误:
① timeout 如:shader长时间运行、超大数循环、死循环
② Out of bounds access 如:global memory或shared memory中数组越界访问
③ Nil resource access 如:访问null的texture资源
④ Invalid resource residency 如:忘记在使用参数缓冲区时调用useResource导致resource residency无效
Xcode启用方法

环境变量启用方法
可在device创建之前,设置环境变量MTL_DEBUG_LAYER=1来开启
增强的命令缓存错误报告(Enhanced Command Buffer Errors)
该特性内置在metal中,非常低的overhead。并且可以针对单个commandBuffer来设置开启
Error Domain=MTLCommandBufferErrorDomain Code=1 UserInfo={NSLocalizedDescription=Discarded (victim of GPU error/recovery) (IOAF code 5), MTLCommandBufferEncoderInfoErrorkey=( "<errorState: MTLCommandEncoderErrorStateCompleted, lablel: RenderOccluders>", "<errorState: MTLCommandEncoderErrorStateCompleted, lablel: DepthPrepass+Lightculling>", "<errorState: MTLCommandEncoderErrorStateCompleted, lablel: Shadow Cascade 0>", "<errorState: MTLCommandEncoderErrorStateFaulted, lablel: GBuffer Pass>", "<errorState: MTLCommandEncoderErrorStateAffected, lablel: Forward Pass>" )}
启用方法
MTLCommandBufferDescriptor* desc = [MTLCommandBufferDescriptor new]; // 在 Metal 3(macOS 13/iOS 16)及以上才有的新特性,允许你细化 command buffer 的某些行为 desc.retainedReferences = true; // 让 command buffer 在执行期间持有其所有依赖资源,便于调试和崩溃分析,会增大内存占用。非性能选项,一般用于调试,正式发布建议默认 false desc.errorOptions = MTLCommandBufferErrorOptionEncoderExecutionStatus; // 使得 command buffer 在发生错误时能收集 encoder 层级的错误状态,这对调试 GPU 问题和性能分析有帮助 id<MTLCommandBuffer> cmdBuffer = [cmdQueue commandBufferWithDescriptor:desc]; // id<MTLCommandQueue> cmdQueue
出错时打印相关信息到日志
出现错误时,格式化打印
static const TCHAR* StringFromCommandEncoderError(MTLCommandEncoderErrorState ErrorState) { switch (ErrorState) { case MTLCommandEncoderErrorStateUnknown: return TEXT("Unknown"); // 未知 case MTLCommandEncoderErrorStateAffected: return TEXT("Affected"); // 受影响(可能有故障) case MTLCommandEncoderErrorStateCompleted: return TEXT("Completed"); // 已完成 case MTLCommandEncoderErrorStateFaulted: return TEXT("Faulted"); // 故障 case MTLCommandEncoderErrorStatePending: return TEXT("Pending"); // 未完成 } return TEXT("Unknown"); } // 输出错误 if (&MTLCommandBufferEncoderInfoErrorKey != nullptr) { if (NSArray<id<MTLCommandBufferEncoderInfo>>* EncoderInfoArray = [CompletedBuffer.GetError() userInfo][MTLCommandBufferEncoderInfoErrorKey]) // mtlpp::CommandBuffer CompletedBuffer { UE_LOG(LogMetal, Warning, TEXT("GPU Encoder Crash Info:")); for (id<MTLCommandBufferEncoderInfo> EncoderInfo in EncoderInfoArray) { UE_LOG(LogMetal, Warning, TEXT("MTLCommandBufferEncoder - Label: %s, State: %s"), *FString(EncoderInfo.label), StringFromCommandEncoderError(EncoderInfo.errorState)); if (EncoderInfo.debugSignposts.count > 0) { UE_LOG(LogMetal, Warning, TEXT(" Signposts:")); for (NSString* Signpost in EncoderInfo.debugSignposts) { UE_LOG(LogMetal, Warning, TEXT(" - %s"), *FString(Signpost)); } } } } }
也可以直接打印
NSError *error = commandBuffer.error; if (error != nil) { NSLog(@"%@", error); }
显存访问越界问题(OOB memory access,Out Of Bounds)

注:访问A显存时出现了越界,越界的区域为未分配区域,则增强指令缓冲区错误(Enhanceed Command Buffer Error) 可以捕捉到该问题
当访问A显存时出现了越界,但越界的区域是分配的区域B,此时增强指令缓冲区错误(Enhanceed Command Buffer Error) 就捕捉不到了,需要Metal着色器校验层(Shader Validation)来捕捉这类问题

Metal着色器校验(Shader Validation)
与API Validation一样,这是一个Shader的校验层,用于检测Metal着色器的逻辑,并对其进行定位和分类,当检测到某个操作会导致未定义行为时,将阻止该操作并创建一个日志,方便定位shader调用相关的问题
可以检测出如下问题:
① Out of bounds global memory(设备和常量的内存越界访问)
② Out of bounds threadgroup memory(线程组内存越界访问)
③ Null texture access(空纹理对象)
④ Infinite Loops(死循环) 注:这类问题开增强指令缓冲区错误(Enhanceed Command Buffer Error)也会有帮助
⑤ Resources residency 注:这类问题开增强指令缓冲区错误(Enhanceed Command Buffer Error)也会有帮助
开启Shader Validation,对性能影响比较大,而且还会修改maxTotalThreadsPerThreadgroup和threadExecutionWidth的数值来保证其运行
Xcode启用方法
点击Shader Validation旁边的箭头按钮,来设置运行时断点

勾选上Enable Runtime Issue Breakpoint

环境变量启用方法
可在device创建之前,设置环境变量MTL_SHADER_VALIDATION=1来开启
示意案例


出错时打印shader相关信息到日志
Command buffer日志API:
[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> cb) { for (id<MTLLogContainer> log in cb.logs) { NSString *encoderLabel = log.encoderLabel ?: @"Unknown Label"; NSLog(@"Faulting encoder \"%@\"", encoderLabel); id<MTLDebugLocation> debugLocation = log.debugLocation; NSString *functionName = debugLocation.functionName; if (debugLocation && functionName) { NSLog(@"Faulting function %@:%ld:%ld", functionName, (long)debugLocation.line, (long)debugLocation.column); } } }];
不解析直接打印:
[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> cb) { for (id<MTLLogContainer> log in cb.logs) { NSLog(@"%@", [log description]); } }];
Terminal中设置,来突出显示
log stream --predicate "subsystem = 'com.apple.Metal' and category = 'GPUDebug'"
OSLog:
MPSRayTracing: (MetalTools) [com.apple.Metal:GPUDebug] Invalid device memory read executing kernel function "shadeKernel" encoder: "2", dispatch: 0, at offset 27202592 AAPLShaders.metal:265:28 - shadeKernel()
编译带Debug Symbol的shader
xcrun -sdk macosx metal -g -c shader1.metal -o shader1.air
xcrun -sdk macosx metal -g -c shader2.metal -o shader2.air
xcrun -sdk macosx metallib shader1.air shader2.air -o final.metallib
在shader中使用#line,在崩溃时可以知道shader来自哪个文件
// Full filename if reading from disk #line 0 /Path/To/My/File.metal // A unique identifier if not #line 0 AUniqueFileName.metal
Shader Validation的限制
① 不支持Binary function pointers(二进制函数指针)
② 不支持Dynamic linking(动态链接)
③ 在MTLGPUFamilyMac1和MTLGPUFamilyApple5及更老的设备上,not Instrumenting pointers from argument buffers(不检查参数缓冲区指针)
浙公网安备 33010602011771号