On AMD paltform, Torch.profiler schedule(wait=x,warmup=y,active=z) option would count GPU kernel time (y+z) times rather than z times #102141
Labels
module: rocm
AMD GPU support for Pytorch
oncall: profiler
profiler-related issues (cpu, gpu, kineto)
馃悰 Describe the bug
When I use torch.profiler on AMD platform to profile GPT model inference, I found the raw data is unreasonable, the CPU operations "self CUDA" total time is far less than the GPU kernels "self CUDA" total time.
After experiment and analyse the raw data, I found the GPU kernel "self CUDA" time is counted by the sum of warmup and active.
Pthon script:
with torch.profiler.profile( schedule=torch.profiler.schedule(wait=1,warmup=1,active=2), on_trace_ready=torch.profiler.tensorboard_trace_handler( dir_name='./logs'), activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA ], with_modules=True, record_shapes=True, profile_memory=True, with_stack=True, )as prof: with torch.no_grad(): for i in range(4): logits = model.generate(**input_ids, do_sample=True, num_beams=1, min_length=12, max_new_tokens=12,pad_token_id=50256) prof.step() print(prof.key_averages(group_by_input_shape=True).table(row_limit=1000000, sort_by='self_cuda_time_total'))
Versions
model: GPT-J-6B with FP16 inference
H/W: MI100 * 8
S/W: PyTorch 1.13.0 ROCm:5.4 Transformers:4.29.2
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @robieta @chaekit @aaronenyeshi @ngimel @nbcsm @guotuofeng @guyang3532 @gaoteng-git @tiffzhaofb @dzhulgakov @davidberard98
The text was updated successfully, but these errors were encountered: