[Inductor] [CPU] performance regression with TORCHINDUCTOR_FREEZING=1 #104952

ESI-SYD · 2023-07-11T09:14:54Z

🐛 Describe the bug

There are 6 performance regression from #93531 (comment)

	2023-07-09 nightly				2023-07-06 nightly				Result Comp
model	batch_size	speedup	inductor	eager	batch_size	speedup	inductor	eager	speedup ratio	eager ratio	inductor ratio
Background_Matting	1	0.782193	0.350721461	0.274331872	1	1.00737	0.279395653	0.281454799	0.78	1.03	0.8
doctr_det_predictor	1	1.090279	0.148405348	0.161803234	1	1.713053	0.095406578	0.163436525	0.64	1.01	0.64
functorch_dp_cifar10	64	0.622732	0.009190167	0.005723011	64	1.008348	0.005596095	0.005642811	0.62	0.99	0.61
gmlp_s16_224	128	1.068468	0.658434753	0.703516464	128	1.227975	0.587295424	0.721184098	0.87	1.03	0.89
resmlp_12_224	128	0.749039	0.415152565	0.310965462	128	1.237528	0.259625741	0.321294124	0.61	1.03	0.63
tnt_s_patch16_224	1	1.173545	0.094649226	0.111075126	1	1.367958	0.081854891	0.111974053	0.86	1.01	0.86

SW information:

SW	Nightly commit	Main commit
Pytorch	9b5a84f	dd6c38c
Torchbench	/	8526eabb
torchaudio	a233cc1	1e117f5
torchtext	90ea46c	8546bbb
torchvision	2ab2f74	657027f
torchdata	9ed0325	901b483
dynamo_benchmarks	6226b7d	/

Versions

export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"

export TORCHINDUCTOR_FREEZING=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES

python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/torchbench.py --inference --inference --float32 -dcpu -n50 --inductor  --no-skip --dashboard --only Background_Matting  --cold_start_latency

python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/torchbench.py --inference --performance --float32 -dcpu -n50 --inductor  --no-skip --dashboard --only doctr_det_predictor  --cold_start_latency

python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/torchbench.py --inference --performance --float32 -dcpu -n50 --inductor  --no-skip --dashboard --only functorch_dp_cifar10 --cold_start_latency

python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/torchbench.py --inference --performance --float32 -dcpu -n50 --inductor  --no-skip --dashboard --only gmlp_s16_224  --cold_start_latency

python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/timm_models.py --inference --performance --float32 -dcpu -n50 --inductor  --no-skip --dashboard --only resmlp_12_224  --cold_start_latency

python -m torch.backends.xeon.run_cpu --core_list 0 --ncores_per_instance 1 benchmarks/dynamo/timm_models.py --inference --performance --float32 -dcpu -n50 --inductor  --no-skip --dashboard --only tnt_s_patch16_224  --cold_start_latency

cc @ezyang @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519

chuanqi129 · 2023-12-07T08:15:05Z

Have double checked according to the latest test results, below 4 models still have regressions. cc @zxd1997066 help to find guilty commit for those 4 models. those are fP32 static default wrapper test, the first three models test with multi-thread, the last one tnt_s_patch16_224 is single thread.

2023-07-09 nightly				2023-07-06 nightly				Result Comp
model	batch_size	speedup	inductor	eager	batch_size	speedup	inductor	eager	speedup ratio	eager ratio
Background_Matting	1	0.782193	0.350721461	0.274331872	1	1.00737	0.279395653	0.281454799	0.78	1.03
functorch_dp_cifar10	64	0.622732	0.009190167	0.005723011	64	1.008348	0.005596095	0.005642811	0.62	0.99
resmlp_12_224	128	0.749039	0.415152565	0.310965462	128	1.237528	0.259625741	0.321294124	0.61	1.03
tnt_s_patch16_224	1	1.173545	0.094649226	0.111075126	1	1.367958	0.081854891	0.111974053	0.86	1.01

leslie-fang-intel · 2023-12-20T09:39:17Z

@zxd1997066 Please help to find the guilty commit for each regression, so we can take a look.

zxd1997066 · 2024-01-10T07:45:35Z

I cannot reproduce the 2023-07-06 nightly result from my side for these 4 models

Have double checked according to the latest test results, below 4 models still have regressions. cc @zxd1997066 help to find guilty commit for those 4 models. those are fP32 static default wrapper test, the first three models test with multi-thread, the last one tnt_s_patch16_224 is single thread.

2023-07-09 nightly 2023-07-06 nightly Result Comp
model batch_size speedup inductor eager batch_size speedup inductor eager speedup ratio eager ratio
Background_Matting 1 0.782193 0.350721461 0.274331872 1 1.00737 0.279395653 0.281454799 0.78 1.03
functorch_dp_cifar10 64 0.622732 0.009190167 0.005723011 64 1.008348 0.005596095 0.005642811 0.62 0.99
resmlp_12_224 128 0.749039 0.415152565 0.310965462 128 1.237528 0.259625741 0.321294124 0.61 1.03
tnt_s_patch16_224 1 1.173545 0.094649226 0.111075126 1 1.367958 0.081854891 0.111974053 0.86 1.01

leslie-fang-intel · 2024-01-11T08:13:01Z

@zxd1997066 @chuanqi129 will check the performance data before regression.

zxd1997066 · 2024-02-01T06:13:27Z

update:
verified on 2023-07-6 nightly: 13763f5
Background_Matting:

without freezing:(good)
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,Background_Matting,1,0.997587,279.893412,43.882071,0.987273,478.472192,484.640358,183,1,0,0

with freezing:(bad)
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,Background_Matting,1,0.787942,356.164509,46.301095,0.985237,473.252659,480.344064,183,1,0,0
for commit before 7/6 nightly, running with TORCHINDUCTOR_FREEZING=1 will meet crash

but with 2024-01-29 nightly: 890d8e6, both with and without freezing have bad performance:
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles
cpu,Background_Matting,1,0.854602,326.251965,15.495688,0.982960,477.204070,485.476762,183,1,0,0,0,0

leslie-fang-intel · 2024-02-04T01:27:33Z

update: verified on 2023-07-6 nightly: 13763f5 Background_Matting:

without freezing:(good) dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,Background_Matting,1,0.997587,279.893412,43.882071,0.987273,478.472192,484.640358,183,1,0,0

with freezing:(bad) dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,Background_Matting,1,0.787942,356.164509,46.301095,0.985237,473.252659,480.344064,183,1,0,0

Thanks. @chuanqi129 @zxd1997066, For Background_Matting, did you mean the regression is due to our testing semantic TORCHINDUCTOR_FREEZING=1 or TORCHINDUCTOR_FREEZING=0 changed instead of code check in?

zxd1997066 · 2024-02-05T05:40:14Z

It is hard to say, since it is a very early report. But per my verification, TORCHINDUCTOR_FREEZING=1 and TORCHINDUCTOR_FREEZING=0 make difference on the same commit 13763f5

BTW, when using TORCHINDUCTOR_FREEZING=0, tnt_s_patch16_224, functorch_dp_cifar10, Background_Matting have performance regression with latest pytorch, tnt_s_patch16_224 and functorch_dp_cifar10 have the same suspected guilty commit 7e098f9, Background_Matting has the guilty commit 7c97c94, will submit issue for them separately.

zxd1997066 · 2024-02-05T09:43:06Z

tnt_s_patch16_224 and functorch_dp_cifar10 regression: #119178
Background_Matting regression: #119181
resmlp_12_224 has no regression when using TORCHINDUCTOR_FREEZING=0
the gap between TORCHINDUCTOR_FREEZING=1 and TORCHINDUCTOR_FREEZING=0: #119183

leslie-fang-intel · 2024-02-06T02:18:43Z

Close this issue as tracked in new issues group by guilty commit as above.

XiaobingSuper self-assigned this Jul 12, 2023

XiaobingSuper added the module: cpu inductor label Jul 12, 2023

soulitzer added the oncall: pt2 label Jul 14, 2023

shunting314 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 26, 2023

penguinwu added oncall: cpu inductor CPU Inductor issues for Intel team to triage and removed module: cpu inductor labels Dec 2, 2023

leslie-fang-intel assigned leslie-fang-intel and chuanqi129 and unassigned XiaobingSuper and leslie-fang-intel Dec 5, 2023

leslie-fang-intel assigned leslie-fang-intel and chuanqi129 and unassigned chuanqi129 and leslie-fang-intel Dec 16, 2023

leslie-fang-intel removed their assignment Dec 25, 2023

chuanqi129 assigned zxd1997066 Jan 11, 2024

leslie-fang-intel closed this as completed Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inductor] [CPU] performance regression with TORCHINDUCTOR_FREEZING=1 #104952

[Inductor] [CPU] performance regression with TORCHINDUCTOR_FREEZING=1 #104952

ESI-SYD commented Jul 11, 2023 •

edited by pytorch-bot bot

chuanqi129 commented Dec 7, 2023

leslie-fang-intel commented Dec 20, 2023

zxd1997066 commented Jan 10, 2024

leslie-fang-intel commented Jan 11, 2024

zxd1997066 commented Feb 1, 2024

leslie-fang-intel commented Feb 4, 2024

zxd1997066 commented Feb 5, 2024

zxd1997066 commented Feb 5, 2024

leslie-fang-intel commented Feb 6, 2024

[Inductor] [CPU] performance regression with TORCHINDUCTOR_FREEZING=1 #104952

[Inductor] [CPU] performance regression with TORCHINDUCTOR_FREEZING=1 #104952

Comments

ESI-SYD commented Jul 11, 2023 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

chuanqi129 commented Dec 7, 2023

leslie-fang-intel commented Dec 20, 2023

zxd1997066 commented Jan 10, 2024

leslie-fang-intel commented Jan 11, 2024

zxd1997066 commented Feb 1, 2024

leslie-fang-intel commented Feb 4, 2024

zxd1997066 commented Feb 5, 2024

zxd1997066 commented Feb 5, 2024

leslie-fang-intel commented Feb 6, 2024

ESI-SYD commented Jul 11, 2023 •

edited by pytorch-bot bot