New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyTorch hangs at import when used together with TensorFlow #102360
Comments
Interestingly, maybe using TensorFlow 2.10 does not cause the problem with the hang in PyTorch? At least I don't get the hang then. However, I don't have the proper CUDA env setup for this, so TF fails to load some CUDA libs, which might also influence the behavior. Or maybe TF 2.12 is also behaving a bit different w.r.t. the CUDA libs and loads them more lazily. I'm not sure.
Why did I try TF 2.10? Because in our GitHub CI, this is what we use there, and this works. But in addition to that, I have read here that TF has changed sth in recent versions, and they mention:
So it mentions exactly the error I saw (but I saw this error in PyTorch...). |
This sort of problem tends to be quite difficult to diagnose, but one thing you could try is building pytorch and tf from source with the same compiler toolchain |
Going back to |
This doesn't seem to reproduce with nightlies (e.g. |
My friend and I encountered the same issue here: the process became unresponsive while waiting for a blocked 4-byte read from the standard input (fd=0). System Info: Temporary Solution: Unfortunately, the import is twisted with my project so I'm not able to provide a script for reproducing the error. Here are some other findings when I debugged the script with GDB:
Then I took a look at what The _fileno=4 here points to
However, it went for another path of reading from fd=0.
I was unable to determine the preprocessor macro as it had been optimized out during compilation. However, system-wide POSIX I/O is fully supported. While traversing the This occurrence is quite perplexing to me; within the same process, different branches of the macro definition were activated. If anyone possesses insights into this situation, I would greatly appreciate any input. Thank you. |
Does this mean we're dealing with two conflicting versions of the lib, one compiled with I can reproduce this, by the way, just by doing:
decord==0.6.0 I've tried this with Python 3.8.16 and 3.11.4. If I import torch first, and then decord, the hang doesn't happen. |
@DaveyBiggers Dave is correct -- thank you for the clue! With the 'import' example Dave shared above, I confirmed that random_device::_M_init and random_device::_M_getval were resolved to references in two different dynamic libraries:
This wouldn't happen at compile time due to the ODR of C++. But it could cause trouble during dynamic linking as we see in this example. The solutions (other than downgrading torch) are:
Cheers. |
In the TF discussion I linked above (here), they don't mention |
Recently I encountered one similar issue when I used both tensorboard and torch. device = torch.device(opt.device if torch.cuda.is_available() else 'cpu')
if torch.cuda.is_available():
model_t = model_t.to(device)
model_t = nn.DataParallel(model_t, device_ids=opt.device_id)
model_s = model_s.to(device)
model_s = nn.DataParallel(model_s, device_ids=opt.device_id)
# criterion = criterion.to(device)
cudnn.benchmark = True These codes araised error message in the terminal
Magically, change the order of import tensorboard_logger as tb_logger
import torch to import torch
import tensorboard_logger as tb_logger I don't know much about the mechanism behind the solution, but I guess it is related to the discussions above. |
My issue got resolved after going to pytorch version less than 2. Pls refer to below link for compatibility matrix. |
🐛 Describe the bug
Code:
This hangs in some cases in the
import torch
.Importing it the other way around, or also just importing Torch does not hang. However, I'm reporting this here because the stacktrace looks still suspicious.
Specifically, on my system, Ubuntu 22.04, using the distrib Python 3.10, I have TensorFlow 2.12 and PyTorch 2.0.1. Same also with Python 3.11.
The stacktrace of the hang:
The
fd=0
, coming fromstd::random_device::_M_getval
looks very suspicious to me. It looks like thestd::random_device
is not properly initialized? Code here and here.In other cases, I have also seen the error "random_device could not be read". This seems to be very related, maybe it got another uninitialized
_M_fd
value.I also reported this here: rwth-i6/returnn#1339
Some related issues:
https://discuss.pytorch.org/t/random-device-could-not-be-read/138697 (very related)
JohnSnowLabs/spark-nlp#5943
https://discuss.tensorflow.org/t/tensorflow-linux-wheels-are-being-upgraded-to-manylinux2014/8339
h2oai/datatable#2453
robjinman/pro_office_calc#5
boostorg/fiber#249
h2oai/datatable#2453
microsoft/LightGBM#1516
Versions
PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: 15.0.7
CMake version: version 3.26.3
Libc version: glibc-2.35
Python version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-46-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 980
Nvidia driver version: 530.41.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 6
On-line CPU(s) list: 0-5
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz
CPU family: 6
Model: 158
Thread(s) per core: 1
Core(s) per socket: 6
Socket(s): 1
Stepping: 10
CPU(s) scaling MHz: 42%
CPU max MHz: 4100.0000
CPU min MHz: 800.0000
BogoMIPS: 6000.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
L1d cache: 192 KiB (6 instances)
L1i cache: 192 KiB (6 instances)
L2 cache: 1.5 MiB (6 instances)
L3 cache: 9 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0-5
Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported
Vulnerability L1tf: Mitigation; PTE Inversion
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Retbleed: Mitigation; IBRS
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, RSB filling
Vulnerability Srbds: Mitigation; Microcode
Vulnerability Tsx async abort: Mitigation; TSX disabled
Versions of relevant libraries:
[pip3] flake8==4.0.1
[pip3] numpy==1.23.5
[pip3] torch==2.0.1
[pip3] torchaudio==2.0.2
[pip3] torchdata==0.6.1
[pip3] triton==2.0.0
[conda] Could not collect
The text was updated successfully, but these errors were encountered: