I’ve recently faced an issue with Pytorch Distributed Data Parallel, in which the process hangs without any error message. I don’t exactly know what causes this, but I believe it is triggered by NCCL in the distributed setting.
The following line solves the problem:
export NCCL_P2P_DISABLE=1
Leave a comment