• Skip to primary navigation
  • Skip to content
  • Skip to footer
  • CV
  • Blog
  • Photos

    How to prevent PyTorch Distributed Data Parallel (DDP) from freezing

    I’ve recently faced an issue with Pytorch Distributed Data Parallel, in which the process hangs without any error message. I don’t exactly know what causes this, but I believe it is triggered by NCCL in the distributed setting.

    March 7, 2022 less than 1 minute read

    Photo by Junyong Lee
    'Junyong Lee', '이준용'

    Junyong Lee,
    Ph.D. in CSE

    Staff Research Scientist @ AIC Toronto

    • Toronto
    • CV
    • Google Scholar
    • GitHub
    • LinkedIn
    • Instagram
    • Email

    I’ve recently faced an issue with Pytorch Distributed Data Parallel, in which the process hangs without any error message. I don’t exactly know what causes this, but I believe it is triggered by NCCL in the distributed setting.

    The following line solves the problem:

    export NCCL_P2P_DISABLE=1
    

    References

    1. Distributed data parallel freezes without error message

    Tags: Github

    Categories: Github

    Updated: March 7, 2022

    Share on

    Twitter Facebook LinkedIn
    Prev Next

    Leave a comment

    You may also enjoy

    How to keep stable adb connection between multiple android devices and docker container in ubuntu PC

    September 1, 2024 3 minute read

    This guide will help you maintain a stable ADB connection between multiple Android devices and a Docker container running on an Ubuntu server.

    How to fix [forkpty: Device not configured][Could not create a new process and open a pseudo-tty.]

    April 12, 2023 1 minute read

    If you’re using a terminal on macOS (Ventura 13.5) and encounter the following error message when trying to open a new terminal session, this guide will help...

    Jump multiple remote hosts using ProxyCommand (SSH Tunneling)

    June 12, 2022 2 minute read

    This article introduces how to ssh-jump on a remote intermediate server(s) to ssh-connect into a target server with a single command.

    • GitHub
    • LinkedIn
    • Feed
    © 2025 Junyong Lee.
    Powered by Jekyll & Minimal Mistakes.