• Skip to primary navigation
  • Skip to content
  • Skip to footer
  • CV
  • Blog
  • Photos

    How to prevent PyTorch Distributed Data Parallel (DDP) from freezing

    I’ve recently faced an issue with Pytorch Distributed Data Parallel, in which the process hangs without any error message. I don’t exactly know what causes this, but I believe it is triggered by NCCL in the distributed setting.

    March 7, 2022 less than 1 minute read

    Photo by Junyong Lee
    Junyong Lee

    Junyong Lee,
    Ph.D. in CSE

    Research Scientist @ SAIC Toronto

    • CV
    • Google Scholar
    • GitHub
    • LinkedIn
    • Instagram
    • Email

    I’ve recently faced an issue with Pytorch Distributed Data Parallel, in which the process hangs without any error message. I don’t exactly know what causes this, but I believe it is triggered by NCCL in the distributed setting.

    The following line solves the problem:

    export NCCL_P2P_DISABLE=1
    

    References

    1. Distributed data parallel freezes without error message

    Tags: Github

    Categories: Github

    Updated: March 7, 2022

    Share on

    Twitter Facebook LinkedIn
    Prev Next

    Leave a comment

    You may also enjoy

    Jump multiple remote hosts using ProxyCommand (SSH Tunneling)

    June 12, 2022 2 minute read

    This article introduces how to ssh-jump on a remote intermediate server(s) to ssh-connect into a target server with a single command.

    How to permanently add passphrase of private key to ssh-agent (macOS, Ubuntu, and Windows)

    May 9, 2022 5 minute read

    When setting up a passwordless SSH login environment using private and public keys, it is necessary to enter a passphrase for the private key when logging in...

    How to configure SSH without Passwords

    May 9, 2022 4 minute read

    This article presents a secure method for logging into remote servers using private/public key-based SSH connections. The public key serves as the keyhole on...

    • GitHub
    • LinkedIn
    • Feed
    © 2023 Junyong Lee.
    Powered by Jekyll & Minimal Mistakes.