ValueError: Default process group has not been initialized, please make sure to call init_process_group.

Error

Environment
- Ubuntu 20.04.6 LTS
- CUDA 11.8
- Python 3.10

Occurred when I tried to reimplement VideoChat2 on the server; I modified the execution script cuz I was trying to launch the code on just one machine with 2 GPUs(or a single GPU).

the modified code is as follows:

python tasks/train.py \
    $(dirname $0)/config.py \
    output_dir ${OUTPUT_DIR}

The original code is like:

srun -p ${PARTITION} \
    -n${NNODE} \
    --gres=gpu:${NUM_GPUS} \
    --ntasks-per-node=1 \
    --cpus-per-task=${NUM_CPU} \
    bash torchrun.sh \
    --nnodes=${NNODE} \
    --nproc_per_node=${NUM_GPUS} \
    --rdzv_backend=c10d \
    tasks/train.py \
    $(dirname $0)/config.py \
    output_dir ${OUTPUT_DIR}

Rationale

[1] explains that he gotta setup DDP init process group. My code does not utilize DDP but instead uses the torch.distributed module. Hence I refered [2] cuz the code in [1] seems to initialize the initial process group by executing

dist.init_process_group("gloo", rank=rank, world_size=world_size).

To fully understand [2], I had to refer to [3] which provides an overview of pytorch distributed module. There are three types of parallelism apis: DDP, FSDP, TP, PP.

Data Parallelism

Data parallelism duplicates the model process in multiple GPUs, and they compute local gradient on subset of datasets. Therefore, it does not reduce memory intensity but may reduce training time.

We can adopt it by using DistributedDataParallel(DDP) module in our code.

Model Parallelism (Sharded Data Parallelism)

When a model is too large to fit into my GPU, then I can seperate a model into several pieces and load it on the GPU. By using FullyShardedDataParallel(FSDP) it can be accomplished.

Tensor Parallel(TP) and Pipeline Parallel(PP) seems to are another method to achieve this.

Backends

In [2], three types of backends are specified. In this context, the backend refers to the libraries or frameworks used in distributed computation.

Gloo

Gloo is a library developed by Facebook for deep learning and other large-scale distributed tasks. It works with GPUs and CPUs. It also supports multi-node, multi-GPU training. (In distributed computing context, node means an individual computer.[4])

MPI (Message Passing Interface)

MPI is a standardized MP system designed for parallel computing. It is known to be highly portable.

NCCL (NVIDIA Collective Communication Library)

NCCL(is pronunced like 'nickel') is a library developed by NVIDIA. It optimizes various task settings with multi-GPU and multi-node settings. One thing is that it is optimized for NVIDIA hardware.

→ Rule of thumb[2]

Use the NCCL backend for distributed GPU training
Use the Gloo backend for distributed CPU training

Now I understand what a backend is. But what exactly does `torch.distributed` do?

torch.distributed is a package within the PyTorch that provides support for distributed training. In short, it leverages different backends to communicate between processes.

Initialization

the packages should be initialized using the torch.distributed.init_process_group() or torch.distributed.device_mesh.init_device_mehs() before calling other methods.[2]

How does the original code initialize?

By using torchrun.sh. torchrun was introduced in PyTorch version 1.10.

It is a utility in PyTorch to simplify launching distributed training jobs.

Solution

Include torchrun.sh section in your shell script, or manually add the initialization line for the torch distributed module.

bash torchrun.sh \ --nnodes=${NNODE} \ --nproc_per_node=${NUM_GPUS} \ --rdzv_backend=c10d \

References

[1] https://discuss.pytorch.org/t/getting-a-eerror-default-process-group-has-not-been-initialized-please-make-sure-to-call-init-process-group/73513/1
[2] https://pytorch.org/docs/stable/distributed.html
[3] https://pytorch.org/tutorials/beginner/dist_overview.html
[4] https://www.supermicro.com/en/glossary/distributed-computing#:~:text=A%20distributed%20system%20is%20a,its%20own%20set%20of%20data.

Footnotes

'CS > Python' 카테고리의 다른 글

RuntimeError: Expected to mark a variable ready only once. (0)	2024.08.16

Error

Rationale

Data Parallelism

Model Parallelism (Sharded Data Parallelism)

Backends

Now I understand what a backend is. But what exactly does torch.distributed do?

Initialization

How does the original code initialize?

Solution

References

Footnotes

'CS > Python' 카테고리의 다른 글

티스토리툴바

Now I understand what a backend is. But what exactly does `torch.distributed` do?