Error
- Environment
- Ubuntu 20.04.6 LTS
- CUDA 11.8
- Python 3.10
Occurred when I tried to reimplement VideoChat2 on the server; I modified the execution script cuz I was trying to launch the code on just one machine with 2 GPUs(or a single GPU).
the modified code is as follows:
python tasks/train.py \
$(dirname $0)/config.py \
output_dir ${OUTPUT_DIR}
The original code is like:
srun -p ${PARTITION} \
-n${NNODE} \
--gres=gpu:${NUM_GPUS} \
--ntasks-per-node=1 \
--cpus-per-task=${NUM_CPU} \
bash torchrun.sh \
--nnodes=${NNODE} \
--nproc_per_node=${NUM_GPUS} \
--rdzv_backend=c10d \
tasks/train.py \
$(dirname $0)/config.py \
output_dir ${OUTPUT_DIR}
Rationale
[1] explains that he gotta setup DDP init process group. My code does not utilize DDP but instead uses the torch.distributed module. Hence I refered [2] cuz the code in [1] seems to initialize the initial process group by executing
dist.init_process_group("gloo", rank=rank, world_size=world_size)
.
To fully understand [2], I had to refer to [3] which provides an overview of pytorch distributed module. There are three types of parallelism apis: DDP, FSDP, TP, PP.
Data Parallelism
Data parallelism duplicates the model process in multiple GPUs, and they compute local gradient on subset of datasets. Therefore, it does not reduce memory intensity but may reduce training time.
We can adopt it by using DistributedDataParallel(DDP)
module in our code.
Model Parallelism (Sharded Data Parallelism)
When a model is too large to fit into my GPU, then I can seperate a model into several pieces and load it on the GPU. By using FullyShardedDataParallel(FSDP) it can be accomplished.
Tensor Parallel(TP) and Pipeline Parallel(PP) seems to are another method to achieve this.
Backends
In [2], three types of backends
are specified. In this context, the backend
refers to the libraries or frameworks used in distributed computation.
- Gloo
Gloo is a library developed by Facebook for deep learning and other large-scale distributed tasks. It works with GPUs and CPUs. It also supports multi-node, multi-GPU training. (In distributed computing context, node means an individual computer.[4])
- MPI (Message Passing Interface)
MPI is a standardized MP system designed for parallel computing. It is known to be highly portable.
- NCCL (NVIDIA Collective Communication Library)
NCCL(is pronunced like 'nickel') is a library developed by NVIDIA. It optimizes various task settings with multi-GPU and multi-node settings. One thing is that it is optimized for NVIDIA hardware.
→ Rule of thumb[2]
- Use the NCCL backend for distributed GPU training
- Use the Gloo backend for distributed CPU training
Now I understand what a backend is. But what exactly does torch.distributed
do?
torch.distributed
is a package within the PyTorch that provides support for distributed training. In short, it leverages different backends to communicate between processes.
Initialization
the packages should be initialized using the torch.distributed.init_process_group()
or torch.distributed.device_mesh.init_device_mehs()
before calling other methods.[2]
How does the original code initialize?
By using torchrun.sh
. torchrun was introduced in PyTorch version 1.10.
It is a utility in PyTorch to simplify launching distributed training jobs.
Solution
Include torchrun.sh section in your shell script, or manually add the initialization line for the torch distributed module.
bash torchrun.sh \
--nnodes=${NNODE} \
--nproc_per_node=${NUM_GPUS} \
--rdzv_backend=c10d \
References
[1] https://discuss.pytorch.org/t/getting-a-eerror-default-process-group-has-not-been-initialized-please-make-sure-to-call-init-process-group/73513/1
[2] https://pytorch.org/docs/stable/distributed.html
[3] https://pytorch.org/tutorials/beginner/dist_overview.html
[4] https://www.supermicro.com/en/glossary/distributed-computing#:~:text=A%20distributed%20system%20is%20a,its%20own%20set%20of%20data.
Footnotes
'CS > Python' 카테고리의 다른 글
RuntimeError: Expected to mark a variable ready only once. (0) | 2024.08.16 |
---|