LLaMa2(GPT3) 사용기와 에러 정리

Distributed package doesn't have NCCL built in

Traceback (most recent call last):
  File "example_chat_completion.py", line 104, in <module>
    fire.Fire(main)
  File "/home/csjihwanh/Desktop/projects/sggVQA/llama/env/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/csjihwanh/Desktop/projects/sggVQA/llama/env/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/csjihwanh/Desktop/projects/sggVQA/llama/env/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "example_chat_completion.py", line 35, in main
    generator = Llama.build(
  File "/home/csjihwanh/Desktop/projects/sggVQA/llama/llama/generation.py", line 85, in build
    torch.distributed.init_process_group("nccl")
  File "/home/csjihwanh/Desktop/projects/sggVQA/llama/env/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
    func_return = func(*args, **kwargs)
  File "/home/csjihwanh/Desktop/projects/sggVQA/llama/env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1148, in init_process_group
    default_pg, _ = _new_process_group_helper(
  File "/home/csjihwanh/Desktop/projects/sggVQA/llama/env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1268, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in

nccl은 NVIDIA Collective Communications Library의 약자로 DL library에서 GPU를 효율적으로 사용할 수 있도록 한다. 현재 conda 환경에서 사용하고 있으므로 conda에 설치해야 한다.

시도 1

conda install -c conda-forge nccl을 이용하여 설치하면 될 것 같다. [1]

-> 해결 안 됨.

시도 2

https://discuss.pytorch.org/t/runtimeerror-distributed-package-doesnt-have-nccl-built-in/176744/3

를 확인해보니, GPU 호환 가능한 PyTorch 버전을 깔아야 한다고 해서 시도해 봤다.

설치는 https://pytorch.org/get-started/locally/ 를 참조한다.

-> 해결 안 됨

시도 3

nvcc --version은 conda의 nvcc를 출력하는 것이 아니라, base의 nvcc를 출력하므로 python -m torch.utils.collect_env로 출력해야 한다.

근데 python에서 torch version 체크해보면 이렇게 나온다. GPU를 사용하지 않는 torch 버전이다.

삭제 후 재설치하여 해결했다.

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

[W CUDAFunctions.cpp:108] Warning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (function operator())
Traceback (most recent call last):
  File "/home/csjihwanh/Desktop/projects/sggVQA/llama/example_chat_completion.py", line 104, in <module>
    fire.Fire(main)
  File "/home/csjihwanh/anaconda3/envs/llama2_mode/lib/python3.11/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/csjihwanh/anaconda3/envs/llama2_mode/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/home/csjihwanh/anaconda3/envs/llama2_mode/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/csjihwanh/Desktop/projects/sggVQA/llama/example_chat_completion.py", line 35, in main
    generator = Llama.build(
                ^^^^^^^^^^^^
  File "/home/csjihwanh/Desktop/projects/sggVQA/llama/llama/generation.py", line 85, in build
    torch.distributed.init_process_group("nccl")
  File "/home/csjihwanh/anaconda3/envs/llama2_mode/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
    func_return = func(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^
  File "/home/csjihwanh/anaconda3/envs/llama2_mode/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1148, in init_process_group
    default_pg, _ = _new_process_group_helper(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/csjihwanh/anaconda3/envs/llama2_mode/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1279, in _new_process_group_helper
    backend_class = ProcessGroupNCCL(backend_prefix_store, group_rank, group_size, pg_options)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

GPU가 발견되지 않는다고 한torch.cuda.OutOfMemoryError: CUDA out of memory.다.

시도 1

https://blog.csdn.net/xu823508091/article/details/122340013

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found_xu823508091的博客-CSDN博客

今天用GPU跑的时候显示：RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! pg = ProcessGroupNCCL(prefix_store, rank, world_size, pg_options) RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! 这个

blog.csdn.net

구글링해도 내 상황에 맞는 게 없어서 위 중국 사이트를 참조했다. 먼저 cuda가 torch에서 available한지 확인해 보라고 해서 해봤는데.. 다음 에러가 나온다.

>>> torch.cuda.is_available()
/home/csjihwanh/anaconda3/envs/llama2_mode/lib/python3.11/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1699449183005/work/c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False

environment의 잘못된 설정으로 인해서 CUDA 에러가 발생했다는 것이다.

(llama2_mode) csjihwanh@csjihwanh-System-Product-Name:~/Desktop/projects/sggVQA/llama$ nvidia-smi
Thu Nov 30 18:30:06 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 30%   27C    P8    19W / 170W |    532MiB / 12045MiB |     32%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1728      G   /usr/lib/xorg/Xorg                244MiB |
|    0   N/A  N/A      1842    C+G   ...ome-remote-desktop-daemon      101MiB |
|    0   N/A  N/A      1882      G   /usr/bin/gnome-shell               40MiB |
|    0   N/A  N/A     41469    C+G   ...333830932431792870,262144      116MiB |
|    0   N/A  N/A     46163      G   gnome-control-center                2MiB |
+-----------------------------------------------------------------------------+

nvidia driver의 cuda version과 맞지 않아서 오류가 발생했을 수 있겠다고 생각했다. 현재 torch는 2.1.1, cuda는 11.8이다.

torch.cuda.OutOfMemoryError: CUDA out of memory.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 11.75 GiB of which 93.31 MiB is free. Process 1828 has 104.12 MiB memory in use. Process 3393 has 62.85 MiB memory in use. Including non-PyTorch memory, this process has 11.14 GiB memory in use. Of the allocated memory 11.02 GiB is allocated by PyTorch, and 1.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-12-03 20:28:32,435] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 5436) of binary: /home/csjihwanh/anaconda3/envs/llama2_mode/bin/python

VRAM 부족 이슈인데.. Meta 홈페이지에는 다음과 같이 나와있다.

지금 사용하고 있는 것은 RTX3060 12GB인데, batch_size를 최소한으로 줄여도 해결할 수 없다면 말 그대로 다른 그래픽카드를 써야 하므로 지금으로써는 해결할 수 없다.

필요한 minimum VRAM size에 대한 이야기는 20GB~8GB로 커뮤니티마다 말이 다르다.

-> 직접 다운받아 사용하지 않고 HF 모델을 가져와 해결했다.

-> 결국 나중엔 GPT3로 바꿨다..

tokenize 형태

* string query의 tokenization

tokenized query tensor([[ 1, 11699, 278, 17455, 1063, 5700, 29973],
[ 1, 1317, 540, 7432, 29973, 2, 2]]) <class 'torch.Tensor'>

* CLIP의 output

embed tensor([[ 0.1096, -1.6534, 1.6312, ..., 1.4251, 0.2069, -0.3935],
[-0.1330, -1.3853, 1.4392, ..., 1.3921, 0.1346, -0.1105]],

-> 이거 real embedding이라 tokenize해줘야한다.

self.tokenizer.pad_token = self.tokenizer.eos_token

#tokenized_query = self.tokenizer(query, return_tensors='pt',padding=True, truncation=True)

embed_str = [" ".join([str(element) for element in sublist]) for sublist in embed.tolist()]

# Make a concatenated input created by concatenating the clip output and query

# combined_input: list of string(batch_size, )

combined_input = [f"the query is : {a} \n\n and the image embedding is {b}" for a, b in zip(query, embed_str)]

위가 최종 형태이다

RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

CPU에서 돌릴 때 float16을 사용할 수 없어서 발생하는 에러이다.

https://life-is-potatoo.tistory.com/116

위 링크를 보고 따라서

mix_model = MixModel()

mix_model = mix_model.float()

이렇게 바꿔봤는데 내 경우에는 해결되지 않았다.

LLM pipeline을

torch_dtype=torch.bfloat16,

이렇게 바꿔 해결했다.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

어디가 cuda에 안올라갔나 싶어서 보니까 tokenizer는 nn.module이 아니라서 device 설정이 상속되지 않는 것 같다. 그래서 input 얻은 후에 cuda에 올려줘야 하는데 이게 pipeline을 쓰다보니까 그것도 안 된다. 따라서

self.pipeline = transformers.pipeline(

"text-generation",

model=self.model, #on cuda:0

이렇게 model을 지정해주어 해결했다. self.device는 torch.device 객체이다.

References

[1] https://anaconda.org/conda-forge/nccl

Footnotes

'DL·ML' 카테고리의 다른 글

[ZSD] GLIP (2)	2024.02.06
[Paper Review] Emerging Properties in Self-Supervised Vision Transformers (1)	2024.02.06
[Paper Review] Faith and Fate: Limit of Transformers on Compositionality (0)	2023.11.18
TPU란 무엇인가? (1)	2023.11.10
[GNN] GNN Model (0)	2023.11.07

Distributed package doesn't have NCCL built in

시도 1

시도 2

시도 3

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

시도 1

torch.cuda.OutOfMemoryError: CUDA out of memory.

tokenize 형태

RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

References

Footnotes

'DL·ML' 카테고리의 다른 글

티스토리툴바