RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons:
1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.
2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Error Specification
This error occurs when I tried to unfreeze parameters of ViT. At first, I tried to use _set_static_graph()
to my DataDistributed module, however after applying the change, soon I got this:
RuntimeError: Your training graph has changed in this iteration, e.g., one parameter is unused in first iteration, but then got used in the second iteration. this is not compatible with static_graph set to True.
After watching [1], I got an idea that is somewhat related to the find_unused_parameters
parameters. According to [2], when the setting is set to True
, DDP finds out which parameters are involved and not.
When I executed this code with video-only data, there was no problem when I set find_unused_parameters
to True. However, setting it to False caused an error, although I can't recall the exact details of the error.
This code have a checkpoint switch in the configuration file, but I couldn't find any clue that the code utilizes gradient checkpointing in the code.
A replier in [1] said that setting find_unused_parameters=True
for only the first model and leaving other models to be set to False
would solve the problem. Also, some of them said setting use_reentrant=True
would help, though this code is not seems to use the torch.util.checkpoints
.
In [3], it is necessary to turn off the gradient checkpointing. I think I gotta find out where the gradient checkpointing is used... gradient_checkpointing_enable
I finally found the part where the damn gradient checkpoint
is used. In ViT, there was a code as follows:
for idx, blk in enumerate(self.blocks):
if self.use_checkpoint and idx < self.checkpoint_num:
x_vis = checkpoint.checkpoint(blk, x_vis)
else:
x_vis = blk(x_vis)
and the use_checkpoint was set to True
as default. But it causes CUDA OOM in my case :(
use_reentrant
In [4], setting use_reentrant
to False is belived to be may helpful. In my case, I tried like:
checkpoint.checkpoint(blk, x_vis, use_reentrant=False)
It worked!
Conclusion
gradient_checkpointing
is not working correctly with DDP- turning off the checkpointing may help your issue
- Or just setting
use_reentrant
will solve the issue
References
[1] https://discuss.pytorch.org/t/finding-the-cause-of-runtimeerror-expected-to-mark-a-variable-ready-only-once/124428
[2] https://pytorch.org/docs/stable/notes/ddp.html
[3] https://github.com/huggingface/accelerate/issues/389
[4] https://discuss.huggingface.co/t/ddp-gradient-checkpoint-crashes/58432/3
Footnotes
'CS > Python' 카테고리의 다른 글
ValueError: Default process group has not been initialized, please make sure to call init_process_group. (1) | 2024.08.12 |
---|