cs
CS/Python

RuntimeError: Expected to mark a variable ready only once.

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 
1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.
2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.

Error Specification

This error occurs when I tried to unfreeze parameters of ViT. At first, I tried to use _set_static_graph() to my DataDistributed module, however after applying the change, soon I got this:

RuntimeError: Your training graph has changed in this iteration, e.g., one parameter is unused in first iteration, but then got used in the second iteration. this is not compatible with static_graph set to True.

 

After watching [1], I got an idea that is somewhat related to the find_unused_parameters parameters. According to [2], when the setting is set to True, DDP finds out which parameters are involved and not.

 

When I executed this code with video-only data, there was no problem when I set find_unused_parameters to True. However, setting it to False caused an error, although I can't recall the exact details of the error.

 

This code have a checkpoint switch in the configuration file, but I couldn't find any clue that the code utilizes gradient checkpointing in the code.

 

A replier in [1] said that setting find_unused_parameters=True for only the first model and leaving other models to be set to False would solve the problem. Also, some of them said setting use_reentrant=True would help, though this code is not seems to use the torch.util.checkpoints.

 

In [3], it is necessary to turn off the gradient checkpointing. I think I gotta find out where the gradient checkpointing is used... gradient_checkpointing_enable

 

I finally found the part where the damn gradient checkpoint is used. In ViT, there was a code as follows:

 

for idx, blk in enumerate(self.blocks):
            if self.use_checkpoint and idx < self.checkpoint_num:
                x_vis = checkpoint.checkpoint(blk, x_vis)
            else:
                x_vis = blk(x_vis)

and the use_checkpoint was set to True as default. But it causes CUDA OOM in my case :(

use_reentrant

In [4], setting use_reentrant to False is belived to be may helpful. In my case, I tried like:

checkpoint.checkpoint(blk, x_vis, use_reentrant=False)

 

It worked!

 

 

Conclusion

  • gradient_checkpointing is not working correctly with DDP
  • turning off the checkpointing may help your issue
  • Or just setting use_reentrant will solve the issue

References

[1] https://discuss.pytorch.org/t/finding-the-cause-of-runtimeerror-expected-to-mark-a-variable-ready-only-once/124428
[2] https://pytorch.org/docs/stable/notes/ddp.html
[3] https://github.com/huggingface/accelerate/issues/389
[4] https://discuss.huggingface.co/t/ddp-gradient-checkpoint-crashes/58432/3

Footnotes