I am running a 3D image segmentation deep learning training pipeline on a GCloud VM and am noticing a stepwise decrease of the GPU utility after about 25 epochs and an out of memory error after 32 epochs. Since such a training pipeline is basically the same loop over the data over and over again, and since all other main metrics do not show such a pattern change, I don't understand why the first epochs are fine and it then suddenly occurs.
Could this be some kind of memory leak on the GPU? Could GCloud apply some kind of throttling based on the GPU temperature?
Some context info:
- I am using Julia 1.9.0 with FastAI.jl 0.5.1, Flux.jl 0.13.16, CUDA.jl 4.2.0
- The VM is Ubuntu 22.04 x86_64 with CUDA toolkit 12.1, NVIDIA driver 530.30.02, and an NVIDIA Tesla T4 GPU with 16GB RAM
- The model is a residual U-Net with approximately 9.5 million parameters, the input data are 3D Float32 images with size (96, 96, 96) and I am using a batch size of 2.
Some things I've tried:
I can reproduce the behaviour reliably, it happens every time after the same amount of epochs
If I decrease the input image size, it still happens but later (epoch 60)
If I decrease the model size it happens earlier (this I especially don't understand)
I've set
JULIA_CUDA_MEMORY_POOLtononeand added a callback after each epoch that executesGC.gc(true)andCUDA.reclaim()

The problem was resolved by changing my optimiser from
Flux.NesterovtoOptimisers.Nesterovas suggested here. Apparently the Flux optimisers gathers some kind of state whereas the ones from Optimisers.jl do not.