MSDN Array error when training tensorflow model on Macbook M2 chip

126 views Asked by At

I'm training reinforcement learning models using tensorflow (Python) but since few weeks I can't run my code anymore on my macbook air (Monteray 12.5) with M2 chip.

I get this error

/AppleInternal/Library/BuildRoots/20d6c351-ee94-11ec-bcaf-7247572f23b4/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:88: failed assertion `[MPSNDArrayDescriptor sliceDimension:withSubrange:] error: the range subRange.start + subRange.length does not fit in dimension[0] (7)'

When I run my code on Google Collab I notice that a lot of RAM is being used and the usage is linearly increasing over time. I don't know how it works on Mac with M2, but seeing the error it kinda looks like some memory issue ?

I'm trying to profile my code using the FIL-profile or memory-profile library but they can't output anything at the end of the code since it's crashing. The only output I get is if I ctrl+C before it crashes but I don't get much info out of it since I never catch it when the memory leak happens.

I've tried debugging the code and manually checking the RAM usage at the beginning and end of each training iteration but I don't see any trend neither. By training iteration, I mean a rollout of episodes + forward pass + gradient update. It seems like it's staying around 75% of RAM usage (up to 6.1 Gb)

The code I'm using at each debug step is this (using psutil library)

            # Getting % usage of virtual_memory ( 3rd field)
            print('RAM memory % used:', psutil.virtual_memory()[2])
            # Getting usage of virtual_memory in GB ( 4th field)
            print('RAM Used (GB):', psutil.virtual_memory()[3]/1000000000)

Anyone encountered that error ? Or do you have any hints of what it means and what I should look for ?

Thanks a lot !

0

There are 0 answers