Understanding sudden drop in VM and/or persistent disk performance

133 views Asked by At

I am using a GCP VM instance (n1-standard-96) with 4 T4 GPUs for running machine learning model inference. The model is a neural network which uses many input layers. I read the input layers from the persistent disk prior to moving the data to the GPUs and predicting.

I have noticed a significant slowdown in the inference time compared to a few weeks ago: a script that previously took about 1 hour to run now takes 7 hours. Note that this is using the exact same hardware, conda environment, input data, code, etc.

I am trying to understand what has changed and how to get back to the performance I was seeing previously. Looking at the monitoring, it appears that there has been a significant change in the performance of my persistent disk. Previously, it was attaining about 0.6-0.7 k IOPS and 41 MiB/s of throughput. Now, it is showing about 0.1-0.2 k IOPS and 4 MiB/s of throughput. Might this be the cause of the slowdown? What can I try to get back to the performance I was seeing before? Note that the persistent disk may not be the problem; I'm very open to other possible problems/solutions. Thanks in advance.

1

There are 1 answers

1
Ray John Navarro On

Disk Performance could be affected by several factors, You might want to check the Disk type and size as they could be a factor. Also look into other components such as the VM type and the Input/Output operations. Here is a guide on disk performance that can prove useful to your case.[1] I also attached a guide on different types of persistent disk for a good read.[2]

[1] https://cloud.google.com/compute/docs/disks/performance

[2] https://cloud.google.com/compute/docs/disks/