Matlab trainNetwork CNN training pauses iterating intermittently at random then continues

I'm attempting to train a DnCNN network with a grayscale image patch dataset I've collected and aggregated into training and validation imageDatastore objects. I'm using trainNetwork to execute the training routine. When training on imageDatastore train and validation objects containing 50,000 and 5,000 files, respectively, training iterations appear to execute with the same time duration between each iteration (for example, it appears to take less than 1 second for each minibatch size of 128 to be completed and iterate to the next minibatch).
However, when I increase the amount of training and validation files in the imageDatastore objects passed into the trainNetwork function to 350,000 and 35,000, respectively, during training, random iterations appear to hang/pause such that the time duration for the "paused" iteration is 20-30 seconds longer than the normal ~1 second per iteration timeframe. This pausing happens intermittently and frequently significantly increasing my training time and I don't understand why. My memory resources via RAM and GPU have plenty of available memory during training and modification of batchsize, learning rate and optimizer (ADAM, SGDM) do not eliminate this pausing action. The problem appears to be directly related to the number of files in the imageDatastore objects used for training.
Has anyone dealt with this before? Is there some type of data cleanup action being performed via trainNetwork that is executing causing iterations to pause randomly when the imageDatastore objects contain large numbers of files?
Any insight would be greatly appreciated! Thanks


Joss Knight
Joss Knight 2022년 8월 11일
Is the pause associated with a validation measurement being added to the training plot? With 7 times as much validation data it will take 7 times longer to take a validation measurement.
Nicholas Hopkins
Nicholas Hopkins 2022년 8월 12일
Joss, copy all and thank you for the quick responses and troubleshooting tips. The iteration pausing is definitely an interesting deviation from what generally is normal program/training execution when the datastores scale up in size. I'll take a look at analyzing the imagedatastores with the troubleshooting tips you suggested and will hopefully update this thread with an explanation for this training routine behavior; however, it may be a few days before I get back to running this code and focusing on this area of my research so I'll standby on accepting your answer until I look into some of what you suggested.

