Hi I am trying to use learning rate scheduler. But it is not starting to train. (not starting iteration)
I am trying to train imagenet with mirrored strategy because the data set is large.
Also I have set to bring dataset as a batch form.
for example, train_ds_preprocessed = train_ds.batch(256) and using 4 gpus. at training part which is fit, I have set batch as same as training datset batch which is 256.
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
initial_lr = 0.01
num_epochs = 100
warmup_steps = 5
def lr_schedule(epoch):
print(" lr_schedule if first")
if epoch < warmup_steps:
print("if in ")
print("epoch : ",epoch)
print("warmup_steps : ",warmup_steps)
print("return : ", (epoch + 1) / warmup_steps * initial_lr)
return (epoch + 1) / warmup_steps * initial_lr
else:
print("else in ")
return initial_lr * (1.0 - (epoch - warmup_steps) / (num_epochs - warmup_steps))
learning_rate_scheduler = tf.keras.callbacks.LearningRateScheduler(lr_schedule)
#%%
# strategy = tf.distribute.MirroredStrategy(devices = ["GPU:0", "GPU:1", "GPU:2", "GPU:3"])
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model_builder = MyModel(input_shape = input_image_shape, num_classes = num_classes, defined_out_channels = defined_out_channels)
model = model_builder.build()
model.compile(optimizer = optimizer, loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])
model.summary()
# keras.utils.plot_model(model,"001_iamgenet_mymodel.png", show_shapes = True)
history = model.fit(
train_ds_preprocessed,
epochs=100,
batch_size=total_batch_size,
validation_data=val_ds_preprocessed,
callbacks=[checkpoint_callback, wandb_callbacks, learning_rate_scheduler]
)
the mesage stops at here
2023-08-26 07:18:12.707335: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:549] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
lr_schedule if first
if in
epoch : 0
warmup_steps : 5
return : 0.002
Epoch 1/100
INFO:tensorflow:batch_all_reduce: 77 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 77 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 77 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 77 all-reduces with algorithm = nccl, num_packs = 1
2023-08-26 07:18:35.281135: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8301
2023-08-26 07:18:36.819182: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8301
2023-08-26 07:18:37.407587: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-08-26 07:18:37.759992: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:630] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2023-08-26 07:18:38.781774: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8301
2023-08-26 07:18:39.815530: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8301
is there any advice that you guys can give me ?
if I do not use the learning rate scheduler but instead if use
model.compile(optimizer = 'sgd', loss....)
then the iteration works perfectly. is there any advice?