Inconsistent Results in TensorFlow/Keras Training on Kubernetes Cluster

53 views Asked by At

I have a Python code (parallelized with dask) that trains multiple neural networks on a Kubernetes cluster. I want deterministic results, and therefore seed each worker before training the neural networks. The issue I'm facing becomes apparent when I provide all workers with the same seed, since in this case they should all produce identical results. However, some workers end up generating different results.

Example Output:

Worker 0:  Seed 0: result:: -2.4244613650037827 <-- running on Node A
Worker 1:  Seed 0: result:: -2.4244613650037827 <-- running on Node A
Worker 2:  Seed 0: result:: -2.4259607599960265 <-- running on Node X **
Worker 3:  Seed 0: result:: -2.4259607599960265 <-- running on Node Y **
Worker 4:  Seed 0: result:: -2.4259607599960265 <-- running on Node X **
Worker 5:  Seed 0: result:: -2.4244613650037827 <-- running on Node B
Worker 6:  Seed 0: result:: -2.4244613650037827 <-- running on Node B
Worker 7:  Seed 0: result:: -2.4244613650037827 <-- running on Node A
Worker 8:  Seed 0: result:: -2.4244613650037827 <-- running on Node C
Worker 9:  Seed 0: result:: -2.4244613650037827 <-- running on Node A
Worker 10: Seed 0: result:: -2.4244613650037827 <-- running on Node A
Worker 11: Seed 0: result:: -2.4244613650037827 <-- running on Node A
Worker 12: Seed 0: result:: -2.4244613650037827 <-- running on Node A
Worker 13: Seed 0: result:: -2.4259607599960265 <-- running on Node Y **

The workers producing different results are running on the same nodes (in the example above nodes X and Y). Note that these workers produce identical results among themselves (i.e. pods running on nodes X and Y produce the same result, but different from those running on nodes A, B and C).

The code uses CPUs, and no GPUs are involved. I've tried to identify any potential differences between the processor types or architecture used on nodes X and Y compared to A, B and C, but haven't been able to identify any differences.

Additional information I've verified that the data for all workers is equivalent just before creating and training the neural networks. The random operations with NumPy and scikit-learn before initializing neural networks are consistent among workers. I am setting seeds for TensorFlow and Keras using tf.keras.utils.set_random_seed(seed). Despite running on CPUs, I've set tf.config.experimental.enable_op_determinism().

Environment: Python 3.9 TensorFlow 2.10.0 (with Keras backend)

Question: Has anyone encountered a similar issue, and what steps can I take to ensure consistent results among workers, especially when running on Kubernetes with CPU-only configurations?

0

There are 0 answers