I am losing my mind. I am using official tensorflow/tensorflow:1.13.1-gpu-py3 images to run very basic code. And for some reason it fails on me. I found out it pass on low number for first dimension and fails at higher ones. So with my RTX 3090 VRAM 24GB it stuck on 17 and above, it works for 16 and below. These should not be high numbers, the actual project I need to run needs 4000.
import tensorflow as tf
# Create a session configuration with GPU memory growth
config = tf.ConfigProto(allow_soft_placement=True)
config.gpu_options.allow_growth = True
# Create a session with the configured options
with tf.Session(config=config) as sess:
# Create two smaller random matrices
matrix_a = tf.random.normal(shape=(17, 64), dtype=tf.float32)
matrix_b = tf.random.normal(shape=(64, 128), dtype=tf.float32)
# Perform a matrix multiplication using BLAS
result = tf.matmul(matrix_a, matrix_b)
# Run the operation to perform matrix multiplication
output = sess.run(result)
print("BLAS operation successful!")
# Check the result
print("Result:")
print(output.shape)
I get this
2024-01-01 02:04:56.923916: E tensorflow/stream_executor/cuda/cuda_blas.cc:698] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(17, 64), b.shape=(64, 128), m=17, n=128, k=64
[[{{node MatMul}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 9, in <module>
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(17, 64), b.shape=(64, 128), m=17, n=128, k=64
[[node MatMul (defined at <stdin>:7) ]]
Caused by op 'MatMul', defined at:
File "<stdin>", line 7, in <module>
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py", line 2455, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 5333, in mat_mul
name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(17, 64), b.shape=(64, 128), m=17, n=128, k=64
[[node MatMul (defined at <stdin>:7) ]]
I was trying to run Graph Neural Network sample that was using TF1 and I had tracked down my problem to matmul. I could reproduce it with the provided short code. I changed the container I use from tensorflow/tensorflow to the ones provided by nvidia https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow/tags