I am doing so-called "tensor contraction" using the cublasSgemmStridedBatched API. I have tensor A of shape 60000*20*9 and tensor B of shape 9*32, both of them are row-major. By definition, C = A * B should give me the the result tensor C of shape 60000*20*32. The code that i write is following:
int batch_count = 60000;
int M = 20;
int K = 9;
int N = 32;
cublasHandle_t handle;
cublasCreate(&handle);
float alpha = 1.0;
float beta = 0.0;
int strideA = 20 * 9;
int strideB = 0;
int strideC = 20 * 32;
// A(60000 * 20 * 9) * B(9 * 32) = C(60000 * 20 * 32)
cublasStatus_t ret = cublasSgemmStridedBatched(
handle,
CUBLAS_OP_N, //transposed, since in row-major
CUBLAS_OP_N, //transposed, since in row-major
N,
M,
K,
&alpha,
B.data<float>(), //already in GPU
N, // lda, transposed
strideB,
A.data<float>(), //already in GPU
K, // ldb, transposed
strideA,
&beta,
C.data<float>(),//already in GPU
N, // ldc
strideC,
batchCount);
cublasDestroy(handle);
if(ret != CUBLAS_STATUS_SUCCESS){
printf("cublasSgemmStridedBatched failed %d line (%d)\n", ret, __LINE__);
}
The above code can't get the work done and keeps showing cublasSgemmStridedBatched failed 7, which according to manual represents CUBLAS_STATUS_INVALID_VALUE . Any help or suggestion is appreciated!
Here is a minimal version that works and tests the result:
Reports a maximum relative error of 2.5e-7