Let's say i have two matrix tf_t (shape : 5x3 ) and tf_b ( shape : 3x3). y_tf = tf.matmul(tf_t, tf_b) and then I've computed dy/dt using tf.gradient api
import tensorflow as tf
mat = [[0.8363, 0.4719, 0.9783],
[0.3379, 0.6548, 0.3835],
[0.7846, 0.9173, 0.2393],
[0.5418, 0.3875, 0.4276],
[0.0948, 0.2637, 0.8039]]
another_mat = [[ 0.43842274 ,-0.53439844, -0.07710262],
[ 1.5658046, -0.1012345 , -0.2744976 ],
[ 1.4204658 , 1.2609464, -0.43640924]]
tf_t = tf.Variable(tf.convert_to_tensor(mat))
tf_b = tf.Variable(tf.convert_to_tensor(another_mat))
with tf.GradientTape() as tape:
tape.watch(tf_t)
y_tf = tf.matmul(tf_t, tf_b)
y_t0 = y_tf[0,0]
# dy = 2x * dx
dy_dx = tape.gradient(y_tf, tf_t)
print(dy_dx)
I am getting below matrix as dy/dx
tf.Tensor(
[[-0.17307831 1.1900724 2.245003 ]
[-0.17307831 1.1900724 2.245003 ]
[-0.17307831 1.1900724 2.245003 ]
[-0.17307831 1.1900724 2.245003 ]
[-0.17307831 1.1900724 2.245003 ]], shape=(5, 3), dtype=float32)
The above matrix does not look right. because for the element y_tf[0,0]
Note : y_tf[0,0] = tf_t[0,0]*tf_b[0,0] + tf_t[0,1]*tf_b[1,0] + tf_t[0,2]*tf_b[2,0]
if I perform
tape.gradient(y_t0, tf_t)
I get the matrix like this
tf.Tensor(
[[0.43842274 1.5658046 1.4204658 ]
[0. 0. 0. ]
[0. 0. 0. ]
[0. 0. 0. ]
[0. 0. 0. ]], shape=(5, 3), dtype=float32)
The 1st row above is 1st column of matrix tf_b which makes sense given how matrix multiplication works and If I were, to sum up, those numbers it's going to be 3.424693 However, the result I got as dy_dx it has it's first element dy_dx[0,0] as -0.17307831 which is a summation of 1st row of tf_b ( sum(tf_b[0,:]) !!
So can anyone please explain hows the gradient of tf_y[0,0] wrt tf_x is reduced to -0.17307831 and not 3.424693?
The question could appear similar to this but the answer I'm looking for is not addressed there with a clear picture.
The key notion to understand here is that
tf.gradientscomputes the gradients of the sum of the output(s) with respect to the input(s). That isdy_dxrepresents the scale by which the sum of all elements ofy_tfchanges as each element oftf_tchanges.So, if you take
tf_t[0, 0], that value is used to computey_tf[0, 0],y_tf[0, 1]andy_tf[0, 2], in each case with coefficientstf_b[0, 0],tf_b[0, 1]andtf_b[0, 2]. So, if I increasedtf_t[0, 0]by one, the sum ofy_tfwould increase bytf_b[0, 0] + tf_b[0, 1] + tf_b[0, 2], which is the value ofdy_dx[0, 0]. Continuing with the same reasoning, each valuetf_t[i, j]is in fact multiplied by all the values intf_b[j, :], sody_dxis a repetition of the sum of the rows oftf_b.When you compute the gradient of
y_t0with respect totf_t, then changes intf_t[0, 0]would change the sum of the result by a factor oftf_b[0, 0], so that is the value of the gradient in that case.