In inverse Image Projection, do I need to add or subtract t vector to my camera coordinates after multiplying with R inverse?

182 views Asked by At

I am currently working on the transformation between object, camera and world coordinates in an inverse image projection task. I have the following information available:

  • image coordinates of an object in the form of (u,v,1)

  • Euler angles converted to a rotation matrix (R)

  • the translation vector (t) representing distances between the camera and the GNSS receiver

  • camera matrix (K), and GNSS position.

To convert the camera coordinates to world coordinates, I have followed the steps outlined in the literature. First, I computed the camera coordinates (Xc, Yc, Zc) by inverting the intrinsic matrix K:

λ * [Xc, Yc, Zc] = K^(-1) * [u, v, 1]

Next, I want to transform the camera coordinates (Xc, Yc, Zc) to world coordinates (X, Y, Z) by inverting the extrinsic matrix [R | t], is this equation correct?:

[X, Y, Z] = R^(-1) * ( [Xc, Yc, Zc] * t )

Here is where my confusion arises. In my specific case, the desired world coordinates correspond to the objects detected by the camera. Therefore, I believe that I should add the translation vector (t) to my camera coordinates as follows:

[X, Y, Z] = R^(-1) * ( [Xc, Yc, Zc] + t )

However, from my understanding of the documentation, it seems that t is typically subtracted instead:

[X, Y, Z] = R^(-1) * ( [Xc, Yc, Zc] - t )

I would appreciate clarification on whether my understanding is correct and whether I should add or subtract or even multiply the translation vector in my case.

It is important to note that the translation vector (t) I have is obtained by manually measuring the distances between the camera and the GNSS receiver. I have not got the t vector typically obtained through solvePnP. Do I need to use the solvePnP-generated t vector, or is the manually measured t vector sufficient for my purposes?

1

There are 1 answers

0
Lelouch On

The usual camera equation is given by

s(u v 1) = K(RX+t) where s is an arbitrary real number, (u,v) is pixel coordinate of your object and R,t are the extrinsic parameter of the camera and K is the intrinsic matrix.

Hence you cannot usually recover the 3D position only from the 2D position (its intuitive), you can only find the direction of the ray that hit the corresponding pixel. This is modeled here by the presence of the scalar s.

Hence if you want to recover the 3D pose, you have to first invert for X :

X = R^(-1)*[s*K^(-1)*(u v 1) - t]
X = sR^(-1)*K^(-1)*(u v 1) - R^(-1)*t

Now you have to use your "distances between the camera and the GNSS receiver", which I will denote by the letter d. Do NOT confuse it with the t used above, they dont mean the same thing.

The distance between the camera center and the object is hence d, but it is also equal to the norm of the difference between the camera center and the position X of the object :

d = ||sR^(-1)*K^(-1)*(u v 1)||

because - R^(-1)*t is equal to the camera center.

Hence you can find the value of s (or more precisely, the absolute value value of s because s could also be negative but then the object would be behind the camera which is probably not what you are looking for) :

s = d/||R^(-1)*K^(-1)*(u v 1)||

And now you can just substitute in the above equation to find the value of X.

X = d * [R^(-1)*K^(-1)*(u v 1)]/||R^(-1)*K^(-1)*(u v 1)|| - R^(-1)*t

Feel free to check my computations, I wrote this very quickly.