As far as I know, for a specific policy \pi, temporal difference learning let us compute the expected value following that policy \pi, but what's the meaning of knowing a specific policy?
Shouldn't we try finding the optimal policy for a given environment? What's the point of doing a specific \pi using temporal difference learning at all?
As you said, only finding the value function for a given policy is not very useful in the general case, where the goal is finding an optimal policy. However, several classical algorithms such as
SARSAorQ-learning, can ve viewed as a special case ofgeneralized policy iteration, where the most difficult part is finding the value function of a policy. Once you know the value function, it's easy to find a better policy, then find again the value function of the recently computed policy, and so on. This process, given some conditions, converges to the optimal policy.In summary,
temporal difference learningis a key step in other algorthims that allow to find an optimal policy.