Tensorflow Serving prometheus metrics is unclear and latency is high

53 views Asked by At

Background

We have deployed a service with Tensorflow Serving (TFServing) container to Kubernetes with server-side batching enabled. When the service receives a inference request, it invokes TFServing with gRPC to make inference, and once receives the result from TFServing, it returns the result to the service caller. Besides, we emit a few metrics as listed below:

  1. The duration between the service send to TFServing and the service receives the response from TFServing (let's call this "TFServing inference time")
  2. The promethums metrics provided by TFServing framework including :tensorflow:serving:batching_session:queuing_latency, :tensorflow:serving:request_latency, :tensorflow:serving:runtime_latency, :tensorflow:serving:batching_session:processed_batch_size. (Reference)

Issues

Recently, we found the "TFServing inference time", i.e. inference latency, is high during high QPS (2000~4000 request/second). Before we use TFServing, we use tensorflow library and load the model directly into the service's memory and make inference and we know the P99 latency is less than 10ms, but when it comes to TFServing, it seems the P99 "TFServing inference time" is around 20ms. We would like to understand what is causing this issue.

We compared that inference time with :tensorflow:serving:request_latency, which we thought it should refers to the time that TFServing takes to make inference. The datadog screenshots are pasted as below.

TFServing inference time, P99 20ms

:tensorflow:serving:request_latency, P99 3.5ms

:tensorflow:serving:runtime_latency, P99 3.5ms

:tensorflow:serving:batching_session:queuing_latency, P99 600microseconds

I assume that the mechanism of TFServing with server-side batch feature enabled making inference is that:

  1. The inference request would be pushed into TFServing queue
  2. The TFServing pulls multiple inference requests from the queue and make inference in one shot
  3. Return to the TFServing caller.

So if my understanding is correct, TFServing inference time should equals to :tensorflow:serving:request_latency + :tensorflow:serving:batching_session:queuing_latency. However, apparently it doesn't, according to the screenshots pasted above and something is missing that that is causing 20ms latency.

  1. I'm wondering where the latency of 20ms comes from, and if it's possible to emit that missing metrics?

  2. Is :tensorflow:serving:batching_session:queuing_latency refer to the time that a message stays in the queue, or the time that TFServing takes to push into the queue? If it's the latter one, I guess we are missing the duration that a message stays in the TFServing queue?

0

There are 0 answers