How to Implement Asynchronous Request Handling in TorchServe for High-Latency Inference Jobs?

116 views Asked by At

I'm currently developing a Rails application that interacts with a TorchServe instance for machine learning inference. The TorchServe server is hosted on-premises and equipped with 4 GPUs. We're working with stable diffusion models, and each inference request is expected to take around 30 seconds due to the complexity of the models.

Given the high latency per job, I'm exploring the best way to implement asynchronous request handling in TorchServe. The primary goal is to manage a large volume of incoming prediction requests efficiently without having each client blocked waiting for a response.

Here's the current setup and challenges:

  • Rails Application: This acts as the client sending prediction requests to TorchServe.
  • TorchServe Server: Running on an on-prem server with 4 GPUs.
  • Model Complexity: Due to stable diffusion processing, each request takes about 30 seconds.

I'm looking for insights or guidance on the following:

  1. Native Asynchronous Support: Does TorchServe natively support asynchronous request handling? If so, how can it be configured?
  2. Queue Management: If TorchServe does not support this natively, what are the best practices for implementing a queue system on the server side to handle requests asynchronously?
  3. Client-Side Implementation: Tips for managing asynchronous communication in the Rails application. Should I implement a polling mechanism, or are there better approaches?
  4. Resource Management: How to effectively utilize the 4 GPUs in an asynchronous setup to ensure optimal processing and reduced wait times for clients.

Any advice, experiences, or pointers to relevant documentation would be greatly appreciated. I'm aiming to make this process as efficient and scalable as possible, considering the high latency of each inference job.

Thank you in advance for your help!

1

There are 1 answers

0
VonC On

pytorch/serve issue 544 suggests some form of asynchronous request handling should be supported, through batching and setting up a larger job queue.

# TorchServe configuration example
max_batch_delay=5000
max_workers=8

But this question illustrates python's multiprocessing is not helping much.

So you might consider implementing a custom queue system. Tools like Redis (sudo apt-get install redis-server) can be used for managing the job queue.

The client-side (Rails application) would enqueues an inference request into the Redis queue.

# Rails application - Enqueue a job
require 'redis'
require 'json'

def enqueue_job(data)
    redis = Redis.new(host: "localhost", port: 6379, db: 0)
    job_id = SecureRandom.uuid
    job_data = { job_id: job_id, data: data }.to_json
    redis.rpush('torchserve_queue', job_data)
    job_id
end

The server-side (Worker processing jobs), as a separate worker process, would continuously monitor the Redis queue and process new jobs.

# Python worker script - Process jobs from the queue
import redis
import json
import requests

def process_job(job_data):
    # Unpack job data
    job_id, data = job_data['job_id'], job_data['data']
    # Logic to send the job to TorchServe and handle the response
    response = requests.post('http://torchserve_server:8080/predictions/model', data=data)
    # Process and store the response

def main():
    redis_client = redis.Redis(host='localhost', port=6379, db=0)
    while True:
        _, job_data_json = redis_client.blpop('torchserve_queue')
        job_data = json.loads(job_data_json)
        process_job(job_data)

if __name__ == '__main__':
    main()

The workflow would be:

  • the Rails application enqueues a job into the Redis queue.
  • a worker script continuously pulls jobs from the queue and processes them.
  • each job is sent to the TorchServe server for inference.
  • upon completion, the client can be notified (e.g., via a callback or another mechanism).
+------------------+       +--------------------+       +---------------------+
| Rails Application|       | Redis Queue        |       | TorchServe Server   |
| (Client)         |       | (Job Management)   |       |  - 4 GPUs           |
|                  |   1. Enqueue Job           |       |  - Stable Diffusion |
|  - Enqueues      | ----->|                    |   2. Worker Pulls Job       |
|    Job to Queue  |       |  - Worker pulls    | ----->|    and Processes    |
|                  |       |    and processes   |       |  3. Notify Client   |
|                  |       |    jobs            |       |    on Completion    |
+------------------+       +--------------------+       +---------------------+

The client application should not be blocked while waiting for the response. The worker model would provide a scalable way to process multiple inference requests by distributing them across available resources.