Increasing throughput of Single GPU Inference

6 min readOct 22, 2024

Python Web Servers written with FastAPI can handle multiple requests with asynchronous processing but with GPU Intensive tasks such as performing inference on an ML Model turn out to be the bottleneck in async processing where in the requests become almost sequential. This post explores some of the reasons for the same, potential solutions and proposes an optimization based on GPU batching to handle the same.

Experiments

cPython has Global Interpreter Lock (GIL) which ensures that only one native thread can execute Python code at a time, even on multi-core processors. In terms of our use case, it basically translates to the fact that our compute-intensive (GPU) inference operation cannot run in parallel when multiple web requests are being handled asynchronously by the FastAPI ASGI and hence making the entire system behave synchronous in nature !

To get around bottleneck, I tried several ways, lets dive into some of them:

Cuda Streams

It occurred to me that even if we had a way to do inference parallelism on GPU, how would GPU handle this — do we have GPU threads analogous to CPU threads ? I came across CUDA streams which can achieve operation parallelism of CUDA operations on GPU. One way to implement this in the code was using PyTorch Streams:

curr = itertools.count()
REQS = 10 # Lets parallelise 10 requests on GPU ?
streams = [torch.cuda.Stream() for _ in range(REQS)]
with torch.no_grad():
  with torch.cuda.stream(streams[next(curr)%REQS]):
    inp = request["input"]
    st = time.perf_counter()
    result = model(get_tokens(inp))
    end = time.perf_counter()
    print("Time Elapsed: ", end - st )

But to my dismay, this didn’t help me to achieve any performance improvements and system still behaved in sequential manner. I found some related issues on GitHub where others also faced similar issues and one hunch there was that there was that the vector computations are not so heavy to achieve parallelism on GPU (after all its operation interleaving and overlapping operations should be close enough in the time axis to get benefit from this).

Multi-Processing

One classic hack in Python to escape GIL is to use multi-processing ( and classic work around for compute intensive tasks as well ), so it was worth giving a try.

import multiprocessing as mp

def run(q,inp, model):
    q.put(model(get_tokens(inp)))
    return

...

inp = request["input"]
st = time.perf_counter()
q = mp.Queue()
p = mp.Process(target=run, args=(q,inp,model))
p.start()
p.join()
en = time.perf_counter()
print("time elapsed: ", en - st)
result = q.get()

Unfortunately, this didn’t work out either. I won’t expect GIL to be cursing this but certainly there was something blocking the parallel GPU inference calls to succeed. I am still to figure out this completely ( stay tuned for another medium post on this :) )

Breaking the GIL

What if we could go past GIL ? Yes Python 3.13 makes this possible to turn off GIL. This is very new concept and many libraries like hugging-face tokenizers are still catching up on this ( as I write this ).

Unfortunately, I was not able to test this fully with my current setup which heavily depended on some of the libraries which are not yet Py3.13 compatible and the exploration continues to try this out. ( Stay tuned for another post here if I try this out with a simpler model )

Request Batching on GPU

The premise that we don’t think about batching multiple inference requests in parallel on GPU ( like we do in training phase ) is because different requests land on the server at different times and you don’t really want to starve the early requests while waiting for batch size to get full.

Theoretically, thinking about this, if we have high throughput of incoming requests on the web-server such that the request are accumulated over an interval and processed all at once on GPU by using batch-processing, we can actually get benefits out of this if batch-formation and processing happens quickly than the requests would be processed otherwise ( sequentially on GPU ) !!
We just need to be little careful with the accumulation time we choose so that previous requests don’t starve and we don’t deteriorate the performance either as we got from sequential processing earlier.

The only way to know is to try this out :

On a high level , I simply kept a Queue in memory and kept adding requests to the queue as they arrived. A background thread just busy-loops to check if there are any entries on the queue to process, makes a batch and processes all of them on GPU at once.

We can optionally wait inside the request processing busy-loop but in case of high throughput to the server, enough requests would be accumulated in the queue during the time a batch is being processed. Not keeping this deliberate wait also ensures that the requests don’t get starved in case of one off requests to the server.

request_queue = queue.Queue()
response = {}

@app.post("/process")
async def process(request: Request):
  body = await request.json()
  uid = uuid.uuid4()
  request_queue.put({"id": uid, "inp": body["input"] })
  while uid not in response:
    await asyncio.sleep(0.01)
  resp = response[uid]
  del response[uid]
  return resp


def process_requests():
  while True:
    # time.sleep(0.01) # optional ??
    if request_queue.empty():
      continue
    requests = []
    while not request_queue.empty():
      requests.append(request_queue.get())
    
    inputs = []
    with torch.no_grad():
        for request in requests:
          inputs.append(request["inp"])
        en = time.perf_counter()        
        results = model(get_tokens_list(inputs).to("cuda"))
        en = time.perf_counter()
        for i in range(len(requests)):
          response[requests[i]["id"]] = results[i]
        print("time elapsed: ", en - st)

While this code can be optimized a lot with proper locks and thread safe dictionaries, the optimizations are kept out of discussion of this post !

Results

I definitely observed some improvements this time in case of high QPS to the server while the low QPS performance numbers remained the same !

Under a steady load of 250 QPS, keeping all variables like underlying deployment infrastructure, GPU type, number of pods in replica-set the same, following were the results:

Without queueing: the p50 response time was coming 160ms and p95 came to be 400ms.

With queuing: the p50 response time came out to be 100ms and p95 came to be 240ms.

This is showed an improvement of 40% in terms of response times — not a lot but great to harness the complete power of GPU memory ( provided most of GPU Memory remained empty after loading the model and with sequential processing, doesn’t get used at all ).

I didn’t see the batch size to ever go beyond 5 or 6 requests while the inference time remained same after batching ( thanks to Vector Multiplications on the GPU, it doesn’t really make a difference if we increase one dimension (batch dimension) by 5 or 6 from earlier just 1 ).

Closing Thoughts

To be honest, I expected to see more improvement than 40% ( I did some maths before by calculating single request inference time and then taking some overhead in terms of a request missing a processing boat and upper bounds of the same) but turns out that a lot of time was spent by Python in context switching and busy waiting and probably some key-misses from response dictionary. And probably the GIL was show stopper here as well.

All in all, there are still some improvements which can be done to this code to make it more efficient ( applying all threading best practices here) and also some more explorations can be done on the lines of CUDA Streams and breaking the GIL. I highly think a combination of these 2 approaches would give much more throughput from a single GPU ( Stay tuned for more such explorations )

Thanks for reading. Looking forward to the comments for feedback and interesting discussions/ideas regarding the same.