litellm-queue

A server that sits in front of LiteLLM and queues requests.

LiteLLM has no queueing mechanism for incoming requests, meaning all requests hit the inference endpoints at the same time. This is fine for most endpoints like OpenAI or Anthropic, but endpoints running inference servers such as llama.cpp will be quickly overwhelmed.

This is a simple queuing server that sits in front of your LiteLLM server and reads the model header of incoming requests to route them to per-model queues. For example, you can limit the model gpt-4.1 to 4 concurrent requests or limit the model Beepo-22B on your llama.cpp backend to only 1 concurrent request.

This is not a perfect solution and hopefully queuing is added to LiteLLM soon. Every effort is made to make streaming smooth.

Install

Download the latest release from the releases tab
Copy config.sample.yaml to config.yaml
Start the litellm-queue server
Update your reverse proxy for LiteLLM to point to the listen address of litellm-queue

Example Nginx Config:

location ~ ^/(v1/)?(chat/)?completions {
    proxy_pass http://127.0.0.1:8080;
}

An example systemd service is provided.

Build

./build.sh
Compiled binary will be at dist/litellm-queue-0.0.0-...

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
config.sample.yaml		config.sample.yaml
litellm-queue.service		litellm-queue.service

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

litellm-queue

Install

Build

About

Uh oh!

Uh oh!

Languages

License

Cyberes/litellm-queue

Folders and files

Latest commit

History

Repository files navigation

litellm-queue

Install

Build

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages