vLLM

vLLM is a high-throughput inference engine for large language models. It implements PagedAttention — a memory management technique that dramatically improves GPU utilisation during inference — and provides an OpenAI-compatible API out of the box.

https://docs.vllm.ai/

Why It Exists

Standard transformer inference wastes GPU memory because the KV cache for each request is pre-allocated contiguously. PagedAttention manages KV cache in non-contiguous pages (like OS virtual memory), enabling continuous batching across requests and much higher throughput under concurrent load.

The practical result: vLLM can serve significantly more requests per second per GPU than a naive implementation.

Key Features

OpenAI-compatible REST API — drop-in replacement for the OpenAI completions and chat endpoints
Continuous batching — new requests join mid-flight rather than waiting for a full batch to finish
Quantisation — GPTQ, AWQ, bitsandbytes (int8/int4)
Tensor parallelism — split a model across multiple GPUs
Model support — Llama, Mistral, Qwen, Phi, Gemma, and most HuggingFace-compatible architectures

Quick Start

pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --quantization awq \
  --port 8000

Then use it like the OpenAI API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.2-3B-Instruct", "messages": [{"role": "user", "content": "Hello"}]}'

When to Use vLLM

Multi-user or concurrent inference load
Production-grade throughput matters
You need tensor parallelism across multiple GPUs
You want quantised model support with a clean API

For single-user or low-concurrency use with a constrained GPU, Ollama may be simpler to operate. vLLM’s advantage shows under concurrent load.

vLLM documentation
Ollama — simpler alternative for single-user or low-concurrency use
LLM inference in the homelab

Why It Exists

Key Features

Quick Start

When to Use vLLM

Related