vLLM

Thu, 14 May 2026 00:00:00 +0000

vLLM is a high-throughput inference engine for large language models. It implements PagedAttention — a memory management technique that dramatically improves GPU utilisation during inference — and provides an OpenAI-compatible API out of the box.

https://docs.vllm.ai/

Why It Exists

Standard transformer inference wastes GPU memory because the KV cache for each request is pre-allocated contiguously. PagedAttention manages KV cache in non-contiguous pages (like OS virtual memory), enabling continuous batching across requests and much higher throughput under concurrent load.

The practical result: vLLM can serve significantly more requests per second per GPU than a naive implementation.

Key Features

OpenAI-compatible REST API — drop-in replacement for the OpenAI completions and chat endpoints
Continuous batching — new requests join mid-flight rather than waiting for a full batch to finish
Quantisation — GPTQ, AWQ, bitsandbytes (int8/int4)
Tensor parallelism — split a model across multiple GPUs
Model support — Llama, Mistral, Qwen, Phi, Gemma, and most HuggingFace-compatible architectures

Quick Start

pip install vllm

python -m vllm.entrypoints.openai.api_server \
 --model meta-llama/Llama-3.2-3B-Instruct \
 --quantization awq \
 --port 8000

Then use it like the OpenAI API:

curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{"model": "meta-llama/Llama-3.2-3B-Instruct", "messages": [{"role": "user", "content": "Hello"}]}'

When to Use vLLM

Multi-user or concurrent inference load
Production-grade throughput matters
You need tensor parallelism across multiple GPUs
You want quantised model support with a clean API

For single-user or low-concurrency use with a constrained GPU, Ollama may be simpler to operate. vLLM’s advantage shows under concurrent load.

vLLM documentation
Ollama — simpler alternative for single-user or low-concurrency use
LLM inference in the homelab

GPU Inventory

Wed, 13 May 2026 00:00:00 +0000

GPU Catalog

Component ID	Manufacturer	Model	Quantity	VRAM	Interface	Compute Cap	Notes
GPU-001	NVIDIA/Dell	Quadro 600	3	1 GB	PCIe	sm_21
GPU-002	EVGA	GTX 770	1	2 GB	PCIe	sm_30	Requires 6+8 pin power

GPU Placement

Asset ID	Hostname	Component ID	Slot / Location	Role	Notes

GPU Overviews

Here are some brief overviews of the GPUs in the inventory, highlighting their typical uses and characteristics.

NVIDIA Quadro 600 (e.g., GPU-001)

96 CUDA cores · 1GB GDDR3 · low-profile · 40W TDP

The NVIDIA Quadro 600 is an entry-level professional graphics card from the Fermi generation (circa 2010-2011). Designed primarily for CAD, DCC (Digital Content Creation), and basic scientific visualization, it is not optimized for gaming workloads. Equipped with 1GB of VRAM and typically presented in a low-profile form factor, these cards are well-suited for providing display output in servers that lack integrated graphics, or for light compute tasks that can utilize NVIDIA’s CUDA architecture, though performance will be limited by their vintage.

EVGA GTX 770 (e.g., GPU-002) — NVIDIA specs

1536 CUDA cores · 2GB GDDR5 256-bit · 3.2 TFLOPS · 230W TDP · Compute Capability sm_30

The NVIDIA GeForce GTX 770, frequently available in variants such as the EVGA GTX 770, was a high-end gaming graphics card released in 2013, based on the Kepler architecture. Featuring 2GB (or 4GB) of GDDR5 VRAM, it delivered strong performance for its release era. In a homelab setting, a GTX 770 can be repurposed for tasks like video transcoding, entry-level machine learning experiments, or providing robust graphical output for a dedicated workstation attached to a server. Its requirement for external power connectors (typically 6+8 pin) signifies its higher power consumption profile.

CUDA compatibility: sm_30 (Kepler) is below the minimum for most current ML tooling — PyTorch 2.x requires sm_37, vLLM requires sm_70, and pre-built Ollama packages target sm_50+. GPU-accelerated inference with off-the-shelf tools is unlikely without custom builds. CPU fallback is the practical path.

Gpu on Backend Engineering Strategy Tools