Ollama

Thu, 14 May 2026 00:00:00 +0000

Ollama is a local LLM runner. Single binary, model library, REST API. The fastest path from zero to a running model on your own hardware.

https://ollama.com/

What it does

Ollama wraps llama.cpp (and other backends) with a clean CLI and REST API. You pull a model by name, it downloads and quantizes if needed, and you’re serving completions from localhost. OpenAI-compatible API included.

ollama pull llama3.2:3b
ollama run llama3.2:3b

Or via API:

curl http://localhost:11434/api/generate \
 -d '{"model": "llama3.2:3b", "prompt": "Hello"}'

Key Concepts

Modelfile — a Dockerfile-like definition for customising a model: system prompt, parameters, template.

FROM llama3.2:3b
SYSTEM "You are a helpful assistant for infrastructure tasks."
PARAMETER temperature 0.7

Model library — https://ollama.com/library — pre-quantized models ready to pull. Includes Llama, Mistral, Phi, Qwen, Gemma, and many others.

GGUF format — Ollama uses GGUF (llama.cpp’s model format). Q4_K_M quantization is the common balance of quality vs size.

GPU Support

Ollama detects and uses available GPUs automatically (NVIDIA CUDA, Apple Metal, AMD ROCm). Falls back to CPU if no GPU is available — slower but functional.

For NVIDIA on Linux, the CUDA toolkit must be installed on the host (or available to the container). Under Talos, this means using the NVIDIA GPU Operator.

When to Use Ollama

Single-user or development use
You want minimal setup overhead
Constrained GPU or CPU-only inference
Quick model experimentation before committing to a production stack

For concurrent multi-user serving or production throughput, vLLM is the better choice.

Ollama model library
vLLM — production serving alternative for concurrent workloads
LLM inference in the homelab

vLLM

Thu, 14 May 2026 00:00:00 +0000

vLLM is a high-throughput inference engine for large language models. It implements PagedAttention — a memory management technique that dramatically improves GPU utilisation during inference — and provides an OpenAI-compatible API out of the box.

https://docs.vllm.ai/

Why It Exists

Standard transformer inference wastes GPU memory because the KV cache for each request is pre-allocated contiguously. PagedAttention manages KV cache in non-contiguous pages (like OS virtual memory), enabling continuous batching across requests and much higher throughput under concurrent load.

The practical result: vLLM can serve significantly more requests per second per GPU than a naive implementation.

Key Features

OpenAI-compatible REST API — drop-in replacement for the OpenAI completions and chat endpoints
Continuous batching — new requests join mid-flight rather than waiting for a full batch to finish
Quantisation — GPTQ, AWQ, bitsandbytes (int8/int4)
Tensor parallelism — split a model across multiple GPUs
Model support — Llama, Mistral, Qwen, Phi, Gemma, and most HuggingFace-compatible architectures

Quick Start

pip install vllm

python -m vllm.entrypoints.openai.api_server \
 --model meta-llama/Llama-3.2-3B-Instruct \
 --quantization awq \
 --port 8000

Then use it like the OpenAI API:

curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{"model": "meta-llama/Llama-3.2-3B-Instruct", "messages": [{"role": "user", "content": "Hello"}]}'

When to Use vLLM

Multi-user or concurrent inference load
Production-grade throughput matters
You need tensor parallelism across multiple GPUs
You want quantised model support with a clean API

For single-user or low-concurrency use with a constrained GPU, Ollama may be simpler to operate. vLLM’s advantage shows under concurrent load.

vLLM documentation
Ollama — simpler alternative for single-user or low-concurrency use
LLM inference in the homelab

Serving on Backend Engineering Strategy Tools