<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Serving on Backend Engineering Strategy Tools</title><link>https://backend-engineering-strategy-tools.github.io/site/tags/serving/</link><description>Recent content in Serving on Backend Engineering Strategy Tools</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Thu, 14 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://backend-engineering-strategy-tools.github.io/site/tags/serving/index.xml" rel="self" type="application/rss+xml"/><item><title>Ollama</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/ai/ollama/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/ai/ollama/</guid><description>&lt;p&gt;Ollama is a local LLM runner. Single binary, model library, REST API. The fastest path from zero to a running model on your own hardware.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://ollama.com/" target="_blank" rel="noopener"
 &gt;https://ollama.com/&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="what-it-does"&gt;What it does
&lt;/h2&gt;&lt;p&gt;Ollama wraps llama.cpp (and other backends) with a clean CLI and REST API. You pull a model by name, it downloads and quantizes if needed, and you&amp;rsquo;re serving completions from localhost. OpenAI-compatible API included.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ollama pull llama3.2:3b
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ollama run llama3.2:3b
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Or via API:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl http://localhost:11434/api/generate &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -d &lt;span style="color:#e6db74"&gt;&amp;#39;{&amp;#34;model&amp;#34;: &amp;#34;llama3.2:3b&amp;#34;, &amp;#34;prompt&amp;#34;: &amp;#34;Hello&amp;#34;}&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="key-concepts"&gt;Key Concepts
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Modelfile&lt;/strong&gt; — a Dockerfile-like definition for customising a model: system prompt, parameters, template.&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;FROM llama3.2:3b
SYSTEM &amp;#34;You are a helpful assistant for infrastructure tasks.&amp;#34;
PARAMETER temperature 0.7
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;&lt;strong&gt;Model library&lt;/strong&gt; — &lt;a class="link" href="https://ollama.com/library" target="_blank" rel="noopener"
 &gt;https://ollama.com/library&lt;/a&gt; — pre-quantized models ready to pull. Includes Llama, Mistral, Phi, Qwen, Gemma, and many others.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;GGUF format&lt;/strong&gt; — Ollama uses GGUF (llama.cpp&amp;rsquo;s model format). Q4_K_M quantization is the common balance of quality vs size.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="gpu-support"&gt;GPU Support
&lt;/h2&gt;&lt;p&gt;Ollama detects and uses available GPUs automatically (NVIDIA CUDA, Apple Metal, AMD ROCm). Falls back to CPU if no GPU is available — slower but functional.&lt;/p&gt;
&lt;p&gt;For NVIDIA on Linux, the CUDA toolkit must be installed on the host (or available to the container). Under Talos, this means using the NVIDIA GPU Operator.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="when-to-use-ollama"&gt;When to Use Ollama
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Single-user or development use&lt;/li&gt;
&lt;li&gt;You want minimal setup overhead&lt;/li&gt;
&lt;li&gt;Constrained GPU or CPU-only inference&lt;/li&gt;
&lt;li&gt;Quick model experimentation before committing to a production stack&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For concurrent multi-user serving or production throughput, &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/ai/vllm/" &gt;vLLM&lt;/a&gt; is the better choice.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="related"&gt;Related
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://ollama.com/library" target="_blank" rel="noopener"
 &gt;Ollama model library&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/ai/vllm/" &gt;vLLM&lt;/a&gt; — production serving alternative for concurrent workloads&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/llm-inference/" &gt;LLM inference in the homelab&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>vLLM</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/ai/vllm/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/ai/vllm/</guid><description>&lt;p&gt;vLLM is a high-throughput inference engine for large language models. It implements PagedAttention — a memory management technique that dramatically improves GPU utilisation during inference — and provides an OpenAI-compatible API out of the box.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://docs.vllm.ai/" target="_blank" rel="noopener"
 &gt;https://docs.vllm.ai/&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="why-it-exists"&gt;Why It Exists
&lt;/h2&gt;&lt;p&gt;Standard transformer inference wastes GPU memory because the KV cache for each request is pre-allocated contiguously. PagedAttention manages KV cache in non-contiguous pages (like OS virtual memory), enabling continuous batching across requests and much higher throughput under concurrent load.&lt;/p&gt;
&lt;p&gt;The practical result: vLLM can serve significantly more requests per second per GPU than a naive implementation.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="key-features"&gt;Key Features
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;OpenAI-compatible REST API&lt;/strong&gt; — drop-in replacement for the OpenAI completions and chat endpoints&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Continuous batching&lt;/strong&gt; — new requests join mid-flight rather than waiting for a full batch to finish&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quantisation&lt;/strong&gt; — GPTQ, AWQ, bitsandbytes (int8/int4)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tensor parallelism&lt;/strong&gt; — split a model across multiple GPUs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model support&lt;/strong&gt; — Llama, Mistral, Qwen, Phi, Gemma, and most HuggingFace-compatible architectures&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="quick-start"&gt;Quick Start
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;pip install vllm
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;python -m vllm.entrypoints.openai.api_server &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --model meta-llama/Llama-3.2-3B-Instruct &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --quantization awq &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --port &lt;span style="color:#ae81ff"&gt;8000&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then use it like the OpenAI API:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl http://localhost:8000/v1/chat/completions &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -H &lt;span style="color:#e6db74"&gt;&amp;#34;Content-Type: application/json&amp;#34;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -d &lt;span style="color:#e6db74"&gt;&amp;#39;{&amp;#34;model&amp;#34;: &amp;#34;meta-llama/Llama-3.2-3B-Instruct&amp;#34;, &amp;#34;messages&amp;#34;: [{&amp;#34;role&amp;#34;: &amp;#34;user&amp;#34;, &amp;#34;content&amp;#34;: &amp;#34;Hello&amp;#34;}]}&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="when-to-use-vllm"&gt;When to Use vLLM
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Multi-user or concurrent inference load&lt;/li&gt;
&lt;li&gt;Production-grade throughput matters&lt;/li&gt;
&lt;li&gt;You need tensor parallelism across multiple GPUs&lt;/li&gt;
&lt;li&gt;You want quantised model support with a clean API&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For single-user or low-concurrency use with a constrained GPU, &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/ai/ollama/" &gt;Ollama&lt;/a&gt; may be simpler to operate. vLLM&amp;rsquo;s advantage shows under concurrent load.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="related"&gt;Related
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://docs.vllm.ai/" target="_blank" rel="noopener"
 &gt;vLLM documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/ai/ollama/" &gt;Ollama&lt;/a&gt; — simpler alternative for single-user or low-concurrency use&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/llm-inference/" &gt;LLM inference in the homelab&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>