<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Vllm on Backend Engineering Strategy Tools</title><link>https://backend-engineering-strategy-tools.github.io/site/tags/vllm/</link><description>Recent content in Vllm on Backend Engineering Strategy Tools</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Thu, 14 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://backend-engineering-strategy-tools.github.io/site/tags/vllm/index.xml" rel="self" type="application/rss+xml"/><item><title>vLLM</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/ai/vllm/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/ai/vllm/</guid><description>&lt;p&gt;vLLM is a high-throughput inference engine for large language models. It implements PagedAttention — a memory management technique that dramatically improves GPU utilisation during inference — and provides an OpenAI-compatible API out of the box.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://docs.vllm.ai/" target="_blank" rel="noopener"
 &gt;https://docs.vllm.ai/&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="why-it-exists"&gt;Why It Exists
&lt;/h2&gt;&lt;p&gt;Standard transformer inference wastes GPU memory because the KV cache for each request is pre-allocated contiguously. PagedAttention manages KV cache in non-contiguous pages (like OS virtual memory), enabling continuous batching across requests and much higher throughput under concurrent load.&lt;/p&gt;
&lt;p&gt;The practical result: vLLM can serve significantly more requests per second per GPU than a naive implementation.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="key-features"&gt;Key Features
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;OpenAI-compatible REST API&lt;/strong&gt; — drop-in replacement for the OpenAI completions and chat endpoints&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Continuous batching&lt;/strong&gt; — new requests join mid-flight rather than waiting for a full batch to finish&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quantisation&lt;/strong&gt; — GPTQ, AWQ, bitsandbytes (int8/int4)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tensor parallelism&lt;/strong&gt; — split a model across multiple GPUs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model support&lt;/strong&gt; — Llama, Mistral, Qwen, Phi, Gemma, and most HuggingFace-compatible architectures&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="quick-start"&gt;Quick Start
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;pip install vllm
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;python -m vllm.entrypoints.openai.api_server &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --model meta-llama/Llama-3.2-3B-Instruct &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --quantization awq &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --port &lt;span style="color:#ae81ff"&gt;8000&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then use it like the OpenAI API:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl http://localhost:8000/v1/chat/completions &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -H &lt;span style="color:#e6db74"&gt;&amp;#34;Content-Type: application/json&amp;#34;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -d &lt;span style="color:#e6db74"&gt;&amp;#39;{&amp;#34;model&amp;#34;: &amp;#34;meta-llama/Llama-3.2-3B-Instruct&amp;#34;, &amp;#34;messages&amp;#34;: [{&amp;#34;role&amp;#34;: &amp;#34;user&amp;#34;, &amp;#34;content&amp;#34;: &amp;#34;Hello&amp;#34;}]}&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="when-to-use-vllm"&gt;When to Use vLLM
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Multi-user or concurrent inference load&lt;/li&gt;
&lt;li&gt;Production-grade throughput matters&lt;/li&gt;
&lt;li&gt;You need tensor parallelism across multiple GPUs&lt;/li&gt;
&lt;li&gt;You want quantised model support with a clean API&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For single-user or low-concurrency use with a constrained GPU, &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/ai/ollama/" &gt;Ollama&lt;/a&gt; may be simpler to operate. vLLM&amp;rsquo;s advantage shows under concurrent load.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="related"&gt;Related
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://docs.vllm.ai/" target="_blank" rel="noopener"
 &gt;vLLM documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/ai/ollama/" &gt;Ollama&lt;/a&gt; — simpler alternative for single-user or low-concurrency use&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/llm-inference/" &gt;LLM inference in the homelab&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>