<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Gpu on Backend Engineering Strategy Tools</title><link>https://backend-engineering-strategy-tools.github.io/site/tags/gpu/</link><description>Recent content in Gpu on Backend Engineering Strategy Tools</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Thu, 14 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://backend-engineering-strategy-tools.github.io/site/tags/gpu/index.xml" rel="self" type="application/rss+xml"/><item><title>vLLM</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/ai/vllm/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/ai/vllm/</guid><description>&lt;p&gt;vLLM is a high-throughput inference engine for large language models. It implements PagedAttention — a memory management technique that dramatically improves GPU utilisation during inference — and provides an OpenAI-compatible API out of the box.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://docs.vllm.ai/" target="_blank" rel="noopener"
 &gt;https://docs.vllm.ai/&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="why-it-exists"&gt;Why It Exists
&lt;/h2&gt;&lt;p&gt;Standard transformer inference wastes GPU memory because the KV cache for each request is pre-allocated contiguously. PagedAttention manages KV cache in non-contiguous pages (like OS virtual memory), enabling continuous batching across requests and much higher throughput under concurrent load.&lt;/p&gt;
&lt;p&gt;The practical result: vLLM can serve significantly more requests per second per GPU than a naive implementation.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="key-features"&gt;Key Features
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;OpenAI-compatible REST API&lt;/strong&gt; — drop-in replacement for the OpenAI completions and chat endpoints&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Continuous batching&lt;/strong&gt; — new requests join mid-flight rather than waiting for a full batch to finish&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quantisation&lt;/strong&gt; — GPTQ, AWQ, bitsandbytes (int8/int4)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tensor parallelism&lt;/strong&gt; — split a model across multiple GPUs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model support&lt;/strong&gt; — Llama, Mistral, Qwen, Phi, Gemma, and most HuggingFace-compatible architectures&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="quick-start"&gt;Quick Start
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;pip install vllm
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;python -m vllm.entrypoints.openai.api_server &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --model meta-llama/Llama-3.2-3B-Instruct &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --quantization awq &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --port &lt;span style="color:#ae81ff"&gt;8000&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then use it like the OpenAI API:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl http://localhost:8000/v1/chat/completions &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -H &lt;span style="color:#e6db74"&gt;&amp;#34;Content-Type: application/json&amp;#34;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -d &lt;span style="color:#e6db74"&gt;&amp;#39;{&amp;#34;model&amp;#34;: &amp;#34;meta-llama/Llama-3.2-3B-Instruct&amp;#34;, &amp;#34;messages&amp;#34;: [{&amp;#34;role&amp;#34;: &amp;#34;user&amp;#34;, &amp;#34;content&amp;#34;: &amp;#34;Hello&amp;#34;}]}&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="when-to-use-vllm"&gt;When to Use vLLM
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Multi-user or concurrent inference load&lt;/li&gt;
&lt;li&gt;Production-grade throughput matters&lt;/li&gt;
&lt;li&gt;You need tensor parallelism across multiple GPUs&lt;/li&gt;
&lt;li&gt;You want quantised model support with a clean API&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For single-user or low-concurrency use with a constrained GPU, &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/ai/ollama/" &gt;Ollama&lt;/a&gt; may be simpler to operate. vLLM&amp;rsquo;s advantage shows under concurrent load.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="related"&gt;Related
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://docs.vllm.ai/" target="_blank" rel="noopener"
 &gt;vLLM documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/ai/ollama/" &gt;Ollama&lt;/a&gt; — simpler alternative for single-user or low-concurrency use&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/llm-inference/" &gt;LLM inference in the homelab&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>GPU Inventory</title><link>https://backend-engineering-strategy-tools.github.io/site/homelab/inventory/gpu/</link><pubDate>Wed, 13 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/homelab/inventory/gpu/</guid><description>&lt;h1 id="gpu-catalog"&gt;GPU Catalog
&lt;/h1&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Component ID&lt;/th&gt;
 &lt;th&gt;Manufacturer&lt;/th&gt;
 &lt;th&gt;Model&lt;/th&gt;
 &lt;th&gt;Quantity&lt;/th&gt;
 &lt;th&gt;VRAM&lt;/th&gt;
 &lt;th&gt;Interface&lt;/th&gt;
 &lt;th&gt;Compute Cap&lt;/th&gt;
 &lt;th&gt;Notes&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;GPU-001&lt;/td&gt;
 &lt;td&gt;NVIDIA/Dell&lt;/td&gt;
 &lt;td&gt;Quadro 600&lt;/td&gt;
 &lt;td&gt;3&lt;/td&gt;
 &lt;td&gt;1 GB&lt;/td&gt;
 &lt;td&gt;PCIe&lt;/td&gt;
 &lt;td&gt;sm_21&lt;/td&gt;
 &lt;td&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;GPU-002&lt;/td&gt;
 &lt;td&gt;EVGA&lt;/td&gt;
 &lt;td&gt;GTX 770&lt;/td&gt;
 &lt;td&gt;1&lt;/td&gt;
 &lt;td&gt;2 GB&lt;/td&gt;
 &lt;td&gt;PCIe&lt;/td&gt;
 &lt;td&gt;sm_30&lt;/td&gt;
 &lt;td&gt;Requires 6+8 pin power&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h1 id="gpu-placement"&gt;GPU Placement
&lt;/h1&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Asset ID&lt;/th&gt;
 &lt;th&gt;Hostname&lt;/th&gt;
 &lt;th&gt;Component ID&lt;/th&gt;
 &lt;th&gt;Slot / Location&lt;/th&gt;
 &lt;th&gt;Role&lt;/th&gt;
 &lt;th&gt;Notes&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h1 id="gpu-overviews"&gt;GPU Overviews
&lt;/h1&gt;&lt;p&gt;Here are some brief overviews of the GPUs in the inventory, highlighting their typical uses and characteristics.&lt;/p&gt;
&lt;h3 id="nvidia-quadro-600-eg-gpu-001"&gt;NVIDIA Quadro 600 (e.g., GPU-001)
&lt;/h3&gt;&lt;p&gt;&lt;em&gt;96 CUDA cores · 1GB GDDR3 · low-profile · 40W TDP&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The NVIDIA Quadro 600 is an entry-level professional graphics card from the Fermi generation (circa 2010-2011). Designed primarily for CAD, DCC (Digital Content Creation), and basic scientific visualization, it is not optimized for gaming workloads. Equipped with 1GB of VRAM and typically presented in a low-profile form factor, these cards are well-suited for providing display output in servers that lack integrated graphics, or for light compute tasks that can utilize NVIDIA&amp;rsquo;s CUDA architecture, though performance will be limited by their vintage.&lt;/p&gt;
&lt;h3 id="evga-gtx-770-eg-gpu-002--nvidia-specs"&gt;EVGA GTX 770 (e.g., GPU-002) — &lt;a class="link" href="https://www.nvidia.com/en-us/geforce/graphics-cards/geforce-gtx-770/specifications/" target="_blank" rel="noopener"
 &gt;NVIDIA specs&lt;/a&gt;
&lt;/h3&gt;&lt;p&gt;&lt;em&gt;1536 CUDA cores · 2GB GDDR5 256-bit · 3.2 TFLOPS · 230W TDP · Compute Capability sm_30&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The NVIDIA GeForce GTX 770, frequently available in variants such as the EVGA GTX 770, was a high-end gaming graphics card released in 2013, based on the Kepler architecture. Featuring 2GB (or 4GB) of GDDR5 VRAM, it delivered strong performance for its release era. In a homelab setting, a GTX 770 can be repurposed for tasks like video transcoding, entry-level machine learning experiments, or providing robust graphical output for a dedicated workstation attached to a server. Its requirement for external power connectors (typically 6+8 pin) signifies its higher power consumption profile.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;CUDA compatibility&lt;/strong&gt;: sm_30 (Kepler) is below the minimum for most current ML tooling — PyTorch 2.x requires sm_37, vLLM requires sm_70, and pre-built Ollama packages target sm_50+. GPU-accelerated inference with off-the-shelf tools is unlikely without custom builds. CPU fallback is the practical path.&lt;/p&gt;</description></item></channel></rss>