<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Slurm on Backend Engineering Strategy Tools</title><link>https://backend-engineering-strategy-tools.github.io/site/tags/slurm/</link><description>Recent content in Slurm on Backend Engineering Strategy Tools</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Fri, 15 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://backend-engineering-strategy-tools.github.io/site/tags/slurm/index.xml" rel="self" type="application/rss+xml"/><item><title>ASGARD — the blade cluster</title><link>https://backend-engineering-strategy-tools.github.io/site/homelab/asgard-blades/</link><pubDate>Fri, 15 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/homelab/asgard-blades/</guid><description>&lt;p&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/inventory/systems/" &gt;ASGARD (SYS-007)&lt;/a&gt; is the HP BladeSystem C7000 with 16× BL460c Gen8 blades. The reason to use it is profile switching: boot a blade as a Slurm compute node, run the experiment, reimage it as a Talos worker, run the next one. The same iPXE boot menu already set up for &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/talos-omni/" &gt;ODEN&lt;/a&gt; works here — the C7000 Onboard Administrator lets you configure boot order per blade slot, so switching roles is a BIOS setting and a PXE entry, not a reinstall.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="power-reality"&gt;Power reality
&lt;/h2&gt;&lt;p&gt;Before committing to blades as the permanent always-on platform, it&amp;rsquo;s worth being honest about the enclosure overhead. The C7000 has fixed costs regardless of how many blades are populated: 10 fans, dual OA modules, 2 interconnect switches, backplane management. It doesn&amp;rsquo;t scale down gracefully.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Setup&lt;/th&gt;
 &lt;th&gt;Approx power&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;C7000 enclosure alone (no blades)&lt;/td&gt;
 &lt;td&gt;200–400W&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;C7000 + 1 blade&lt;/td&gt;
 &lt;td&gt;350–550W&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;C7000 + 3 blades&lt;/td&gt;
 &lt;td&gt;500–800W&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;ODEN alone (1U M3, Talos)&lt;/td&gt;
 &lt;td&gt;100–150W&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;HEIMDAL alone (Sun X4150, router)&lt;/td&gt;
 &lt;td&gt;150–200W&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;ODEN + HEIMDAL&lt;/td&gt;
 &lt;td&gt;250–350W&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Two pizza boxes beat three blades in the enclosure on power. The overhead only amortises at 8+ populated slots. For a permanent minimal setup, the 1U rack servers win. For experiments where you want to run 8–16 nodes at once, ASGARD earns its place.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="what-each-role-actually-needs"&gt;What each role actually needs
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Role&lt;/th&gt;
 &lt;th&gt;RAM&lt;/th&gt;
 &lt;th&gt;Disk&lt;/th&gt;
 &lt;th&gt;Network&lt;/th&gt;
 &lt;th&gt;Limiting factor&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Talos / K8s worker&lt;/td&gt;
 &lt;td&gt;32–64GB&lt;/td&gt;
 &lt;td&gt;1× OSD disk&lt;/td&gt;
 &lt;td&gt;1GbE fine&lt;/td&gt;
 &lt;td&gt;RAM — current blades too thin&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;OpenStack compute&lt;/td&gt;
 &lt;td&gt;32–64GB&lt;/td&gt;
 &lt;td&gt;local ephemeral&lt;/td&gt;
 &lt;td&gt;1GbE fine&lt;/td&gt;
 &lt;td&gt;RAM&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;OpenStack control&lt;/td&gt;
 &lt;td&gt;32GB+&lt;/td&gt;
 &lt;td&gt;small&lt;/td&gt;
 &lt;td&gt;1GbE fine&lt;/td&gt;
 &lt;td&gt;RAM&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Slurm compute&lt;/td&gt;
 &lt;td&gt;as much as possible&lt;/td&gt;
 &lt;td&gt;fast scratch&lt;/td&gt;
 &lt;td&gt;1GbE mediocre&lt;/td&gt;
 &lt;td&gt;network&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Ceph OSD&lt;/td&gt;
 &lt;td&gt;16–32GB&lt;/td&gt;
 &lt;td&gt;more / bigger disks&lt;/td&gt;
 &lt;td&gt;1GbE&lt;/td&gt;
 &lt;td&gt;disk count&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The network note matters for Slurm: blade LOM connects to the enclosure switch backplane at &lt;strong&gt;1GbE&lt;/strong&gt;, not 10GbE. The switch has 10GbE uplinks going out, but blade-to-blade traffic inside the enclosure goes through the switch at 1GbE. For Talos and OpenStack this is fine. For MPI jobs exchanging large datasets between Slurm nodes it&amp;rsquo;s a real bottleneck — HPC wants InfiniBand, which the empty interconnect bays 5–8 could take (plus matching mezzanine cards in each blade), but that&amp;rsquo;s a separate cost. For learning Slurm, 1GbE is workable.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="current-blade-state"&gt;Current blade state
&lt;/h2&gt;&lt;p&gt;Most blades are underpowered for any of the roles above. CPUs are also unknown across all 16 slots — the OA web GUI reports CPU model and core count per blade and should be checked first. The E5-2600 v1 range runs from E5-2603 (4c, 80W) to E5-2690 (8c/16t, 135W), which matters significantly for role assignment.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Slot&lt;/th&gt;
 &lt;th&gt;RAM&lt;/th&gt;
 &lt;th&gt;Disk&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-001&lt;/td&gt;
 &lt;td&gt;4GB&lt;/td&gt;
 &lt;td&gt;2× 146GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-002&lt;/td&gt;
 &lt;td&gt;14GB (mixed, odd count)&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-003&lt;/td&gt;
 &lt;td&gt;32GB&lt;/td&gt;
 &lt;td&gt;2× 300GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-004&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-005&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;1× 146GB + 1× 300GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-006&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;2× 300GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-007&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;2× 900GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-008&lt;/td&gt;
 &lt;td&gt;16GB&lt;/td&gt;
 &lt;td&gt;2× 300GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-009&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-010&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;2× 300GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-011&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;2× 300GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-012&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;2× 300GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-013&lt;/td&gt;
 &lt;td&gt;32GB&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-014&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-015&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;2× 300GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-016&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;BLD-003 and BLD-013 are already at 32GB and are natural candidates for control-plane or master roles once CPUs are confirmed.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="suggested-configuration-from-existing-stock"&gt;Suggested configuration from existing stock
&lt;/h2&gt;&lt;p&gt;Available spare hardware:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;14× RAM-007 (8GB DDR3 1600MHz ECC Reg) — unassigned&lt;/li&gt;
&lt;li&gt;2× HDD-004 (120GB SATA SSD) — spare&lt;/li&gt;
&lt;li&gt;6× HDD-002 (146GB 10K SAS) — spare&lt;/li&gt;
&lt;li&gt;Embedded P220i on each blade (can be set to JBOD/passthrough for Ceph)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;&amp;ldquo;Fat&amp;rdquo; nodes × 2&lt;/strong&gt; — Talos control plane, OpenStack control, Slurm master:
Add 4× RAM-007 to each blade. From a base of 8–16GB that gives ~40GB. Candidates: BLD-006 and BLD-010, both have 2× 300GB SAS for local storage. Costs 8 of 14 spare sticks. Install a spare 120GB SSD as boot disk in each.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;ldquo;Medium&amp;rdquo; nodes × 3&lt;/strong&gt; — Talos workers, OpenStack compute, Slurm compute:
Add 2× RAM-007 to each → 24GB from the 8GB base. Candidates: BLD-008 (already 16GB, gets to 32GB), BLD-011, BLD-012. All three have 300GB SAS for scratch or Ceph OSDs. Costs the remaining 6 spare sticks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Rest&lt;/strong&gt; — thin compute, storage expansion, or powered off:
Leave at current RAM. BLD-007&amp;rsquo;s 900GB SAS pair is better used elsewhere (see below). BLD-003 and BLD-013 at 32GB can step up to fat-node role once CPUs are confirmed.&lt;/p&gt;
&lt;p&gt;That leaves 5 blades properly kitted and 11 available for experiments or idle.&lt;/p&gt;
&lt;p&gt;BL460c Gen8 DIMM rule: populate per-CPU symmetrically — pairs or quads per memory channel — for best throughput. Don&amp;rsquo;t mix odd counts.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="storage--what-moves-where"&gt;Storage — what moves where
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Pull the 900GB SAS drives from BLD-007 now.&lt;/strong&gt; HDD-013 (HGST 900GB) and HDD-014 (Toshiba 900GB) are the two largest drives in the blade pool and they&amp;rsquo;re sitting in a blade that may end up as a thin compute worker. Move them into ODEN or LOKE as permanent Ceph OSDs. This immediately gives the always-on cluster substantially more storage than the current 120GB SSDs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;MIMIR&lt;/strong&gt; (SYS-004, 15× 1TB SAS) is the Ceph expansion story for later. To connect it: install CTRL-006 (ServeRAID-8e, have 2 unplaced) into a server with a free PCIe slot, then cable it with a SFF-8470 → SFF-8088 cable (not currently owned, inexpensive). TOR is the natural host — it already has CTRL-003 in HBA mode and free PCIe slots. Not urgent, but the hardware is almost all there.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;What&lt;/th&gt;
 &lt;th&gt;Goes to&lt;/th&gt;
 &lt;th&gt;When&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;900GB SAS ×2 from BLD-007&lt;/td&gt;
 &lt;td&gt;ODEN or LOKE, permanent Ceph OSDs&lt;/td&gt;
 &lt;td&gt;Now&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;120GB SSD ×2 spare&lt;/td&gt;
 &lt;td&gt;BLD fat node boot disks&lt;/td&gt;
 &lt;td&gt;Before Talos on blades&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;300GB SAS in blades&lt;/td&gt;
 &lt;td&gt;Local scratch or blade Ceph OSDs&lt;/td&gt;
 &lt;td&gt;During ASGARD experiments&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;MIMIR 15× 1TB SAS&lt;/td&gt;
 &lt;td&gt;TOR via CTRL-006, Ceph expansion&lt;/td&gt;
 &lt;td&gt;Later (needs cable)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2 id="three-things-to-do-before-blades-can-boot-anything"&gt;Three things to do before blades can boot anything
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Identify CPUs.&lt;/strong&gt; Connect to the OA management port, open the web GUI, check CPU model per slot. Ten minutes. Everything else depends on this.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Network uplink.&lt;/strong&gt; The blade switches in bays 1 and 2 have 4× RJ45 1GbE uplinks (ports 22–25). Run a patch cable from one to any available switch — MODI, MAGNI, whatever&amp;rsquo;s reachable from the cable box. That&amp;rsquo;s enough for blades to reach DHCP and iPXE.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;RAM redistribution.&lt;/strong&gt; Pull the 14 spare RAM-007 sticks and install into the chosen fat and medium nodes per the profile above.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id="the-permanent-vs-experiment-split"&gt;The permanent vs experiment split
&lt;/h2&gt;&lt;pre tabindex="0"&gt;&lt;code&gt;Always on (~300–400W total):
 HEIMDAL → OPNsense router, Sun X4150, ~150–200W
 ODEN → Talos, Minecraft + small services, ~100–150W
 LOKE → 2nd Talos node (needs RAM-007 × 8 + SSD boot), ~100–150W

Experiments (fire up, learn, power off):
 ASGARD → 3–16 blades for Slurm / OpenStack / larger Talos cluster
 TYR+TOR+FREJA → Proxmox cluster (M1 DDR2, temporary)
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Once the Proxmox experiment wraps, TYR, TOR, and FREJA can be powered down permanently. If ASGARD blades eventually become the long-term compute platform, OPNsense can move to a VM on a blade at that point — but not before the blades are stable and trusted. Don&amp;rsquo;t consolidate the router onto experimental infrastructure.&lt;/p&gt;</description></item><item><title>Slurm</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/slurm/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/slurm/</guid><description>&lt;p&gt;Slurm (Simple Linux Utility for Resource Management) is a workload manager and job scheduler. It originated in HPC but is now the standard scheduler for ML training clusters — most cloud GPU clusters (AWS, GCP, Azure HPC) run Slurm under the hood.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://slurm.schedmd.com/" target="_blank" rel="noopener"
 &gt;https://slurm.schedmd.com/&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="core-concepts"&gt;Core Concepts
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Node&lt;/strong&gt; — a compute host registered with Slurm. Can have CPU, GPU, and memory resources defined.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Partition&lt;/strong&gt; — a logical group of nodes (like a queue). You can have separate partitions for GPU and CPU nodes, or different priority tiers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Job&lt;/strong&gt; — a workload submitted to the queue. Slurm allocates resources, places the job, and tracks it to completion.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Allocation&lt;/strong&gt; — the reserved resources (nodes, CPUs, GPUs, memory) for a running job.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="key-commands"&gt;Key Commands
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sbatch job.sh &lt;span style="color:#75715e"&gt;# submit a batch job script&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;squeue &lt;span style="color:#75715e"&gt;# view the job queue&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sacct -j &amp;lt;jobid&amp;gt; &lt;span style="color:#75715e"&gt;# job accounting / history&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sinfo &lt;span style="color:#75715e"&gt;# view partition and node state&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;scancel &amp;lt;jobid&amp;gt; &lt;span style="color:#75715e"&gt;# cancel a job&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;srun --pty bash &lt;span style="color:#75715e"&gt;# interactive allocation&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;A minimal batch script:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#!/bin/bash
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#SBATCH --job-name=train&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#SBATCH --nodes=1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#SBATCH --gpus=1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#SBATCH --time=02:00:00&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;python train.py
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="slurm-vs-kubernetes-for-training"&gt;Slurm vs Kubernetes for Training
&lt;/h2&gt;&lt;p&gt;The fundamental difference is what each system optimises for:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Kubernetes&lt;/strong&gt; optimises for uptime. It keeps services running, reschedules failed pods, and manages long-lived workloads. That&amp;rsquo;s the right model for inference serving, APIs, and anything that needs to stay up.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Slurm&lt;/strong&gt; optimises for utilisation. It packs jobs onto nodes as tightly as possible, queues work when resources are busy, and gets out of the way when a job finishes. That&amp;rsquo;s the right model for batch training — you want every GPU busy, not reserved for availability.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;&lt;/th&gt;
 &lt;th&gt;Slurm&lt;/th&gt;
 &lt;th&gt;Kubernetes&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Optimises for&lt;/td&gt;
 &lt;td&gt;Maximum utilisation&lt;/td&gt;
 &lt;td&gt;Uptime and availability&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Scheduling model&lt;/td&gt;
 &lt;td&gt;Job queue, batch-first&lt;/td&gt;
 &lt;td&gt;Long-running services + batch (via operators)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;GPU allocation&lt;/td&gt;
 &lt;td&gt;Native, fine-grained&lt;/td&gt;
 &lt;td&gt;Requires GPU operator + device plugin&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Multi-node training&lt;/td&gt;
 &lt;td&gt;First-class (MPI, &lt;code&gt;srun&lt;/code&gt;)&lt;/td&gt;
 &lt;td&gt;Possible via KubeFlow, PyTorchJob&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Preemption&lt;/td&gt;
 &lt;td&gt;Built-in&lt;/td&gt;
 &lt;td&gt;Requires configuration&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Operational overhead&lt;/td&gt;
 &lt;td&gt;Low on bare metal&lt;/td&gt;
 &lt;td&gt;Higher — requires cluster management&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Ecosystem&lt;/td&gt;
 &lt;td&gt;HPC, academia, major cloud HPC&lt;/td&gt;
 &lt;td&gt;ML platforms, cloud-native&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;The short version:&lt;/strong&gt; use Slurm for pure batch training on bare metal. Use Kubernetes when you&amp;rsquo;re mixing training with inference serving or need container-native workflows throughout — or when the cluster already exists.&lt;/p&gt;</description></item></channel></rss>