<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Scheduler on Backend Engineering Strategy Tools</title><link>https://backend-engineering-strategy-tools.github.io/site/tags/scheduler/</link><description>Recent content in Scheduler on Backend Engineering Strategy Tools</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Thu, 14 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://backend-engineering-strategy-tools.github.io/site/tags/scheduler/index.xml" rel="self" type="application/rss+xml"/><item><title>Slurm</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/slurm/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/slurm/</guid><description>&lt;p&gt;Slurm (Simple Linux Utility for Resource Management) is a workload manager and job scheduler. It originated in HPC but is now the standard scheduler for ML training clusters — most cloud GPU clusters (AWS, GCP, Azure HPC) run Slurm under the hood.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://slurm.schedmd.com/" target="_blank" rel="noopener"
 &gt;https://slurm.schedmd.com/&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="core-concepts"&gt;Core Concepts
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Node&lt;/strong&gt; — a compute host registered with Slurm. Can have CPU, GPU, and memory resources defined.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Partition&lt;/strong&gt; — a logical group of nodes (like a queue). You can have separate partitions for GPU and CPU nodes, or different priority tiers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Job&lt;/strong&gt; — a workload submitted to the queue. Slurm allocates resources, places the job, and tracks it to completion.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Allocation&lt;/strong&gt; — the reserved resources (nodes, CPUs, GPUs, memory) for a running job.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="key-commands"&gt;Key Commands
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sbatch job.sh &lt;span style="color:#75715e"&gt;# submit a batch job script&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;squeue &lt;span style="color:#75715e"&gt;# view the job queue&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sacct -j &amp;lt;jobid&amp;gt; &lt;span style="color:#75715e"&gt;# job accounting / history&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sinfo &lt;span style="color:#75715e"&gt;# view partition and node state&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;scancel &amp;lt;jobid&amp;gt; &lt;span style="color:#75715e"&gt;# cancel a job&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;srun --pty bash &lt;span style="color:#75715e"&gt;# interactive allocation&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;A minimal batch script:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#!/bin/bash
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#SBATCH --job-name=train&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#SBATCH --nodes=1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#SBATCH --gpus=1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#SBATCH --time=02:00:00&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;python train.py
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="slurm-vs-kubernetes-for-training"&gt;Slurm vs Kubernetes for Training
&lt;/h2&gt;&lt;p&gt;The fundamental difference is what each system optimises for:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Kubernetes&lt;/strong&gt; optimises for uptime. It keeps services running, reschedules failed pods, and manages long-lived workloads. That&amp;rsquo;s the right model for inference serving, APIs, and anything that needs to stay up.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Slurm&lt;/strong&gt; optimises for utilisation. It packs jobs onto nodes as tightly as possible, queues work when resources are busy, and gets out of the way when a job finishes. That&amp;rsquo;s the right model for batch training — you want every GPU busy, not reserved for availability.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;&lt;/th&gt;
 &lt;th&gt;Slurm&lt;/th&gt;
 &lt;th&gt;Kubernetes&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Optimises for&lt;/td&gt;
 &lt;td&gt;Maximum utilisation&lt;/td&gt;
 &lt;td&gt;Uptime and availability&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Scheduling model&lt;/td&gt;
 &lt;td&gt;Job queue, batch-first&lt;/td&gt;
 &lt;td&gt;Long-running services + batch (via operators)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;GPU allocation&lt;/td&gt;
 &lt;td&gt;Native, fine-grained&lt;/td&gt;
 &lt;td&gt;Requires GPU operator + device plugin&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Multi-node training&lt;/td&gt;
 &lt;td&gt;First-class (MPI, &lt;code&gt;srun&lt;/code&gt;)&lt;/td&gt;
 &lt;td&gt;Possible via KubeFlow, PyTorchJob&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Preemption&lt;/td&gt;
 &lt;td&gt;Built-in&lt;/td&gt;
 &lt;td&gt;Requires configuration&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Operational overhead&lt;/td&gt;
 &lt;td&gt;Low on bare metal&lt;/td&gt;
 &lt;td&gt;Higher — requires cluster management&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Ecosystem&lt;/td&gt;
 &lt;td&gt;HPC, academia, major cloud HPC&lt;/td&gt;
 &lt;td&gt;ML platforms, cloud-native&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;The short version:&lt;/strong&gt; use Slurm for pure batch training on bare metal. Use Kubernetes when you&amp;rsquo;re mixing training with inference serving or need container-native workflows throughout — or when the cluster already exists.&lt;/p&gt;</description></item></channel></rss>