<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Cloud &amp; Infrastructure on Backend Engineering Strategy Tools</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/</link><description>Recent content in Cloud &amp; Infrastructure on Backend Engineering Strategy Tools</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Thu, 14 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/index.xml" rel="self" type="application/rss+xml"/><item><title>Ceph</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/ceph/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/ceph/</guid><description>&lt;p&gt;Ceph is an open-source distributed storage platform providing object, block, and file storage in a single unified system. It runs across multiple nodes and has no single point of failure.&lt;/p&gt;
&lt;p&gt;The core idea: data is not stored on specific disks on specific nodes. Instead, the CRUSH algorithm distributes data across all available OSDs (Object Storage Daemons) based on a placement map. Add nodes and the cluster rebalances automatically. Lose a node and Ceph re-replicates from surviving copies without operator intervention.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="storage-types"&gt;Storage types
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Type&lt;/th&gt;
 &lt;th&gt;Interface&lt;/th&gt;
 &lt;th&gt;Typical use&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Block (RBD)&lt;/td&gt;
 &lt;td&gt;Kernel block device / iSCSI&lt;/td&gt;
 &lt;td&gt;Kubernetes PVCs, VM disks&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Object (RGW)&lt;/td&gt;
 &lt;td&gt;S3-compatible API&lt;/td&gt;
 &lt;td&gt;Backups, artifacts, media&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;File (CephFS)&lt;/td&gt;
 &lt;td&gt;POSIX filesystem / NFS&lt;/td&gt;
 &lt;td&gt;Shared filesystems, home dirs&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For Kubernetes workloads, RBD block storage via a StorageClass is the common path.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="components"&gt;Components
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;MON (Monitor)&lt;/strong&gt; — maintains the cluster map; quorum-based, needs an odd number (typically 3 or 5). Not a data path.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;OSD (Object Storage Daemon)&lt;/strong&gt; — one per disk; handles actual data reads/writes and replication.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;MGR (Manager)&lt;/strong&gt; — collects metrics, hosts the dashboard, runs modules (balancer, alertmanager, etc.).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;MDS (Metadata Server)&lt;/strong&gt; — only required for CephFS; manages the filesystem namespace.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="single-node-constraint"&gt;Single-node constraint
&lt;/h2&gt;&lt;p&gt;A single-node Ceph cluster can be made to run (&lt;code&gt;allowMultiplePerNode: true&lt;/code&gt; in Rook, replication &lt;code&gt;size: 1&lt;/code&gt;), but it provides no actual redundancy. There is nothing to replicate to. This is fine for testing concepts; it is not a valid storage setup for anything you care about.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="related"&gt;Related
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://docs.ceph.com/" target="_blank" rel="noopener"
 &gt;Ceph documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/kubernetes/rook/" &gt;Rook&lt;/a&gt; — Kubernetes operator that manages Ceph clusters inside K8s&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/proxmox/" &gt;Proxmox&lt;/a&gt; — Ceph is a native storage backend in Proxmox clusters&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/rook-ceph/" &gt;Rook + Ceph in the homelab&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>OpenStack</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/openstack/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/openstack/</guid><description>&lt;p&gt;OpenStack is an open-source IaaS platform — it turns a pool of bare-metal servers into a self-service cloud: virtual machines, block storage, networking, and object storage, all driven by API.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://www.openstack.org/" target="_blank" rel="noopener"
 &gt;https://www.openstack.org/&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="scale-and-fit"&gt;Scale and fit
&lt;/h2&gt;&lt;p&gt;There is a rough spectrum of virtualization tools, and picking the wrong tier is a common mistake:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Proxmox / VMware / Hyper-V&lt;/strong&gt; — the right choice when you want to run virtual machines. SMB, homelab, or a small ops team managing infrastructure directly. Reasonable setup cost, manageable operational overhead, one or a few admins in control. Think of it as a VMware replacement.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;OpenStack&lt;/strong&gt; — the right choice when you are &lt;em&gt;building a cloud&lt;/em&gt;, not just running VMs. Multi-tenant infrastructure where teams self-service their own compute, networking, and storage via API. The operational complexity is real and significant; it pays off when the cloud-like abstraction is the actual product, or when the scale justifies the overhead.&lt;/p&gt;
&lt;p&gt;The rule of thumb: if the question is &amp;ldquo;how do I replace VMware?&amp;rdquo;, the answer is Proxmox. If the question is &amp;ldquo;how do I build a private cloud platform?&amp;rdquo;, the answer might be OpenStack.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="core-components"&gt;Core Components
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Service&lt;/th&gt;
 &lt;th&gt;Code Name&lt;/th&gt;
 &lt;th&gt;What it does&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Compute&lt;/td&gt;
 &lt;td&gt;Nova&lt;/td&gt;
 &lt;td&gt;Schedules and manages VM lifecycle&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Networking&lt;/td&gt;
 &lt;td&gt;Neutron&lt;/td&gt;
 &lt;td&gt;Virtual networks, routers, floating IPs, security groups&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Block Storage&lt;/td&gt;
 &lt;td&gt;Cinder&lt;/td&gt;
 &lt;td&gt;Persistent volumes attached to VMs&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Image Service&lt;/td&gt;
 &lt;td&gt;Glance&lt;/td&gt;
 &lt;td&gt;Stores and serves OS images&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Identity&lt;/td&gt;
 &lt;td&gt;Keystone&lt;/td&gt;
 &lt;td&gt;Auth, service catalog, RBAC&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Dashboard&lt;/td&gt;
 &lt;td&gt;Horizon&lt;/td&gt;
 &lt;td&gt;Web UI (optional)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Object Storage&lt;/td&gt;
 &lt;td&gt;Swift&lt;/td&gt;
 &lt;td&gt;S3-like object storage (optional)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Bare Metal&lt;/td&gt;
 &lt;td&gt;Ironic&lt;/td&gt;
 &lt;td&gt;Provisions physical machines instead of VMs&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;You do not need all of them. A minimal useful deployment is Nova + Neutron + Cinder + Glance + Keystone.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="openstack-on-kubernetes"&gt;OpenStack on Kubernetes
&lt;/h2&gt;&lt;p&gt;OpenStack services are just applications — and they can run as Kubernetes workloads. Two projects make this practical:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;a class="link" href="https://github.com/openstack/openstack-helm" target="_blank" rel="noopener"
 &gt;OpenStack-Helm&lt;/a&gt;&lt;/strong&gt; — official Helm charts for deploying OpenStack services on an existing Kubernetes cluster. Each service (Nova, Neutron, Cinder, etc.) becomes a Helm release. Upgrades follow standard rolling deployment patterns.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;a class="link" href="https://github.com/vexxhost/atmosphere" target="_blank" rel="noopener"
 &gt;Atmosphere&lt;/a&gt;&lt;/strong&gt; (by VEXXHOST) — a higher-level operator built on top of OpenStack-Helm. Adds Ansible automation, health checks, and a more opinionated deployment model. Targets production use.&lt;/p&gt;
&lt;p&gt;The practical implication: you can run a Talos cluster and deploy OpenStack on top of it — OpenStack as a tenant of Kubernetes rather than a separate platform. This inverts the usual relationship (where Kubernetes runs on top of OpenStack) and is an interesting architectural option for homelab and small private cloud deployments.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://www.fairbanks.nl/" target="_blank" rel="noopener"
 &gt;Fairbanks&lt;/a&gt; (Dutch hosting company specialising in sovereign private clouds) does exactly this in production. Their talk &lt;a class="link" href="https://www.youtube.com/watch?v=zU8mT2f2Hxc" target="_blank" rel="noopener"
 &gt;OpenStack on Talos Linux&lt;/a&gt; is the clearest real-world example of the pattern.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="deployment-options"&gt;Deployment Options
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Kolla-Ansible&lt;/strong&gt;&lt;br&gt;
&lt;a class="link" href="https://docs.openstack.org/kolla-ansible/latest/" target="_blank" rel="noopener"
 &gt;https://docs.openstack.org/kolla-ansible/latest/&lt;/a&gt;&lt;br&gt;
Containerised OpenStack deployed via Ansible. Production-grade, well-maintained. The practical choice for homelab and small-scale production deployments. Each service runs in its own container.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DevStack&lt;/strong&gt;&lt;br&gt;
&lt;a class="link" href="https://docs.openstack.org/devstack/latest/" target="_blank" rel="noopener"
 &gt;https://docs.openstack.org/devstack/latest/&lt;/a&gt;&lt;br&gt;
All-in-one development install. Not for production or anything you want to survive a reboot. Good for learning the API surface.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Canonical OpenStack (Juju / Sunbeam)&lt;/strong&gt;&lt;br&gt;
&lt;a class="link" href="https://ubuntu.com/openstack" target="_blank" rel="noopener"
 &gt;https://ubuntu.com/openstack&lt;/a&gt;&lt;br&gt;
Ubuntu-opinionated deployment. Sunbeam is a newer minimal footprint option. Good if you&amp;rsquo;re already in the Ubuntu/Juju ecosystem.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="concepts-worth-understanding"&gt;Concepts Worth Understanding
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Flavors&lt;/strong&gt; — VM sizing templates (vCPU, RAM, disk). You define these; instances pick from them.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Security Groups&lt;/strong&gt; — stateful firewall rules applied per-port. Default-deny inbound.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Floating IPs&lt;/strong&gt; — externally routable IPs that can be associated/disassociated from instances dynamically.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Availability Zones&lt;/strong&gt; — logical groupings of compute nodes. Useful for fault isolation even at small scale.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hypervisors&lt;/strong&gt; — Nova supports KVM (default), QEMU, VMware, and others. KVM on Linux is the standard.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="relevance-to-the-lab"&gt;Relevance to the Lab
&lt;/h2&gt;&lt;p&gt;The &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/llm-training/" &gt;LLM training experiment&lt;/a&gt; plans to use OpenStack as the IaaS layer over the &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/inventory/systems/" &gt;blade nodes&lt;/a&gt; in ASGARD — Nova for compute scheduling, Neutron for cluster networking, Cinder for shared model/dataset storage backed by Ceph.&lt;/p&gt;</description></item><item><title>Proxmox VE</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/proxmox/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/proxmox/</guid><description>&lt;p&gt;Proxmox VE (Virtual Environment) is an open-source Type 1 hypervisor built on Debian. It runs KVM for full virtual machines and LXC for lightweight containers, managed through a web UI or API. The subscription model is optional — the community edition is fully functional without a paid license; the subscription gives access to the enterprise update repository and support.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="comparison"&gt;Comparison
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Platform&lt;/th&gt;
 &lt;th&gt;License&lt;/th&gt;
 &lt;th&gt;VMs (KVM)&lt;/th&gt;
 &lt;th&gt;Containers&lt;/th&gt;
 &lt;th&gt;Clustering&lt;/th&gt;
 &lt;th&gt;Web UI&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Proxmox VE&lt;/td&gt;
 &lt;td&gt;Open-source (optional sub)&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;td&gt;Yes (LXC)&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;VMware ESXi&lt;/td&gt;
 &lt;td&gt;Commercial&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;td&gt;No&lt;/td&gt;
 &lt;td&gt;Yes (vCenter)&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Standalone KVM&lt;/td&gt;
 &lt;td&gt;Open-source&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;td&gt;No&lt;/td&gt;
 &lt;td&gt;Manual&lt;/td&gt;
 &lt;td&gt;No&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;oVirt&lt;/td&gt;
 &lt;td&gt;Open-source&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;td&gt;No&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Proxmox is the practical choice when you want VMware-style management without the licensing cost, or when you want to run both VMs and containers on the same node.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="core-concepts"&gt;Core concepts
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Node&lt;/strong&gt; — a physical host running Proxmox VE. Managed independently or as part of a cluster.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cluster&lt;/strong&gt; — multiple nodes joined together. Share a unified management view and allow live migration of VMs between nodes. Uses &lt;a class="link" href="https://corosync.github.io/corosync/" target="_blank" rel="noopener"
 &gt;Corosync&lt;/a&gt; for distributed consensus.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Quorum&lt;/strong&gt; — clusters require a majority of nodes to be reachable to avoid split-brain. Minimum useful cluster size is 3 nodes (loss of one node still leaves a majority). Two-node clusters need a quorum device (&lt;code&gt;qdevice&lt;/code&gt;) to function safely.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;VM&lt;/strong&gt; — full virtual machine backed by QEMU/KVM. Hardware-level isolation. Arbitrary OS.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Container (CT)&lt;/strong&gt; — LXC container. Shares the host kernel; lower overhead than a VM. Linux-only. Useful for services where you want process-level isolation without a full OS.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Storage pool&lt;/strong&gt; — where disks and images live. Supported backends: local directory, LVM, LVM-thin, ZFS, NFS, CIFS, and Ceph (via &lt;code&gt;rbd&lt;/code&gt;). ZFS and Ceph are the most capable options for a cluster — ZFS for local redundancy, Ceph for shared storage across nodes.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="related"&gt;Related
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://pve.proxmox.com/pve-docs/" target="_blank" rel="noopener"
 &gt;Proxmox VE documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://forum.proxmox.com/" target="_blank" rel="noopener"
 &gt;Proxmox community forum&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://corosync.github.io/corosync/" target="_blank" rel="noopener"
 &gt;Corosync documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/ceph/" &gt;Ceph&lt;/a&gt; — distributed storage backend for Proxmox clusters&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/openstack/" &gt;OpenStack&lt;/a&gt; — the next tier up the scale spectrum&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/proxmox-cluster/" &gt;Proxmox cluster in the homelab&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Slurm</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/slurm/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/slurm/</guid><description>&lt;p&gt;Slurm (Simple Linux Utility for Resource Management) is a workload manager and job scheduler. It originated in HPC but is now the standard scheduler for ML training clusters — most cloud GPU clusters (AWS, GCP, Azure HPC) run Slurm under the hood.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://slurm.schedmd.com/" target="_blank" rel="noopener"
 &gt;https://slurm.schedmd.com/&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="core-concepts"&gt;Core Concepts
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Node&lt;/strong&gt; — a compute host registered with Slurm. Can have CPU, GPU, and memory resources defined.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Partition&lt;/strong&gt; — a logical group of nodes (like a queue). You can have separate partitions for GPU and CPU nodes, or different priority tiers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Job&lt;/strong&gt; — a workload submitted to the queue. Slurm allocates resources, places the job, and tracks it to completion.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Allocation&lt;/strong&gt; — the reserved resources (nodes, CPUs, GPUs, memory) for a running job.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="key-commands"&gt;Key Commands
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sbatch job.sh &lt;span style="color:#75715e"&gt;# submit a batch job script&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;squeue &lt;span style="color:#75715e"&gt;# view the job queue&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sacct -j &amp;lt;jobid&amp;gt; &lt;span style="color:#75715e"&gt;# job accounting / history&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sinfo &lt;span style="color:#75715e"&gt;# view partition and node state&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;scancel &amp;lt;jobid&amp;gt; &lt;span style="color:#75715e"&gt;# cancel a job&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;srun --pty bash &lt;span style="color:#75715e"&gt;# interactive allocation&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;A minimal batch script:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#!/bin/bash
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#SBATCH --job-name=train&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#SBATCH --nodes=1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#SBATCH --gpus=1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#SBATCH --time=02:00:00&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;python train.py
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="slurm-vs-kubernetes-for-training"&gt;Slurm vs Kubernetes for Training
&lt;/h2&gt;&lt;p&gt;The fundamental difference is what each system optimises for:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Kubernetes&lt;/strong&gt; optimises for uptime. It keeps services running, reschedules failed pods, and manages long-lived workloads. That&amp;rsquo;s the right model for inference serving, APIs, and anything that needs to stay up.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Slurm&lt;/strong&gt; optimises for utilisation. It packs jobs onto nodes as tightly as possible, queues work when resources are busy, and gets out of the way when a job finishes. That&amp;rsquo;s the right model for batch training — you want every GPU busy, not reserved for availability.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;&lt;/th&gt;
 &lt;th&gt;Slurm&lt;/th&gt;
 &lt;th&gt;Kubernetes&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Optimises for&lt;/td&gt;
 &lt;td&gt;Maximum utilisation&lt;/td&gt;
 &lt;td&gt;Uptime and availability&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Scheduling model&lt;/td&gt;
 &lt;td&gt;Job queue, batch-first&lt;/td&gt;
 &lt;td&gt;Long-running services + batch (via operators)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;GPU allocation&lt;/td&gt;
 &lt;td&gt;Native, fine-grained&lt;/td&gt;
 &lt;td&gt;Requires GPU operator + device plugin&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Multi-node training&lt;/td&gt;
 &lt;td&gt;First-class (MPI, &lt;code&gt;srun&lt;/code&gt;)&lt;/td&gt;
 &lt;td&gt;Possible via KubeFlow, PyTorchJob&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Preemption&lt;/td&gt;
 &lt;td&gt;Built-in&lt;/td&gt;
 &lt;td&gt;Requires configuration&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Operational overhead&lt;/td&gt;
 &lt;td&gt;Low on bare metal&lt;/td&gt;
 &lt;td&gt;Higher — requires cluster management&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Ecosystem&lt;/td&gt;
 &lt;td&gt;HPC, academia, major cloud HPC&lt;/td&gt;
 &lt;td&gt;ML platforms, cloud-native&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;The short version:&lt;/strong&gt; use Slurm for pure batch training on bare metal. Use Kubernetes when you&amp;rsquo;re mixing training with inference serving or need container-native workflows throughout — or when the cluster already exists.&lt;/p&gt;</description></item><item><title>LVM — Logical Volume Manager</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/lvm/</link><pubDate>Mon, 01 Jan 2024 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/lvm/</guid><description>&lt;p&gt;LVM adds a virtualisation layer between physical disks and filesystems. Instead of formatting a disk partition directly, you assemble physical volumes into a volume group and carve logical volumes out of the pool. This makes resizing, snapshots, and spanning volumes across multiple disks straightforward operations rather than destructive partition table surgery.&lt;/p&gt;
&lt;h2 id="layers"&gt;Layers
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Layer&lt;/th&gt;
 &lt;th&gt;Description&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Physical Volume (PV)&lt;/td&gt;
 &lt;td&gt;A disk or partition initialised for LVM use (&lt;code&gt;pvcreate&lt;/code&gt;)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Volume Group (VG)&lt;/td&gt;
 &lt;td&gt;A pool of storage assembled from one or more PVs&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Logical Volume (LV)&lt;/td&gt;
 &lt;td&gt;A virtual partition carved from a VG, formatted and mounted like a regular disk&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Initialise two disks as PVs&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;pvcreate /dev/sdb /dev/sdc
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Create a VG from both&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;vgcreate data-vg /dev/sdb /dev/sdc
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Create an LV using all available space&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;lvcreate -l 100%FREE -n data-lv data-vg
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Format and mount&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;mkfs.ext4 /dev/data-vg/data-lv
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;mount /dev/data-vg/data-lv /mnt/data
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="resizing"&gt;Resizing
&lt;/h2&gt;&lt;p&gt;The practical benefit over raw partitions: extend a logical volume online without unmounting:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Extend the LV by 50GB&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;lvextend -L +50G /dev/data-vg/data-lv
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Grow the filesystem to fill the new space&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;resize2fs /dev/data-vg/data-lv
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="snapshots"&gt;Snapshots
&lt;/h2&gt;&lt;p&gt;LVM supports copy-on-write snapshots. A snapshot captures the LV state at a point in time and stores only the blocks that change afterwards:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;lvcreate -L 10G -s -n data-snap /dev/data-vg/data-lv
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Used for consistent backups of live filesystems — snapshot, back up the snapshot, remove it. Rook/Ceph and cloud providers use similar snapshot semantics at the storage layer.&lt;/p&gt;
&lt;h2 id="resources"&gt;Resources
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://sourceware.org/lvm2/" target="_blank" rel="noopener"
 &gt;LVM2 documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/configuring_and_managing_logical_volumes/" target="_blank" rel="noopener"
 &gt;Red Hat LVM administration guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>