Cloud & Infrastructure on Backend Engineering Strategy Tools

Ceph

Thu, 14 May 2026 00:00:00 +0000

Ceph is an open-source distributed storage platform providing object, block, and file storage in a single unified system. It runs across multiple nodes and has no single point of failure.

The core idea: data is not stored on specific disks on specific nodes. Instead, the CRUSH algorithm distributes data across all available OSDs (Object Storage Daemons) based on a placement map. Add nodes and the cluster rebalances automatically. Lose a node and Ceph re-replicates from surviving copies without operator intervention.

Storage types

Type	Interface	Typical use
Block (RBD)	Kernel block device / iSCSI	Kubernetes PVCs, VM disks
Object (RGW)	S3-compatible API	Backups, artifacts, media
File (CephFS)	POSIX filesystem / NFS	Shared filesystems, home dirs

For Kubernetes workloads, RBD block storage via a StorageClass is the common path.

Components

MON (Monitor) — maintains the cluster map; quorum-based, needs an odd number (typically 3 or 5). Not a data path.

OSD (Object Storage Daemon) — one per disk; handles actual data reads/writes and replication.

MGR (Manager) — collects metrics, hosts the dashboard, runs modules (balancer, alertmanager, etc.).

MDS (Metadata Server) — only required for CephFS; manages the filesystem namespace.

Single-node constraint

A single-node Ceph cluster can be made to run (allowMultiplePerNode: true in Rook, replication size: 1), but it provides no actual redundancy. There is nothing to replicate to. This is fine for testing concepts; it is not a valid storage setup for anything you care about.

Ceph documentation
Rook — Kubernetes operator that manages Ceph clusters inside K8s
Proxmox — Ceph is a native storage backend in Proxmox clusters
Rook + Ceph in the homelab

OpenStack

Thu, 14 May 2026 00:00:00 +0000

OpenStack is an open-source IaaS platform — it turns a pool of bare-metal servers into a self-service cloud: virtual machines, block storage, networking, and object storage, all driven by API.

https://www.openstack.org/

Scale and fit

There is a rough spectrum of virtualization tools, and picking the wrong tier is a common mistake:

Proxmox / VMware / Hyper-V — the right choice when you want to run virtual machines. SMB, homelab, or a small ops team managing infrastructure directly. Reasonable setup cost, manageable operational overhead, one or a few admins in control. Think of it as a VMware replacement.

OpenStack — the right choice when you are building a cloud, not just running VMs. Multi-tenant infrastructure where teams self-service their own compute, networking, and storage via API. The operational complexity is real and significant; it pays off when the cloud-like abstraction is the actual product, or when the scale justifies the overhead.

The rule of thumb: if the question is “how do I replace VMware?”, the answer is Proxmox. If the question is “how do I build a private cloud platform?”, the answer might be OpenStack.

Core Components

Service	Code Name	What it does
Compute	Nova	Schedules and manages VM lifecycle
Networking	Neutron	Virtual networks, routers, floating IPs, security groups
Block Storage	Cinder	Persistent volumes attached to VMs
Image Service	Glance	Stores and serves OS images
Identity	Keystone	Auth, service catalog, RBAC
Dashboard	Horizon	Web UI (optional)
Object Storage	Swift	S3-like object storage (optional)
Bare Metal	Ironic	Provisions physical machines instead of VMs

You do not need all of them. A minimal useful deployment is Nova + Neutron + Cinder + Glance + Keystone.

OpenStack on Kubernetes

OpenStack services are just applications — and they can run as Kubernetes workloads. Two projects make this practical:

OpenStack-Helm — official Helm charts for deploying OpenStack services on an existing Kubernetes cluster. Each service (Nova, Neutron, Cinder, etc.) becomes a Helm release. Upgrades follow standard rolling deployment patterns.

Atmosphere (by VEXXHOST) — a higher-level operator built on top of OpenStack-Helm. Adds Ansible automation, health checks, and a more opinionated deployment model. Targets production use.

The practical implication: you can run a Talos cluster and deploy OpenStack on top of it — OpenStack as a tenant of Kubernetes rather than a separate platform. This inverts the usual relationship (where Kubernetes runs on top of OpenStack) and is an interesting architectural option for homelab and small private cloud deployments.

Fairbanks (Dutch hosting company specialising in sovereign private clouds) does exactly this in production. Their talk OpenStack on Talos Linux is the clearest real-world example of the pattern.

Deployment Options

Kolla-Ansible
https://docs.openstack.org/kolla-ansible/latest/
Containerised OpenStack deployed via Ansible. Production-grade, well-maintained. The practical choice for homelab and small-scale production deployments. Each service runs in its own container.

DevStack
https://docs.openstack.org/devstack/latest/
All-in-one development install. Not for production or anything you want to survive a reboot. Good for learning the API surface.

Canonical OpenStack (Juju / Sunbeam)
https://ubuntu.com/openstack
Ubuntu-opinionated deployment. Sunbeam is a newer minimal footprint option. Good if you’re already in the Ubuntu/Juju ecosystem.

Concepts Worth Understanding

Flavors — VM sizing templates (vCPU, RAM, disk). You define these; instances pick from them.

Security Groups — stateful firewall rules applied per-port. Default-deny inbound.

Floating IPs — externally routable IPs that can be associated/disassociated from instances dynamically.

Availability Zones — logical groupings of compute nodes. Useful for fault isolation even at small scale.

Hypervisors — Nova supports KVM (default), QEMU, VMware, and others. KVM on Linux is the standard.

Relevance to the Lab

The LLM training experiment plans to use OpenStack as the IaaS layer over the blade nodes in ASGARD — Nova for compute scheduling, Neutron for cluster networking, Cinder for shared model/dataset storage backed by Ceph.

Proxmox VE

Thu, 14 May 2026 00:00:00 +0000

Proxmox VE (Virtual Environment) is an open-source Type 1 hypervisor built on Debian. It runs KVM for full virtual machines and LXC for lightweight containers, managed through a web UI or API. The subscription model is optional — the community edition is fully functional without a paid license; the subscription gives access to the enterprise update repository and support.

Comparison

Platform	License	VMs (KVM)	Containers	Clustering	Web UI
Proxmox VE	Open-source (optional sub)	Yes	Yes (LXC)	Yes	Yes
VMware ESXi	Commercial	Yes	No	Yes (vCenter)	Yes
Standalone KVM	Open-source	Yes	No	Manual	No
oVirt	Open-source	Yes	No	Yes	Yes

Proxmox is the practical choice when you want VMware-style management without the licensing cost, or when you want to run both VMs and containers on the same node.

Core concepts

Node — a physical host running Proxmox VE. Managed independently or as part of a cluster.

Cluster — multiple nodes joined together. Share a unified management view and allow live migration of VMs between nodes. Uses Corosync for distributed consensus.

Quorum — clusters require a majority of nodes to be reachable to avoid split-brain. Minimum useful cluster size is 3 nodes (loss of one node still leaves a majority). Two-node clusters need a quorum device (qdevice) to function safely.

VM — full virtual machine backed by QEMU/KVM. Hardware-level isolation. Arbitrary OS.

Container (CT) — LXC container. Shares the host kernel; lower overhead than a VM. Linux-only. Useful for services where you want process-level isolation without a full OS.

Storage pool — where disks and images live. Supported backends: local directory, LVM, LVM-thin, ZFS, NFS, CIFS, and Ceph (via rbd). ZFS and Ceph are the most capable options for a cluster — ZFS for local redundancy, Ceph for shared storage across nodes.

Proxmox VE documentation
Proxmox community forum
Corosync documentation
Ceph — distributed storage backend for Proxmox clusters
OpenStack — the next tier up the scale spectrum
Proxmox cluster in the homelab

Slurm

Thu, 14 May 2026 00:00:00 +0000

Slurm (Simple Linux Utility for Resource Management) is a workload manager and job scheduler. It originated in HPC but is now the standard scheduler for ML training clusters — most cloud GPU clusters (AWS, GCP, Azure HPC) run Slurm under the hood.

https://slurm.schedmd.com/

Core Concepts

Node — a compute host registered with Slurm. Can have CPU, GPU, and memory resources defined.

Partition — a logical group of nodes (like a queue). You can have separate partitions for GPU and CPU nodes, or different priority tiers.

Job — a workload submitted to the queue. Slurm allocates resources, places the job, and tracks it to completion.

Allocation — the reserved resources (nodes, CPUs, GPUs, memory) for a running job.

Key Commands

sbatch job.sh # submit a batch job script
squeue # view the job queue
sacct -j <jobid> # job accounting / history
sinfo # view partition and node state
scancel <jobid> # cancel a job
srun --pty bash # interactive allocation

A minimal batch script:

#!/bin/bash
#SBATCH --job-name=train
#SBATCH --nodes=1
#SBATCH --gpus=1
#SBATCH --time=02:00:00

python train.py

Slurm vs Kubernetes for Training

The fundamental difference is what each system optimises for:

Kubernetes optimises for uptime. It keeps services running, reschedules failed pods, and manages long-lived workloads. That’s the right model for inference serving, APIs, and anything that needs to stay up.

Slurm optimises for utilisation. It packs jobs onto nodes as tightly as possible, queues work when resources are busy, and gets out of the way when a job finishes. That’s the right model for batch training — you want every GPU busy, not reserved for availability.

	Slurm	Kubernetes
Optimises for	Maximum utilisation	Uptime and availability
Scheduling model	Job queue, batch-first	Long-running services + batch (via operators)
GPU allocation	Native, fine-grained	Requires GPU operator + device plugin
Multi-node training	First-class (MPI, `srun`)	Possible via KubeFlow, PyTorchJob
Preemption	Built-in	Requires configuration
Operational overhead	Low on bare metal	Higher — requires cluster management
Ecosystem	HPC, academia, major cloud HPC	ML platforms, cloud-native

The short version: use Slurm for pure batch training on bare metal. Use Kubernetes when you’re mixing training with inference serving or need container-native workflows throughout — or when the cluster already exists.

LVM — Logical Volume Manager

Mon, 01 Jan 2024 00:00:00 +0000

LVM adds a virtualisation layer between physical disks and filesystems. Instead of formatting a disk partition directly, you assemble physical volumes into a volume group and carve logical volumes out of the pool. This makes resizing, snapshots, and spanning volumes across multiple disks straightforward operations rather than destructive partition table surgery.

Layers

Layer	Description
Physical Volume (PV)	A disk or partition initialised for LVM use (`pvcreate`)
Volume Group (VG)	A pool of storage assembled from one or more PVs
Logical Volume (LV)	A virtual partition carved from a VG, formatted and mounted like a regular disk

# Initialise two disks as PVs
pvcreate /dev/sdb /dev/sdc

# Create a VG from both
vgcreate data-vg /dev/sdb /dev/sdc

# Create an LV using all available space
lvcreate -l 100%FREE -n data-lv data-vg

# Format and mount
mkfs.ext4 /dev/data-vg/data-lv
mount /dev/data-vg/data-lv /mnt/data

Resizing

The practical benefit over raw partitions: extend a logical volume online without unmounting:

# Extend the LV by 50GB
lvextend -L +50G /dev/data-vg/data-lv

# Grow the filesystem to fill the new space
resize2fs /dev/data-vg/data-lv

Snapshots

LVM supports copy-on-write snapshots. A snapshot captures the LV state at a point in time and stores only the blocks that change afterwards:

lvcreate -L 10G -s -n data-snap /dev/data-vg/data-lv

Used for consistent backups of live filesystems — snapshot, back up the snapshot, remove it. Rook/Ceph and cloud providers use similar snapshot semantics at the storage layer.

Cloud & Infrastructure on Backend Engineering Strategy Tools

Ceph

Storage types

Components

Single-node constraint

Related

OpenStack

Scale and fit

Core Components

OpenStack on Kubernetes

Deployment Options

Concepts Worth Understanding

Relevance to the Lab

Proxmox VE

Comparison

Core concepts

Related

Slurm

Core Concepts

Key Commands

Slurm vs Kubernetes for Training

LVM — Logical Volume Manager

Layers

Resizing

Snapshots

Resources