Bare-Metal on Backend Engineering Strategy Tools

IPMI

Fri, 22 May 2026 00:00:00 +0000

IPMI (Intelligent Platform Management Interface) is a hardware-level management standard built into most server-class hardware. It runs on a dedicated processor on the motherboard — the BMC (Baseboard Management Controller) — independently of the host OS. The BMC has its own NIC, its own firmware, and its own IP address. You can power a server on or off, read sensor data, and access a serial console even if the host is completely dead.

Current version is IPMI 2.0, which added encryption and stronger authentication over 1.5.

BMC implementations by vendor

IPMI is the standard; each vendor ships their own BMC firmware on top of it:

Vendor	BMC / OOB product	Notes
Dell	iDRAC (Integrated Dell Remote Access Controller)	iDRAC 6/7/8/9; newer versions add Redfish
HP / HPE	iLO (Integrated Lights-Out)	iLO 2/3/4/5; iLO 4+ adds Redfish
Sun / Oracle	ILOM (Integrated Lights-Out Manager)	Sun Fire series (X4150, X4450, etc.)
Supermicro	IPMI / BMC	Web UI + IPMI; newer boards also Redfish
Lenovo / IBM	XClarity / IMM	IMM2 on older systems
HP BladeSystem	Onboard Administrator (OA)	Enclosure-level management (C7000, C3000) — separate from individual blade iLO

Most also expose a web UI and some form of virtual KVM (keyboard/video/mouse over network) in addition to IPMI over LAN.

Network setup

The BMC NIC is usually shared with a host NIC (shared/failover mode) or dedicated (preferred for management). Configure via BIOS/UEFI or the vendor’s setup utility before the OS boots.

Assign a static IP — a BMC on DHCP is workable but inconvenient. Keep BMCs on a dedicated management VLAN if possible; they have historically had security issues and shouldn’t be exposed to general traffic.

ipmitool

The standard CLI for IPMI over LAN. Available in most Linux package repos.

# Power control
ipmitool -I lanplus -H <bmc-ip> -U <user> -P <pass> power status
ipmitool -I lanplus -H <bmc-ip> -U <user> -P <pass> power on
ipmitool -I lanplus -H <bmc-ip> -U <user> -P <pass> power off
ipmitool -I lanplus -H <bmc-ip> -U <user> -P <pass> power cycle
ipmitool -I lanplus -H <bmc-ip> -U <user> -P <pass> power reset

# Sensor readings (temperatures, voltages, fan speeds)
ipmitool -I lanplus -H <bmc-ip> -U <user> -P <pass> sensor list

# System Event Log
ipmitool -I lanplus -H <bmc-ip> -U <user> -P <pass> sel list
ipmitool -I lanplus -H <bmc-ip> -U <user> -P <pass> sel clear

# Serial over LAN (SoL) — console access without KVM
ipmitool -I lanplus -H <bmc-ip> -U <user> -P <pass> sol activate
# Exit SoL: ~.

Use -I lanplus (IPMI 2.0 with encryption) rather than -I lan (IPMI 1.5, unencrypted) where supported.

Serial over LAN (SoL)

SoL forwards the server’s serial port over the IPMI connection — giving you a text console to the host without a KVM or physical access. Requires the host OS to have serial console enabled:

# Add to GRUB_CMDLINE_LINUX in /etc/default/grub
console=tty0 console=ttyS1,115200n8

# Enable serial getty
systemctl enable serial-getty@ttyS1.service

Baud rate must match what’s configured in the BIOS/BMC (typically 115200).

Security

IPMI has a poor security history:

IPMI 1.5 sends credentials in cleartext
IPMI 2.0 has had multiple authentication bypass vulnerabilities (RAKP, cipher 0)
The BMC itself runs independent firmware that may have unpatched CVEs
Default credentials (admin/admin, ADMIN/ADMIN) are common and widely known

Minimum steps:

Change default credentials immediately
Use IPMI 2.0 (lanplus) only
Disable cipher suite 0: ipmitool -I lanplus ... lan set 1 cipher_privs XxxxxxxxxxxxxxxX
Isolate BMC network from internet and untrusted hosts — management VLAN with no external exposure
Keep BMC firmware updated

Redfish — the modern REST API replacement for IPMI
Out-of-band management overview
Hardware provisioning — PXE boot and bare-metal provisioning

Redfish

Fri, 22 May 2026 00:00:00 +0000

Redfish is a DMTF standard that defines a RESTful API for out-of-band server management. It replaces IPMI’s aging binary protocol with JSON over HTTPS — same capabilities (power control, sensors, firmware, console), but with a proper API, role-based access control, and standard authentication. Supported by all major server vendors on current-generation hardware.

Why Redfish over IPMI

	IPMI	Redfish
Protocol	Binary, UDP 623	HTTPS (REST/JSON)
Auth	RAKP (has CVEs)	HTTP Basic / Session tokens
Encryption	Optional (IPMI 2.0)	Always (TLS)
Discoverability	No	Yes (hypermedia)
Scripting	ipmitool flags	curl, Python, any HTTP client
Extensibility	Vendor OEM extensions	Structured OEM namespaces
Maturity	Established, aging	Modern, actively developed

Redfish is not universally available — older hardware (pre-2015 roughly) has IPMI only. Both coexist on many current systems; IPMI is still useful for compatibility. See IPMI.

Vendor implementations

Vendor	BMC	Redfish support
Dell	iDRAC 8+	Full, v1.0+
HPE	iLO 4+	Full (iLO 5 most complete)
Supermicro	BMC (X11+)	Full
Lenovo	XClarity	Full
Intel	BMC on server boards	Partial
OpenBMC	Open-source BMC firmware	Full (used by Facebook, Google infra)
AMI MegaRAC	OEM BMC firmware	Full

API structure

Redfish uses a consistent URL hierarchy rooted at /redfish/v1/. Navigation is hypermedia-driven — the root returns links to subsystems, and you follow them.

/redfish/v1/
├── Systems/ ← compute systems (servers)
│ └── 1/
│ ├── Processors/
│ ├── Memory/
│ ├── Storage/
│ └── Actions/ComputerSystem.Reset
├── Chassis/ ← physical chassis, power, thermal
│ └── 1/
│ ├── Power/ ← PSU status, power consumption
│ └── Thermal/ ← temperatures, fan speeds
├── Managers/ ← the BMC itself
│ └── 1/
│ └── NetworkInterfaces/
└── UpdateService/ ← firmware updates

Usage with curl

BMC="https://192.168.1.10"
USER="admin"
PASS="password"

# Get system overview
curl -sk -u "$USER:$PASS" "$BMC/redfish/v1/Systems/1" | jq .

# Power state
curl -sk -u "$USER:$PASS" "$BMC/redfish/v1/Systems/1" | jq .PowerState

# Power on
curl -sk -u "$USER:$PASS" -X POST \
 -H "Content-Type: application/json" \
 -d '{"ResetType":"On"}' \
 "$BMC/redfish/v1/Systems/1/Actions/ComputerSystem.Reset"

# Power off (graceful)
curl -sk -u "$USER:$PASS" -X POST \
 -H "Content-Type: application/json" \
 -d '{"ResetType":"GracefulShutdown"}' \
 "$BMC/redfish/v1/Systems/1/Actions/ComputerSystem.Reset"

# Force off
curl -sk -u "$USER:$PASS" -X POST \
 -H "Content-Type: application/json" \
 -d '{"ResetType":"ForceOff"}' \
 "$BMC/redfish/v1/Systems/1/Actions/ComputerSystem.Reset"

# Thermal — CPU temps, fan speeds
curl -sk -u "$USER:$PASS" "$BMC/redfish/v1/Chassis/1/Thermal" | jq '.Temperatures[] | {Name, ReadingCelsius}'

Reset types vary by vendor — check AllowableValues in the action schema:

curl -sk -u "$USER:$PASS" \
 "$BMC/redfish/v1/Systems/1" | jq '.Actions["#ComputerSystem.Reset"]["ResetType@Redfish.AllowableValues"]'

Python — sushy

sushy is the reference Python library for Redfish, used by OpenStack Ironic:

import sushy

client = sushy.Sushy("https://192.168.1.10", username="admin", password="password", verify=False)

system = client.get_system("/redfish/v1/Systems/1")
print(system.power_state) # On / Off
system.reset_system(sushy.RESET_ON)
system.reset_system(sushy.RESET_FORCE_OFF)

Session-based auth

For scripts making many requests, create a session to avoid re-authenticating on every call:

# Create session
SESSION=$(curl -sk -X POST \
 -H "Content-Type: application/json" \
 -d '{"UserName":"admin","Password":"password"}' \
 "https://192.168.1.10/redfish/v1/SessionService/Sessions" \
 -D -)

TOKEN=$(echo "$SESSION" | grep -i X-Auth-Token | awk '{print $2}' | tr -d '\r')

# Use token
curl -sk -H "X-Auth-Token: $TOKEN" \
 "https://192.168.1.10/redfish/v1/Systems/1" | jq .PowerState

Firmware updates

Redfish standardises firmware update via UpdateService:

# Check current firmware
curl -sk -u "$USER:$PASS" "$BMC/redfish/v1/UpdateService/FirmwareInventory" | jq .

# Push update (multipart, vendor-specific details vary)
curl -sk -u "$USER:$PASS" -X POST \
 -H "Content-Type: application/octet-stream" \
 --data-binary @firmware.bin \
 "$BMC/redfish/v1/UpdateService/update"

Vendor tooling (Dell racadm, HPE iLOrest) is often more reliable than raw curl for firmware updates.

IPMI — older binary protocol, still needed for pre-Redfish hardware
Out-of-band management overview
Hardware provisioning — PXE boot and bare-metal provisioning

ASGARD — the blade cluster

Fri, 15 May 2026 00:00:00 +0000

ASGARD (SYS-007) is the HP BladeSystem C7000 with 16× BL460c Gen8 blades. The reason to use it is profile switching: boot a blade as a Slurm compute node, run the experiment, reimage it as a Talos worker, run the next one. The same iPXE boot menu already set up for ODEN works here — the C7000 Onboard Administrator lets you configure boot order per blade slot, so switching roles is a BIOS setting and a PXE entry, not a reinstall.

Power reality

Before committing to blades as the permanent always-on platform, it’s worth being honest about the enclosure overhead. The C7000 has fixed costs regardless of how many blades are populated: 10 fans, dual OA modules, 2 interconnect switches, backplane management. It doesn’t scale down gracefully.

Setup	Approx power
C7000 enclosure alone (no blades)	200–400W
C7000 + 1 blade	350–550W
C7000 + 3 blades	500–800W
ODEN alone (1U M3, Talos)	100–150W
HEIMDAL alone (Sun X4150, router)	150–200W
ODEN + HEIMDAL	250–350W

Two pizza boxes beat three blades in the enclosure on power. The overhead only amortises at 8+ populated slots. For a permanent minimal setup, the 1U rack servers win. For experiments where you want to run 8–16 nodes at once, ASGARD earns its place.

What each role actually needs

Role	RAM	Disk	Network	Limiting factor
Talos / K8s worker	32–64GB	1× OSD disk	1GbE fine	RAM — current blades too thin
OpenStack compute	32–64GB	local ephemeral	1GbE fine	RAM
OpenStack control	32GB+	small	1GbE fine	RAM
Slurm compute	as much as possible	fast scratch	1GbE mediocre	network
Ceph OSD	16–32GB	more / bigger disks	1GbE	disk count

The network note matters for Slurm: blade LOM connects to the enclosure switch backplane at 1GbE, not 10GbE. The switch has 10GbE uplinks going out, but blade-to-blade traffic inside the enclosure goes through the switch at 1GbE. For Talos and OpenStack this is fine. For MPI jobs exchanging large datasets between Slurm nodes it’s a real bottleneck — HPC wants InfiniBand, which the empty interconnect bays 5–8 could take (plus matching mezzanine cards in each blade), but that’s a separate cost. For learning Slurm, 1GbE is workable.

Current blade state

Most blades are underpowered for any of the roles above. CPUs are also unknown across all 16 slots — the OA web GUI reports CPU model and core count per blade and should be checked first. The E5-2600 v1 range runs from E5-2603 (4c, 80W) to E5-2690 (8c/16t, 135W), which matters significantly for role assignment.

Slot	RAM	Disk
BLD-001	4GB	2× 146GB SAS
BLD-002	14GB (mixed, odd count)	—
BLD-003	32GB	2× 300GB SAS
BLD-004	8GB	—
BLD-005	8GB	1× 146GB + 1× 300GB SAS
BLD-006	8GB	2× 300GB SAS
BLD-007	8GB	2× 900GB SAS
BLD-008	16GB	2× 300GB SAS
BLD-009	8GB	—
BLD-010	8GB	2× 300GB SAS
BLD-011	8GB	2× 300GB SAS
BLD-012	8GB	2× 300GB SAS
BLD-013	32GB	—
BLD-014	8GB	—
BLD-015	8GB	2× 300GB SAS
BLD-016	8GB	—

BLD-003 and BLD-013 are already at 32GB and are natural candidates for control-plane or master roles once CPUs are confirmed.

Suggested configuration from existing stock

Available spare hardware:

14× RAM-007 (8GB DDR3 1600MHz ECC Reg) — unassigned
2× HDD-004 (120GB SATA SSD) — spare
6× HDD-002 (146GB 10K SAS) — spare
Embedded P220i on each blade (can be set to JBOD/passthrough for Ceph)

“Fat” nodes × 2 — Talos control plane, OpenStack control, Slurm master: Add 4× RAM-007 to each blade. From a base of 8–16GB that gives ~40GB. Candidates: BLD-006 and BLD-010, both have 2× 300GB SAS for local storage. Costs 8 of 14 spare sticks. Install a spare 120GB SSD as boot disk in each.

“Medium” nodes × 3 — Talos workers, OpenStack compute, Slurm compute: Add 2× RAM-007 to each → 24GB from the 8GB base. Candidates: BLD-008 (already 16GB, gets to 32GB), BLD-011, BLD-012. All three have 300GB SAS for scratch or Ceph OSDs. Costs the remaining 6 spare sticks.

Rest — thin compute, storage expansion, or powered off: Leave at current RAM. BLD-007’s 900GB SAS pair is better used elsewhere (see below). BLD-003 and BLD-013 at 32GB can step up to fat-node role once CPUs are confirmed.

That leaves 5 blades properly kitted and 11 available for experiments or idle.

BL460c Gen8 DIMM rule: populate per-CPU symmetrically — pairs or quads per memory channel — for best throughput. Don’t mix odd counts.

Storage — what moves where

Pull the 900GB SAS drives from BLD-007 now. HDD-013 (HGST 900GB) and HDD-014 (Toshiba 900GB) are the two largest drives in the blade pool and they’re sitting in a blade that may end up as a thin compute worker. Move them into ODEN or LOKE as permanent Ceph OSDs. This immediately gives the always-on cluster substantially more storage than the current 120GB SSDs.

MIMIR (SYS-004, 15× 1TB SAS) is the Ceph expansion story for later. To connect it: install CTRL-006 (ServeRAID-8e, have 2 unplaced) into a server with a free PCIe slot, then cable it with a SFF-8470 → SFF-8088 cable (not currently owned, inexpensive). TOR is the natural host — it already has CTRL-003 in HBA mode and free PCIe slots. Not urgent, but the hardware is almost all there.

What	Goes to	When
900GB SAS ×2 from BLD-007	ODEN or LOKE, permanent Ceph OSDs	Now
120GB SSD ×2 spare	BLD fat node boot disks	Before Talos on blades
300GB SAS in blades	Local scratch or blade Ceph OSDs	During ASGARD experiments
MIMIR 15× 1TB SAS	TOR via CTRL-006, Ceph expansion	Later (needs cable)

Three things to do before blades can boot anything

Identify CPUs. Connect to the OA management port, open the web GUI, check CPU model per slot. Ten minutes. Everything else depends on this.
Network uplink. The blade switches in bays 1 and 2 have 4× RJ45 1GbE uplinks (ports 22–25). Run a patch cable from one to any available switch — MODI, MAGNI, whatever’s reachable from the cable box. That’s enough for blades to reach DHCP and iPXE.
RAM redistribution. Pull the 14 spare RAM-007 sticks and install into the chosen fat and medium nodes per the profile above.

The permanent vs experiment split

Always on (~300–400W total):
 HEIMDAL → OPNsense router, Sun X4150, ~150–200W
 ODEN → Talos, Minecraft + small services, ~100–150W
 LOKE → 2nd Talos node (needs RAM-007 × 8 + SSD boot), ~100–150W

Experiments (fire up, learn, power off):
 ASGARD → 3–16 blades for Slurm / OpenStack / larger Talos cluster
 TYR+TOR+FREJA → Proxmox cluster (M1 DDR2, temporary)

Once the Proxmox experiment wraps, TYR, TOR, and FREJA can be powered down permanently. If ASGARD blades eventually become the long-term compute platform, OPNsense can move to a VM on a blade at that point — but not before the blades are stable and trusted. Don’t consolidate the router onto experimental infrastructure.

OpenStack

Thu, 14 May 2026 00:00:00 +0000

OpenStack is an open-source IaaS platform — it turns a pool of bare-metal servers into a self-service cloud: virtual machines, block storage, networking, and object storage, all driven by API.

https://www.openstack.org/

Scale and fit

There is a rough spectrum of virtualization tools, and picking the wrong tier is a common mistake:

Proxmox / VMware / Hyper-V — the right choice when you want to run virtual machines. SMB, homelab, or a small ops team managing infrastructure directly. Reasonable setup cost, manageable operational overhead, one or a few admins in control. Think of it as a VMware replacement.

OpenStack — the right choice when you are building a cloud, not just running VMs. Multi-tenant infrastructure where teams self-service their own compute, networking, and storage via API. The operational complexity is real and significant; it pays off when the cloud-like abstraction is the actual product, or when the scale justifies the overhead.

The rule of thumb: if the question is “how do I replace VMware?”, the answer is Proxmox. If the question is “how do I build a private cloud platform?”, the answer might be OpenStack.

Core Components

Service	Code Name	What it does
Compute	Nova	Schedules and manages VM lifecycle
Networking	Neutron	Virtual networks, routers, floating IPs, security groups
Block Storage	Cinder	Persistent volumes attached to VMs
Image Service	Glance	Stores and serves OS images
Identity	Keystone	Auth, service catalog, RBAC
Dashboard	Horizon	Web UI (optional)
Object Storage	Swift	S3-like object storage (optional)
Bare Metal	Ironic	Provisions physical machines instead of VMs

You do not need all of them. A minimal useful deployment is Nova + Neutron + Cinder + Glance + Keystone.

OpenStack on Kubernetes

OpenStack services are just applications — and they can run as Kubernetes workloads. Two projects make this practical:

OpenStack-Helm — official Helm charts for deploying OpenStack services on an existing Kubernetes cluster. Each service (Nova, Neutron, Cinder, etc.) becomes a Helm release. Upgrades follow standard rolling deployment patterns.

Atmosphere (by VEXXHOST) — a higher-level operator built on top of OpenStack-Helm. Adds Ansible automation, health checks, and a more opinionated deployment model. Targets production use.

The practical implication: you can run a Talos cluster and deploy OpenStack on top of it — OpenStack as a tenant of Kubernetes rather than a separate platform. This inverts the usual relationship (where Kubernetes runs on top of OpenStack) and is an interesting architectural option for homelab and small private cloud deployments.

Fairbanks (Dutch hosting company specialising in sovereign private clouds) does exactly this in production. Their talk OpenStack on Talos Linux is the clearest real-world example of the pattern.

Deployment Options

Kolla-Ansible
https://docs.openstack.org/kolla-ansible/latest/
Containerised OpenStack deployed via Ansible. Production-grade, well-maintained. The practical choice for homelab and small-scale production deployments. Each service runs in its own container.

DevStack
https://docs.openstack.org/devstack/latest/
All-in-one development install. Not for production or anything you want to survive a reboot. Good for learning the API surface.

Canonical OpenStack (Juju / Sunbeam)
https://ubuntu.com/openstack
Ubuntu-opinionated deployment. Sunbeam is a newer minimal footprint option. Good if you’re already in the Ubuntu/Juju ecosystem.

Concepts Worth Understanding

Flavors — VM sizing templates (vCPU, RAM, disk). You define these; instances pick from them.

Security Groups — stateful firewall rules applied per-port. Default-deny inbound.

Floating IPs — externally routable IPs that can be associated/disassociated from instances dynamically.

Availability Zones — logical groupings of compute nodes. Useful for fault isolation even at small scale.

Hypervisors — Nova supports KVM (default), QEMU, VMware, and others. KVM on Linux is the standard.

Relevance to the Lab

The LLM training experiment plans to use OpenStack as the IaaS layer over the blade nodes in ASGARD — Nova for compute scheduling, Neutron for cluster networking, Cinder for shared model/dataset storage backed by Ceph.

Proxmox Cluster in the homelab

Thu, 14 May 2026 00:00:00 +0000

Getting a three-node Proxmox VE cluster running in the homelab.

The goal is a shared virtualization platform for running VMs and LXC containers across the rack. Also, a good excuse to kick the tires on Proxmox itself so, naturally, let’s needlessly complicate things with some self-imposed constraints:

run it clustered
don’t use any hardware already earmarked for other projects

Hardware

I am going try and use three IBM rack servers from the inventory.

Asset ID	Hostname	Model	Form Factor	RAM	CPU
SYS-001	FREJA	IBM System x3550 M1 (7978)	1U	24GB	single
SYS-002	TYR	IBM System x3650 M1 (7979)	2U	64GB	dual
SYS-003	TOR	IBM System x3650 M1 (7979)	2U	64GB	dual

Three nodes satisfies Corosync quorum without needing a qdevice — losing one node still leaves a majority.

Installation

In progress.

Cluster formation

In progress.

Proxmox VE

Thu, 14 May 2026 00:00:00 +0000

Proxmox VE (Virtual Environment) is an open-source Type 1 hypervisor built on Debian. It runs KVM for full virtual machines and LXC for lightweight containers, managed through a web UI or API. The subscription model is optional — the community edition is fully functional without a paid license; the subscription gives access to the enterprise update repository and support.

Comparison

Platform	License	VMs (KVM)	Containers	Clustering	Web UI
Proxmox VE	Open-source (optional sub)	Yes	Yes (LXC)	Yes	Yes
VMware ESXi	Commercial	Yes	No	Yes (vCenter)	Yes
Standalone KVM	Open-source	Yes	No	Manual	No
oVirt	Open-source	Yes	No	Yes	Yes

Proxmox is the practical choice when you want VMware-style management without the licensing cost, or when you want to run both VMs and containers on the same node.

Core concepts

Node — a physical host running Proxmox VE. Managed independently or as part of a cluster.

Cluster — multiple nodes joined together. Share a unified management view and allow live migration of VMs between nodes. Uses Corosync for distributed consensus.

Quorum — clusters require a majority of nodes to be reachable to avoid split-brain. Minimum useful cluster size is 3 nodes (loss of one node still leaves a majority). Two-node clusters need a quorum device (qdevice) to function safely.

VM — full virtual machine backed by QEMU/KVM. Hardware-level isolation. Arbitrary OS.

Container (CT) — LXC container. Shares the host kernel; lower overhead than a VM. Linux-only. Useful for services where you want process-level isolation without a full OS.

Storage pool — where disks and images live. Supported backends: local directory, LVM, LVM-thin, ZFS, NFS, CIFS, and Ceph (via rbd). ZFS and Ceph are the most capable options for a cluster — ZFS for local redundancy, Ceph for shared storage across nodes.

Proxmox VE documentation
Proxmox community forum
Corosync documentation
Ceph — distributed storage backend for Proxmox clusters
OpenStack — the next tier up the scale spectrum
Proxmox cluster in the homelab

PXE Booting with OPNSense + iPXE

Thu, 14 May 2026 00:00:00 +0000

How to configure OPNSense as a PXE boot server using its built-in DHCP and TFTP services, and how to write an iPXE boot menu that can chainload Talos Linux (or anything else).

OPNSense DHCP — Network Boot Fields

Services → ISC DHCPv4 → [LAN] → Network Booting

Field	Value
Enable network booting	✓
Next-server IP	`192.168.1.1` (OPNSense LAN address)
Default BIOS filename	`undionly.kpxe`
x86 UEFI (32-bit) filename	`ipxe.efi`
x64 UEFI/EBC (64-bit) filename	`ipxe.efi`
iPXE boot filename	`default.ipxe`

The DHCP server serves the correct boot file based on client architecture. BIOS clients get undionly.kpxe; UEFI clients get ipxe.efi. Both then chainload default.ipxe.

TFTP — Downloading the Boot Files

OPNSense runs a TFTP server rooted at /usr/local/tftp. SSH in and fetch the iPXE binaries:

fetch -o /usr/local/tftp/undionly.kpxe https://boot.ipxe.org/undionly.kpxe
fetch -o /usr/local/tftp/ipxe.efi https://boot.ipxe.org/ipxe.efi

iPXE Boot Script

Save as /usr/local/tftp/default.ipxe. This example has a boot menu with options for netboot.xyz, a Talos Omni boot, and a debug shell.

#!ipxe

dhcp
set menu-timeout 5000

:start
menu PXE Boot Menu
item --gap -- ----------------------------
item netboot Boot netboot.xyz
item talos Boot Talos (Omni)
item shell iPXE Shell
item --gap -- ----------------------------
choose target && goto ${target}

:netboot
chain http://boot.netboot.xyz || goto failed
goto start

:talos
echo Booting Talos via Omni...

set api https://<your-omni-instance>.omni.siderolabs.io
set token <join-token>
set event [<siderolink-ipv6>]:8090
set log tcp://[<siderolink-ipv6>]:8092

kernel tftp://${next-server}/talos/vmlinuz-omni \
 ima_template=ima-ng \
 ima_appraise=fix \
 ima_hash=sha512 \
 selinux=1 \
 consoleblank=0 \
 nvme_core.io_timeout=4294967295 \
 initrd=initramfs.xz \
 init_on_alloc=1 \
 slab_nomerge \
 pti=on \
 console=tty0 \
 console=ttyS0 \
 printk.devkmsg=on \
 talos.platform=metal \
 siderolink.api=${api}?jointoken=${token} \
 talos.events.sink=${event} \
 talos.logging.kernel=${log}

initrd tftp://${next-server}/talos/initramfs-omni.xz
boot || goto failed

:shell
shell

:failed
echo Boot failed, press Enter to continue...
read fake
goto start

The api, token, event, and log values come from the Omni console when you generate a join link.

Talos Kernel and Initramfs — Image Factory

The standard Talos release binaries do not include firmware for all hardware. Since Talos 1.6, several older NIC drivers (including Broadcom BNX2 / BCM5709) were removed from the mainline image and made available as extensions via the image factory.

Generate a custom image at factory.talos.dev with the extensions you need (e.g. siderolabs/bnx2), then download the PXE artifacts:

mkdir -p /usr/local/tftp/talos

fetch -o /usr/local/tftp/talos/vmlinuz-omni \
 "https://pxe.factory.talos.dev/image/<IMAGE_ID>/v1.10.1/kernel-amd64"

fetch -o /usr/local/tftp/talos/initramfs-omni.xz \
 "https://pxe.factory.talos.dev/image/<IMAGE_ID>/v1.10.1/initramfs-amd64.xz"

Replace <IMAGE_ID> with the schematic ID from the image factory, and adjust the version tag as needed.

Gotchas

UEFI boot and NIC memory limits — ipxe.efi can be too large to fit in the NIC’s PXE memory buffer on some older hardware. If the UEFI chain fails silently, switch to BIOS/legacy mode and use undionly.kpxe instead.

DHCP options 66/67 conflict — If you have previously set DHCP options 66 (next-server) and 67 (boot file) as raw additional options, remove them. OPNSense’s built-in network boot fields handle this; having both causes conflicts.

BIOS boot order after first boot — Talos writes its own bootloader on first boot. If the BIOS is set to PXE as the primary device, the machine will fall back to the PXE menu on every subsequent reboot. Set disk as the primary boot device once the cluster is up.

Rook + Ceph on ODEN

Thu, 14 May 2026 00:00:00 +0000

Attempting to add persistent block storage to the ODEN single-node Talos cluster using Rook and Ceph. This did not fully succeed — the setup reached the point of a bound PVC and a working write test, but the cluster was not left in a clean stable state. Notes are here for completeness.

This builds on the Talos cluster setup on ODEN.

Hardware

ODEN has five storage devices:

Device	Type	Size	Role
`/dev/sdb`	Kingston SA400S3 SSD (SATA)	120 GB	Boot disk — leave alone
`/dev/nvme0n1`	Samsung 970 EVO NVMe	500 GB	OSD
`/dev/sdc`	Kingston SA400S3 SSD (SATA)	120 GB	OSD
`/dev/sdd`	Kingston SA400S3 SSD (SATA)	120 GB	OSD
`/dev/sde`	Kingston SA400S3 SSD (SATA)	120 GB	OSD

Do not add /dev/sdb to Ceph. It is the boot disk.

Step 1 — Install the Rook operator

kubectl apply -f https://raw.githubusercontent.com/rook/rook/refs/tags/v1.17.9/deploy/examples/crds.yaml
kubectl apply -f https://raw.githubusercontent.com/rook/rook/refs/tags/v1.17.9/deploy/examples/common.yaml
kubectl apply -f https://raw.githubusercontent.com/rook/rook/refs/tags/v1.17.9/deploy/examples/operator.yaml

Wait for the operator pod to be running in rook-ceph namespace before continuing.

Step 2 — CephCluster (single-node)

Single-node requires allowMultiplePerNode: true and explicit disk selection. The cluster-test example from the Rook repo is a reasonable starting point:

storage:
 useAllNodes: false
 nodes:
 - name: "192.168.1.171"
 devices:
 - name: "nvme0n1"
 - name: "sdc"
 - name: "sdd"
 - name: "sde"

Reference: cluster-test.yaml

Step 3 — CephBlockPool and StorageClass

apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
 name: replicapool
 namespace: rook-ceph
spec:
 replicated:
 size: 1

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
 name: rook-ceph-block
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
 clusterID: rook-ceph
 pool: replicapool
 imageFormat: "2"
 imageFeatures: layering
reclaimPolicy: Delete

Step 4 — PVC test

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: test-pvc
spec:
 accessModes:
 - ReadWriteOnce
 storageClassName: rook-ceph-block
 resources:
 requests:
 storage: 10Gi

PVC reached Bound. A BusyBox pod mounting it could write to /mnt. The Ceph dashboard (kubectl -n rook-ceph port-forward svc/rook-ceph-mgr-dashboard 7000:7000) showed OSDs active and the pool present.

What did not work

The cluster ran but was not left stable. Single-node Ceph produces health warnings by design (no redundancy, no failure domain separation). More importantly, the setup was not revisited after initial testing and there are unresolved questions about:

CSI driver behaviour on Talos (Talos has specific requirements for CSI socket paths)
Whether the dashboard warnings were cosmetic or indicated real issues
Long-term stability under actual workloads

This is left as a draft until there is time to run it properly — ideally on more than one node.

Slurm

Thu, 14 May 2026 00:00:00 +0000

Slurm (Simple Linux Utility for Resource Management) is a workload manager and job scheduler. It originated in HPC but is now the standard scheduler for ML training clusters — most cloud GPU clusters (AWS, GCP, Azure HPC) run Slurm under the hood.

https://slurm.schedmd.com/

Core Concepts

Node — a compute host registered with Slurm. Can have CPU, GPU, and memory resources defined.

Partition — a logical group of nodes (like a queue). You can have separate partitions for GPU and CPU nodes, or different priority tiers.

Job — a workload submitted to the queue. Slurm allocates resources, places the job, and tracks it to completion.

Allocation — the reserved resources (nodes, CPUs, GPUs, memory) for a running job.

Key Commands

sbatch job.sh # submit a batch job script
squeue # view the job queue
sacct -j <jobid> # job accounting / history
sinfo # view partition and node state
scancel <jobid> # cancel a job
srun --pty bash # interactive allocation

A minimal batch script:

#!/bin/bash
#SBATCH --job-name=train
#SBATCH --nodes=1
#SBATCH --gpus=1
#SBATCH --time=02:00:00

python train.py

Slurm vs Kubernetes for Training

The fundamental difference is what each system optimises for:

Kubernetes optimises for uptime. It keeps services running, reschedules failed pods, and manages long-lived workloads. That’s the right model for inference serving, APIs, and anything that needs to stay up.

Slurm optimises for utilisation. It packs jobs onto nodes as tightly as possible, queues work when resources are busy, and gets out of the way when a job finishes. That’s the right model for batch training — you want every GPU busy, not reserved for availability.

	Slurm	Kubernetes
Optimises for	Maximum utilisation	Uptime and availability
Scheduling model	Job queue, batch-first	Long-running services + batch (via operators)
GPU allocation	Native, fine-grained	Requires GPU operator + device plugin
Multi-node training	First-class (MPI, `srun`)	Possible via KubeFlow, PyTorchJob
Preemption	Built-in	Requires configuration
Operational overhead	Low on bare metal	Higher — requires cluster management
Ecosystem	HPC, academia, major cloud HPC	ML platforms, cloud-native

The short version: use Slurm for pure batch training on bare metal. Use Kubernetes when you’re mixing training with inference serving or need container-native workflows throughout — or when the cluster already exists.

Talos Linux + Omni

Thu, 14 May 2026 00:00:00 +0000

Talos Linux is an immutable, minimal operating system designed specifically for running Kubernetes. There is no shell, no SSH, no package manager. The entire OS is read-only and managed via a gRPC API (talosctl). Node configuration is declarative YAML applied over the API; changes that require a reboot take effect on the next boot.

The tradeoff is rigidity for operational simplicity. You cannot log into a Talos node and fix something by hand. In return, nodes are deterministic, reproducible, and there is no configuration drift.

Comparison to other installs

Method	OS	Config	Mutable
kubeadm	Ubuntu / RHEL / etc	Manual + scripts	Yes
k3s	Any Linux	Minimal	Yes
Talos	Talos Linux	Declarative API	No

k3s and kubeadm give you more flexibility and a familiar Linux environment. Talos is the right choice when you want the cluster nodes to behave like appliances — provisioned, never touched.

Omni

Omni is a cluster management platform by Sidero Labs built on top of Talos. It handles:

Node registration (nodes boot and phone home to the Omni API)
Cluster creation and machine assignment
Kubernetes upgrades (one action in the UI)
talosctl and kubeconfig access via the Omni CLI

Nodes register via a join token embedded in the kernel command line at PXE boot time. The cluster runs on your hardware; Omni only manages the control plane.

Hobby tier: 10 nodes, non-commercial use, free. Sidero Labs also offers a self-hosted version.

Image Factory

factory.talos.dev generates custom Talos images with hardware extensions included. Notable extensions:

siderolabs/bnx2 — Broadcom NetXtreme II (BCM5708/BCM5709) NIC firmware, required on some enterprise hardware (IBM x3550 M3, HP Gen 6/7 blades)
siderolabs/intel-ucode — Intel microcode updates
siderolabs/nvidia-* — NVIDIA GPU support

The factory produces both ISO and PXE artifacts (kernel + initramfs). See the OPNSense + iPXE reference for how to serve these over TFTP.

Supporting Sidero Labs

Talos and Omni are built by Sidero Labs — good people doing good work. I sponsor them via GitHub Sponsors at the fanboi tier.

Relevant links

Talos Linux in the homelab via Omni

Thu, 14 May 2026 00:00:00 +0000

Getting Talos Linux running in the homelab via PXE boot and Omni — starting with ODEN (SYS-005), an IBM System x3550 M3. The full OPNSense + iPXE configuration lives in the reference note; this covers what actually happened, in order.

Setup

Hardware: ODEN (SYS-005) — IBM x3550 M3, Broadcom BNX2 NICs (BCM5709)
Network: OPNSense router on LAN; ODEN connected via one NIC (start with one — removes variables)
Target: Single-node Talos cluster registered in Omni

Step 1 — OPNSense DHCP and TFTP

Enable network booting on the LAN DHCP server and download the iPXE binaries to the TFTP root. Full field values in the iPXE reference note.

One thing to check first: if you previously set DHCP options 66 and 67 as raw additional options, remove them. OPNSense’s built-in network boot fields do the same job and having both causes conflicts.

Step 2 — iPXE boot script

Write default.ipxe to /usr/local/tftp/. Include a boot menu with at minimum a Talos option and a shell fallback — the shell is genuinely useful when something fails and you need to debug from the boot prompt. Full script in the reference note.

The Talos entry in the menu needs the Omni join token from your Omni console. Generate a join link in Omni; it provides the API endpoint, token, and SideroLink addresses.

Step 3 — Talos kernel and initramfs

The standard Talos release binaries do not include BNX2 firmware. Since around Talos 1.6 those drivers are available as extensions but not in the mainline image. Without them, the node boots, fails to initialise the NIC, and produces can't load firmware bnx2 errors — everything else looks fine until you notice the node never gets an IP and never appears in Omni.

Fix: generate a custom image at factory.talos.dev with the siderolabs/bnx2 extension included, then download the PXE kernel and initramfs from the factory URL. Commands in the reference note.

Step 4 — First boot

Go into BIOS and set the boot device to PXE. On the M3, UEFI boot with ipxe.efi fails silently — the image is too large for the NIC’s PXE memory buffer. Switch to legacy/BIOS mode and use undionly.kpxe instead.

The machine takes a while to POST and boot. This is normal for old enterprise hardware. It is also why demos typically use virtual machines.

Step 5 — Static IP

After the BNX2 fix the node boots Talos successfully but still does not appear in Omni. The DHCP assignment for the node is not being picked up during early boot. Workaround: add a static IP via kernel params in the iPXE script:

ip=192.168.1.171::192.168.1.1:255.255.255.0::eth0:off

Add this to the kernel line in the Talos iPXE entry. The format is ip=<client-ip>::<gateway>:<netmask>::<iface>:off.

Step 6 — Omni registration

With a working NIC and an IP, the node contacts the Omni API using the join token. It appears in the Omni console as an unallocated machine. Create a cluster, assign the machine, and let Omni configure it. The initial cluster bootstrap takes a few minutes.

Step 7 — Fix the BIOS boot order

After the cluster is up, change the BIOS boot order so the disk is first. If PXE remains the primary boot device, every reboot drops the machine back to the iPXE menu instead of booting the installed Talos. Discovered on first reboot. Worth noting it here so you don’t make the same trip to the garage.

Upgrade

Omni makes single-node upgrades straightforward: open the cluster in the Omni console, select a new Talos version, apply. The node reboots once. Single-node means the cluster has downtime during the reboot; that is expected. Nothing else to do.

Result

Single-node Kubernetes cluster running on ODEN, managed via Omni. kubectl and talosctl access via the Omni CLI. Next experiment: Rook + Ceph for persistent storage.

System Inventory

Wed, 13 May 2026 00:00:00 +0000

Systems

Asset ID	Hostname	Manufacturer	Model	Form Factor	Notes
SYS-001	FREJA	IBM	System x3550 M1 Type 7978	1U	Rack server (S/N: KDHPPNN); 1/2 CPU slots populated
SYS-002	TYR	IBM	System x3650 M1 Type 7979	2U	Rack server
SYS-003	TOR	IBM	System x3650 M1 Type 7979	2U	Rack server
SYS-004	MIMIR	Dell	PowerVault MD1200	2U	Disk shelf
SYS-005	ODEN	IBM	System x3550 M3	1U	Mixed DDR3 1333+1600 ECC Reg; PCIe x16 riser (FRU 43V7066)
SYS-006	LOKE	IBM	System x3550 M3	1U	M3 board in M2 chassis; no RAM; CPU unknown
SYS-007	ASGARD	HP	BladeSystem C7000	10U	Blade enclosure (Hosts 1-16)
SYS-008	BALDER	HP	ProLiant DL320 G5p	1U	Dual 250GB SATA
SYS-009	HEIMDAL	Sun	Sun Fire X4150	1U	2× Xeon E5430 (8c/8t); 4× onboard GbE; OPNsense
SYS-010	VIDAR	HP	ProCurve 1800-24G	1U	Fanless/Silent Switch (J9028A)
SYS-011	GUNGNIR	ZyXEL	ZyWALL 110	1U	Security Gateway / Firewall
SYS-012	BIFROST-01	Edge-Core	ECS4510-28F	1U	28-Port SFP Fiber Switch
SYS-013	BIFROST-02	Edge-Core	ECS4510-28F	1U	28-Port SFP Fiber Switch
SYS-014	MODI	HP	V1910-24G-PoE	1U	365W PoE Switch (JE007A)
SYS-015	MAGNI	Cisco	Catalyst 2960G	1U	24-Port Managed Gig Switch
SYS-016	VALI	HP	ProCurve 1800-24G	1U	Fanless/Silent Switch (J9028A)
SYS-017	RATATOSK	Avocent	KVM Switch	1U	Rackmount KVM
SYS-018	SURTR-01	APC	Back-UPS CS 650	Desktop	UPS Unit 1
SYS-019	SURTR-02	APC	Back-UPS CS 650	Desktop	UPS Unit 2
SYS-020	MUNINN	Cisco	Catalyst 2960 WS-C2960-24TC-L	1U	24× 10/100 + 4× uplink
SYS-021	BIFROST	Raspberry Pi	Raspberry Pi 1 Model B	SBC	Jump node; Raspbian; port-forward 22222→22; rack-mounted
SYS-022	—	Raspberry Pi	Raspberry Pi 1 Model B	SBC	Spare
SYS-023	—	Raspberry Pi	Raspberry Pi 1 Model B	SBC	Spare

Blade Nodes (Inside ASGARD)

Asset ID	Hostname	Manufacturer	Model	Slot
BLD-001	BLADE-01	HP	BL460c Gen8	1
BLD-002	BLADE-02	HP	BL460c Gen8	2
BLD-003	BLADE-03	HP	BL460c Gen8	3
BLD-004	BLADE-04	HP	BL460c Gen8	4
BLD-005	BLADE-05	HP	BL460c Gen8	5
BLD-006	BLADE-06	HP	BL460c Gen8	6
BLD-007	BLADE-07	HP	BL460c Gen8	7
BLD-008	BLADE-08	HP	BL460c Gen8	8
BLD-009	BLADE-09	HP	BL460c Gen8	9
BLD-010	BLADE-10	HP	BL460c Gen8	10
BLD-011	BLADE-11	HP	BL460c Gen8	11
BLD-012	BLADE-12	HP	BL460c Gen8	12
BLD-013	BLADE-13	HP	BL460c Gen8	13
BLD-014	BLADE-14	HP	BL460c Gen8	14
BLD-015	BLADE-15	HP	BL460c Gen8	15
BLD-016	BLADE-16	HP	BL460c Gen8	16

System Overviews

Here are some brief overviews of selected systems to provide context and highlight their typical roles or notable features.

IBM System x3550 Type 7978 / x3650 Type 7979 Series — x3550 overview · x3650 overview

1U (x3550) / 2U (x3650) · dual Xeon (Harpertown/Nehalem) · DDR2 ECC FBDIMM up to 32GB · SAS/SATA

These were enterprise-grade rack servers, popular in the late 2000s, powered by Intel Xeon processors (e.g., Nehalem, Westmere generations). The x3550 is a compact 1U server, ideal for general-purpose computing, while the x3650 is a 2U model offering greater expansion capabilities for storage or PCIe cards. They served as reliable workhorses for various data center applications, including virtualization and database hosting.

HP BladeSystem C7000 — QuickSpecs · BL460c Gen8 QuickSpecs

10U · up to 16 half-height blades · shared power/cooling/networking via backplane · Onboard Administrator

The C7000 is a substantial 10U blade enclosure designed to host up to 16 server blades, along with storage blades and integrated networking/management modules. It provides a consolidated infrastructure for power, cooling, and network connectivity, significantly simplifying cable management and enabling high-density computing environments. These systems were foundational for many enterprise virtualization platforms.

The BL460c Gen8 blades have onboard LOM providing 1GbE connectivity. No mezzanine cards are currently installed — 10GbE requires FlexibleLOM adapters.

Sun Fire X4150

1U · dual Xeon (Harpertown) · 16 DIMM slots · 4 network interface

A 1U rackmount server from Sun Microsystems, the Sun Fire X4150 typically featured Intel Xeon processors. Sun’s x86 server line was recognized for its build quality and integration, often running Solaris or Linux. I use it as a dedicated firewall / network appliance (OpenSense), utilizing its robust hardware for network security and routing tasks.

Dell PowerVault MD1200 — specs

2U DAS · 12× LFF (3.5") hot-swap SAS/SATA bays · 6Gb/s SAS · up to 24TB raw

The PowerVault MD1200 is a direct-attached storage (DAS) enclosure, designed to expand the storage capacity of compatible servers (such as Dell PowerEdge servers or others equipped with suitable SAS HBAs). This 2U unit can accommodate up to 12 LFF (3.5-inch) SAS/SATA drives, providing an expandable and cost-effective solution for adding raw storage to a homelab environment.

ZyXEL ZyWALL 110

2× GbE WAN · 4× GbE LAN · VPN gateway · IPS/IDS

The ZyWALL 110 is a professional-grade security gateway and VPN firewall. It delivers comprehensive network security features, including intrusion prevention, content filtering, and strong VPN capabilities. This appliance is well-suited for establishing a secure perimeter for a homelab network or segmenting different network environments for enhanced control and protection. However since I don’t have any license for it is currently not used.

Hardware Provisioning: PXE Booting and Tooling

Tue, 12 May 2026 00:00:00 +0000

When moving beyond manual installs, managing hardware lifecycle through PXE (Preboot Execution Environment) becomes essential. A breakdown of common tools for automating the “power-on to OS ready” process.

Common starting points

Tool	Focus	Complexity	Best for
Cobbler	PXE/repo server	Low–Medium	Stable, static environments needing reliable kickstart or seed installs
Foreman	Full lifecycle mgmt	High	Single pane of glass for provisioning + ongoing config management (Puppet/Ansible)
Digital Rebar	Infrastructure-as-Code	Medium	Modern DevOps teams wanting cloud-like speed on physical gear; evolved from Crowbar
Ironic / Bifrost	BMaaS / scale	High	Bare Metal as a Service at scale; Bifrost runs Ironic standalone without full OpenStack

Broader landscape

Classic PXE / Provisioning

Tool	Type	Strengths	Weaknesses
Cobbler	PXE provisioning server	Simple, mature, easy to understand	Old architecture, static workflows
Foreman	Lifecycle/provisioning platform	Powerful, enterprise-capable, large ecosystem	Heavy footprint, Rails monolith
Uyuni	Systems management	Enterprise lifecycle management (SUSE/Spacewalk lineage)	Less modern provisioning architecture

Dynamic / Policy-Driven

Tool	Type	Strengths	Weaknesses
Razor	Policy-driven provisioning	Dynamic node discovery, elegant lifecycle model	Effectively dormant
Digital Rebar	Workflow provisioning platform	Architecturally modern and flexible	Partially commercialized

Cloud / Hyperscale Bare Metal

Tool	Type	Strengths	Weaknesses
Ironic	OpenStack bare-metal service	Extremely scalable, API-driven	High operational complexity
Bifrost	Standalone Ironic deployment	Easier entry into Ironic ecosystem	Inherits Ironic complexity
MAAS	Bare metal cloud platform	Excellent UX, API-first, machine discovery	Larger footprint, Ubuntu-centric

Kubernetes-Native / Cloud-Native

Tool	Type	Strengths	Weaknesses
Tinkerbell	Cloud-native provisioning	Modern architecture, composable workflows	Microservice complexity
Metal3	Kubernetes operator	Native Kubernetes integration	Requires Kubernetes infrastructure
Omni	Talos cluster orchestration	Very modern UX and lifecycle management	Talos/Kubernetes specific
Matchbox	Minimal PXE/ignition service	Elegant, simple, iPXE-first	Narrow immutable-infra focus

Boot Infrastructure / PXE Utilities

Tool	Type	Strengths	Weaknesses
iPXE	Network boot firmware	Flexible, fast, programmable (HTTP + scripting)	Requires infrastructure around it
netboot.xyz	Dynamic network boot menu	Extremely useful and lightweight	Not a provisioning orchestrator

Architectural Styles

Style	Example Tools	Characteristics
Static config-driven	Cobbler	Profiles + templates + PXE configs
Policy/state-driven	Razor, Digital Rebar	Nodes discovered dynamically, assigned via policies
Cloud resource model	Ironic, MAAS	Bare metal treated as cloud infrastructure
Kubernetes-native	Tinkerbell, Metal3	Bare metal managed via Kubernetes APIs
Immutable OS orchestration	Omni, Matchbox	Minimal provisioning around immutable operating systems

The Gap

There is still no widely adopted FOSS solution that is simultaneously:

lightweight
modern
self-hostable
API-first
iPXE-native
distro-agnostic
easy to operate
single-binary deployable
workflow-capable
not tied to Kubernetes/OpenStack

Most existing systems drift toward enterprise complexity, cloud platform assumptions, Kubernetes dependency, immutable OS specialization, or monolithic lifecycle management.

“A modern lightweight provisioning orchestrator for reproducible bare-metal infrastructure.”

Kubernetes Across the Stack

Mon, 16 Mar 2026 00:00:00 +0000

A documented comparison of running Kubernetes across every major hosting model — cloud managed, self-managed on cloud, private cloud, and bare metal at home. The goal is a honest, practical reference for each environment: what it costs you in time and money, where the rough edges are, and how the networking story differs between them.

The thread running through all of it is Talos Linux — an immutable, API-driven OS built specifically for Kubernetes. No SSH, no shell, no config drift. The same OS everywhere means the operational model stays consistent regardless of what is running underneath.

Environment	Approach
OpenStack — Cleura	Talos & Terraform	draft exists
OpenStack — Cleura	Talos, with Omni	maybe ?
OpenStack — ElastX	Talos & Terraform	draft exists
OpenStack — ElastX	Talos, with Omni	maybe ?
Homelab — bare metal	Talos + Pixieboot + Omni	draft exists
Homelab — bare metal	Talos + Pixieboot without Omni	maybe ?
Homelab — OpenStack	OpenStack on bare metal, Talos running on top	(stretch)
Homelab — OpenStack	Talos on bare metal, OpenStack inside cluster	(stretch)
AWS	Talos on EC2	(stretch)
Azure	Talos on VMs	(stretch)
GCP	Talos on Compute Engine	(stretch)

Stretch goals

AWS, Azure, GCP — same Talos approach, different underlying infrastructure. Interesting eventually, but not the priority.

Omni

Omni is Sidero’s managed control plane for Talos clusters — worth documenting both with and without it. Without Omni gives you the full picture of what Talos management looks like manually; with Omni shows what the managed layer buys you.

Homelab provisioning

Nodes provisioned via Pixieboot — no USB sticks, no manual installations. A node powers on, boots from the network, and registers. The goal is a fully reproducible cluster from scratch with minimal human steps.

Scope

Cluster provisioning and bootstrap for each environment
Networking — CNI choices, ingress, cross-cluster connectivity
Storage — what you get managed vs what you have to bring yourself
Operational differences — upgrades, node management, observability
Cost and trade-off summary across environments

Making it usable

Getting a cluster running is the easy part. Making it usable is where environments diverge. Each environment needs an answer for ingress, DNS, and storage — and the answer varies significantly depending on what the underlying platform provides.

On managed cloud you can lean on load balancers and block storage from the provider. On OpenStack you have those options if the provider exposes them. On bare metal at home you are on your own — MetalLB or similar for load balancer IPs, a local DNS solution, and either local storage or something like Rook/Ceph. Same Kubernetes, very different operational story underneath.

Notes exist in various states — pulling them together, testing, and documenting properly is the work.

Bare-Metal on Backend Engineering Strategy Tools

IPMI

BMC implementations by vendor

Network setup

ipmitool

Serial over LAN (SoL)

Security

Related

Redfish

Why Redfish over IPMI

Vendor implementations

API structure

Usage with curl

Python — sushy

Session-based auth

Firmware updates

Related

ASGARD — the blade cluster

Power reality

What each role actually needs

Current blade state

Suggested configuration from existing stock

Storage — what moves where

Three things to do before blades can boot anything

The permanent vs experiment split

OpenStack

Scale and fit

Core Components

OpenStack on Kubernetes

Deployment Options

Concepts Worth Understanding

Relevance to the Lab

Proxmox Cluster in the homelab

Hardware

Installation

Cluster formation

Proxmox VE

Comparison

Core concepts

Related

PXE Booting with OPNSense + iPXE

OPNSense DHCP — Network Boot Fields

TFTP — Downloading the Boot Files

iPXE Boot Script

Talos Kernel and Initramfs — Image Factory

Gotchas

Rook + Ceph on ODEN

Hardware

Step 1 — Install the Rook operator

Step 2 — CephCluster (single-node)

Step 3 — CephBlockPool and StorageClass

Step 4 — PVC test

What did not work

Slurm

Core Concepts

Key Commands

Slurm vs Kubernetes for Training

Talos Linux + Omni

Comparison to other installs

Omni

Image Factory

Supporting Sidero Labs

Relevant links

Talos Linux in the homelab via Omni

Setup

Step 1 — OPNSense DHCP and TFTP

Step 2 — iPXE boot script

Step 3 — Talos kernel and initramfs

Step 4 — First boot

Step 5 — Static IP

Step 6 — Omni registration

Step 7 — Fix the BIOS boot order

Upgrade

Result

System Inventory

Systems

Blade Nodes (Inside ASGARD)

System Overviews

IBM System x3550 Type 7978 / x3650 Type 7979 Series — x3550 overview · x3650 overview

HP BladeSystem C7000 — QuickSpecs · BL460c Gen8 QuickSpecs