<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Bare-Metal on Backend Engineering Strategy Tools</title><link>https://backend-engineering-strategy-tools.github.io/site/tags/bare-metal/</link><description>Recent content in Bare-Metal on Backend Engineering Strategy Tools</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Fri, 22 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://backend-engineering-strategy-tools.github.io/site/tags/bare-metal/index.xml" rel="self" type="application/rss+xml"/><item><title>IPMI</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/hardware/out-of-band/ipmi/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/hardware/out-of-band/ipmi/</guid><description>&lt;p&gt;IPMI (Intelligent Platform Management Interface) is a hardware-level management standard built into most server-class hardware. It runs on a dedicated processor on the motherboard — the &lt;strong&gt;BMC (Baseboard Management Controller)&lt;/strong&gt; — independently of the host OS. The BMC has its own NIC, its own firmware, and its own IP address. You can power a server on or off, read sensor data, and access a serial console even if the host is completely dead.&lt;/p&gt;
&lt;p&gt;Current version is IPMI 2.0, which added encryption and stronger authentication over 1.5.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="bmc-implementations-by-vendor"&gt;BMC implementations by vendor
&lt;/h2&gt;&lt;p&gt;IPMI is the standard; each vendor ships their own BMC firmware on top of it:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Vendor&lt;/th&gt;
 &lt;th&gt;BMC / OOB product&lt;/th&gt;
 &lt;th&gt;Notes&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Dell&lt;/td&gt;
 &lt;td&gt;iDRAC (Integrated Dell Remote Access Controller)&lt;/td&gt;
 &lt;td&gt;iDRAC 6/7/8/9; newer versions add Redfish&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;HP / HPE&lt;/td&gt;
 &lt;td&gt;iLO (Integrated Lights-Out)&lt;/td&gt;
 &lt;td&gt;iLO 2/3/4/5; iLO 4+ adds Redfish&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Sun / Oracle&lt;/td&gt;
 &lt;td&gt;ILOM (Integrated Lights-Out Manager)&lt;/td&gt;
 &lt;td&gt;Sun Fire series (X4150, X4450, etc.)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Supermicro&lt;/td&gt;
 &lt;td&gt;IPMI / BMC&lt;/td&gt;
 &lt;td&gt;Web UI + IPMI; newer boards also Redfish&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Lenovo / IBM&lt;/td&gt;
 &lt;td&gt;XClarity / IMM&lt;/td&gt;
 &lt;td&gt;IMM2 on older systems&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;HP BladeSystem&lt;/td&gt;
 &lt;td&gt;Onboard Administrator (OA)&lt;/td&gt;
 &lt;td&gt;Enclosure-level management (C7000, C3000) — separate from individual blade iLO&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Most also expose a web UI and some form of virtual KVM (keyboard/video/mouse over network) in addition to IPMI over LAN.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="network-setup"&gt;Network setup
&lt;/h2&gt;&lt;p&gt;The BMC NIC is usually shared with a host NIC (shared/failover mode) or dedicated (preferred for management). Configure via BIOS/UEFI or the vendor&amp;rsquo;s setup utility before the OS boots.&lt;/p&gt;
&lt;p&gt;Assign a static IP — a BMC on DHCP is workable but inconvenient. Keep BMCs on a dedicated management VLAN if possible; they have historically had security issues and shouldn&amp;rsquo;t be exposed to general traffic.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="ipmitool"&gt;ipmitool
&lt;/h2&gt;&lt;p&gt;The standard CLI for IPMI over LAN. Available in most Linux package repos.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Power control&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ipmitool -I lanplus -H &amp;lt;bmc-ip&amp;gt; -U &amp;lt;user&amp;gt; -P &amp;lt;pass&amp;gt; power status
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ipmitool -I lanplus -H &amp;lt;bmc-ip&amp;gt; -U &amp;lt;user&amp;gt; -P &amp;lt;pass&amp;gt; power on
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ipmitool -I lanplus -H &amp;lt;bmc-ip&amp;gt; -U &amp;lt;user&amp;gt; -P &amp;lt;pass&amp;gt; power off
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ipmitool -I lanplus -H &amp;lt;bmc-ip&amp;gt; -U &amp;lt;user&amp;gt; -P &amp;lt;pass&amp;gt; power cycle
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ipmitool -I lanplus -H &amp;lt;bmc-ip&amp;gt; -U &amp;lt;user&amp;gt; -P &amp;lt;pass&amp;gt; power reset
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Sensor readings (temperatures, voltages, fan speeds)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ipmitool -I lanplus -H &amp;lt;bmc-ip&amp;gt; -U &amp;lt;user&amp;gt; -P &amp;lt;pass&amp;gt; sensor list
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# System Event Log&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ipmitool -I lanplus -H &amp;lt;bmc-ip&amp;gt; -U &amp;lt;user&amp;gt; -P &amp;lt;pass&amp;gt; sel list
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ipmitool -I lanplus -H &amp;lt;bmc-ip&amp;gt; -U &amp;lt;user&amp;gt; -P &amp;lt;pass&amp;gt; sel clear
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Serial over LAN (SoL) — console access without KVM&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ipmitool -I lanplus -H &amp;lt;bmc-ip&amp;gt; -U &amp;lt;user&amp;gt; -P &amp;lt;pass&amp;gt; sol activate
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Exit SoL: ~.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Use &lt;code&gt;-I lanplus&lt;/code&gt; (IPMI 2.0 with encryption) rather than &lt;code&gt;-I lan&lt;/code&gt; (IPMI 1.5, unencrypted) where supported.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="serial-over-lan-sol"&gt;Serial over LAN (SoL)
&lt;/h2&gt;&lt;p&gt;SoL forwards the server&amp;rsquo;s serial port over the IPMI connection — giving you a text console to the host without a KVM or physical access. Requires the host OS to have serial console enabled:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Add to GRUB_CMDLINE_LINUX in /etc/default/grub&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;console&lt;span style="color:#f92672"&gt;=&lt;/span&gt;tty0 console&lt;span style="color:#f92672"&gt;=&lt;/span&gt;ttyS1,115200n8
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Enable serial getty&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;systemctl enable serial-getty@ttyS1.service
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Baud rate must match what&amp;rsquo;s configured in the BIOS/BMC (typically 115200).&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="security"&gt;Security
&lt;/h2&gt;&lt;p&gt;IPMI has a poor security history:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;IPMI 1.5 sends credentials in cleartext&lt;/li&gt;
&lt;li&gt;IPMI 2.0 has had multiple authentication bypass vulnerabilities (RAKP, cipher 0)&lt;/li&gt;
&lt;li&gt;The BMC itself runs independent firmware that may have unpatched CVEs&lt;/li&gt;
&lt;li&gt;Default credentials (&lt;code&gt;admin&lt;/code&gt;/&lt;code&gt;admin&lt;/code&gt;, &lt;code&gt;ADMIN&lt;/code&gt;/&lt;code&gt;ADMIN&lt;/code&gt;) are common and widely known&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Minimum steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Change default credentials immediately&lt;/li&gt;
&lt;li&gt;Use IPMI 2.0 (&lt;code&gt;lanplus&lt;/code&gt;) only&lt;/li&gt;
&lt;li&gt;Disable cipher suite 0: &lt;code&gt;ipmitool -I lanplus ... lan set 1 cipher_privs XxxxxxxxxxxxxxxX&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Isolate BMC network from internet and untrusted hosts — management VLAN with no external exposure&lt;/li&gt;
&lt;li&gt;Keep BMC firmware updated&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="related"&gt;Related
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="redfish/" &gt;Redfish&lt;/a&gt; — the modern REST API replacement for IPMI&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="./" &gt;Out-of-band management overview&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/hardware/hardware-provisioning/" &gt;Hardware provisioning&lt;/a&gt; — PXE boot and bare-metal provisioning&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Redfish</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/hardware/out-of-band/redfish/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/hardware/out-of-band/redfish/</guid><description>&lt;p&gt;Redfish is a DMTF standard that defines a RESTful API for out-of-band server management. It replaces IPMI&amp;rsquo;s aging binary protocol with JSON over HTTPS — same capabilities (power control, sensors, firmware, console), but with a proper API, role-based access control, and standard authentication. Supported by all major server vendors on current-generation hardware.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="why-redfish-over-ipmi"&gt;Why Redfish over IPMI
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;&lt;/th&gt;
 &lt;th&gt;IPMI&lt;/th&gt;
 &lt;th&gt;Redfish&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Protocol&lt;/td&gt;
 &lt;td&gt;Binary, UDP 623&lt;/td&gt;
 &lt;td&gt;HTTPS (REST/JSON)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Auth&lt;/td&gt;
 &lt;td&gt;RAKP (has CVEs)&lt;/td&gt;
 &lt;td&gt;HTTP Basic / Session tokens&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Encryption&lt;/td&gt;
 &lt;td&gt;Optional (IPMI 2.0)&lt;/td&gt;
 &lt;td&gt;Always (TLS)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Discoverability&lt;/td&gt;
 &lt;td&gt;No&lt;/td&gt;
 &lt;td&gt;Yes (hypermedia)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Scripting&lt;/td&gt;
 &lt;td&gt;ipmitool flags&lt;/td&gt;
 &lt;td&gt;curl, Python, any HTTP client&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Extensibility&lt;/td&gt;
 &lt;td&gt;Vendor OEM extensions&lt;/td&gt;
 &lt;td&gt;Structured OEM namespaces&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Maturity&lt;/td&gt;
 &lt;td&gt;Established, aging&lt;/td&gt;
 &lt;td&gt;Modern, actively developed&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Redfish is not universally available — older hardware (pre-2015 roughly) has IPMI only. Both coexist on many current systems; IPMI is still useful for compatibility. See &lt;a class="link" href="ipmi/" &gt;IPMI&lt;/a&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="vendor-implementations"&gt;Vendor implementations
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Vendor&lt;/th&gt;
 &lt;th&gt;BMC&lt;/th&gt;
 &lt;th&gt;Redfish support&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Dell&lt;/td&gt;
 &lt;td&gt;iDRAC 8+&lt;/td&gt;
 &lt;td&gt;Full, v1.0+&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;HPE&lt;/td&gt;
 &lt;td&gt;iLO 4+&lt;/td&gt;
 &lt;td&gt;Full (iLO 5 most complete)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Supermicro&lt;/td&gt;
 &lt;td&gt;BMC (X11+)&lt;/td&gt;
 &lt;td&gt;Full&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Lenovo&lt;/td&gt;
 &lt;td&gt;XClarity&lt;/td&gt;
 &lt;td&gt;Full&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Intel&lt;/td&gt;
 &lt;td&gt;BMC on server boards&lt;/td&gt;
 &lt;td&gt;Partial&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;OpenBMC&lt;/td&gt;
 &lt;td&gt;Open-source BMC firmware&lt;/td&gt;
 &lt;td&gt;Full (used by Facebook, Google infra)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;AMI MegaRAC&lt;/td&gt;
 &lt;td&gt;OEM BMC firmware&lt;/td&gt;
 &lt;td&gt;Full&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2 id="api-structure"&gt;API structure
&lt;/h2&gt;&lt;p&gt;Redfish uses a consistent URL hierarchy rooted at &lt;code&gt;/redfish/v1/&lt;/code&gt;. Navigation is hypermedia-driven — the root returns links to subsystems, and you follow them.&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;/redfish/v1/
├── Systems/ ← compute systems (servers)
│ └── 1/
│ ├── Processors/
│ ├── Memory/
│ ├── Storage/
│ └── Actions/ComputerSystem.Reset
├── Chassis/ ← physical chassis, power, thermal
│ └── 1/
│ ├── Power/ ← PSU status, power consumption
│ └── Thermal/ ← temperatures, fan speeds
├── Managers/ ← the BMC itself
│ └── 1/
│ └── NetworkInterfaces/
└── UpdateService/ ← firmware updates
&lt;/code&gt;&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id="usage-with-curl"&gt;Usage with curl
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;BMC&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;https://192.168.1.10&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;USER&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;admin&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;PASS&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;password&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Get system overview&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl -sk -u &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;$USER&lt;span style="color:#e6db74"&gt;:&lt;/span&gt;$PASS&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;$BMC&lt;span style="color:#e6db74"&gt;/redfish/v1/Systems/1&amp;#34;&lt;/span&gt; | jq .
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Power state&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl -sk -u &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;$USER&lt;span style="color:#e6db74"&gt;:&lt;/span&gt;$PASS&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;$BMC&lt;span style="color:#e6db74"&gt;/redfish/v1/Systems/1&amp;#34;&lt;/span&gt; | jq .PowerState
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Power on&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl -sk -u &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;$USER&lt;span style="color:#e6db74"&gt;:&lt;/span&gt;$PASS&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt; -X POST &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -H &lt;span style="color:#e6db74"&gt;&amp;#34;Content-Type: application/json&amp;#34;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -d &lt;span style="color:#e6db74"&gt;&amp;#39;{&amp;#34;ResetType&amp;#34;:&amp;#34;On&amp;#34;}&amp;#39;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;$BMC&lt;span style="color:#e6db74"&gt;/redfish/v1/Systems/1/Actions/ComputerSystem.Reset&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Power off (graceful)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl -sk -u &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;$USER&lt;span style="color:#e6db74"&gt;:&lt;/span&gt;$PASS&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt; -X POST &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -H &lt;span style="color:#e6db74"&gt;&amp;#34;Content-Type: application/json&amp;#34;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -d &lt;span style="color:#e6db74"&gt;&amp;#39;{&amp;#34;ResetType&amp;#34;:&amp;#34;GracefulShutdown&amp;#34;}&amp;#39;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;$BMC&lt;span style="color:#e6db74"&gt;/redfish/v1/Systems/1/Actions/ComputerSystem.Reset&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Force off&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl -sk -u &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;$USER&lt;span style="color:#e6db74"&gt;:&lt;/span&gt;$PASS&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt; -X POST &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -H &lt;span style="color:#e6db74"&gt;&amp;#34;Content-Type: application/json&amp;#34;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -d &lt;span style="color:#e6db74"&gt;&amp;#39;{&amp;#34;ResetType&amp;#34;:&amp;#34;ForceOff&amp;#34;}&amp;#39;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;$BMC&lt;span style="color:#e6db74"&gt;/redfish/v1/Systems/1/Actions/ComputerSystem.Reset&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Thermal — CPU temps, fan speeds&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl -sk -u &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;$USER&lt;span style="color:#e6db74"&gt;:&lt;/span&gt;$PASS&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;$BMC&lt;span style="color:#e6db74"&gt;/redfish/v1/Chassis/1/Thermal&amp;#34;&lt;/span&gt; | jq &lt;span style="color:#e6db74"&gt;&amp;#39;.Temperatures[] | {Name, ReadingCelsius}&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Reset types vary by vendor — check &lt;code&gt;AllowableValues&lt;/code&gt; in the action schema:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl -sk -u &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;$USER&lt;span style="color:#e6db74"&gt;:&lt;/span&gt;$PASS&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;$BMC&lt;span style="color:#e6db74"&gt;/redfish/v1/Systems/1&amp;#34;&lt;/span&gt; | jq &lt;span style="color:#e6db74"&gt;&amp;#39;.Actions[&amp;#34;#ComputerSystem.Reset&amp;#34;][&amp;#34;ResetType@Redfish.AllowableValues&amp;#34;]&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="python--sushy"&gt;Python — sushy
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;sushy&lt;/code&gt; is the reference Python library for Redfish, used by OpenStack Ironic:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;import&lt;/span&gt; sushy
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;client &lt;span style="color:#f92672"&gt;=&lt;/span&gt; sushy&lt;span style="color:#f92672"&gt;.&lt;/span&gt;Sushy(&lt;span style="color:#e6db74"&gt;&amp;#34;https://192.168.1.10&amp;#34;&lt;/span&gt;, username&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;admin&amp;#34;&lt;/span&gt;, password&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;password&amp;#34;&lt;/span&gt;, verify&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;False&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;system &lt;span style="color:#f92672"&gt;=&lt;/span&gt; client&lt;span style="color:#f92672"&gt;.&lt;/span&gt;get_system(&lt;span style="color:#e6db74"&gt;&amp;#34;/redfish/v1/Systems/1&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;print(system&lt;span style="color:#f92672"&gt;.&lt;/span&gt;power_state) &lt;span style="color:#75715e"&gt;# On / Off&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;system&lt;span style="color:#f92672"&gt;.&lt;/span&gt;reset_system(sushy&lt;span style="color:#f92672"&gt;.&lt;/span&gt;RESET_ON)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;system&lt;span style="color:#f92672"&gt;.&lt;/span&gt;reset_system(sushy&lt;span style="color:#f92672"&gt;.&lt;/span&gt;RESET_FORCE_OFF)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="session-based-auth"&gt;Session-based auth
&lt;/h2&gt;&lt;p&gt;For scripts making many requests, create a session to avoid re-authenticating on every call:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Create session&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;SESSION&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;$(&lt;/span&gt;curl -sk -X POST &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -H &lt;span style="color:#e6db74"&gt;&amp;#34;Content-Type: application/json&amp;#34;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -d &lt;span style="color:#e6db74"&gt;&amp;#39;{&amp;#34;UserName&amp;#34;:&amp;#34;admin&amp;#34;,&amp;#34;Password&amp;#34;:&amp;#34;password&amp;#34;}&amp;#39;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;https://192.168.1.10/redfish/v1/SessionService/Sessions&amp;#34;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -D -&lt;span style="color:#66d9ef"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;TOKEN&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;$(&lt;/span&gt;echo &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;$SESSION&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt; | grep -i X-Auth-Token | awk &lt;span style="color:#e6db74"&gt;&amp;#39;{print $2}&amp;#39;&lt;/span&gt; | tr -d &lt;span style="color:#e6db74"&gt;&amp;#39;\r&amp;#39;&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Use token&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl -sk -H &lt;span style="color:#e6db74"&gt;&amp;#34;X-Auth-Token: &lt;/span&gt;$TOKEN&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;https://192.168.1.10/redfish/v1/Systems/1&amp;#34;&lt;/span&gt; | jq .PowerState
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="firmware-updates"&gt;Firmware updates
&lt;/h2&gt;&lt;p&gt;Redfish standardises firmware update via &lt;code&gt;UpdateService&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Check current firmware&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl -sk -u &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;$USER&lt;span style="color:#e6db74"&gt;:&lt;/span&gt;$PASS&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;$BMC&lt;span style="color:#e6db74"&gt;/redfish/v1/UpdateService/FirmwareInventory&amp;#34;&lt;/span&gt; | jq .
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Push update (multipart, vendor-specific details vary)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl -sk -u &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;$USER&lt;span style="color:#e6db74"&gt;:&lt;/span&gt;$PASS&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt; -X POST &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -H &lt;span style="color:#e6db74"&gt;&amp;#34;Content-Type: application/octet-stream&amp;#34;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --data-binary @firmware.bin &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;$BMC&lt;span style="color:#e6db74"&gt;/redfish/v1/UpdateService/update&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Vendor tooling (Dell racadm, HPE iLOrest) is often more reliable than raw curl for firmware updates.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="related"&gt;Related
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="ipmi/" &gt;IPMI&lt;/a&gt; — older binary protocol, still needed for pre-Redfish hardware&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="./" &gt;Out-of-band management overview&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/hardware/hardware-provisioning/" &gt;Hardware provisioning&lt;/a&gt; — PXE boot and bare-metal provisioning&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>ASGARD — the blade cluster</title><link>https://backend-engineering-strategy-tools.github.io/site/homelab/asgard-blades/</link><pubDate>Fri, 15 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/homelab/asgard-blades/</guid><description>&lt;p&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/inventory/systems/" &gt;ASGARD (SYS-007)&lt;/a&gt; is the HP BladeSystem C7000 with 16× BL460c Gen8 blades. The reason to use it is profile switching: boot a blade as a Slurm compute node, run the experiment, reimage it as a Talos worker, run the next one. The same iPXE boot menu already set up for &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/talos-omni/" &gt;ODEN&lt;/a&gt; works here — the C7000 Onboard Administrator lets you configure boot order per blade slot, so switching roles is a BIOS setting and a PXE entry, not a reinstall.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="power-reality"&gt;Power reality
&lt;/h2&gt;&lt;p&gt;Before committing to blades as the permanent always-on platform, it&amp;rsquo;s worth being honest about the enclosure overhead. The C7000 has fixed costs regardless of how many blades are populated: 10 fans, dual OA modules, 2 interconnect switches, backplane management. It doesn&amp;rsquo;t scale down gracefully.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Setup&lt;/th&gt;
 &lt;th&gt;Approx power&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;C7000 enclosure alone (no blades)&lt;/td&gt;
 &lt;td&gt;200–400W&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;C7000 + 1 blade&lt;/td&gt;
 &lt;td&gt;350–550W&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;C7000 + 3 blades&lt;/td&gt;
 &lt;td&gt;500–800W&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;ODEN alone (1U M3, Talos)&lt;/td&gt;
 &lt;td&gt;100–150W&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;HEIMDAL alone (Sun X4150, router)&lt;/td&gt;
 &lt;td&gt;150–200W&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;ODEN + HEIMDAL&lt;/td&gt;
 &lt;td&gt;250–350W&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Two pizza boxes beat three blades in the enclosure on power. The overhead only amortises at 8+ populated slots. For a permanent minimal setup, the 1U rack servers win. For experiments where you want to run 8–16 nodes at once, ASGARD earns its place.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="what-each-role-actually-needs"&gt;What each role actually needs
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Role&lt;/th&gt;
 &lt;th&gt;RAM&lt;/th&gt;
 &lt;th&gt;Disk&lt;/th&gt;
 &lt;th&gt;Network&lt;/th&gt;
 &lt;th&gt;Limiting factor&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Talos / K8s worker&lt;/td&gt;
 &lt;td&gt;32–64GB&lt;/td&gt;
 &lt;td&gt;1× OSD disk&lt;/td&gt;
 &lt;td&gt;1GbE fine&lt;/td&gt;
 &lt;td&gt;RAM — current blades too thin&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;OpenStack compute&lt;/td&gt;
 &lt;td&gt;32–64GB&lt;/td&gt;
 &lt;td&gt;local ephemeral&lt;/td&gt;
 &lt;td&gt;1GbE fine&lt;/td&gt;
 &lt;td&gt;RAM&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;OpenStack control&lt;/td&gt;
 &lt;td&gt;32GB+&lt;/td&gt;
 &lt;td&gt;small&lt;/td&gt;
 &lt;td&gt;1GbE fine&lt;/td&gt;
 &lt;td&gt;RAM&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Slurm compute&lt;/td&gt;
 &lt;td&gt;as much as possible&lt;/td&gt;
 &lt;td&gt;fast scratch&lt;/td&gt;
 &lt;td&gt;1GbE mediocre&lt;/td&gt;
 &lt;td&gt;network&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Ceph OSD&lt;/td&gt;
 &lt;td&gt;16–32GB&lt;/td&gt;
 &lt;td&gt;more / bigger disks&lt;/td&gt;
 &lt;td&gt;1GbE&lt;/td&gt;
 &lt;td&gt;disk count&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The network note matters for Slurm: blade LOM connects to the enclosure switch backplane at &lt;strong&gt;1GbE&lt;/strong&gt;, not 10GbE. The switch has 10GbE uplinks going out, but blade-to-blade traffic inside the enclosure goes through the switch at 1GbE. For Talos and OpenStack this is fine. For MPI jobs exchanging large datasets between Slurm nodes it&amp;rsquo;s a real bottleneck — HPC wants InfiniBand, which the empty interconnect bays 5–8 could take (plus matching mezzanine cards in each blade), but that&amp;rsquo;s a separate cost. For learning Slurm, 1GbE is workable.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="current-blade-state"&gt;Current blade state
&lt;/h2&gt;&lt;p&gt;Most blades are underpowered for any of the roles above. CPUs are also unknown across all 16 slots — the OA web GUI reports CPU model and core count per blade and should be checked first. The E5-2600 v1 range runs from E5-2603 (4c, 80W) to E5-2690 (8c/16t, 135W), which matters significantly for role assignment.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Slot&lt;/th&gt;
 &lt;th&gt;RAM&lt;/th&gt;
 &lt;th&gt;Disk&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-001&lt;/td&gt;
 &lt;td&gt;4GB&lt;/td&gt;
 &lt;td&gt;2× 146GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-002&lt;/td&gt;
 &lt;td&gt;14GB (mixed, odd count)&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-003&lt;/td&gt;
 &lt;td&gt;32GB&lt;/td&gt;
 &lt;td&gt;2× 300GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-004&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-005&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;1× 146GB + 1× 300GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-006&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;2× 300GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-007&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;2× 900GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-008&lt;/td&gt;
 &lt;td&gt;16GB&lt;/td&gt;
 &lt;td&gt;2× 300GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-009&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-010&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;2× 300GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-011&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;2× 300GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-012&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;2× 300GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-013&lt;/td&gt;
 &lt;td&gt;32GB&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-014&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-015&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;2× 300GB SAS&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-016&lt;/td&gt;
 &lt;td&gt;8GB&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;BLD-003 and BLD-013 are already at 32GB and are natural candidates for control-plane or master roles once CPUs are confirmed.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="suggested-configuration-from-existing-stock"&gt;Suggested configuration from existing stock
&lt;/h2&gt;&lt;p&gt;Available spare hardware:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;14× RAM-007 (8GB DDR3 1600MHz ECC Reg) — unassigned&lt;/li&gt;
&lt;li&gt;2× HDD-004 (120GB SATA SSD) — spare&lt;/li&gt;
&lt;li&gt;6× HDD-002 (146GB 10K SAS) — spare&lt;/li&gt;
&lt;li&gt;Embedded P220i on each blade (can be set to JBOD/passthrough for Ceph)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;&amp;ldquo;Fat&amp;rdquo; nodes × 2&lt;/strong&gt; — Talos control plane, OpenStack control, Slurm master:
Add 4× RAM-007 to each blade. From a base of 8–16GB that gives ~40GB. Candidates: BLD-006 and BLD-010, both have 2× 300GB SAS for local storage. Costs 8 of 14 spare sticks. Install a spare 120GB SSD as boot disk in each.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;ldquo;Medium&amp;rdquo; nodes × 3&lt;/strong&gt; — Talos workers, OpenStack compute, Slurm compute:
Add 2× RAM-007 to each → 24GB from the 8GB base. Candidates: BLD-008 (already 16GB, gets to 32GB), BLD-011, BLD-012. All three have 300GB SAS for scratch or Ceph OSDs. Costs the remaining 6 spare sticks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Rest&lt;/strong&gt; — thin compute, storage expansion, or powered off:
Leave at current RAM. BLD-007&amp;rsquo;s 900GB SAS pair is better used elsewhere (see below). BLD-003 and BLD-013 at 32GB can step up to fat-node role once CPUs are confirmed.&lt;/p&gt;
&lt;p&gt;That leaves 5 blades properly kitted and 11 available for experiments or idle.&lt;/p&gt;
&lt;p&gt;BL460c Gen8 DIMM rule: populate per-CPU symmetrically — pairs or quads per memory channel — for best throughput. Don&amp;rsquo;t mix odd counts.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="storage--what-moves-where"&gt;Storage — what moves where
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Pull the 900GB SAS drives from BLD-007 now.&lt;/strong&gt; HDD-013 (HGST 900GB) and HDD-014 (Toshiba 900GB) are the two largest drives in the blade pool and they&amp;rsquo;re sitting in a blade that may end up as a thin compute worker. Move them into ODEN or LOKE as permanent Ceph OSDs. This immediately gives the always-on cluster substantially more storage than the current 120GB SSDs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;MIMIR&lt;/strong&gt; (SYS-004, 15× 1TB SAS) is the Ceph expansion story for later. To connect it: install CTRL-006 (ServeRAID-8e, have 2 unplaced) into a server with a free PCIe slot, then cable it with a SFF-8470 → SFF-8088 cable (not currently owned, inexpensive). TOR is the natural host — it already has CTRL-003 in HBA mode and free PCIe slots. Not urgent, but the hardware is almost all there.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;What&lt;/th&gt;
 &lt;th&gt;Goes to&lt;/th&gt;
 &lt;th&gt;When&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;900GB SAS ×2 from BLD-007&lt;/td&gt;
 &lt;td&gt;ODEN or LOKE, permanent Ceph OSDs&lt;/td&gt;
 &lt;td&gt;Now&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;120GB SSD ×2 spare&lt;/td&gt;
 &lt;td&gt;BLD fat node boot disks&lt;/td&gt;
 &lt;td&gt;Before Talos on blades&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;300GB SAS in blades&lt;/td&gt;
 &lt;td&gt;Local scratch or blade Ceph OSDs&lt;/td&gt;
 &lt;td&gt;During ASGARD experiments&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;MIMIR 15× 1TB SAS&lt;/td&gt;
 &lt;td&gt;TOR via CTRL-006, Ceph expansion&lt;/td&gt;
 &lt;td&gt;Later (needs cable)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2 id="three-things-to-do-before-blades-can-boot-anything"&gt;Three things to do before blades can boot anything
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Identify CPUs.&lt;/strong&gt; Connect to the OA management port, open the web GUI, check CPU model per slot. Ten minutes. Everything else depends on this.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Network uplink.&lt;/strong&gt; The blade switches in bays 1 and 2 have 4× RJ45 1GbE uplinks (ports 22–25). Run a patch cable from one to any available switch — MODI, MAGNI, whatever&amp;rsquo;s reachable from the cable box. That&amp;rsquo;s enough for blades to reach DHCP and iPXE.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;RAM redistribution.&lt;/strong&gt; Pull the 14 spare RAM-007 sticks and install into the chosen fat and medium nodes per the profile above.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id="the-permanent-vs-experiment-split"&gt;The permanent vs experiment split
&lt;/h2&gt;&lt;pre tabindex="0"&gt;&lt;code&gt;Always on (~300–400W total):
 HEIMDAL → OPNsense router, Sun X4150, ~150–200W
 ODEN → Talos, Minecraft + small services, ~100–150W
 LOKE → 2nd Talos node (needs RAM-007 × 8 + SSD boot), ~100–150W

Experiments (fire up, learn, power off):
 ASGARD → 3–16 blades for Slurm / OpenStack / larger Talos cluster
 TYR+TOR+FREJA → Proxmox cluster (M1 DDR2, temporary)
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Once the Proxmox experiment wraps, TYR, TOR, and FREJA can be powered down permanently. If ASGARD blades eventually become the long-term compute platform, OPNsense can move to a VM on a blade at that point — but not before the blades are stable and trusted. Don&amp;rsquo;t consolidate the router onto experimental infrastructure.&lt;/p&gt;</description></item><item><title>OpenStack</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/openstack/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/openstack/</guid><description>&lt;p&gt;OpenStack is an open-source IaaS platform — it turns a pool of bare-metal servers into a self-service cloud: virtual machines, block storage, networking, and object storage, all driven by API.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://www.openstack.org/" target="_blank" rel="noopener"
 &gt;https://www.openstack.org/&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="scale-and-fit"&gt;Scale and fit
&lt;/h2&gt;&lt;p&gt;There is a rough spectrum of virtualization tools, and picking the wrong tier is a common mistake:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Proxmox / VMware / Hyper-V&lt;/strong&gt; — the right choice when you want to run virtual machines. SMB, homelab, or a small ops team managing infrastructure directly. Reasonable setup cost, manageable operational overhead, one or a few admins in control. Think of it as a VMware replacement.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;OpenStack&lt;/strong&gt; — the right choice when you are &lt;em&gt;building a cloud&lt;/em&gt;, not just running VMs. Multi-tenant infrastructure where teams self-service their own compute, networking, and storage via API. The operational complexity is real and significant; it pays off when the cloud-like abstraction is the actual product, or when the scale justifies the overhead.&lt;/p&gt;
&lt;p&gt;The rule of thumb: if the question is &amp;ldquo;how do I replace VMware?&amp;rdquo;, the answer is Proxmox. If the question is &amp;ldquo;how do I build a private cloud platform?&amp;rdquo;, the answer might be OpenStack.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="core-components"&gt;Core Components
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Service&lt;/th&gt;
 &lt;th&gt;Code Name&lt;/th&gt;
 &lt;th&gt;What it does&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Compute&lt;/td&gt;
 &lt;td&gt;Nova&lt;/td&gt;
 &lt;td&gt;Schedules and manages VM lifecycle&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Networking&lt;/td&gt;
 &lt;td&gt;Neutron&lt;/td&gt;
 &lt;td&gt;Virtual networks, routers, floating IPs, security groups&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Block Storage&lt;/td&gt;
 &lt;td&gt;Cinder&lt;/td&gt;
 &lt;td&gt;Persistent volumes attached to VMs&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Image Service&lt;/td&gt;
 &lt;td&gt;Glance&lt;/td&gt;
 &lt;td&gt;Stores and serves OS images&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Identity&lt;/td&gt;
 &lt;td&gt;Keystone&lt;/td&gt;
 &lt;td&gt;Auth, service catalog, RBAC&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Dashboard&lt;/td&gt;
 &lt;td&gt;Horizon&lt;/td&gt;
 &lt;td&gt;Web UI (optional)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Object Storage&lt;/td&gt;
 &lt;td&gt;Swift&lt;/td&gt;
 &lt;td&gt;S3-like object storage (optional)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Bare Metal&lt;/td&gt;
 &lt;td&gt;Ironic&lt;/td&gt;
 &lt;td&gt;Provisions physical machines instead of VMs&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;You do not need all of them. A minimal useful deployment is Nova + Neutron + Cinder + Glance + Keystone.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="openstack-on-kubernetes"&gt;OpenStack on Kubernetes
&lt;/h2&gt;&lt;p&gt;OpenStack services are just applications — and they can run as Kubernetes workloads. Two projects make this practical:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;a class="link" href="https://github.com/openstack/openstack-helm" target="_blank" rel="noopener"
 &gt;OpenStack-Helm&lt;/a&gt;&lt;/strong&gt; — official Helm charts for deploying OpenStack services on an existing Kubernetes cluster. Each service (Nova, Neutron, Cinder, etc.) becomes a Helm release. Upgrades follow standard rolling deployment patterns.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;a class="link" href="https://github.com/vexxhost/atmosphere" target="_blank" rel="noopener"
 &gt;Atmosphere&lt;/a&gt;&lt;/strong&gt; (by VEXXHOST) — a higher-level operator built on top of OpenStack-Helm. Adds Ansible automation, health checks, and a more opinionated deployment model. Targets production use.&lt;/p&gt;
&lt;p&gt;The practical implication: you can run a Talos cluster and deploy OpenStack on top of it — OpenStack as a tenant of Kubernetes rather than a separate platform. This inverts the usual relationship (where Kubernetes runs on top of OpenStack) and is an interesting architectural option for homelab and small private cloud deployments.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://www.fairbanks.nl/" target="_blank" rel="noopener"
 &gt;Fairbanks&lt;/a&gt; (Dutch hosting company specialising in sovereign private clouds) does exactly this in production. Their talk &lt;a class="link" href="https://www.youtube.com/watch?v=zU8mT2f2Hxc" target="_blank" rel="noopener"
 &gt;OpenStack on Talos Linux&lt;/a&gt; is the clearest real-world example of the pattern.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="deployment-options"&gt;Deployment Options
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Kolla-Ansible&lt;/strong&gt;&lt;br&gt;
&lt;a class="link" href="https://docs.openstack.org/kolla-ansible/latest/" target="_blank" rel="noopener"
 &gt;https://docs.openstack.org/kolla-ansible/latest/&lt;/a&gt;&lt;br&gt;
Containerised OpenStack deployed via Ansible. Production-grade, well-maintained. The practical choice for homelab and small-scale production deployments. Each service runs in its own container.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DevStack&lt;/strong&gt;&lt;br&gt;
&lt;a class="link" href="https://docs.openstack.org/devstack/latest/" target="_blank" rel="noopener"
 &gt;https://docs.openstack.org/devstack/latest/&lt;/a&gt;&lt;br&gt;
All-in-one development install. Not for production or anything you want to survive a reboot. Good for learning the API surface.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Canonical OpenStack (Juju / Sunbeam)&lt;/strong&gt;&lt;br&gt;
&lt;a class="link" href="https://ubuntu.com/openstack" target="_blank" rel="noopener"
 &gt;https://ubuntu.com/openstack&lt;/a&gt;&lt;br&gt;
Ubuntu-opinionated deployment. Sunbeam is a newer minimal footprint option. Good if you&amp;rsquo;re already in the Ubuntu/Juju ecosystem.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="concepts-worth-understanding"&gt;Concepts Worth Understanding
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Flavors&lt;/strong&gt; — VM sizing templates (vCPU, RAM, disk). You define these; instances pick from them.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Security Groups&lt;/strong&gt; — stateful firewall rules applied per-port. Default-deny inbound.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Floating IPs&lt;/strong&gt; — externally routable IPs that can be associated/disassociated from instances dynamically.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Availability Zones&lt;/strong&gt; — logical groupings of compute nodes. Useful for fault isolation even at small scale.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hypervisors&lt;/strong&gt; — Nova supports KVM (default), QEMU, VMware, and others. KVM on Linux is the standard.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="relevance-to-the-lab"&gt;Relevance to the Lab
&lt;/h2&gt;&lt;p&gt;The &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/llm-training/" &gt;LLM training experiment&lt;/a&gt; plans to use OpenStack as the IaaS layer over the &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/inventory/systems/" &gt;blade nodes&lt;/a&gt; in ASGARD — Nova for compute scheduling, Neutron for cluster networking, Cinder for shared model/dataset storage backed by Ceph.&lt;/p&gt;</description></item><item><title>Proxmox Cluster in the homelab</title><link>https://backend-engineering-strategy-tools.github.io/site/homelab/proxmox-cluster/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/homelab/proxmox-cluster/</guid><description>&lt;p&gt;Getting a three-node &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/proxmox/" &gt;Proxmox VE&lt;/a&gt; cluster running in the homelab.&lt;/p&gt;
&lt;p&gt;The goal is a shared virtualization platform for running VMs and LXC containers across the rack. Also, a good excuse to kick the tires on Proxmox itself so, naturally, let&amp;rsquo;s needlessly complicate things with some self-imposed constraints:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;run it clustered&lt;/li&gt;
&lt;li&gt;don&amp;rsquo;t use any hardware already earmarked for other projects&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id="hardware"&gt;Hardware
&lt;/h2&gt;&lt;p&gt;I am going try and use three IBM rack servers from the &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/inventory/systems/" &gt;inventory&lt;/a&gt;.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Asset ID&lt;/th&gt;
 &lt;th&gt;Hostname&lt;/th&gt;
 &lt;th&gt;Model&lt;/th&gt;
 &lt;th&gt;Form Factor&lt;/th&gt;
 &lt;th&gt;RAM&lt;/th&gt;
 &lt;th&gt;CPU&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-001&lt;/td&gt;
 &lt;td&gt;FREJA&lt;/td&gt;
 &lt;td&gt;IBM System x3550 M1 (7978)&lt;/td&gt;
 &lt;td&gt;1U&lt;/td&gt;
 &lt;td&gt;24GB&lt;/td&gt;
 &lt;td&gt;single&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-002&lt;/td&gt;
 &lt;td&gt;TYR&lt;/td&gt;
 &lt;td&gt;IBM System x3650 M1 (7979)&lt;/td&gt;
 &lt;td&gt;2U&lt;/td&gt;
 &lt;td&gt;64GB&lt;/td&gt;
 &lt;td&gt;dual&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-003&lt;/td&gt;
 &lt;td&gt;TOR&lt;/td&gt;
 &lt;td&gt;IBM System x3650 M1 (7979)&lt;/td&gt;
 &lt;td&gt;2U&lt;/td&gt;
 &lt;td&gt;64GB&lt;/td&gt;
 &lt;td&gt;dual&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Three nodes satisfies Corosync quorum without needing a &lt;code&gt;qdevice&lt;/code&gt; — losing one node still leaves a majority.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="installation"&gt;Installation
&lt;/h2&gt;&lt;p&gt;&lt;em&gt;In progress.&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="cluster-formation"&gt;Cluster formation
&lt;/h2&gt;&lt;p&gt;&lt;em&gt;In progress.&lt;/em&gt;&lt;/p&gt;</description></item><item><title>Proxmox VE</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/proxmox/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/proxmox/</guid><description>&lt;p&gt;Proxmox VE (Virtual Environment) is an open-source Type 1 hypervisor built on Debian. It runs KVM for full virtual machines and LXC for lightweight containers, managed through a web UI or API. The subscription model is optional — the community edition is fully functional without a paid license; the subscription gives access to the enterprise update repository and support.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="comparison"&gt;Comparison
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Platform&lt;/th&gt;
 &lt;th&gt;License&lt;/th&gt;
 &lt;th&gt;VMs (KVM)&lt;/th&gt;
 &lt;th&gt;Containers&lt;/th&gt;
 &lt;th&gt;Clustering&lt;/th&gt;
 &lt;th&gt;Web UI&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Proxmox VE&lt;/td&gt;
 &lt;td&gt;Open-source (optional sub)&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;td&gt;Yes (LXC)&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;VMware ESXi&lt;/td&gt;
 &lt;td&gt;Commercial&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;td&gt;No&lt;/td&gt;
 &lt;td&gt;Yes (vCenter)&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Standalone KVM&lt;/td&gt;
 &lt;td&gt;Open-source&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;td&gt;No&lt;/td&gt;
 &lt;td&gt;Manual&lt;/td&gt;
 &lt;td&gt;No&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;oVirt&lt;/td&gt;
 &lt;td&gt;Open-source&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;td&gt;No&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Proxmox is the practical choice when you want VMware-style management without the licensing cost, or when you want to run both VMs and containers on the same node.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="core-concepts"&gt;Core concepts
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Node&lt;/strong&gt; — a physical host running Proxmox VE. Managed independently or as part of a cluster.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cluster&lt;/strong&gt; — multiple nodes joined together. Share a unified management view and allow live migration of VMs between nodes. Uses &lt;a class="link" href="https://corosync.github.io/corosync/" target="_blank" rel="noopener"
 &gt;Corosync&lt;/a&gt; for distributed consensus.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Quorum&lt;/strong&gt; — clusters require a majority of nodes to be reachable to avoid split-brain. Minimum useful cluster size is 3 nodes (loss of one node still leaves a majority). Two-node clusters need a quorum device (&lt;code&gt;qdevice&lt;/code&gt;) to function safely.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;VM&lt;/strong&gt; — full virtual machine backed by QEMU/KVM. Hardware-level isolation. Arbitrary OS.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Container (CT)&lt;/strong&gt; — LXC container. Shares the host kernel; lower overhead than a VM. Linux-only. Useful for services where you want process-level isolation without a full OS.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Storage pool&lt;/strong&gt; — where disks and images live. Supported backends: local directory, LVM, LVM-thin, ZFS, NFS, CIFS, and Ceph (via &lt;code&gt;rbd&lt;/code&gt;). ZFS and Ceph are the most capable options for a cluster — ZFS for local redundancy, Ceph for shared storage across nodes.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="related"&gt;Related
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://pve.proxmox.com/pve-docs/" target="_blank" rel="noopener"
 &gt;Proxmox VE documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://forum.proxmox.com/" target="_blank" rel="noopener"
 &gt;Proxmox community forum&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://corosync.github.io/corosync/" target="_blank" rel="noopener"
 &gt;Corosync documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/ceph/" &gt;Ceph&lt;/a&gt; — distributed storage backend for Proxmox clusters&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/openstack/" &gt;OpenStack&lt;/a&gt; — the next tier up the scale spectrum&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/proxmox-cluster/" &gt;Proxmox cluster in the homelab&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>PXE Booting with OPNSense + iPXE</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/hardware/hardware-provisioning/ipxe-opnsense/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/hardware/hardware-provisioning/ipxe-opnsense/</guid><description>&lt;p&gt;How to configure OPNSense as a PXE boot server using its built-in DHCP and TFTP services, and how to write an iPXE boot menu that can chainload Talos Linux (or anything else).&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="opnsense-dhcp--network-boot-fields"&gt;OPNSense DHCP — Network Boot Fields
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Services → ISC DHCPv4 → [LAN] → Network Booting&lt;/code&gt;&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Field&lt;/th&gt;
 &lt;th&gt;Value&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Enable network booting&lt;/td&gt;
 &lt;td&gt;✓&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Next-server IP&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;192.168.1.1&lt;/code&gt; (OPNSense LAN address)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Default BIOS filename&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;undionly.kpxe&lt;/code&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;x86 UEFI (32-bit) filename&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;ipxe.efi&lt;/code&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;x64 UEFI/EBC (64-bit) filename&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;ipxe.efi&lt;/code&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;iPXE boot filename&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;default.ipxe&lt;/code&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The DHCP server serves the correct boot file based on client architecture. BIOS clients get &lt;code&gt;undionly.kpxe&lt;/code&gt;; UEFI clients get &lt;code&gt;ipxe.efi&lt;/code&gt;. Both then chainload &lt;code&gt;default.ipxe&lt;/code&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="tftp--downloading-the-boot-files"&gt;TFTP — Downloading the Boot Files
&lt;/h2&gt;&lt;p&gt;OPNSense runs a TFTP server rooted at &lt;code&gt;/usr/local/tftp&lt;/code&gt;. SSH in and fetch the iPXE binaries:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;fetch -o /usr/local/tftp/undionly.kpxe https://boot.ipxe.org/undionly.kpxe
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;fetch -o /usr/local/tftp/ipxe.efi https://boot.ipxe.org/ipxe.efi
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="ipxe-boot-script"&gt;iPXE Boot Script
&lt;/h2&gt;&lt;p&gt;Save as &lt;code&gt;/usr/local/tftp/default.ipxe&lt;/code&gt;. This example has a boot menu with options for netboot.xyz, a Talos Omni boot, and a debug shell.&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code class="language-ipxe" data-lang="ipxe"&gt;#!ipxe

dhcp
set menu-timeout 5000

:start
menu PXE Boot Menu
item --gap -- ----------------------------
item netboot Boot netboot.xyz
item talos Boot Talos (Omni)
item shell iPXE Shell
item --gap -- ----------------------------
choose target &amp;amp;&amp;amp; goto ${target}

:netboot
chain http://boot.netboot.xyz || goto failed
goto start

:talos
echo Booting Talos via Omni...

set api https://&amp;lt;your-omni-instance&amp;gt;.omni.siderolabs.io
set token &amp;lt;join-token&amp;gt;
set event [&amp;lt;siderolink-ipv6&amp;gt;]:8090
set log tcp://[&amp;lt;siderolink-ipv6&amp;gt;]:8092

kernel tftp://${next-server}/talos/vmlinuz-omni \
 ima_template=ima-ng \
 ima_appraise=fix \
 ima_hash=sha512 \
 selinux=1 \
 consoleblank=0 \
 nvme_core.io_timeout=4294967295 \
 initrd=initramfs.xz \
 init_on_alloc=1 \
 slab_nomerge \
 pti=on \
 console=tty0 \
 console=ttyS0 \
 printk.devkmsg=on \
 talos.platform=metal \
 siderolink.api=${api}?jointoken=${token} \
 talos.events.sink=${event} \
 talos.logging.kernel=${log}

initrd tftp://${next-server}/talos/initramfs-omni.xz
boot || goto failed

:shell
shell

:failed
echo Boot failed, press Enter to continue...
read fake
goto start
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;The &lt;code&gt;api&lt;/code&gt;, &lt;code&gt;token&lt;/code&gt;, &lt;code&gt;event&lt;/code&gt;, and &lt;code&gt;log&lt;/code&gt; values come from the Omni console when you generate a join link.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="talos-kernel-and-initramfs--image-factory"&gt;Talos Kernel and Initramfs — Image Factory
&lt;/h2&gt;&lt;p&gt;The standard Talos release binaries do not include firmware for all hardware. Since Talos 1.6, several older NIC drivers (including Broadcom BNX2 / BCM5709) were removed from the mainline image and made available as extensions via the image factory.&lt;/p&gt;
&lt;p&gt;Generate a custom image at &lt;a class="link" href="https://factory.talos.dev" target="_blank" rel="noopener"
 &gt;factory.talos.dev&lt;/a&gt; with the extensions you need (e.g. &lt;code&gt;siderolabs/bnx2&lt;/code&gt;), then download the PXE artifacts:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;mkdir -p /usr/local/tftp/talos
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;fetch -o /usr/local/tftp/talos/vmlinuz-omni &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;https://pxe.factory.talos.dev/image/&amp;lt;IMAGE_ID&amp;gt;/v1.10.1/kernel-amd64&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;fetch -o /usr/local/tftp/talos/initramfs-omni.xz &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;https://pxe.factory.talos.dev/image/&amp;lt;IMAGE_ID&amp;gt;/v1.10.1/initramfs-amd64.xz&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Replace &lt;code&gt;&amp;lt;IMAGE_ID&amp;gt;&lt;/code&gt; with the schematic ID from the image factory, and adjust the version tag as needed.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="gotchas"&gt;Gotchas
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;UEFI boot and NIC memory limits&lt;/strong&gt; — &lt;code&gt;ipxe.efi&lt;/code&gt; can be too large to fit in the NIC&amp;rsquo;s PXE memory buffer on some older hardware. If the UEFI chain fails silently, switch to BIOS/legacy mode and use &lt;code&gt;undionly.kpxe&lt;/code&gt; instead.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DHCP options 66/67 conflict&lt;/strong&gt; — If you have previously set DHCP options 66 (next-server) and 67 (boot file) as raw additional options, remove them. OPNSense&amp;rsquo;s built-in network boot fields handle this; having both causes conflicts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;BIOS boot order after first boot&lt;/strong&gt; — Talos writes its own bootloader on first boot. If the BIOS is set to PXE as the primary device, the machine will fall back to the PXE menu on every subsequent reboot. Set disk as the primary boot device once the cluster is up.&lt;/p&gt;</description></item><item><title>Rook + Ceph on ODEN</title><link>https://backend-engineering-strategy-tools.github.io/site/homelab/rook-ceph/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/homelab/rook-ceph/</guid><description>&lt;p&gt;Attempting to add persistent block storage to the &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/inventory/systems/" &gt;ODEN&lt;/a&gt; single-node Talos cluster using &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/kubernetes/rook/" &gt;Rook&lt;/a&gt; and &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/ceph/" &gt;Ceph&lt;/a&gt;. This did not fully succeed — the setup reached the point of a bound PVC and a working write test, but the cluster was not left in a clean stable state. Notes are here for completeness.&lt;/p&gt;
&lt;p&gt;This builds on the &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/talos-omni/" &gt;Talos cluster setup on ODEN&lt;/a&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="hardware"&gt;Hardware
&lt;/h2&gt;&lt;p&gt;ODEN has five storage devices:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Device&lt;/th&gt;
 &lt;th&gt;Type&lt;/th&gt;
 &lt;th&gt;Size&lt;/th&gt;
 &lt;th&gt;Role&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;/dev/sdb&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;Kingston SA400S3 SSD (SATA)&lt;/td&gt;
 &lt;td&gt;120 GB&lt;/td&gt;
 &lt;td&gt;Boot disk — leave alone&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;/dev/nvme0n1&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;Samsung 970 EVO NVMe&lt;/td&gt;
 &lt;td&gt;500 GB&lt;/td&gt;
 &lt;td&gt;OSD&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;/dev/sdc&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;Kingston SA400S3 SSD (SATA)&lt;/td&gt;
 &lt;td&gt;120 GB&lt;/td&gt;
 &lt;td&gt;OSD&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;/dev/sdd&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;Kingston SA400S3 SSD (SATA)&lt;/td&gt;
 &lt;td&gt;120 GB&lt;/td&gt;
 &lt;td&gt;OSD&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;/dev/sde&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;Kingston SA400S3 SSD (SATA)&lt;/td&gt;
 &lt;td&gt;120 GB&lt;/td&gt;
 &lt;td&gt;OSD&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Do not add &lt;code&gt;/dev/sdb&lt;/code&gt; to Ceph. It is the boot disk.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="step-1--install-the-rook-operator"&gt;Step 1 — Install the Rook operator
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl apply -f https://raw.githubusercontent.com/rook/rook/refs/tags/v1.17.9/deploy/examples/crds.yaml
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl apply -f https://raw.githubusercontent.com/rook/rook/refs/tags/v1.17.9/deploy/examples/common.yaml
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl apply -f https://raw.githubusercontent.com/rook/rook/refs/tags/v1.17.9/deploy/examples/operator.yaml
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Wait for the operator pod to be running in &lt;code&gt;rook-ceph&lt;/code&gt; namespace before continuing.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="step-2--cephcluster-single-node"&gt;Step 2 — CephCluster (single-node)
&lt;/h2&gt;&lt;p&gt;Single-node requires &lt;code&gt;allowMultiplePerNode: true&lt;/code&gt; and explicit disk selection. The cluster-test example from the Rook repo is a reasonable starting point:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;storage&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;useAllNodes&lt;/span&gt;: &lt;span style="color:#66d9ef"&gt;false&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;nodes&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;name&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;192.168.1.171&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;devices&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;name&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;nvme0n1&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;name&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;sdc&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;name&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;sdd&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;name&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;sde&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Reference: &lt;a class="link" href="https://github.com/rook/rook/blob/release-1.17/deploy/examples/cluster-test.yaml" target="_blank" rel="noopener"
 &gt;cluster-test.yaml&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="step-3--cephblockpool-and-storageclass"&gt;Step 3 — CephBlockPool and StorageClass
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;apiVersion&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;ceph.rook.io/v1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;kind&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;CephBlockPool&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;metadata&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;name&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;replicapool&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;namespace&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;rook-ceph&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;spec&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;replicated&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;size&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;apiVersion&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;storage.k8s.io/v1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;kind&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;StorageClass&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;metadata&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;name&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;rook-ceph-block&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;provisioner&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;rook-ceph.rbd.csi.ceph.com&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;parameters&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;clusterID&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;rook-ceph&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;pool&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;replicapool&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;imageFormat&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;2&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;imageFeatures&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;layering&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;reclaimPolicy&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;Delete&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="step-4--pvc-test"&gt;Step 4 — PVC test
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;apiVersion&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;v1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;kind&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;PersistentVolumeClaim&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;metadata&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;name&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;test-pvc&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;spec&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;accessModes&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;ReadWriteOnce&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;storageClassName&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;rook-ceph-block&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;resources&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;requests&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;storage&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;10Gi&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;PVC reached &lt;code&gt;Bound&lt;/code&gt;. A BusyBox pod mounting it could write to &lt;code&gt;/mnt&lt;/code&gt;. The Ceph dashboard (&lt;code&gt;kubectl -n rook-ceph port-forward svc/rook-ceph-mgr-dashboard 7000:7000&lt;/code&gt;) showed OSDs active and the pool present.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="what-did-not-work"&gt;What did not work
&lt;/h2&gt;&lt;p&gt;The cluster ran but was not left stable. Single-node Ceph produces health warnings by design (no redundancy, no failure domain separation). More importantly, the setup was not revisited after initial testing and there are unresolved questions about:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;CSI driver behaviour on Talos (Talos has specific requirements for CSI socket paths)&lt;/li&gt;
&lt;li&gt;Whether the dashboard warnings were cosmetic or indicated real issues&lt;/li&gt;
&lt;li&gt;Long-term stability under actual workloads&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is left as a draft until there is time to run it properly — ideally on more than one node.&lt;/p&gt;</description></item><item><title>Slurm</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/slurm/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/cloud-infrastructure/slurm/</guid><description>&lt;p&gt;Slurm (Simple Linux Utility for Resource Management) is a workload manager and job scheduler. It originated in HPC but is now the standard scheduler for ML training clusters — most cloud GPU clusters (AWS, GCP, Azure HPC) run Slurm under the hood.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://slurm.schedmd.com/" target="_blank" rel="noopener"
 &gt;https://slurm.schedmd.com/&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="core-concepts"&gt;Core Concepts
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Node&lt;/strong&gt; — a compute host registered with Slurm. Can have CPU, GPU, and memory resources defined.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Partition&lt;/strong&gt; — a logical group of nodes (like a queue). You can have separate partitions for GPU and CPU nodes, or different priority tiers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Job&lt;/strong&gt; — a workload submitted to the queue. Slurm allocates resources, places the job, and tracks it to completion.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Allocation&lt;/strong&gt; — the reserved resources (nodes, CPUs, GPUs, memory) for a running job.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="key-commands"&gt;Key Commands
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sbatch job.sh &lt;span style="color:#75715e"&gt;# submit a batch job script&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;squeue &lt;span style="color:#75715e"&gt;# view the job queue&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sacct -j &amp;lt;jobid&amp;gt; &lt;span style="color:#75715e"&gt;# job accounting / history&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sinfo &lt;span style="color:#75715e"&gt;# view partition and node state&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;scancel &amp;lt;jobid&amp;gt; &lt;span style="color:#75715e"&gt;# cancel a job&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;srun --pty bash &lt;span style="color:#75715e"&gt;# interactive allocation&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;A minimal batch script:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#!/bin/bash
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#SBATCH --job-name=train&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#SBATCH --nodes=1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#SBATCH --gpus=1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#SBATCH --time=02:00:00&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;python train.py
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="slurm-vs-kubernetes-for-training"&gt;Slurm vs Kubernetes for Training
&lt;/h2&gt;&lt;p&gt;The fundamental difference is what each system optimises for:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Kubernetes&lt;/strong&gt; optimises for uptime. It keeps services running, reschedules failed pods, and manages long-lived workloads. That&amp;rsquo;s the right model for inference serving, APIs, and anything that needs to stay up.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Slurm&lt;/strong&gt; optimises for utilisation. It packs jobs onto nodes as tightly as possible, queues work when resources are busy, and gets out of the way when a job finishes. That&amp;rsquo;s the right model for batch training — you want every GPU busy, not reserved for availability.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;&lt;/th&gt;
 &lt;th&gt;Slurm&lt;/th&gt;
 &lt;th&gt;Kubernetes&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Optimises for&lt;/td&gt;
 &lt;td&gt;Maximum utilisation&lt;/td&gt;
 &lt;td&gt;Uptime and availability&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Scheduling model&lt;/td&gt;
 &lt;td&gt;Job queue, batch-first&lt;/td&gt;
 &lt;td&gt;Long-running services + batch (via operators)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;GPU allocation&lt;/td&gt;
 &lt;td&gt;Native, fine-grained&lt;/td&gt;
 &lt;td&gt;Requires GPU operator + device plugin&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Multi-node training&lt;/td&gt;
 &lt;td&gt;First-class (MPI, &lt;code&gt;srun&lt;/code&gt;)&lt;/td&gt;
 &lt;td&gt;Possible via KubeFlow, PyTorchJob&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Preemption&lt;/td&gt;
 &lt;td&gt;Built-in&lt;/td&gt;
 &lt;td&gt;Requires configuration&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Operational overhead&lt;/td&gt;
 &lt;td&gt;Low on bare metal&lt;/td&gt;
 &lt;td&gt;Higher — requires cluster management&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Ecosystem&lt;/td&gt;
 &lt;td&gt;HPC, academia, major cloud HPC&lt;/td&gt;
 &lt;td&gt;ML platforms, cloud-native&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;The short version:&lt;/strong&gt; use Slurm for pure batch training on bare metal. Use Kubernetes when you&amp;rsquo;re mixing training with inference serving or need container-native workflows throughout — or when the cluster already exists.&lt;/p&gt;</description></item><item><title>Talos Linux + Omni</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/kubernetes/talos/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/kubernetes/talos/</guid><description>&lt;p&gt;Talos Linux is an immutable, minimal operating system designed specifically for running Kubernetes. There is no shell, no SSH, no package manager. The entire OS is read-only and managed via a gRPC API (&lt;code&gt;talosctl&lt;/code&gt;). Node configuration is declarative YAML applied over the API; changes that require a reboot take effect on the next boot.&lt;/p&gt;
&lt;p&gt;The tradeoff is rigidity for operational simplicity. You cannot log into a Talos node and fix something by hand. In return, nodes are deterministic, reproducible, and there is no configuration drift.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="comparison-to-other-installs"&gt;Comparison to other installs
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Method&lt;/th&gt;
 &lt;th&gt;OS&lt;/th&gt;
 &lt;th&gt;Config&lt;/th&gt;
 &lt;th&gt;Mutable&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;kubeadm&lt;/td&gt;
 &lt;td&gt;Ubuntu / RHEL / etc&lt;/td&gt;
 &lt;td&gt;Manual + scripts&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;k3s&lt;/td&gt;
 &lt;td&gt;Any Linux&lt;/td&gt;
 &lt;td&gt;Minimal&lt;/td&gt;
 &lt;td&gt;Yes&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Talos&lt;/td&gt;
 &lt;td&gt;Talos Linux&lt;/td&gt;
 &lt;td&gt;Declarative API&lt;/td&gt;
 &lt;td&gt;No&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;k3s and kubeadm give you more flexibility and a familiar Linux environment. Talos is the right choice when you want the cluster nodes to behave like appliances — provisioned, never touched.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="omni"&gt;Omni
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://omni.siderolabs.com" target="_blank" rel="noopener"
 &gt;Omni&lt;/a&gt; is a cluster management platform by Sidero Labs built on top of Talos. It handles:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Node registration (nodes boot and phone home to the Omni API)&lt;/li&gt;
&lt;li&gt;Cluster creation and machine assignment&lt;/li&gt;
&lt;li&gt;Kubernetes upgrades (one action in the UI)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;talosctl&lt;/code&gt; and &lt;code&gt;kubeconfig&lt;/code&gt; access via the Omni CLI&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Nodes register via a join token embedded in the kernel command line at PXE boot time. The cluster runs on your hardware; Omni only manages the control plane.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hobby tier&lt;/strong&gt;: 10 nodes, non-commercial use, free. Sidero Labs also offers a self-hosted version.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="image-factory"&gt;Image Factory
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://factory.talos.dev" target="_blank" rel="noopener"
 &gt;factory.talos.dev&lt;/a&gt; generates custom Talos images with hardware extensions included. Notable extensions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;siderolabs/bnx2&lt;/code&gt; — Broadcom NetXtreme II (BCM5708/BCM5709) NIC firmware, required on some enterprise hardware (IBM x3550 M3, HP Gen 6/7 blades)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;siderolabs/intel-ucode&lt;/code&gt; — Intel microcode updates&lt;/li&gt;
&lt;li&gt;&lt;code&gt;siderolabs/nvidia-*&lt;/code&gt; — NVIDIA GPU support&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The factory produces both ISO and PXE artifacts (kernel + initramfs). See the &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/hardware/hardware-provisioning/ipxe-opnsense/" &gt;OPNSense + iPXE reference&lt;/a&gt; for how to serve these over TFTP.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="supporting-sidero-labs"&gt;Supporting Sidero Labs
&lt;/h2&gt;&lt;p&gt;Talos and Omni are built by &lt;a class="link" href="https://github.com/siderolabs" target="_blank" rel="noopener"
 &gt;Sidero Labs&lt;/a&gt; — good people doing good work. I sponsor them via &lt;a class="link" href="https://github.com/sponsors/siderolabs" target="_blank" rel="noopener"
 &gt;GitHub Sponsors&lt;/a&gt; at the fanboi tier.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="relevant-links"&gt;Relevant links
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://www.talos.dev/latest/" target="_blank" rel="noopener"
 &gt;Talos Linux docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://omni.siderolabs.com/docs" target="_blank" rel="noopener"
 &gt;Omni docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://factory.talos.dev" target="_blank" rel="noopener"
 &gt;Image factory&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/siderolabs" target="_blank" rel="noopener"
 &gt;Sidero Labs GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/sponsors/siderolabs" target="_blank" rel="noopener"
 &gt;Sponsor Sidero Labs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Talos Linux in the homelab via Omni</title><link>https://backend-engineering-strategy-tools.github.io/site/homelab/talos-omni/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/homelab/talos-omni/</guid><description>&lt;p&gt;Getting &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/kubernetes/talos/" &gt;Talos Linux&lt;/a&gt; running in the homelab via PXE boot and &lt;a class="link" href="https://omni.siderolabs.com" target="_blank" rel="noopener"
 &gt;Omni&lt;/a&gt; — starting with &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/inventory/systems/" &gt;ODEN (SYS-005)&lt;/a&gt;, an IBM System x3550 M3. The full OPNSense + iPXE configuration lives in the &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/hardware/hardware-provisioning/ipxe-opnsense/" &gt;reference note&lt;/a&gt;; this covers what actually happened, in order.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="setup"&gt;Setup
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Hardware&lt;/strong&gt;: ODEN (SYS-005) — IBM x3550 M3, Broadcom BNX2 NICs (BCM5709)&lt;br&gt;
&lt;strong&gt;Network&lt;/strong&gt;: OPNSense router on LAN; ODEN connected via one NIC (start with one — removes variables)&lt;br&gt;
&lt;strong&gt;Target&lt;/strong&gt;: Single-node Talos cluster registered in Omni&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="step-1--opnsense-dhcp-and-tftp"&gt;Step 1 — OPNSense DHCP and TFTP
&lt;/h2&gt;&lt;p&gt;Enable network booting on the LAN DHCP server and download the iPXE binaries to the TFTP root. Full field values in the &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/hardware/hardware-provisioning/ipxe-opnsense/" &gt;iPXE reference note&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;One thing to check first: if you previously set DHCP options 66 and 67 as raw additional options, remove them. OPNSense&amp;rsquo;s built-in network boot fields do the same job and having both causes conflicts.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="step-2--ipxe-boot-script"&gt;Step 2 — iPXE boot script
&lt;/h2&gt;&lt;p&gt;Write &lt;code&gt;default.ipxe&lt;/code&gt; to &lt;code&gt;/usr/local/tftp/&lt;/code&gt;. Include a boot menu with at minimum a Talos option and a shell fallback — the shell is genuinely useful when something fails and you need to debug from the boot prompt. Full script in the &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/hardware/hardware-provisioning/ipxe-opnsense/" &gt;reference note&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The Talos entry in the menu needs the Omni join token from your Omni console. Generate a join link in Omni; it provides the API endpoint, token, and SideroLink addresses.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="step-3--talos-kernel-and-initramfs"&gt;Step 3 — Talos kernel and initramfs
&lt;/h2&gt;&lt;p&gt;The standard Talos release binaries do not include BNX2 firmware. Since around Talos 1.6 those drivers are available as extensions but not in the mainline image. Without them, the node boots, fails to initialise the NIC, and produces &lt;code&gt;can't load firmware bnx2&lt;/code&gt; errors — everything else looks fine until you notice the node never gets an IP and never appears in Omni.&lt;/p&gt;
&lt;p&gt;Fix: generate a custom image at &lt;a class="link" href="https://factory.talos.dev" target="_blank" rel="noopener"
 &gt;factory.talos.dev&lt;/a&gt; with the &lt;code&gt;siderolabs/bnx2&lt;/code&gt; extension included, then download the PXE kernel and initramfs from the factory URL. Commands in the &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/hardware/hardware-provisioning/ipxe-opnsense/" &gt;reference note&lt;/a&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="step-4--first-boot"&gt;Step 4 — First boot
&lt;/h2&gt;&lt;p&gt;Go into BIOS and set the boot device to PXE. On the M3, UEFI boot with &lt;code&gt;ipxe.efi&lt;/code&gt; fails silently — the image is too large for the NIC&amp;rsquo;s PXE memory buffer. Switch to legacy/BIOS mode and use &lt;code&gt;undionly.kpxe&lt;/code&gt; instead.&lt;/p&gt;
&lt;p&gt;The machine takes a while to POST and boot. This is normal for old enterprise hardware. It is also why demos typically use virtual machines.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="step-5--static-ip"&gt;Step 5 — Static IP
&lt;/h2&gt;&lt;p&gt;After the BNX2 fix the node boots Talos successfully but still does not appear in Omni. The DHCP assignment for the node is not being picked up during early boot. Workaround: add a static IP via kernel params in the iPXE script:&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code class="language-ipxe" data-lang="ipxe"&gt;ip=192.168.1.171::192.168.1.1:255.255.255.0::eth0:off
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Add this to the &lt;code&gt;kernel&lt;/code&gt; line in the Talos iPXE entry. The format is &lt;code&gt;ip=&amp;lt;client-ip&amp;gt;::&amp;lt;gateway&amp;gt;:&amp;lt;netmask&amp;gt;::&amp;lt;iface&amp;gt;:off&lt;/code&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="step-6--omni-registration"&gt;Step 6 — Omni registration
&lt;/h2&gt;&lt;p&gt;With a working NIC and an IP, the node contacts the Omni API using the join token. It appears in the Omni console as an unallocated machine. Create a cluster, assign the machine, and let Omni configure it. The initial cluster bootstrap takes a few minutes.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="step-7--fix-the-bios-boot-order"&gt;Step 7 — Fix the BIOS boot order
&lt;/h2&gt;&lt;p&gt;After the cluster is up, change the BIOS boot order so the disk is first. If PXE remains the primary boot device, every reboot drops the machine back to the iPXE menu instead of booting the installed Talos. Discovered on first reboot. Worth noting it here so you don&amp;rsquo;t make the same trip to the garage.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="upgrade"&gt;Upgrade
&lt;/h2&gt;&lt;p&gt;Omni makes single-node upgrades straightforward: open the cluster in the Omni console, select a new Talos version, apply. The node reboots once. Single-node means the cluster has downtime during the reboot; that is expected. Nothing else to do.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="result"&gt;Result
&lt;/h2&gt;&lt;p&gt;Single-node Kubernetes cluster running on ODEN, managed via Omni. &lt;code&gt;kubectl&lt;/code&gt; and &lt;code&gt;talosctl&lt;/code&gt; access via the Omni CLI. Next experiment: &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/homelab/rook-ceph/" &gt;Rook + Ceph&lt;/a&gt; for persistent storage.&lt;/p&gt;</description></item><item><title>System Inventory</title><link>https://backend-engineering-strategy-tools.github.io/site/homelab/inventory/systems/</link><pubDate>Wed, 13 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/homelab/inventory/systems/</guid><description>&lt;h1 id="systems"&gt;Systems
&lt;/h1&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Asset ID&lt;/th&gt;
 &lt;th&gt;Hostname&lt;/th&gt;
 &lt;th&gt;Manufacturer&lt;/th&gt;
 &lt;th&gt;Model&lt;/th&gt;
 &lt;th&gt;Form Factor&lt;/th&gt;
 &lt;th&gt;Notes&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-001&lt;/td&gt;
 &lt;td&gt;FREJA&lt;/td&gt;
 &lt;td&gt;IBM&lt;/td&gt;
 &lt;td&gt;System x3550 M1 Type 7978&lt;/td&gt;
 &lt;td&gt;1U&lt;/td&gt;
 &lt;td&gt;Rack server (S/N: KDHPPNN); 1/2 CPU slots populated&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-002&lt;/td&gt;
 &lt;td&gt;TYR&lt;/td&gt;
 &lt;td&gt;IBM&lt;/td&gt;
 &lt;td&gt;System x3650 M1 Type 7979&lt;/td&gt;
 &lt;td&gt;2U&lt;/td&gt;
 &lt;td&gt;Rack server&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-003&lt;/td&gt;
 &lt;td&gt;TOR&lt;/td&gt;
 &lt;td&gt;IBM&lt;/td&gt;
 &lt;td&gt;System x3650 M1 Type 7979&lt;/td&gt;
 &lt;td&gt;2U&lt;/td&gt;
 &lt;td&gt;Rack server&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-004&lt;/td&gt;
 &lt;td&gt;MIMIR&lt;/td&gt;
 &lt;td&gt;Dell&lt;/td&gt;
 &lt;td&gt;PowerVault MD1200&lt;/td&gt;
 &lt;td&gt;2U&lt;/td&gt;
 &lt;td&gt;Disk shelf&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-005&lt;/td&gt;
 &lt;td&gt;ODEN&lt;/td&gt;
 &lt;td&gt;IBM&lt;/td&gt;
 &lt;td&gt;System x3550 M3&lt;/td&gt;
 &lt;td&gt;1U&lt;/td&gt;
 &lt;td&gt;Mixed DDR3 1333+1600 ECC Reg; PCIe x16 riser (FRU 43V7066)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-006&lt;/td&gt;
 &lt;td&gt;LOKE&lt;/td&gt;
 &lt;td&gt;IBM&lt;/td&gt;
 &lt;td&gt;System x3550 M3&lt;/td&gt;
 &lt;td&gt;1U&lt;/td&gt;
 &lt;td&gt;M3 board in M2 chassis; no RAM; CPU unknown&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-007&lt;/td&gt;
 &lt;td&gt;ASGARD&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;BladeSystem C7000&lt;/td&gt;
 &lt;td&gt;10U&lt;/td&gt;
 &lt;td&gt;Blade enclosure (Hosts 1-16)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-008&lt;/td&gt;
 &lt;td&gt;BALDER&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;ProLiant DL320 G5p&lt;/td&gt;
 &lt;td&gt;1U&lt;/td&gt;
 &lt;td&gt;Dual 250GB SATA&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-009&lt;/td&gt;
 &lt;td&gt;HEIMDAL&lt;/td&gt;
 &lt;td&gt;Sun&lt;/td&gt;
 &lt;td&gt;Sun Fire X4150&lt;/td&gt;
 &lt;td&gt;1U&lt;/td&gt;
 &lt;td&gt;2× Xeon E5430 (8c/8t); 4× onboard GbE; OPNsense&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-010&lt;/td&gt;
 &lt;td&gt;VIDAR&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;ProCurve 1800-24G&lt;/td&gt;
 &lt;td&gt;1U&lt;/td&gt;
 &lt;td&gt;Fanless/Silent Switch (J9028A)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-011&lt;/td&gt;
 &lt;td&gt;GUNGNIR&lt;/td&gt;
 &lt;td&gt;ZyXEL&lt;/td&gt;
 &lt;td&gt;ZyWALL 110&lt;/td&gt;
 &lt;td&gt;1U&lt;/td&gt;
 &lt;td&gt;Security Gateway / Firewall&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-012&lt;/td&gt;
 &lt;td&gt;BIFROST-01&lt;/td&gt;
 &lt;td&gt;Edge-Core&lt;/td&gt;
 &lt;td&gt;ECS4510-28F&lt;/td&gt;
 &lt;td&gt;1U&lt;/td&gt;
 &lt;td&gt;28-Port SFP Fiber Switch&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-013&lt;/td&gt;
 &lt;td&gt;BIFROST-02&lt;/td&gt;
 &lt;td&gt;Edge-Core&lt;/td&gt;
 &lt;td&gt;ECS4510-28F&lt;/td&gt;
 &lt;td&gt;1U&lt;/td&gt;
 &lt;td&gt;28-Port SFP Fiber Switch&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-014&lt;/td&gt;
 &lt;td&gt;MODI&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;V1910-24G-PoE&lt;/td&gt;
 &lt;td&gt;1U&lt;/td&gt;
 &lt;td&gt;365W PoE Switch (JE007A)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-015&lt;/td&gt;
 &lt;td&gt;MAGNI&lt;/td&gt;
 &lt;td&gt;Cisco&lt;/td&gt;
 &lt;td&gt;Catalyst 2960G&lt;/td&gt;
 &lt;td&gt;1U&lt;/td&gt;
 &lt;td&gt;24-Port Managed Gig Switch&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-016&lt;/td&gt;
 &lt;td&gt;VALI&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;ProCurve 1800-24G&lt;/td&gt;
 &lt;td&gt;1U&lt;/td&gt;
 &lt;td&gt;Fanless/Silent Switch (J9028A)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-017&lt;/td&gt;
 &lt;td&gt;RATATOSK&lt;/td&gt;
 &lt;td&gt;Avocent&lt;/td&gt;
 &lt;td&gt;KVM Switch&lt;/td&gt;
 &lt;td&gt;1U&lt;/td&gt;
 &lt;td&gt;Rackmount KVM&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-018&lt;/td&gt;
 &lt;td&gt;SURTR-01&lt;/td&gt;
 &lt;td&gt;APC&lt;/td&gt;
 &lt;td&gt;Back-UPS CS 650&lt;/td&gt;
 &lt;td&gt;Desktop&lt;/td&gt;
 &lt;td&gt;UPS Unit 1&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-019&lt;/td&gt;
 &lt;td&gt;SURTR-02&lt;/td&gt;
 &lt;td&gt;APC&lt;/td&gt;
 &lt;td&gt;Back-UPS CS 650&lt;/td&gt;
 &lt;td&gt;Desktop&lt;/td&gt;
 &lt;td&gt;UPS Unit 2&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-020&lt;/td&gt;
 &lt;td&gt;MUNINN&lt;/td&gt;
 &lt;td&gt;Cisco&lt;/td&gt;
 &lt;td&gt;Catalyst 2960 WS-C2960-24TC-L&lt;/td&gt;
 &lt;td&gt;1U&lt;/td&gt;
 &lt;td&gt;24× 10/100 + 4× uplink&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-021&lt;/td&gt;
 &lt;td&gt;BIFROST&lt;/td&gt;
 &lt;td&gt;Raspberry Pi&lt;/td&gt;
 &lt;td&gt;Raspberry Pi 1 Model B&lt;/td&gt;
 &lt;td&gt;SBC&lt;/td&gt;
 &lt;td&gt;Jump node; Raspbian; port-forward 22222→22; rack-mounted&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-022&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;td&gt;Raspberry Pi&lt;/td&gt;
 &lt;td&gt;Raspberry Pi 1 Model B&lt;/td&gt;
 &lt;td&gt;SBC&lt;/td&gt;
 &lt;td&gt;Spare&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;SYS-023&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;td&gt;Raspberry Pi&lt;/td&gt;
 &lt;td&gt;Raspberry Pi 1 Model B&lt;/td&gt;
 &lt;td&gt;SBC&lt;/td&gt;
 &lt;td&gt;Spare&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h1 id="blade-nodes-inside-asgard"&gt;Blade Nodes (Inside ASGARD)
&lt;/h1&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Asset ID&lt;/th&gt;
 &lt;th&gt;Hostname&lt;/th&gt;
 &lt;th&gt;Manufacturer&lt;/th&gt;
 &lt;th&gt;Model&lt;/th&gt;
 &lt;th&gt;Slot&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-001&lt;/td&gt;
 &lt;td&gt;BLADE-01&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;BL460c Gen8&lt;/td&gt;
 &lt;td&gt;1&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-002&lt;/td&gt;
 &lt;td&gt;BLADE-02&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;BL460c Gen8&lt;/td&gt;
 &lt;td&gt;2&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-003&lt;/td&gt;
 &lt;td&gt;BLADE-03&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;BL460c Gen8&lt;/td&gt;
 &lt;td&gt;3&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-004&lt;/td&gt;
 &lt;td&gt;BLADE-04&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;BL460c Gen8&lt;/td&gt;
 &lt;td&gt;4&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-005&lt;/td&gt;
 &lt;td&gt;BLADE-05&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;BL460c Gen8&lt;/td&gt;
 &lt;td&gt;5&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-006&lt;/td&gt;
 &lt;td&gt;BLADE-06&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;BL460c Gen8&lt;/td&gt;
 &lt;td&gt;6&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-007&lt;/td&gt;
 &lt;td&gt;BLADE-07&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;BL460c Gen8&lt;/td&gt;
 &lt;td&gt;7&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-008&lt;/td&gt;
 &lt;td&gt;BLADE-08&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;BL460c Gen8&lt;/td&gt;
 &lt;td&gt;8&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-009&lt;/td&gt;
 &lt;td&gt;BLADE-09&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;BL460c Gen8&lt;/td&gt;
 &lt;td&gt;9&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-010&lt;/td&gt;
 &lt;td&gt;BLADE-10&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;BL460c Gen8&lt;/td&gt;
 &lt;td&gt;10&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-011&lt;/td&gt;
 &lt;td&gt;BLADE-11&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;BL460c Gen8&lt;/td&gt;
 &lt;td&gt;11&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-012&lt;/td&gt;
 &lt;td&gt;BLADE-12&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;BL460c Gen8&lt;/td&gt;
 &lt;td&gt;12&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-013&lt;/td&gt;
 &lt;td&gt;BLADE-13&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;BL460c Gen8&lt;/td&gt;
 &lt;td&gt;13&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-014&lt;/td&gt;
 &lt;td&gt;BLADE-14&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;BL460c Gen8&lt;/td&gt;
 &lt;td&gt;14&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-015&lt;/td&gt;
 &lt;td&gt;BLADE-15&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;BL460c Gen8&lt;/td&gt;
 &lt;td&gt;15&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;BLD-016&lt;/td&gt;
 &lt;td&gt;BLADE-16&lt;/td&gt;
 &lt;td&gt;HP&lt;/td&gt;
 &lt;td&gt;BL460c Gen8&lt;/td&gt;
 &lt;td&gt;16&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h1 id="system-overviews"&gt;System Overviews
&lt;/h1&gt;&lt;p&gt;Here are some brief overviews of selected systems to provide context and highlight their typical roles or notable features.&lt;/p&gt;
&lt;h3 id="ibm-system-x3550-type-7978--x3650-type-7979-series--x3550-overview--x3650-overview"&gt;IBM System x3550 Type 7978 / x3650 Type 7979 Series — &lt;a class="link" href="https://www.ibm.com/support/pages/overview-ibm-system-x3550-type-7978" target="_blank" rel="noopener"
 &gt;x3550 overview&lt;/a&gt; · &lt;a class="link" href="https://www.ibm.com/support/pages/overview-ibm-system-x3650-type-1914-7979" target="_blank" rel="noopener"
 &gt;x3650 overview&lt;/a&gt;
&lt;/h3&gt;&lt;p&gt;&lt;em&gt;1U (x3550) / 2U (x3650) · dual Xeon (Harpertown/Nehalem) · DDR2 ECC FBDIMM up to 32GB · SAS/SATA&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;These were enterprise-grade rack servers, popular in the late 2000s, powered by Intel Xeon processors (e.g., Nehalem, Westmere generations). The x3550 is a compact 1U server, ideal for general-purpose computing, while the x3650 is a 2U model offering greater expansion capabilities for storage or PCIe cards. They served as reliable workhorses for various data center applications, including virtualization and database hosting.&lt;/p&gt;
&lt;h3 id="hp-bladesystem-c7000--quickspecs--bl460c-gen8-quickspecs"&gt;HP BladeSystem C7000 — &lt;a class="link" href="https://www.hpe.com/psnow/doc/c04128339" target="_blank" rel="noopener"
 &gt;QuickSpecs&lt;/a&gt; · &lt;a class="link" href="https://www.hpe.com/psnow/doc/c04123239" target="_blank" rel="noopener"
 &gt;BL460c Gen8 QuickSpecs&lt;/a&gt;
&lt;/h3&gt;&lt;p&gt;&lt;em&gt;10U · up to 16 half-height blades · shared power/cooling/networking via backplane · Onboard Administrator&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The C7000 is a substantial 10U blade enclosure designed to host up to 16 server blades, along with storage blades and integrated networking/management modules. It provides a consolidated infrastructure for power, cooling, and network connectivity, significantly simplifying cable management and enabling high-density computing environments. These systems were foundational for many enterprise virtualization platforms.&lt;/p&gt;
&lt;p&gt;The BL460c Gen8 blades have onboard LOM providing 1GbE connectivity. No mezzanine cards are currently installed — 10GbE requires FlexibleLOM adapters.&lt;/p&gt;
&lt;h3 id="sun-fire-x4150"&gt;Sun Fire X4150
&lt;/h3&gt;&lt;p&gt;&lt;em&gt;1U · dual Xeon (Harpertown) · 16 DIMM slots · 4 network interface&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;A 1U rackmount server from Sun Microsystems, the Sun Fire X4150 typically featured Intel Xeon processors. Sun&amp;rsquo;s x86 server line was recognized for its build quality and integration, often running Solaris or Linux. I use it as a dedicated firewall / network appliance (OpenSense), utilizing its robust hardware for network security and routing tasks.&lt;/p&gt;
&lt;h3 id="dell-powervault-md1200--specs"&gt;Dell PowerVault MD1200 — &lt;a class="link" href="https://www.dell.com/support/kbdoc/en-us/000124452/dell-powervault-md1200-md1220-direct-attached-storage" target="_blank" rel="noopener"
 &gt;specs&lt;/a&gt;
&lt;/h3&gt;&lt;p&gt;&lt;em&gt;2U DAS · 12× LFF (3.5&amp;quot;) hot-swap SAS/SATA bays · 6Gb/s SAS · up to 24TB raw&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The PowerVault MD1200 is a direct-attached storage (DAS) enclosure, designed to expand the storage capacity of compatible servers (such as Dell PowerEdge servers or others equipped with suitable SAS HBAs). This 2U unit can accommodate up to 12 LFF (3.5-inch) SAS/SATA drives, providing an expandable and cost-effective solution for adding raw storage to a homelab environment.&lt;/p&gt;
&lt;h3 id="zyxel-zywall-110"&gt;ZyXEL ZyWALL 110
&lt;/h3&gt;&lt;p&gt;&lt;em&gt;2× GbE WAN · 4× GbE LAN · VPN gateway · IPS/IDS&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The ZyWALL 110 is a professional-grade security gateway and VPN firewall. It delivers comprehensive network security features, including intrusion prevention, content filtering, and strong VPN capabilities. This appliance is well-suited for establishing a secure perimeter for a homelab network or segmenting different network environments for enhanced control and protection. However since I don&amp;rsquo;t have any license for it is currently not used.&lt;/p&gt;</description></item><item><title>Hardware Provisioning: PXE Booting and Tooling</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/hardware/hardware-provisioning/</link><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/hardware/hardware-provisioning/</guid><description>&lt;p&gt;When moving beyond manual installs, managing hardware lifecycle through PXE (Preboot Execution Environment) becomes essential. A breakdown of common tools for automating the &amp;ldquo;power-on to OS ready&amp;rdquo; process.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="common-starting-points"&gt;Common starting points
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Tool&lt;/th&gt;
 &lt;th&gt;Focus&lt;/th&gt;
 &lt;th&gt;Complexity&lt;/th&gt;
 &lt;th&gt;Best for&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://cobbler.github.io/" target="_blank" rel="noopener"
 &gt;Cobbler&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;PXE/repo server&lt;/td&gt;
 &lt;td&gt;Low–Medium&lt;/td&gt;
 &lt;td&gt;Stable, static environments needing reliable kickstart or seed installs&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://theforeman.org/" target="_blank" rel="noopener"
 &gt;Foreman&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;Full lifecycle mgmt&lt;/td&gt;
 &lt;td&gt;High&lt;/td&gt;
 &lt;td&gt;Single pane of glass for provisioning + ongoing config management (Puppet/Ansible)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://rebar.digital/" target="_blank" rel="noopener"
 &gt;Digital Rebar&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;Infrastructure-as-Code&lt;/td&gt;
 &lt;td&gt;Medium&lt;/td&gt;
 &lt;td&gt;Modern DevOps teams wanting cloud-like speed on physical gear; evolved from Crowbar&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://wiki.openstack.org/wiki/Ironic" target="_blank" rel="noopener"
 &gt;Ironic&lt;/a&gt; / &lt;a class="link" href="https://docs.openstack.org/bifrost/latest/" target="_blank" rel="noopener"
 &gt;Bifrost&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;BMaaS / scale&lt;/td&gt;
 &lt;td&gt;High&lt;/td&gt;
 &lt;td&gt;Bare Metal as a Service at scale; Bifrost runs Ironic standalone without full OpenStack&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2 id="broader-landscape"&gt;Broader landscape
&lt;/h2&gt;&lt;h3 id="classic-pxe--provisioning"&gt;Classic PXE / Provisioning
&lt;/h3&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Tool&lt;/th&gt;
 &lt;th&gt;Type&lt;/th&gt;
 &lt;th&gt;Strengths&lt;/th&gt;
 &lt;th&gt;Weaknesses&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Cobbler&lt;/td&gt;
 &lt;td&gt;PXE provisioning server&lt;/td&gt;
 &lt;td&gt;Simple, mature, easy to understand&lt;/td&gt;
 &lt;td&gt;Old architecture, static workflows&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Foreman&lt;/td&gt;
 &lt;td&gt;Lifecycle/provisioning platform&lt;/td&gt;
 &lt;td&gt;Powerful, enterprise-capable, large ecosystem&lt;/td&gt;
 &lt;td&gt;Heavy footprint, Rails monolith&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Uyuni&lt;/td&gt;
 &lt;td&gt;Systems management&lt;/td&gt;
 &lt;td&gt;Enterprise lifecycle management (SUSE/Spacewalk lineage)&lt;/td&gt;
 &lt;td&gt;Less modern provisioning architecture&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="dynamic--policy-driven"&gt;Dynamic / Policy-Driven
&lt;/h3&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Tool&lt;/th&gt;
 &lt;th&gt;Type&lt;/th&gt;
 &lt;th&gt;Strengths&lt;/th&gt;
 &lt;th&gt;Weaknesses&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Razor&lt;/td&gt;
 &lt;td&gt;Policy-driven provisioning&lt;/td&gt;
 &lt;td&gt;Dynamic node discovery, elegant lifecycle model&lt;/td&gt;
 &lt;td&gt;Effectively dormant&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Digital Rebar&lt;/td&gt;
 &lt;td&gt;Workflow provisioning platform&lt;/td&gt;
 &lt;td&gt;Architecturally modern and flexible&lt;/td&gt;
 &lt;td&gt;Partially commercialized&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="cloud--hyperscale-bare-metal"&gt;Cloud / Hyperscale Bare Metal
&lt;/h3&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Tool&lt;/th&gt;
 &lt;th&gt;Type&lt;/th&gt;
 &lt;th&gt;Strengths&lt;/th&gt;
 &lt;th&gt;Weaknesses&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Ironic&lt;/td&gt;
 &lt;td&gt;OpenStack bare-metal service&lt;/td&gt;
 &lt;td&gt;Extremely scalable, API-driven&lt;/td&gt;
 &lt;td&gt;High operational complexity&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Bifrost&lt;/td&gt;
 &lt;td&gt;Standalone Ironic deployment&lt;/td&gt;
 &lt;td&gt;Easier entry into Ironic ecosystem&lt;/td&gt;
 &lt;td&gt;Inherits Ironic complexity&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;MAAS&lt;/td&gt;
 &lt;td&gt;Bare metal cloud platform&lt;/td&gt;
 &lt;td&gt;Excellent UX, API-first, machine discovery&lt;/td&gt;
 &lt;td&gt;Larger footprint, Ubuntu-centric&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="kubernetes-native--cloud-native"&gt;Kubernetes-Native / Cloud-Native
&lt;/h3&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Tool&lt;/th&gt;
 &lt;th&gt;Type&lt;/th&gt;
 &lt;th&gt;Strengths&lt;/th&gt;
 &lt;th&gt;Weaknesses&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Tinkerbell&lt;/td&gt;
 &lt;td&gt;Cloud-native provisioning&lt;/td&gt;
 &lt;td&gt;Modern architecture, composable workflows&lt;/td&gt;
 &lt;td&gt;Microservice complexity&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Metal3&lt;/td&gt;
 &lt;td&gt;Kubernetes operator&lt;/td&gt;
 &lt;td&gt;Native Kubernetes integration&lt;/td&gt;
 &lt;td&gt;Requires Kubernetes infrastructure&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Omni&lt;/td&gt;
 &lt;td&gt;Talos cluster orchestration&lt;/td&gt;
 &lt;td&gt;Very modern UX and lifecycle management&lt;/td&gt;
 &lt;td&gt;Talos/Kubernetes specific&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Matchbox&lt;/td&gt;
 &lt;td&gt;Minimal PXE/ignition service&lt;/td&gt;
 &lt;td&gt;Elegant, simple, iPXE-first&lt;/td&gt;
 &lt;td&gt;Narrow immutable-infra focus&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="boot-infrastructure--pxe-utilities"&gt;Boot Infrastructure / PXE Utilities
&lt;/h3&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Tool&lt;/th&gt;
 &lt;th&gt;Type&lt;/th&gt;
 &lt;th&gt;Strengths&lt;/th&gt;
 &lt;th&gt;Weaknesses&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;iPXE&lt;/td&gt;
 &lt;td&gt;Network boot firmware&lt;/td&gt;
 &lt;td&gt;Flexible, fast, programmable (HTTP + scripting)&lt;/td&gt;
 &lt;td&gt;Requires infrastructure around it&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;netboot.xyz&lt;/td&gt;
 &lt;td&gt;Dynamic network boot menu&lt;/td&gt;
 &lt;td&gt;Extremely useful and lightweight&lt;/td&gt;
 &lt;td&gt;Not a provisioning orchestrator&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2 id="architectural-styles"&gt;Architectural Styles
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Style&lt;/th&gt;
 &lt;th&gt;Example Tools&lt;/th&gt;
 &lt;th&gt;Characteristics&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Static config-driven&lt;/td&gt;
 &lt;td&gt;Cobbler&lt;/td&gt;
 &lt;td&gt;Profiles + templates + PXE configs&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Policy/state-driven&lt;/td&gt;
 &lt;td&gt;Razor, Digital Rebar&lt;/td&gt;
 &lt;td&gt;Nodes discovered dynamically, assigned via policies&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Cloud resource model&lt;/td&gt;
 &lt;td&gt;Ironic, MAAS&lt;/td&gt;
 &lt;td&gt;Bare metal treated as cloud infrastructure&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Kubernetes-native&lt;/td&gt;
 &lt;td&gt;Tinkerbell, Metal3&lt;/td&gt;
 &lt;td&gt;Bare metal managed via Kubernetes APIs&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Immutable OS orchestration&lt;/td&gt;
 &lt;td&gt;Omni, Matchbox&lt;/td&gt;
 &lt;td&gt;Minimal provisioning around immutable operating systems&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2 id="the-gap"&gt;The Gap
&lt;/h2&gt;&lt;p&gt;There is still no widely adopted FOSS solution that is simultaneously:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;lightweight&lt;/li&gt;
&lt;li&gt;modern&lt;/li&gt;
&lt;li&gt;self-hostable&lt;/li&gt;
&lt;li&gt;API-first&lt;/li&gt;
&lt;li&gt;iPXE-native&lt;/li&gt;
&lt;li&gt;distro-agnostic&lt;/li&gt;
&lt;li&gt;easy to operate&lt;/li&gt;
&lt;li&gt;single-binary deployable&lt;/li&gt;
&lt;li&gt;workflow-capable&lt;/li&gt;
&lt;li&gt;not tied to Kubernetes/OpenStack&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Most existing systems drift toward enterprise complexity, cloud platform assumptions, Kubernetes dependency, immutable OS specialization, or monolithic lifecycle management.&lt;/p&gt;

 &lt;blockquote&gt;
 &lt;p&gt;&amp;ldquo;A modern lightweight provisioning orchestrator for reproducible bare-metal infrastructure.&amp;rdquo;&lt;/p&gt;

 &lt;/blockquote&gt;</description></item><item><title>Kubernetes Across the Stack</title><link>https://backend-engineering-strategy-tools.github.io/site/projects/kubernetes-stack/</link><pubDate>Mon, 16 Mar 2026 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/projects/kubernetes-stack/</guid><description>&lt;p&gt;A documented comparison of running Kubernetes across every major hosting model — cloud managed, self-managed on cloud, private cloud, and bare metal at home. The goal is a honest, practical reference for each environment: what it costs you in time and money, where the rough edges are, and how the networking story differs between them.&lt;/p&gt;
&lt;p&gt;The thread running through all of it is &lt;a class="link" href="https://www.talos.dev/" target="_blank" rel="noopener"
 &gt;Talos Linux&lt;/a&gt; — an immutable, API-driven OS built specifically for Kubernetes. No SSH, no shell, no config drift. The same OS everywhere means the operational model stays consistent regardless of what is running underneath.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Environment&lt;/th&gt;
 &lt;th&gt;Approach&lt;/th&gt;
 &lt;th&gt;&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;OpenStack — &lt;a class="link" href="https://cleura.com/" target="_blank" rel="noopener"
 &gt;Cleura&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;Talos &amp;amp; Terraform&lt;/td&gt;
 &lt;td&gt;draft exists&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;OpenStack — &lt;a class="link" href="https://cleura.com/" target="_blank" rel="noopener"
 &gt;Cleura&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;Talos, with Omni&lt;/td&gt;
 &lt;td&gt;maybe ?&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;OpenStack — &lt;a class="link" href="https://elastx.se/" target="_blank" rel="noopener"
 &gt;ElastX&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;Talos &amp;amp; Terraform&lt;/td&gt;
 &lt;td&gt;draft exists&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;OpenStack — &lt;a class="link" href="https://elastx.se/" target="_blank" rel="noopener"
 &gt;ElastX&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;Talos, with Omni&lt;/td&gt;
 &lt;td&gt;maybe ?&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Homelab — bare metal&lt;/td&gt;
 &lt;td&gt;Talos + Pixieboot + Omni&lt;/td&gt;
 &lt;td&gt;draft exists&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Homelab — bare metal&lt;/td&gt;
 &lt;td&gt;Talos + Pixieboot without Omni&lt;/td&gt;
 &lt;td&gt;maybe ?&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Homelab — OpenStack&lt;/td&gt;
 &lt;td&gt;OpenStack on bare metal, Talos running on top&lt;/td&gt;
 &lt;td&gt;&lt;em&gt;(stretch)&lt;/em&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Homelab — OpenStack&lt;/td&gt;
 &lt;td&gt;Talos on bare metal, OpenStack inside cluster&lt;/td&gt;
 &lt;td&gt;&lt;em&gt;(stretch)&lt;/em&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;AWS&lt;/td&gt;
 &lt;td&gt;Talos on EC2&lt;/td&gt;
 &lt;td&gt;&lt;em&gt;(stretch)&lt;/em&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Azure&lt;/td&gt;
 &lt;td&gt;Talos on VMs&lt;/td&gt;
 &lt;td&gt;&lt;em&gt;(stretch)&lt;/em&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;GCP&lt;/td&gt;
 &lt;td&gt;Talos on Compute Engine&lt;/td&gt;
 &lt;td&gt;&lt;em&gt;(stretch)&lt;/em&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Stretch goals&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;AWS, Azure, GCP — same Talos approach, different underlying infrastructure. Interesting eventually, but not the priority.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Omni&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://omni.siderolabs.com/" target="_blank" rel="noopener"
 &gt;Omni&lt;/a&gt; is Sidero&amp;rsquo;s managed control plane for Talos clusters — worth documenting both with and without it. Without Omni gives you the full picture of what Talos management looks like manually; with Omni shows what the managed layer buys you.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Homelab provisioning&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Nodes provisioned via Pixieboot — no USB sticks, no manual installations. A node powers on, boots from the network, and registers. The goal is a fully reproducible cluster from scratch with minimal human steps.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Scope&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Cluster provisioning and bootstrap for each environment&lt;/li&gt;
&lt;li&gt;Networking — CNI choices, ingress, cross-cluster connectivity&lt;/li&gt;
&lt;li&gt;Storage — what you get managed vs what you have to bring yourself&lt;/li&gt;
&lt;li&gt;Operational differences — upgrades, node management, observability&lt;/li&gt;
&lt;li&gt;Cost and trade-off summary across environments&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Making it usable&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Getting a cluster running is the easy part. Making it usable is where environments diverge. Each environment needs an answer for ingress, DNS, and storage — and the answer varies significantly depending on what the underlying platform provides.&lt;/p&gt;
&lt;p&gt;On managed cloud you can lean on load balancers and block storage from the provider. On OpenStack you have those options if the provider exposes them. On bare metal at home you are on your own — MetalLB or similar for load balancer IPs, a local DNS solution, and either local storage or something like Rook/Ceph. Same Kubernetes, very different operational story underneath.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Notes exist in various states — pulling them together, testing, and documenting properly is the work.&lt;/em&gt;&lt;/p&gt;</description></item></channel></rss>