<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Observability on Backend Engineering Strategy Tools</title><link>https://backend-engineering-strategy-tools.github.io/site/tags/observability/</link><description>Recent content in Observability on Backend Engineering Strategy Tools</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Mon, 01 Jan 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://backend-engineering-strategy-tools.github.io/site/tags/observability/index.xml" rel="self" type="application/rss+xml"/><item><title>Elasticsearch &amp; Kibana</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/frameworks-tools/elk/</link><pubDate>Mon, 01 Jan 2024 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/frameworks-tools/elk/</guid><description>&lt;p&gt;Elasticsearch is a distributed search and analytics engine built on Apache Lucene. Kibana is its web UI for querying, visualising, and exploring the data stored in Elasticsearch. Together they form the search and analysis layer of the ELK stack — typically with Logstash or Beats collecting and shipping data into Elasticsearch, and Kibana on top for humans to interact with it.&lt;/p&gt;
&lt;h2 id="elasticsearch"&gt;Elasticsearch
&lt;/h2&gt;&lt;p&gt;A document store where every document is JSON and every field is indexed by default. Queries are also JSON, using a rich query DSL that supports full-text search, structured filters, aggregations, and geospatial queries. Elasticsearch is horizontally scalable — an index is split into shards, shards are distributed across nodes, and replicas provide redundancy. Adding nodes increases both capacity and query throughput.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-json" data-lang="json"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#960050;background-color:#1e0010"&gt;#&lt;/span&gt; &lt;span style="color:#960050;background-color:#1e0010"&gt;Index&lt;/span&gt; &lt;span style="color:#960050;background-color:#1e0010"&gt;a&lt;/span&gt; &lt;span style="color:#960050;background-color:#1e0010"&gt;document&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#960050;background-color:#1e0010"&gt;POST&lt;/span&gt; &lt;span style="color:#960050;background-color:#1e0010"&gt;/logs/_doc&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;{
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;&amp;#34;timestamp&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;2026-06-04T12:00:00Z&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;&amp;#34;level&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;error&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;&amp;#34;service&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;api&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;&amp;#34;message&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;connection refused&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#960050;background-color:#1e0010"&gt;#&lt;/span&gt; &lt;span style="color:#960050;background-color:#1e0010"&gt;Search&lt;/span&gt; &lt;span style="color:#960050;background-color:#1e0010"&gt;with&lt;/span&gt; &lt;span style="color:#960050;background-color:#1e0010"&gt;filter&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#960050;background-color:#1e0010"&gt;GET&lt;/span&gt; &lt;span style="color:#960050;background-color:#1e0010"&gt;/logs/_search&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;{
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;&amp;#34;query&amp;#34;&lt;/span&gt;: {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;&amp;#34;bool&amp;#34;&lt;/span&gt;: {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;&amp;#34;filter&amp;#34;&lt;/span&gt;: [
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; { &lt;span style="color:#f92672"&gt;&amp;#34;term&amp;#34;&lt;/span&gt;: { &lt;span style="color:#f92672"&gt;&amp;#34;level&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;error&amp;#34;&lt;/span&gt; } },
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; { &lt;span style="color:#f92672"&gt;&amp;#34;term&amp;#34;&lt;/span&gt;: { &lt;span style="color:#f92672"&gt;&amp;#34;service&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;api&amp;#34;&lt;/span&gt; } }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; ]
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;At scale, index lifecycle management (ILM) policies handle the hot-warm-cold tiering automatically — recent indices stay on fast nodes, older indices roll to cheaper storage, and expired indices are deleted.&lt;/p&gt;
&lt;h2 id="kibana"&gt;Kibana
&lt;/h2&gt;&lt;p&gt;The interface to Elasticsearch. Kibana&amp;rsquo;s core is &lt;strong&gt;Discover&lt;/strong&gt; — a time-series log explorer with free-text search and field filtering — and &lt;strong&gt;Dashboards&lt;/strong&gt; — composable visualisations (time series, bar charts, pie charts, data tables, maps) that query Elasticsearch directly. For log aggregation and observability use cases, a typical workflow is: ship logs into Elasticsearch via Filebeat or Logstash, explore them in Discover, build dashboards for the signals that matter, set up alerting rules on those patterns.&lt;/p&gt;
&lt;p&gt;Kibana also hosts the Elastic APM UI (application performance monitoring), the SIEM app (security event correlation), and the Lens visual editor for building dashboards without writing aggregation queries by hand.&lt;/p&gt;
&lt;h2 id="elk-vs-the-grafana-stack"&gt;ELK vs the Grafana stack
&lt;/h2&gt;&lt;p&gt;The Grafana stack (&lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/observability/loki/" &gt;Loki&lt;/a&gt; + &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/observability/prometheus/" &gt;Prometheus&lt;/a&gt; + &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/observability/grafana/" &gt;Grafana&lt;/a&gt;) has become the common alternative for cloud-native environments. The key difference: Loki indexes only log metadata (labels), not the full log content — it is cheaper to run and query at scale, but full-text search across log bodies is slower. Elasticsearch indexes everything and full-text search is fast, but the storage and memory cost is significantly higher. For log volumes in the hundreds of GB/day and above, the operational cost of Elasticsearch becomes the dominant factor. For environments that need fast full-text search across structured and unstructured data — logs, documents, events — Elasticsearch earns its cost.&lt;/p&gt;
&lt;h2 id="resources"&gt;Resources
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://www.elastic.co/docs/solutions/search" target="_blank" rel="noopener"
 &gt;Elasticsearch documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.elastic.co/docs/solutions/observability" target="_blank" rel="noopener"
 &gt;Kibana documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.elastic.co/guide/en/ecs/current/" target="_blank" rel="noopener"
 &gt;Elastic common schema (ECS)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Grafana</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/observability/grafana/</link><pubDate>Mon, 01 Jan 2024 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/observability/grafana/</guid><description>&lt;p&gt;Prometheus shows you the spike. It tells you memory climbed at 14:32, error rate crossed 5% at 14:35, and latency hit 2 seconds at 14:37. But raw PromQL results are numbers in a table. You cannot see the shape of an incident in a table. You cannot hand a table to a product manager and explain what happened.&lt;/p&gt;
&lt;p&gt;So you use Grafana. It connects to Prometheus (and Loki, and a dozen other data sources) and turns those numbers into dashboards. You see the spike, the timeline, the correlation between services — all on one screen.&lt;/p&gt;
&lt;h2 id="data-sources"&gt;Data sources
&lt;/h2&gt;&lt;p&gt;Grafana is a visualisation layer, not a storage layer. It queries data sources and renders the results. In a Kubernetes observability stack, the typical setup:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Data source&lt;/th&gt;
 &lt;th&gt;What it provides&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Prometheus&lt;/td&gt;
 &lt;td&gt;Metrics — CPU, memory, request rates, error rates, latency&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Loki&lt;/td&gt;
 &lt;td&gt;Logs — searchable, filterable, correlated with metrics by time&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Jaeger / Tempo&lt;/td&gt;
 &lt;td&gt;Traces — individual request journeys across services&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Adding a data source is a few fields in the UI or a ConfigMap if you manage Grafana as code.&lt;/p&gt;
&lt;h2 id="dashboards"&gt;Dashboards
&lt;/h2&gt;&lt;p&gt;A dashboard is a collection of panels. Each panel runs a query against a data source and renders the result as a graph, gauge, stat, table, or heatmap.&lt;/p&gt;
&lt;p&gt;The fastest way to get useful dashboards is &lt;a class="link" href="https://grafana.com/grafana/dashboards/" target="_blank" rel="noopener"
 &gt;grafana.com/grafana/dashboards&lt;/a&gt; — a library of community dashboards for almost every common component. Import by ID:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;1860&lt;/strong&gt; — Node Exporter Full (host metrics: CPU, memory, disk, network)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;6417&lt;/strong&gt; — Kubernetes cluster overview&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;7362&lt;/strong&gt; — MySQL overview&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;9628&lt;/strong&gt; — Postgres overview&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Import these on day one and you have coverage before writing a single PromQL query.&lt;/p&gt;
&lt;h2 id="variables"&gt;Variables
&lt;/h2&gt;&lt;p&gt;Dashboard variables make panels reusable across namespaces, clusters, or services. A variable populated from a Prometheus label query:&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;label_values(kube_pod_info{namespace=~&amp;#34;.+&amp;#34;}, namespace)
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Now every panel can use &lt;code&gt;$namespace&lt;/code&gt; in its query, and a dropdown at the top of the dashboard filters the whole view.&lt;/p&gt;
&lt;h2 id="alerting"&gt;Alerting
&lt;/h2&gt;&lt;p&gt;Grafana has its own alert engine that evaluates queries on a schedule and routes alerts through contact points (Slack, PagerDuty, email). For Kubernetes setups already using Alertmanager, it is usually cleaner to define alert rules in Prometheus and use Grafana purely for visualisation — one place for alert rules, not two.&lt;/p&gt;
&lt;h2 id="managing-grafana-as-code"&gt;Managing Grafana as code
&lt;/h2&gt;&lt;p&gt;Dashboards built in the UI are fragile — they live in a database and disappear if you rebuild the stack. Two better approaches:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Grafana provisioning&lt;/strong&gt; — mount dashboard JSON files via ConfigMap. Grafana loads them on startup and they survive restarts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Grafonnet / Jsonnet&lt;/strong&gt; — generate dashboard JSON programmatically. Verbose but version-controllable and reviewable in pull requests.&lt;/p&gt;
&lt;h2 id="the-observability-trio"&gt;The observability trio
&lt;/h2&gt;&lt;p&gt;Grafana is the front end for the full observability stack:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Prometheus&lt;/strong&gt; — something is wrong, here are the numbers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Loki&lt;/strong&gt; — here are the log lines from that time window&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Jaeger&lt;/strong&gt; — here is the exact request that failed and where it slowed down&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each answers a different question. Grafana is where you look at all three in one place.&lt;/p&gt;
&lt;h2 id="resources"&gt;Resources
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://grafana.com/docs/grafana/latest/" target="_blank" rel="noopener"
 &gt;Grafana documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://grafana.com/grafana/dashboards/" target="_blank" rel="noopener"
 &gt;Grafana dashboard library&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack" target="_blank" rel="noopener"
 &gt;kube-prometheus-stack&lt;/a&gt; — installs Prometheus + Grafana together&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Istio</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/kubernetes/istio/</link><pubDate>Mon, 01 Jan 2024 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/kubernetes/istio/</guid><description>&lt;p&gt;Istio is a service mesh for Kubernetes. It injects a sidecar proxy (Envoy) into every pod, and all traffic between pods flows through these proxies rather than directly between containers. This gives the mesh control over traffic routing, security, and observability without any changes to application code.&lt;/p&gt;
&lt;h2 id="what-it-solves"&gt;What it solves
&lt;/h2&gt;&lt;p&gt;In a large microservice deployment, every service needs to handle retries, timeouts, circuit breaking, mutual TLS, and metrics collection — or skip them and accept the risk. Without a mesh, each team implements this differently, or not at all. Istio moves these concerns out of the application and into the infrastructure layer, where they are configured once and applied uniformly.&lt;/p&gt;
&lt;h2 id="traffic-management"&gt;Traffic management
&lt;/h2&gt;&lt;p&gt;Istio&amp;rsquo;s &lt;code&gt;VirtualService&lt;/code&gt; and &lt;code&gt;DestinationRule&lt;/code&gt; CRDs give fine-grained control over how traffic is routed:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;apiVersion&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;networking.istio.io/v1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;kind&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;VirtualService&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;metadata&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;name&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;reviews&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;spec&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;hosts&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;reviews&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;http&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;match&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;headers&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;end-user&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;exact&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;test-user&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;route&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;destination&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;host&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;reviews&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;subset&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;v2&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;route&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;destination&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;host&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;reviews&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;subset&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;v1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This routes a specific user to &lt;code&gt;v2&lt;/code&gt; of a service while everyone else gets &lt;code&gt;v1&lt;/code&gt; — canary testing without a load balancer rule or code change.&lt;/p&gt;
&lt;h2 id="mtls"&gt;mTLS
&lt;/h2&gt;&lt;p&gt;Istio issues and rotates certificates for every workload and enforces mutual TLS between services automatically. Services authenticate each other&amp;rsquo;s identity, not just encrypt the connection. A &lt;code&gt;PeerAuthentication&lt;/code&gt; policy can enforce strict mTLS across a namespace, ensuring no plaintext traffic is accepted.&lt;/p&gt;
&lt;h2 id="observability"&gt;Observability
&lt;/h2&gt;&lt;p&gt;Because all traffic flows through Envoy sidecars, Istio generates L7 metrics (request rate, error rate, latency percentiles), distributed traces, and access logs for every service-to-service call — without instrumentation in the services themselves. This integrates with &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/observability/prometheus/" &gt;Prometheus&lt;/a&gt;, &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/observability/grafana/" &gt;Grafana&lt;/a&gt;, and &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/observability/jaeger/" &gt;Jaeger&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="cost"&gt;Cost
&lt;/h2&gt;&lt;p&gt;Istio adds latency (two extra proxy hops per call) and resource overhead (a sidecar per pod). For clusters with tens of services, the operational benefit is clear. For small clusters or teams early in a microservices journey, the complexity may outweigh the gains.&lt;/p&gt;
&lt;h2 id="resources"&gt;Resources
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://istio.io/latest/docs/" target="_blank" rel="noopener"
 &gt;Istio documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://istio.io/latest/docs/concepts/" target="_blank" rel="noopener"
 &gt;Istio concepts&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Jaeger</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/observability/jaeger/</link><pubDate>Mon, 01 Jan 2024 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/observability/jaeger/</guid><description>&lt;p&gt;Metrics tell you something is wrong. Logs tell you what happened on one service. Distributed tracing tells you what happened across all the services involved in a single request — where the time went, which service called which, and where a failure or latency spike originated.&lt;/p&gt;
&lt;p&gt;Jaeger is an open source distributed tracing system originally from Uber, now a CNCF graduated project. It collects trace data from instrumented services, stores it, and provides a UI for querying and visualising traces. A trace is a tree of spans — each span represents one operation (an HTTP request, a database query, a cache lookup), with start time, duration, and metadata. Jaeger assembles the spans from all services involved in a request into a single trace and lets you follow a request from the frontend through every microservice it touched.&lt;/p&gt;
&lt;h2 id="how-it-works"&gt;How it works
&lt;/h2&gt;&lt;p&gt;Services emit spans using the OpenTelemetry SDK (the modern standard) or the legacy Jaeger client libraries. Spans are sent to a Jaeger Collector, stored in a backend (Elasticsearch, Cassandra, or in-memory for development), and queried via the Jaeger UI or API.&lt;/p&gt;
&lt;p&gt;The typical deployment in Kubernetes uses the Jaeger Operator:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/latest/download/jaeger-operator.yaml
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;A minimal all-in-one instance (suitable for development):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;apiVersion&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;jaegertracing.io/v1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;kind&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;Jaeger&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;metadata&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;name&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;jaeger&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;spec&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;strategy&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;allInOne&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;For production, use the &lt;code&gt;production&lt;/code&gt; strategy with a separate Collector, Query, and storage backend.&lt;/p&gt;
&lt;h2 id="opentelemetry"&gt;OpenTelemetry
&lt;/h2&gt;&lt;p&gt;Jaeger is increasingly used as a backend rather than as the instrumentation library. OpenTelemetry (OTel) is the standard for instrumenting applications — language SDKs, auto-instrumentation agents, and a Collector that receives, processes, and exports telemetry. OTel exports traces to Jaeger (or Tempo, Zipkin, or any OTLP-compatible backend) via the OTLP protocol. The practical consequence: instrument once with OTel, route to whichever backend fits.&lt;/p&gt;
&lt;h2 id="in-the-observability-stack"&gt;In the observability stack
&lt;/h2&gt;&lt;p&gt;Jaeger covers the tracing pillar alongside &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/observability/prometheus/" &gt;Prometheus&lt;/a&gt; (metrics) and &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/observability/loki/" &gt;Loki&lt;/a&gt; (logs). &lt;a class="link" href="https://backend-engineering-strategy-tools.github.io/site/public-notes/observability/grafana/" &gt;Grafana&lt;/a&gt; can query Jaeger directly — a trace ID in a log line becomes a clickable link that opens the trace in Grafana&amp;rsquo;s trace explorer, connecting all three pillars in a single investigation workflow. Grafana Tempo is an alternative tracing backend that integrates more tightly with the Grafana stack, but Jaeger remains the standalone tracing solution with the longest track record.&lt;/p&gt;
&lt;h2 id="resources"&gt;Resources
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://www.jaegertracing.io/docs/" target="_blank" rel="noopener"
 &gt;Jaeger documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://opentelemetry.io/docs/" target="_blank" rel="noopener"
 &gt;OpenTelemetry documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/jaegertracing/jaeger-operator" target="_blank" rel="noopener"
 &gt;Jaeger Operator&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Loki</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/observability/loki/</link><pubDate>Mon, 01 Jan 2024 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/observability/loki/</guid><description>&lt;p&gt;&lt;a class="link" href="../prometheus/" &gt;Prometheus&lt;/a&gt; tells you &lt;em&gt;that&lt;/em&gt; something is wrong and &lt;em&gt;when&lt;/em&gt; it started. Loki tells you &lt;em&gt;what&lt;/em&gt; happened — it is the log aggregation layer of the observability stack. Logs from every pod across every node are collected, indexed, and made searchable in one place. Grafana is the front end for both.&lt;/p&gt;
&lt;h2 id="how-it-works"&gt;How it works
&lt;/h2&gt;&lt;p&gt;Loki stores logs as compressed chunks, indexed only by labels (not by content). This makes it cheap to store and fast to query by label — namespace, pod name, app — but slower for full-text search than something like Elasticsearch. The trade-off is intentional: label-scoped queries cover the vast majority of real operational use, and the storage cost is dramatically lower.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Promtail&lt;/strong&gt; runs as a DaemonSet on every node, tails log files from &lt;code&gt;/var/log/pods/&lt;/code&gt;, attaches Kubernetes labels, and ships to Loki. Grafana queries Loki directly.&lt;/p&gt;
&lt;h2 id="deployment-modes"&gt;Deployment modes
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;SingleBinary&lt;/strong&gt; — ingestion, querying, and management all run in a single instance. Simple to deploy, minimal operational overhead. A single point of failure: if it goes down, ingestion stops and logs are lost. The right starting point for most clusters.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;SimpleScalable&lt;/strong&gt; — responsibilities split into separate pods, each running a minimum of two instances for HA. Ingestion, querying, and the compactor can be scaled independently. Significantly more operational overhead, but fault-tolerant and tunable under load. The right move for production once you have volume and reliability requirements.&lt;/p&gt;
&lt;h2 id="getting-started"&gt;Getting started
&lt;/h2&gt;&lt;p&gt;The fastest path to a working stack is deploying Loki alongside &lt;code&gt;kube-prometheus-stack&lt;/code&gt;, which brings up Prometheus, Grafana, and Alertmanager together. See the &lt;a class="link" href="../prometheus/" &gt;Prometheus&lt;/a&gt; note for the kube-prometheus-stack setup and the ArgoCD CRD workaround.&lt;/p&gt;
&lt;p&gt;Loki and Promtail are installed as a separate ArgoCD Application, using multiple Helm sources with values pulled from the cluster config repo:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;apiVersion&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;kind&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;Application&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;metadata&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;name&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;log-ingestion&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;namespace&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;argo-cd&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;spec&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;project&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;default&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;sources&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# Loki&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;repoURL&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;https://grafana.github.io/helm-charts&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;chart&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;loki&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;targetRevision&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;6.55.0&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;helm&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;releaseName&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;loki&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;valueFiles&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;$values/cluster/testing/overlay/monitoring/helm/loki-values.yaml&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# Promtail&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;repoURL&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;https://grafana.github.io/helm-charts&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;chart&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;promtail&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;targetRevision&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;6.17.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;helm&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;releaseName&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;promtail&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;valueFiles&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;$values/cluster/testing/overlay/monitoring/helm/promtail-values.yaml&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# Values source — cluster config repo&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;repoURL&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#39;git@github.com:example-org/cluster-config.git&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;targetRevision&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;HEAD&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;ref&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;values&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;destination&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;server&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;https://kubernetes.default.svc&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;namespace&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;monitoring&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;syncPolicy&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;automated&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;selfHeal&lt;/span&gt;: &lt;span style="color:#66d9ef"&gt;true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;prune&lt;/span&gt;: &lt;span style="color:#66d9ef"&gt;true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;syncOptions&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;CreateNamespace=true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;ServerSideApply=true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Note: &lt;code&gt;targetRevision: HEAD&lt;/code&gt; is fine for testing environments. Pin to a tag for staging and production.&lt;/p&gt;
&lt;h2 id="promtail-deprecation"&gt;Promtail deprecation
&lt;/h2&gt;&lt;p&gt;Promtail is deprecated as of February 2025 and in LTS — security fixes only, no new features. Expected EOL is end of 2026.&lt;/p&gt;
&lt;p&gt;The Grafana-recommended replacement is &lt;strong&gt;&lt;a class="link" href="https://grafana.com/docs/alloy/latest/" target="_blank" rel="noopener"
 &gt;Grafana Alloy&lt;/a&gt;&lt;/strong&gt;, a more capable collector that handles metrics, logs, and traces in a single agent. The migration path is not yet settled enough for a confident recommendation — worth waiting for clear community consensus before moving. Until then, Promtail continues to work and the LTS window gives time to plan.&lt;/p&gt;
&lt;h2 id="grafana-integration"&gt;Grafana integration
&lt;/h2&gt;&lt;p&gt;Add Loki as a data source in Grafana and logs become queryable alongside metrics. A useful starting point is a simple app-oriented logs dashboard — filter by namespace and pod, tail in near-real-time, correlate timestamps with Prometheus spikes.&lt;/p&gt;
&lt;p&gt;LogQL, Loki&amp;rsquo;s query language, mirrors PromQL in style:&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code class="language-logql" data-lang="logql"&gt;# All error logs from a namespace
{namespace=&amp;#34;production&amp;#34;} |= &amp;#34;error&amp;#34;

# Parse and filter structured logs
{app=&amp;#34;my-api&amp;#34;} | json | status &amp;gt;= 500

# Rate of error log lines over time
rate({namespace=&amp;#34;production&amp;#34;} |= &amp;#34;error&amp;#34; [5m])
&lt;/code&gt;&lt;/pre&gt;&lt;h2 id="resources"&gt;Resources
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://grafana.com/docs/loki/latest/" target="_blank" rel="noopener"
 &gt;Loki documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://grafana.com/docs/alloy/latest/" target="_blank" rel="noopener"
 &gt;Grafana Alloy documentation&lt;/a&gt; — future Promtail replacement&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/grafana/helm-charts/tree/main/charts/loki-stack" target="_blank" rel="noopener"
 &gt;loki-stack Helm chart&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack" target="_blank" rel="noopener"
 &gt;kube-prometheus-stack&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Prometheus</title><link>https://backend-engineering-strategy-tools.github.io/site/public-notes/observability/prometheus/</link><pubDate>Mon, 01 Jan 2024 00:00:00 +0000</pubDate><guid>https://backend-engineering-strategy-tools.github.io/site/public-notes/observability/prometheus/</guid><description>&lt;p&gt;Something is wrong. Pods are restarting, latency is climbing, and a request that usually takes 50ms is now taking 2 seconds. You know something happened — users are complaining — but you have no numbers, no history, and no way to know when it started or which service caused it.&lt;/p&gt;
&lt;p&gt;The instinct is to grep the logs. And with one service on one server, that works. But once you have 20 services running across 40 pods, grepping logs to understand system behaviour does not scale. You are looking at individual events trying to infer aggregate trends — the wrong tool for the question. Logs tell you what happened in one place at one moment. Metrics tell you how the system is behaving across all of it, over time.&lt;/p&gt;
&lt;p&gt;So you use Prometheus. It scrapes metrics from every pod, node, and cluster component on a regular interval and stores them as time series. Now you have the spike, the exact minute it started, and a number attached to every symptom.&lt;/p&gt;
&lt;h2 id="how-it-works"&gt;How it works
&lt;/h2&gt;&lt;p&gt;Prometheus is pull-based: it reaches out to your services and scrapes a &lt;code&gt;/metrics&lt;/code&gt; endpoint on a schedule. Services expose metrics in a simple text format; Prometheus stores them and makes them queryable.&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;# EXPOSE from your app
http://my-service:8080/metrics

# Prometheus scrapes this every 15s and stores:
http_requests_total{method=&amp;#34;POST&amp;#34;, status=&amp;#34;500&amp;#34;} 42
http_request_duration_seconds{quantile=&amp;#34;0.99&amp;#34;} 1.847
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Most infrastructure components (Kubernetes itself, NGINX, Postgres, Redis, JVM) either expose Prometheus metrics natively or have an exporter that does it for them.&lt;/p&gt;
&lt;h2 id="in-kubernetes--kube-prometheus-stack"&gt;In Kubernetes — kube-prometheus-stack
&lt;/h2&gt;&lt;p&gt;Running Prometheus in Kubernetes manually is fiddly. The fastest path to a working Prometheus + Grafana + Alertmanager stack is the &lt;code&gt;kube-prometheus-stack&lt;/code&gt; Helm chart — it installs the Prometheus Operator, Grafana, Alertmanager, node exporters, and a set of default dashboards and alert rules in one go. Add &lt;a class="link" href="../loki/" &gt;Loki&lt;/a&gt; on top for logs.&lt;/p&gt;
&lt;h3 id="crd-workaround"&gt;CRD workaround
&lt;/h3&gt;&lt;p&gt;&lt;code&gt;kube-prometheus-stack&lt;/code&gt; ships with a large set of CRDs. When managed through ArgoCD, applying CRDs and the chart in the same sync can cause ordering failures. The standard workaround is &lt;code&gt;skipCrds: true&lt;/code&gt; in the ArgoCD Application, with CRDs applied via a separate kustomize source in the same Application:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;- &lt;span style="color:#f92672"&gt;repoURL&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;oci://ghcr.io/prometheus-community/charts/kube-prometheus-stack&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;chart&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;kube-prometheus-stack&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;targetRevision&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;82.14.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;helm&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;releaseName&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;kube-prometheus-stack&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;valueFiles&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;$values/cluster/development/overlay/monitoring/helm/kube-prometheus-stack.yaml&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;skipCrds&lt;/span&gt;: &lt;span style="color:#66d9ef"&gt;true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This keeps the CRDs in Git and lets ArgoCD manage them, while avoiding the race condition on first install.&lt;/p&gt;
&lt;h3 id="servicemonitor"&gt;ServiceMonitor
&lt;/h3&gt;&lt;p&gt;The &lt;strong&gt;Prometheus Operator&lt;/strong&gt; (installed by the stack) manages scrape config via CRDs. The key one is &lt;code&gt;ServiceMonitor&lt;/code&gt; — it tells Prometheus which services to scrape without editing Prometheus config directly:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;apiVersion&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;monitoring.coreos.com/v1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;kind&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;ServiceMonitor&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;metadata&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;name&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;my-app&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;spec&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;selector&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;matchLabels&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;app&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;my-app&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;endpoints&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;port&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;metrics&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;interval&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;30s&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Deploy this alongside your app and Prometheus picks it up automatically.&lt;/p&gt;
&lt;h2 id="promql"&gt;PromQL
&lt;/h2&gt;&lt;p&gt;Prometheus Query Language lets you slice and aggregate metrics. A few patterns worth knowing:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Request rate over last 5 minutes&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;rate&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;http_requests_total[&lt;span style="color:#e6db74"&gt;5m&lt;/span&gt;]&lt;span style="color:#f92672"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 99th percentile latency&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;histogram_quantile&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;0.99&lt;/span&gt;, &lt;span style="color:#66d9ef"&gt;rate&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;http_request_duration_seconds_bucket[&lt;span style="color:#e6db74"&gt;5m&lt;/span&gt;]&lt;span style="color:#f92672"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Error ratio&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;rate&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;http_requests_total{status&lt;span style="color:#f92672"&gt;=~&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;5..&lt;/span&gt;&amp;#34;}[&lt;span style="color:#e6db74"&gt;5m&lt;/span&gt;]&lt;span style="color:#f92672"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;/&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;rate&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;http_requests_total[&lt;span style="color:#e6db74"&gt;5m&lt;/span&gt;]&lt;span style="color:#f92672"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Memory usage per pod&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;container_memory_working_set_bytes{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;production&lt;/span&gt;&amp;#34;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="alerting"&gt;Alerting
&lt;/h2&gt;&lt;p&gt;Prometheus evaluates alert rules continuously and fires them to Alertmanager, which handles routing, grouping, and silencing:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;- &lt;span style="color:#f92672"&gt;alert&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;HighErrorRate&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;expr&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;rate(http_requests_total{status=~&amp;#34;5..&amp;#34;}[5m]) &amp;gt; 0.05&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;for&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;5m&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;labels&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;severity&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;critical&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;annotations&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;summary&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;Error rate above 5% for {{ $labels.service }}&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 5m&lt;/code&gt; means the condition must hold for 5 minutes before firing — avoids noisy alerts on brief spikes.&lt;/p&gt;
&lt;h2 id="metrics-server"&gt;Metrics Server
&lt;/h2&gt;&lt;p&gt;Metrics Server is a separate, lightweight component that provides the real-time resource metrics (&lt;code&gt;kubectl top pods&lt;/code&gt;, &lt;code&gt;kubectl top nodes&lt;/code&gt;) that HPA uses to make scaling decisions. It is not Prometheus — it does not store history and is not queryable. It exists purely to feed the Kubernetes control plane.&lt;/p&gt;
&lt;p&gt;Install via ArgoCD:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;apiVersion&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;kind&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;Application&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;metadata&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;name&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;metrics-server&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;namespace&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;argo-cd&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;spec&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;project&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;default&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;sources&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;repoURL&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;https://kubernetes-sigs.github.io/metrics-server&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;chart&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;metrics-server&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;targetRevision&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;3.13.0&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;helm&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;releaseName&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;metrics-server&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;destination&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;server&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;https://kubernetes.default.svc&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;namespace&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;kube-system&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;syncPolicy&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;automated&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;selfHeal&lt;/span&gt;: &lt;span style="color:#66d9ef"&gt;true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;prune&lt;/span&gt;: &lt;span style="color:#66d9ef"&gt;true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;syncOptions&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;CreateNamespace=true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Install this early — HPA does nothing without it, and &lt;code&gt;kubectl top&lt;/code&gt; is the first thing you reach for when something looks wrong.&lt;/p&gt;
&lt;h2 id="what-prometheus-is-not"&gt;What Prometheus is not
&lt;/h2&gt;&lt;p&gt;Prometheus stores metrics — numbers over time. It does not store logs (see &lt;a class="link" href="../loki/" &gt;Loki&lt;/a&gt;) and it does not trace individual requests across services (see &lt;a class="link" href="../jaeger/" &gt;Jaeger&lt;/a&gt;). Metrics tell you &lt;em&gt;that&lt;/em&gt; something is wrong and &lt;em&gt;when&lt;/em&gt; it started. The other two tell you &lt;em&gt;what&lt;/em&gt; happened and &lt;em&gt;where&lt;/em&gt;.&lt;/p&gt;
&lt;h2 id="resources"&gt;Resources
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://prometheus.io/docs/" target="_blank" rel="noopener"
 &gt;Prometheus documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack" target="_blank" rel="noopener"
 &gt;kube-prometheus-stack Helm chart&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://promlabs.com/promql-cheat-sheet/" target="_blank" rel="noopener"
 &gt;PromQL cheat sheet&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>