Monitoring Namespace Recovery

Service: monitoring
Tier: Critical
Last Updated: 2026-05-18
Incident Reference: 2026-05-18 Monitoring Namespace Deletion

Overview

This runbook covers recovery of the complete k3s monitoring namespace after accidental deletion or corruption. The monitoring namespace contains all core observability services:

Prometheus Operator + Prometheus
Grafana (with persistent dashboards)
Loki + Alloy (log collection)
Thanos (long-term metrics storage)
Jaeger + Cassandra (distributed tracing)
Backstage (service catalog)
Alertmanager
Unpoller (UniFi metrics)
Syslogs receiver

When to Use This Runbook

Symptoms:

Monitoring namespace missing or stuck in Terminating
All monitoring services unavailable
Grafana, Prometheus, or Backstage unreachable
Alert delivery stopped

Do NOT use this runbook for:

Individual service failures (use service-specific runbooks)
Powercut recovery (use Barn Door Protocol skill instead)
Partial namespace corruption (troubleshoot specific component)

Prerequisites

kubectl access to k3s cluster
Access to Longhorn UI for PVC restoration
MinIO S3 credentials (Loki + Thanos storage)
UniFi Dream Machine credentials (Unpoller)
GitHub OAuth credentials (Backstage authentication)
GitHub PAT (Backstage catalog integration)

Required Credentials

MinIO S3:

Access Key: (stored in ~/source/prometheus-setup/.credentials)
Secret Key: (stored in ~/source/prometheus-setup/.credentials)
Endpoint: http://192.168.55.107:9000
Buckets: loki, longhorn-backups

UniFi Dream Machine:

URL: https://192.168.50.1
User: dan
Password: (stored in ~/source/unfi-config/.credentials)

Backstage GitHub OAuth:

Client ID: (stored in ~/source/golden-signals/.credentials)
Client Secret: (stored in ~/source/golden-signals/.credentials)

GitHub PAT (Backstage):

Token: (stored in ~/source/golden-signals/.credentials)

Recovery Steps

1. Force Delete Namespace (if stuck)

If namespace is stuck in Terminating:

kubectl get namespace monitoring -o json | \
  jq '.spec.finalizers = []' | \
  kubectl replace --raw /api/v1/namespaces/monitoring/finalize -f -

Wait for namespace deletion to complete:

kubectl wait --for=delete namespace/monitoring --timeout=5m || true

2. Redeploy Prometheus Operator

cd ~/source/prometheus-setup
helm repo update
helm upgrade --install prometheus-operator prometheus-community/kube-prometheus-stack \
  -f prometheus-operator/values.yaml \
  -n monitoring --create-namespace

Wait for pods:

kubectl wait --for=condition=ready pod \
  -l app=kube-prometheus-stack-operator \
  -n monitoring --timeout=5m

3. Restore Grafana from Longhorn Backup

Manual Longhorn UI Restore:

Open Longhorn UI (https://longhorn.foulkes.cloud)
Navigate to Backup tab
Find latest Grafana backup (labeled app=grafana)
Click Restore, name volume grafana
Wait for volume creation

Create PV + PVC pointing to restored volume:

kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolume
metadata:
  name: grafana-restored-pv
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: longhorn
  csi:
    driver: driver.longhorn.io
    fsType: ext4
    volumeHandle: grafana
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-operator-grafana
  namespace: monitoring
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn
  volumeName: grafana-restored-pv
  resources:
    requests:
      storage: 10Gi
EOF

Restart Grafana to mount restored volume:

kubectl rollout restart statefulset/prometheus-operator-grafana -n monitoring

Verify restoration:

kubectl exec prometheus-operator-grafana-0 -n monitoring -c grafana -- \
  du -sh /var/lib/grafana/grafana.db
# Should show ~14-15MB if dashboards restored

4. Deploy Loki

cd ~/source/loki_setup/loki_helm
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install loki grafana/loki -f values.yaml -n monitoring

Create MinIO secret:

kubectl create secret generic loki-s3-secret -n monitoring \
  --from-literal=access-key-id=<ACCESS_KEY> \
  --from-literal=secret-access-key=<SECRET_KEY>

Verify Loki Compactor:

kubectl logs -n monitoring loki-compactor-0 --tail=20
# Should show successful S3 connection

5. Deploy Alloy (Log Collection)

cd ~/source/loki_setup/alloy_helm
helm upgrade --install alloy grafana/alloy -f values.yaml -n monitoring

Verify log collection:

kubectl logs -n monitoring -l app.kubernetes.io/name=alloy --tail=10
# Should show "opened log stream" messages

6. Deploy Syslogs Receiver

cd ~/source/loki_setup/syslogs_helm
helm upgrade --install syslogs . -n monitoring

7. Deploy Thanos Components

Create S3 secret:

kubectl create secret generic thanos-objstore-config -n monitoring \
  --from-literal=objstore.yml="type: S3
config:
  bucket: loki
  endpoint: 192.168.55.107:9000
  access_key: <ACCESS_KEY>
  secret_key: <SECRET_KEY>
  insecure: true"

Deploy components:

cd ~/source/prometheus-setup
kubectl apply -f thanos-query.yaml
kubectl apply -f thanos-store-gateway.yaml
kubectl apply -f thanos-compactor.yaml

⚠️ Known Issue: Thanos Compactor Longhorn CSI Bug

If Compactor gets stuck in ContainerCreating with error:

MountVolume.MountDevice failed: format of disk failed: device apparently in use

Root cause: Longhorn CSI v1.8.1 bug - fresh volumes fail to format.

Fix: Switch to emptyDir (Compactor data is ephemeral anyway):

kubectl scale deployment thanos-compactor -n monitoring --replicas=0
kubectl delete pvc thanos-compactor-data -n monitoring
kubectl patch deployment thanos-compactor -n monitoring --type=json -p='[
  {
    "op": "replace",
    "path": "/spec/template/spec/volumes/1",
    "value": {
      "name": "data",
      "emptyDir": {
        "sizeLimit": "50Gi"
      }
    }
  }
]'
kubectl scale deployment thanos-compactor -n monitoring --replicas=1

Also ensure deployment strategy is Recreate (not RollingUpdate):

kubectl patch deployment thanos-compactor -n monitoring --type=json -p='[
  {"op":"remove","path":"/spec/strategy/rollingUpdate"},
  {"op":"replace","path":"/spec/strategy/type","value":"Recreate"}
]'

Verify Thanos Store Gateway:

kubectl logs -n monitoring -l app.kubernetes.io/name=thanos-store-gateway --tail=20
# Should show "block loaded" messages

8. Deploy Jaeger + Cassandra

cd ~/source/Jaeger
kubectl apply -f jaeger-cassandra-statefulset.yaml
kubectl wait --for=condition=ready pod/jaeger-cassandra-0 -n monitoring --timeout=10m
kubectl apply -f jaeger-deployment.yaml

Restore Cassandra from Longhorn (optional): Follow same Longhorn restore process as Grafana. Volume should be labeled app=jaeger.

Note: If Cassandra keyspace is missing from backup, Jaeger will run but won't store traces.

9. Deploy Unpoller

helm repo add k8s-at-home https://k8s-at-home.com/charts/
helm upgrade --install unpoller k8s-at-home/unifi-poller \
  --set config.unifi.url=https://192.168.50.1 \
  --set config.unifi.user=dan \
  --set config.unifi.pass="<PASSWORD>" \
  -n monitoring

Verify scraping:

kubectl logs -n monitoring -l app.kubernetes.io/name=unifi-poller --tail=10
# Should show "Connected to UniFi controller"

10. Deploy Backstage

Create PostgreSQL secret:

kubectl create secret generic backstage-postgresql-secret \
  --namespace monitoring \
  --from-literal=admin-password=$(openssl rand -base64 24) \
  --from-literal=backstage-password=$(openssl rand -base64 24)

Create GitHub secrets:

kubectl create secret generic backstage-github-secret \
  --namespace monitoring \
  --from-literal=GITHUB_TOKEN=<GITHUB_PAT>

kubectl create secret generic backstage-github-oauth \
  --namespace monitoring \
  --from-literal=AUTH_GITHUB_CLIENT_ID=<OAUTH_CLIENT_ID> \
  --from-literal=AUTH_GITHUB_CLIENT_SECRET=<OAUTH_CLIENT_SECRET>

Deploy Backstage:

cd ~/source/golden-signals/platform/backstage
helm repo add backstage https://backstage.github.io/charts
helm install backstage backstage/backstage \
  --namespace monitoring \
  --values values.yaml

Wait for PostgreSQL + Backstage:

kubectl wait --for=condition=ready pod \
  -l app.kubernetes.io/name=postgresql \
  -n monitoring --timeout=10m

# If Backstage pods started before PostgreSQL, restart them:
kubectl delete pod -n monitoring -l app.kubernetes.io/name=backstage

11. Configure Grafana Datasources

Loki datasource:

kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-loki-datasource
  namespace: monitoring
  labels:
    grafana_datasource: "1"
data:
  loki-datasource.yaml: |-
    apiVersion: 1
    datasources:
    - name: Loki
      type: loki
      access: proxy
      url: http://loki-gateway.monitoring.svc.cluster.local
      jsonData:
        httpHeaderName1: 'X-Scope-OrgID'
      secureJsonData:
        httpHeaderValue1: 'fake'
EOF

Restart Grafana:

kubectl rollout restart statefulset/prometheus-operator-grafana -n monitoring

12. Configure TLS Certificates

Grafana:

kubectl annotate ingress prometheus-operator-grafana -n monitoring \
  cert-manager.io/cluster-issuer=letsencrypt-production

Backstage:

kubectl annotate ingress backstage -n monitoring \
  cert-manager.io/cluster-issuer=letsencrypt-production

Verify certificates:

kubectl get certificate -n monitoring
# Should show grafana-cert-production-tls and backstage-tls as Ready

13. Configure Longhorn Recurring Backups

# Grafana backup
kubectl label volume grafana -n longhorn-system app=grafana

# Cassandra backup
kubectl label volume <cassandra-volume-name> -n longhorn-system app=jaeger

Verify recurring jobs:

kubectl get recurringjob -n longhorn-system
# Should show jobs with matching labels

Post-Recovery Verification

Check All Pods Running

kubectl get pods -n monitoring

Expected state:

prometheus-operator-* (1/1)
prometheus-prometheus-operator-kube-p-prometheus-0 (3/3)
prometheus-operator-grafana-0 (3/3)
loki-* (all Running)
alloy-* (2/2 on each node)
thanos-* (all Running)
jaeger-* (1/1)
jaeger-cassandra-0 (1/1)
unpoller-* (1/1)
alertmanager-* (Running)
backstage-* (1/1)
backstage-postgresql-0 (1/1)

Verify Services

# Grafana
curl -k https://grafana.foulkes.cloud

# Backstage
curl -k https://backstage.foulkes.cloud

# Thanos Query
kubectl port-forward -n monitoring svc/thanos-query 9090:9090
curl http://localhost:9090/-/healthy

Verify Data Collection

Prometheus targets:

kubectl port-forward -n monitoring prometheus-prometheus-operator-kube-p-prometheus-0 9090:9090
# Open http://localhost:9090/targets

Loki logs:

kubectl exec prometheus-operator-grafana-0 -n monitoring -c grafana -- \
  wget -qO- --header="X-Scope-OrgID: fake" \
  "http://loki-gateway.monitoring.svc.cluster.local/loki/api/v1/labels"
# Should return labels (not empty)

Thanos long-term storage:

kubectl logs -n monitoring -l app.kubernetes.io/name=thanos-store-gateway --tail=20
# Should show "block loaded" messages

Common Issues

Prometheus Pods Stuck Pending

Symptoms: Prometheus StatefulSet 0/3, Longhorn VolumeAttachment errors

Fix:

# Identify overloaded node
kubectl top nodes

# Cordon and drain
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Wait for pods to reschedule
kubectl wait --for=condition=ready pod \
  -l app.kubernetes.io/name=prometheus \
  -n monitoring --timeout=10m

# Uncordon
kubectl uncordon <node-name>

Loki "Timestamp Too Old" Errors

Symptoms: Alloy logs showing timestamp too old, no recent logs in Grafana

Explanation: Alloy catching up on old pod logs. Errors will stop after processing backlog.

Action: Wait 5-10 minutes. Fresh logs will flow once backlog cleared.

Backstage Not Ready

Symptoms: Backstage pod Running but not Ready (0/1), 503 on readiness probe

Cause: Backstage started before PostgreSQL was ready