Monitoring Namespace Recovery
Service: monitoring
Tier: Critical
Last Updated: 2026-05-18
Incident Reference: 2026-05-18 Monitoring Namespace Deletion
Overview
This runbook covers recovery of the complete k3s monitoring namespace after accidental deletion or corruption. The monitoring namespace contains all core observability services:
Prometheus Operator + Prometheus
Grafana (with persistent dashboards)
Loki + Alloy (log collection)
Thanos (long-term metrics storage)
Jaeger + Cassandra (distributed tracing)
Backstage (service catalog)
Alertmanager
Unpoller (UniFi metrics)
Syslogs receiver
When to Use This Runbook
Symptoms:
Monitoring namespace missing or stuck in
TerminatingAll monitoring services unavailable
Grafana, Prometheus, or Backstage unreachable
Alert delivery stopped
Do NOT use this runbook for:
Individual service failures (use service-specific runbooks)
Powercut recovery (use Barn Door Protocol skill instead)
Partial namespace corruption (troubleshoot specific component)
Prerequisites
kubectlaccess to k3s clusterAccess to Longhorn UI for PVC restoration
MinIO S3 credentials (Loki + Thanos storage)
UniFi Dream Machine credentials (Unpoller)
GitHub OAuth credentials (Backstage authentication)
GitHub PAT (Backstage catalog integration)
Required Credentials
MinIO S3:
Access Key: (stored in
~/source/prometheus-setup/.credentials)Secret Key: (stored in
~/source/prometheus-setup/.credentials)Endpoint:
http://192.168.55.107:9000Buckets:
loki,longhorn-backups
UniFi Dream Machine:
URL:
https://192.168.50.1User:
danPassword: (stored in
~/source/unfi-config/.credentials)
Backstage GitHub OAuth:
Client ID: (stored in
~/source/golden-signals/.credentials)Client Secret: (stored in
~/source/golden-signals/.credentials)
GitHub PAT (Backstage):
Token: (stored in
~/source/golden-signals/.credentials)
Recovery Steps
1. Force Delete Namespace (if stuck)
If namespace is stuck in Terminating:
Wait for namespace deletion to complete:
2. Redeploy Prometheus Operator
Wait for pods:
3. Restore Grafana from Longhorn Backup
Manual Longhorn UI Restore:
Open Longhorn UI (
https://longhorn.foulkes.cloud)Navigate to Backup tab
Find latest Grafana backup (labeled
app=grafana)Click Restore, name volume
grafanaWait for volume creation
Create PV + PVC pointing to restored volume:
Restart Grafana to mount restored volume:
Verify restoration:
4. Deploy Loki
Create MinIO secret:
Verify Loki Compactor:
5. Deploy Alloy (Log Collection)
Verify log collection:
6. Deploy Syslogs Receiver
7. Deploy Thanos Components
Create S3 secret:
Deploy components:
⚠️ Known Issue: Thanos Compactor Longhorn CSI Bug
If Compactor gets stuck in ContainerCreating with error:
Root cause: Longhorn CSI v1.8.1 bug - fresh volumes fail to format.
Fix: Switch to emptyDir (Compactor data is ephemeral anyway):
Also ensure deployment strategy is Recreate (not RollingUpdate):
Verify Thanos Store Gateway:
8. Deploy Jaeger + Cassandra
Restore Cassandra from Longhorn (optional): Follow same Longhorn restore process as Grafana. Volume should be labeled app=jaeger.
Note: If Cassandra keyspace is missing from backup, Jaeger will run but won't store traces.
9. Deploy Unpoller
Verify scraping:
10. Deploy Backstage
Create PostgreSQL secret:
Create GitHub secrets:
Deploy Backstage:
Wait for PostgreSQL + Backstage:
11. Configure Grafana Datasources
Loki datasource:
Restart Grafana:
12. Configure TLS Certificates
Grafana:
Backstage:
Verify certificates:
13. Configure Longhorn Recurring Backups
Verify recurring jobs:
Post-Recovery Verification
Check All Pods Running
Expected state:
prometheus-operator-*(1/1)prometheus-prometheus-operator-kube-p-prometheus-0(3/3)prometheus-operator-grafana-0(3/3)loki-*(all Running)alloy-*(2/2 on each node)thanos-*(all Running)jaeger-*(1/1)jaeger-cassandra-0(1/1)unpoller-*(1/1)alertmanager-*(Running)backstage-*(1/1)backstage-postgresql-0(1/1)
Verify Services
Verify Data Collection
Prometheus targets:
Loki logs:
Thanos long-term storage:
Common Issues
Prometheus Pods Stuck Pending
Symptoms: Prometheus StatefulSet 0/3, Longhorn VolumeAttachment errors
Fix:
Loki "Timestamp Too Old" Errors
Symptoms: Alloy logs showing timestamp too old, no recent logs in Grafana
Explanation: Alloy catching up on old pod logs. Errors will stop after processing backlog.
Action: Wait 5-10 minutes. Fresh logs will flow once backlog cleared.
Backstage Not Ready
Symptoms: Backstage pod Running but not Ready (0/1), 503 on readiness probe
Cause: Backstage started before PostgreSQL was ready
Fix:
Jaeger CrashLoopBackOff
Symptoms: Jaeger pod restarting, logs show "keyspace not found"
Options:
Accept trace data loss - Jaeger runs but doesn't store traces
Recreate keyspace:
kubectl exec -n monitoring jaeger-cassandra-0 -- cqlsh -e " CREATE KEYSPACE jaeger_v1_dc1 WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; " kubectl delete pod -n monitoring -l app=jaeger
Data Loss Assessment
Recoverable (if backups exist)
✅ Grafana dashboards (Longhorn backup, weekly)
✅ Cassandra traces (Longhorn backup, weekly)
✅ Long-term metrics (Thanos S3, indefinite retention)
Non-Recoverable
❌ Prometheus local metrics (7-day window)
❌ Loki logs (no backup configured)
❌ Recent traces (gap between backup and incident)
Prevention
Namespace Protection
Longhorn Backup Coverage
Ensure all stateful volumes have recurring backups:
Configuration in Git
All Helm values and manifests must be in git:
prometheus-setuprepoloki_setuprepoJaegerrepogolden-signalsrepo (Backstage)
Related Runbooks
Barn Door Protocol - Powercut recovery (see homelab docs)
Prometheus Recovery - Prometheus-specific issues
Grafana Recovery - Grafana-specific issues
Incident History
2026-05-18: Complete namespace deletion during GitOps blog post preparation
Duration: 1h 51min (04:24 - 06:15 UTC)
Data loss: 7 days Prometheus metrics, all Loki logs, 3 days Jaeger traces
Recovery: Grafana dashboards fully restored, services operational