Skip to content

Operations

This page covers day-to-day operational tasks: health monitoring, upgrading, rolling back, backup, and understanding the security posture of your SLOpilot deployment.


1. Health Monitoring

SLOpilot exposes two HTTP endpoints that Kubernetes uses to manage pod lifecycle. You can query these manually to diagnose issues.

Health endpoint (/health)

Used as the liveness probe. Returns HTTP 200 when the application server process is running. A non-200 response causes Kubernetes to restart the pod.

kubectl exec -n slopilot deploy/slopilot-rightsizing -- \
    wget -qO- http://localhost:8080/health

Readiness endpoint (/ready)

Used as the readiness probe. Returns HTTP 200 only when all of the following conditions are met:

  • Prometheus is reachable and responding to queries
  • The license is valid (or within the offline grace period)
  • Kubernetes informers have completed their initial sync

Until all conditions are satisfied the pod is excluded from Service load balancing. A 503 response from this endpoint indicates which dependency is not yet ready.

kubectl exec -n slopilot deploy/slopilot-rightsizing -- \
    wget -qO- http://localhost:8080/ready

Default probe configuration

Probe Path Initial Delay Period Timeout Failure Threshold
Liveness /health 10s 30s 5s 3
Readiness /ready 5s 10s 10s 3

You can override probe parameters via livenessProbe and readinessProbe Helm values if your cluster's resource constraints require longer startup times.


2. Monitoring the Deployment

# Check pod status
kubectl get pods -n slopilot

# Watch pod events and conditions
kubectl describe pod -n slopilot -l app.kubernetes.io/name=slopilot-rightsizing

# Stream application logs
kubectl logs -n slopilot -l app.kubernetes.io/name=slopilot-rightsizing -f

# Check bundled Prometheus pods
kubectl get pods -n slopilot -l app.kubernetes.io/name=prometheus

Logs are also written to /data/logs/slopilot.log inside the pod (configurable via log.file_path). This file persists across pod restarts because /data is backed by a PVC.


3. Upgrading

Re-run the installer with the new version tag. The installer calls helm upgrade --atomic, which automatically rolls back if the new version does not pass health checks within the timeout.

./slopilot-install.sh \
    --username <ghcr-username> \
    --password <ghcr-token> \
    --license-key "SLOPILOT-XXXX" \
    --tag vNEW.VERSION

What is preserved across upgrades

  • Authentication credentials (stored in Kubernetes Secrets)
  • The application data PVC (user accounts, configuration, logs)
  • The Prometheus data PVC (metric time-series history)

Brief downtime expected

The deployment uses the Recreate strategy. The old pod is terminated before the new one starts, resulting in a brief period of unavailability during the pod restart.


4. Rolling Back

Use standard Helm rollback commands to revert to a previous revision.

# View release history
helm history slopilot-rightsizing -n slopilot

# Roll back to the previous revision
helm rollback slopilot-rightsizing -n slopilot

# Roll back to a specific revision number
helm rollback slopilot-rightsizing <REVISION> -n slopilot

Tip

After a rollback Kubernetes will pull the older image tag. The PVC data remains on the newer schema; if the older version is incompatible with the newer database schema, contact Valuematic support before rolling back.


5. Backup and Restore

Both PersistentVolumeClaims are annotated with helm.sh/resource-policy: keep and are not deleted when you run helm uninstall. Your data survives uninstalls and reinstalls automatically.

What to back up

PVC Contents
data-slopilot-rightsizing User accounts, monitoring configuration, application logs
slopilot-rightsizing-prometheus-stack-server Prometheus TSDB — metric time-series database

Use your cluster's standard PVC backup procedures. Common approaches:

  • VolumeSnapshots (CSI): take a consistent snapshot without downtime on supported storage backends
  • Velero: cluster-wide backup including PVC data, Secrets, and ConfigMaps
  • Manual copy: scale the deployment to zero, copy data with kubectl cp or a sidecar job, then scale back up

Restore procedure (manual copy example)

# Scale down to avoid write conflicts
kubectl scale deploy slopilot-rightsizing -n slopilot --replicas=0

# Copy backup archive into the PVC via a temporary pod
kubectl run restore-helper --rm -it --restart=Never -n slopilot \
    --image=busybox \
    --overrides='{"spec":{"volumes":[{"name":"data","persistentVolumeClaim":{"claimName":"data-slopilot-rightsizing"}}],"containers":[{"name":"r","image":"busybox","command":["sh"],"stdin":true,"tty":true,"volumeMounts":[{"name":"data","mountPath":"/data"}]}]}}'

# (Inside the pod: extract your backup archive to /data)

# Scale back up
kubectl scale deploy slopilot-rightsizing -n slopilot --replicas=1

6. Prometheus Guardrail Alerts

The bundled Prometheus includes pre-configured alerting rules that monitor ingestion safety. Alertmanager is disabled by default, but alerts are evaluated continuously and visible in the Prometheus UI under Alerts, and via the ALERTS metric.

Alert Severity What It Means Recommended Action
SLOpilotTSDBHeadSeriesWarning Warning Active time series exceeding the expected baseline Review namespace monitoring scope
SLOpilotTSDBHeadSeriesCritical Critical Active time series significantly above budget Reduce monitored namespaces or workloads
SLOpilotScrapeVolumeHigh Warning A scrape job returning an unusually large number of samples Check for namespace explosion or label cardinality issues
SLOpilotSampleLimitExceeded Critical Scrape sample limits are being hit, causing data loss Investigate label cardinality immediately
SLOpilotTSDBStorageHigh Warning TSDB storage approaching the configured retention limit Expected on large clusters; consider increasing PVC size
SLOpilotPrometheusDiskUsageHigh Warning Disk usage exceeding 80% of the Prometheus PVC Increase PVC size or reduce retention

View active alerts in the Prometheus UI:

kubectl port-forward -n slopilot \
    svc/slopilot-rightsizing-prometheus-stack-server 9090:80
# Open http://localhost:9090/alerts

Tip

If you override prometheus-stack.server.persistentVolume.size to a value other than 40Gi, the SLOpilotPrometheusDiskUsageHigh threshold will need to be adjusted manually, as it is calibrated to the 40 Gi default.


7. Security Posture

SLOpilot is deployed with the following security hardening applied by default. These settings are visible in the Helm chart and can be audited at any time.

Control Setting
Non-root execution Runs as UID/GID 1000; runAsNonRoot: true
Read-only root filesystem Container root filesystem is read-only; only /tmp (emptyDir) and /data (PVC) are writable
Dropped capabilities All Linux capabilities are dropped (drop: [ALL])
No privilege escalation allowPrivilegeEscalation: false
Network policies Strict ingress/egress rules restrict traffic to necessary endpoints only (enabled by default; see Configuration — Network Policy)
RBAC Cluster-scoped roles are read-only for Kubernetes discovery resources; limited write access is scoped to the bundled Prometheus deployment
Secure cookies Authentication cookies carry HttpOnly and Secure flags in production builds
Security headers HSTS, X-Frame-Options, and Content-Security-Policy headers are applied to all responses

Warning

Loosening securityContext settings (e.g. setting readOnlyRootFilesystem: false or restoring dropped capabilities) reduces the security posture of the deployment and is not supported by Valuematic.