Operations¶
This page covers day-to-day operational tasks: health monitoring, upgrading, rolling back, backup, and understanding the security posture of your SLOpilot deployment.
1. Health Monitoring¶
SLOpilot exposes two HTTP endpoints that Kubernetes uses to manage pod lifecycle. You can query these manually to diagnose issues.
Health endpoint (/health)¶
Used as the liveness probe. Returns HTTP 200 when the application server process is running. A non-200 response causes Kubernetes to restart the pod.
Readiness endpoint (/ready)¶
Used as the readiness probe. Returns HTTP 200 only when all of the following conditions are met:
- Prometheus is reachable and responding to queries
- The license is valid (or within the offline grace period)
- Kubernetes informers have completed their initial sync
Until all conditions are satisfied the pod is excluded from Service load balancing. A 503 response from this endpoint indicates which dependency is not yet ready.
Default probe configuration¶
| Probe | Path | Initial Delay | Period | Timeout | Failure Threshold |
|---|---|---|---|---|---|
| Liveness | /health |
10s | 30s | 5s | 3 |
| Readiness | /ready |
5s | 10s | 10s | 3 |
You can override probe parameters via livenessProbe and readinessProbe Helm values if your cluster's resource constraints require longer startup times.
2. Monitoring the Deployment¶
# Check pod status
kubectl get pods -n slopilot
# Watch pod events and conditions
kubectl describe pod -n slopilot -l app.kubernetes.io/name=slopilot-rightsizing
# Stream application logs
kubectl logs -n slopilot -l app.kubernetes.io/name=slopilot-rightsizing -f
# Check bundled Prometheus pods
kubectl get pods -n slopilot -l app.kubernetes.io/name=prometheus
Logs are also written to /data/logs/slopilot.log inside the pod (configurable via log.file_path). This file persists across pod restarts because /data is backed by a PVC.
3. Upgrading¶
Re-run the installer with the new version tag. The installer calls helm upgrade --atomic, which automatically rolls back if the new version does not pass health checks within the timeout.
./slopilot-install.sh \
--username <ghcr-username> \
--password <ghcr-token> \
--license-key "SLOPILOT-XXXX" \
--tag vNEW.VERSION
What is preserved across upgrades
- Authentication credentials (stored in Kubernetes Secrets)
- The application data PVC (user accounts, configuration, logs)
- The Prometheus data PVC (metric time-series history)
Brief downtime expected
The deployment uses the Recreate strategy. The old pod is terminated before the new one starts, resulting in a brief period of unavailability during the pod restart.
4. Rolling Back¶
Use standard Helm rollback commands to revert to a previous revision.
# View release history
helm history slopilot-rightsizing -n slopilot
# Roll back to the previous revision
helm rollback slopilot-rightsizing -n slopilot
# Roll back to a specific revision number
helm rollback slopilot-rightsizing <REVISION> -n slopilot
Tip
After a rollback Kubernetes will pull the older image tag. The PVC data remains on the newer schema; if the older version is incompatible with the newer database schema, contact Valuematic support before rolling back.
5. Backup and Restore¶
Both PersistentVolumeClaims are annotated with helm.sh/resource-policy: keep and are not deleted when you run helm uninstall. Your data survives uninstalls and reinstalls automatically.
What to back up¶
| PVC | Contents |
|---|---|
data-slopilot-rightsizing |
User accounts, monitoring configuration, application logs |
slopilot-rightsizing-prometheus-stack-server |
Prometheus TSDB — metric time-series database |
Use your cluster's standard PVC backup procedures. Common approaches:
- VolumeSnapshots (CSI): take a consistent snapshot without downtime on supported storage backends
- Velero: cluster-wide backup including PVC data, Secrets, and ConfigMaps
- Manual copy: scale the deployment to zero, copy data with
kubectl cpor a sidecar job, then scale back up
Restore procedure (manual copy example)¶
# Scale down to avoid write conflicts
kubectl scale deploy slopilot-rightsizing -n slopilot --replicas=0
# Copy backup archive into the PVC via a temporary pod
kubectl run restore-helper --rm -it --restart=Never -n slopilot \
--image=busybox \
--overrides='{"spec":{"volumes":[{"name":"data","persistentVolumeClaim":{"claimName":"data-slopilot-rightsizing"}}],"containers":[{"name":"r","image":"busybox","command":["sh"],"stdin":true,"tty":true,"volumeMounts":[{"name":"data","mountPath":"/data"}]}]}}'
# (Inside the pod: extract your backup archive to /data)
# Scale back up
kubectl scale deploy slopilot-rightsizing -n slopilot --replicas=1
6. Prometheus Guardrail Alerts¶
The bundled Prometheus includes pre-configured alerting rules that monitor ingestion safety. Alertmanager is disabled by default, but alerts are evaluated continuously and visible in the Prometheus UI under Alerts, and via the ALERTS metric.
| Alert | Severity | What It Means | Recommended Action |
|---|---|---|---|
SLOpilotTSDBHeadSeriesWarning |
Warning | Active time series exceeding the expected baseline | Review namespace monitoring scope |
SLOpilotTSDBHeadSeriesCritical |
Critical | Active time series significantly above budget | Reduce monitored namespaces or workloads |
SLOpilotScrapeVolumeHigh |
Warning | A scrape job returning an unusually large number of samples | Check for namespace explosion or label cardinality issues |
SLOpilotSampleLimitExceeded |
Critical | Scrape sample limits are being hit, causing data loss | Investigate label cardinality immediately |
SLOpilotTSDBStorageHigh |
Warning | TSDB storage approaching the configured retention limit | Expected on large clusters; consider increasing PVC size |
SLOpilotPrometheusDiskUsageHigh |
Warning | Disk usage exceeding 80% of the Prometheus PVC | Increase PVC size or reduce retention |
View active alerts in the Prometheus UI:
kubectl port-forward -n slopilot \
svc/slopilot-rightsizing-prometheus-stack-server 9090:80
# Open http://localhost:9090/alerts
Tip
If you override prometheus-stack.server.persistentVolume.size to a value other than 40Gi, the SLOpilotPrometheusDiskUsageHigh threshold will need to be adjusted manually, as it is calibrated to the 40 Gi default.
7. Security Posture¶
SLOpilot is deployed with the following security hardening applied by default. These settings are visible in the Helm chart and can be audited at any time.
| Control | Setting |
|---|---|
| Non-root execution | Runs as UID/GID 1000; runAsNonRoot: true |
| Read-only root filesystem | Container root filesystem is read-only; only /tmp (emptyDir) and /data (PVC) are writable |
| Dropped capabilities | All Linux capabilities are dropped (drop: [ALL]) |
| No privilege escalation | allowPrivilegeEscalation: false |
| Network policies | Strict ingress/egress rules restrict traffic to necessary endpoints only (enabled by default; see Configuration — Network Policy) |
| RBAC | Cluster-scoped roles are read-only for Kubernetes discovery resources; limited write access is scoped to the bundled Prometheus deployment |
| Secure cookies | Authentication cookies carry HttpOnly and Secure flags in production builds |
| Security headers | HSTS, X-Frame-Options, and Content-Security-Policy headers are applied to all responses |
Warning
Loosening securityContext settings (e.g. setting readOnlyRootFilesystem: false or restoring dropped capabilities) reduces the security posture of the deployment and is not supported by Valuematic.