Operations¶

This page covers day-to-day operational tasks: health monitoring, upgrading, rolling back, backup, and understanding the security posture of your SLOpilot deployment.

1. Health Monitoring¶

SLOpilot exposes two HTTP endpoints that Kubernetes uses to manage pod lifecycle. You can query these manually to diagnose issues.

Health endpoint (`/health`)¶

Used as the liveness probe. Returns HTTP 200 when the application server process is running. A non-200 response causes Kubernetes to restart the pod.

kubectl exec -n slopilot deploy/slopilot-rightsizing -- \
    wget -qO- http://localhost:8080/health

Readiness endpoint (`/ready`)¶

Used as the readiness probe. Returns HTTP 200 only when all of the following conditions are met:

Prometheus is reachable and responding to queries
The license is valid (or within the offline grace period)
Kubernetes informers have completed their initial sync

Until all conditions are satisfied the pod is excluded from Service load balancing. A 503 response from this endpoint indicates which dependency is not yet ready.

kubectl exec -n slopilot deploy/slopilot-rightsizing -- \
    wget -qO- http://localhost:8080/ready

Default probe configuration¶

Probe	Path	Initial Delay	Period	Timeout	Failure Threshold
Liveness	`/health`	10s	30s	5s	3
Readiness	`/ready`	5s	10s	10s	3

You can override probe parameters via livenessProbe and readinessProbe Helm values if your cluster's resource constraints require longer startup times.

2. Monitoring the Deployment¶

# Check pod status
kubectl get pods -n slopilot

# Watch pod events and conditions
kubectl describe pod -n slopilot -l app.kubernetes.io/name=slopilot-rightsizing

# Stream application logs
kubectl logs -n slopilot -l app.kubernetes.io/name=slopilot-rightsizing -f

# Check bundled Prometheus pods
kubectl get pods -n slopilot -l app.kubernetes.io/name=prometheus

Logs are also written to /data/logs/slopilot.log inside the pod (configurable via log.file_path). This file persists across pod restarts because /data is backed by a PVC.

3. Upgrading¶

Re-run the installer with the new version tag. The installer calls helm upgrade --atomic, which automatically rolls back if the new version does not pass health checks within the timeout.

./slopilot-install.sh \
    --username <ghcr-username> \
    --password <ghcr-token> \
    --license-key "SLOPILOT-XXXX" \
    --tag vNEW.VERSION

What is preserved across upgrades

Authentication credentials (stored in Kubernetes Secrets)
The application data PVC (user accounts, configuration, logs)
The Prometheus data PVC (metric time-series history)

Brief downtime expected

The deployment uses the Recreate strategy. The old pod is terminated before the new one starts, resulting in a brief period of unavailability during the pod restart.

4. Rolling Back¶

Use standard Helm rollback commands to revert to a previous revision.

# View release history
helm history slopilot-rightsizing -n slopilot

# Roll back to the previous revision
helm rollback slopilot-rightsizing -n slopilot

# Roll back to a specific revision number
helm rollback slopilot-rightsizing <REVISION> -n slopilot

Tip

After a rollback Kubernetes will pull the older image tag. The PVC data remains on the newer schema; if the older version is incompatible with the newer database schema, contact Valuematic support before rolling back.

5. Backup and Restore¶

Both PersistentVolumeClaims are annotated with helm.sh/resource-policy: keep and are not deleted when you run helm uninstall. Your data survives uninstalls and reinstalls automatically.

What to back up¶

PVC	Contents
`data-slopilot-rightsizing`	User accounts, monitoring configuration, application logs
`slopilot-rightsizing-prometheus-stack-server`	Prometheus TSDB — metric time-series database

Use your cluster's standard PVC backup procedures. Common approaches:

VolumeSnapshots (CSI): take a consistent snapshot without downtime on supported storage backends
Velero: cluster-wide backup including PVC data, Secrets, and ConfigMaps
Manual copy: scale the deployment to zero, copy data with kubectl cp or a sidecar job, then scale back up

Restore procedure (manual copy example)¶

# Scale down to avoid write conflicts
kubectl scale deploy slopilot-rightsizing -n slopilot --replicas=0

# Copy backup archive into the PVC via a temporary pod
kubectl run restore-helper --rm -it --restart=Never -n slopilot \
    --image=busybox \
    --overrides='{"spec":{"volumes":[{"name":"data","persistentVolumeClaim":{"claimName":"data-slopilot-rightsizing"}}],"containers":[{"name":"r","image":"busybox","command":["sh"],"stdin":true,"tty":true,"volumeMounts":[{"name":"data","mountPath":"/data"}]}]}}'

# (Inside the pod: extract your backup archive to /data)

# Scale back up
kubectl scale deploy slopilot-rightsizing -n slopilot --replicas=1

6. Prometheus Guardrail Alerts¶

The bundled Prometheus includes pre-configured alerting rules that monitor ingestion safety. Alertmanager is disabled by default, but alerts are evaluated continuously and visible in the Prometheus UI under Alerts, and via the ALERTS metric.

Alert	Severity	What It Means	Recommended Action
`SLOpilotTSDBHeadSeriesWarning`	Warning	Active time series exceeding the expected baseline	Review namespace monitoring scope
`SLOpilotTSDBHeadSeriesCritical`	Critical	Active time series significantly above budget	Reduce monitored namespaces or workloads
`SLOpilotScrapeVolumeHigh`	Warning	A scrape job returning an unusually large number of samples	Check for namespace explosion or label cardinality issues
`SLOpilotSampleLimitExceeded`	Critical	Scrape sample limits are being hit, causing data loss	Investigate label cardinality immediately
`SLOpilotTSDBStorageHigh`	Warning	TSDB storage approaching the configured retention limit	Expected on large clusters; consider increasing PVC size
`SLOpilotPrometheusDiskUsageHigh`	Warning	Disk usage exceeding 80% of the Prometheus PVC	Increase PVC size or reduce retention

View active alerts in the Prometheus UI:

kubectl port-forward -n slopilot \
    svc/slopilot-rightsizing-prometheus-stack-server 9090:80
# Open http://localhost:9090/alerts

Tip

If you override prometheus-stack.server.persistentVolume.size to a value other than 40Gi, the SLOpilotPrometheusDiskUsageHigh threshold will need to be adjusted manually, as it is calibrated to the 40 Gi default.

7. Security Posture¶

SLOpilot is deployed with the following security hardening applied by default. These settings are visible in the Helm chart and can be audited at any time.

Control	Setting
Non-root execution	Runs as UID/GID 1000; `runAsNonRoot: true`
Read-only root filesystem	Container root filesystem is read-only; only `/tmp` (emptyDir) and `/data` (PVC) are writable
Dropped capabilities	All Linux capabilities are dropped (`drop: [ALL]`)
No privilege escalation	`allowPrivilegeEscalation: false`
Network policies	Strict ingress/egress rules restrict traffic to necessary endpoints only (enabled by default; see Configuration — Network Policy)
RBAC	Cluster-scoped roles are read-only for Kubernetes discovery resources; limited write access is scoped to the bundled Prometheus deployment
Secure cookies	Authentication cookies carry `HttpOnly` and `Secure` flags in production builds
Security headers	HSTS, `X-Frame-Options`, and `Content-Security-Policy` headers are applied to all responses

Warning

Loosening securityContext settings (e.g. setting readOnlyRootFilesystem: false or restoring dropped capabilities) reduces the security posture of the deployment and is not supported by Valuematic.