Troubleshooting¶
This page documents common symptoms, their causes, and remediation steps. Each section is organized as symptom → cause → fix.
For configuration reference see Configuration. For operational procedures see Operations.
Pod is in CrashLoopBackOff¶
Retrieve the logs from the previous (crashed) container instance:
Common log messages and their remediation:
| Log Message | Cause | Remediation |
|---|---|---|
License key not found |
The license Secret is missing or has the wrong key name | Re-run the installer, or manually create the secret: kubectl create secret generic slopilot-rightsizing-license -n slopilot --from-literal=license-key=SLOPILOT-XXXX |
JWT secret must be at least 32 characters |
The authentication Secret is corrupt or was created with an invalid value | Delete the Secret and re-run the installer: kubectl delete secret slopilot-rightsizing-users -n slopilot |
Cannot read config |
The ConfigMap is missing | Verify the Helm release is installed: helm list -n slopilot |
Readiness Probe Failing (503)¶
The /ready endpoint returns 503 when any required dependency is not available. Diagnose the specific failure:
The response body describes which check failed.
Common causes:
-
Prometheus not ready: Normal during the first 1–2 minutes after a fresh deployment. Wait for the Prometheus pod to become ready:
-
License server unreachable: Verify that outbound HTTPS (port 443) to
license.slopilot.euis allowed by your network policies and firewall rules: -
Kubernetes informers not synced: Check that the ServiceAccount has the correct ClusterRole bindings:
"Collecting..." Displayed for All Workloads¶
No action required
This is expected behavior in the first week or two after installation. The analysis engine requires a minimum period of metric history before producing recommendations. Recommendations will appear automatically as data accumulates, with confidence increasing over time.
If "Collecting..." persists beyond two weeks on workloads that have been running continuously, check the Prometheus pod logs for scrape errors and verify that the bundled Prometheus is running correctly.
License Errors on the Settings Page¶
-
Expired license: Contact your Valuematic representative to renew your license key.
-
Network error / license server unreachable: Verify connectivity from the pod to the license server:
If this fails, check your network policies and firewall egress rules. The license server requires outbound TCP 443 tokubectl exec -n slopilot deploy/slopilot-rightsizing -- \ wget -qO- --spider https://license.slopilot.eu/api/v1/healthlicense.slopilot.eu. -
Invalid license key: Ensure the license key stored in the Secret exactly matches what was provided by Valuematic. Retrieve the current value:
No Workloads Visible¶
-
Namespace restriction: Your license may restrict analysis to specific namespaces. Open the Settings page in the SLOpilot UI and review the namespace configuration.
-
Informer sync delay: After startup, the Kubernetes informers may take a few seconds to complete their initial sync. Check the logs for informer-related messages:
-
RBAC misconfiguration: Verify the ClusterRole was created and contains the required read permissions:
Installer RBAC Preflight Failure¶
The installer validates permissions before proceeding. If the preflight check fails, your kubeconfig context does not have sufficient permissions.
Required permissions:
createonnamespaces(cluster-scoped)createonsecretsin the target namespacecreateonclusterroles(cluster-scoped)createonclusterrolebindings(cluster-scoped)
Remediation: Switch to a kubeconfig context with cluster-admin or equivalent permissions before running the installer.
Helm Deploy Timeout (--atomic Rollback)¶
The --atomic flag causes Helm to roll back automatically when pods do not reach the Ready state within the timeout. Check events to identify the root cause:
Common causes:
-
PVC not binding: A StorageClass supporting
Ensure the StorageClass has a provisioner that supportsReadWriteOncemay not exist, or no storage capacity is available.ReadWriteOnceand has available capacity. -
Image pull failure: The
Re-run the installer with valid GHCR credentials.registry-pull-secretcredentials may be expired or incorrect. -
Insufficient node capacity: Default resource requests total approximately 1.1 CPU and 2.3 Gi memory for SLOpilot itself, plus 1 CPU and 2 Gi for Prometheus. Verify nodes have sufficient allocatable capacity:
Admin Password Lost¶
The initial admin password is always Changeme1!. If you changed it through the UI and lost it, reset by deleting the application data PVC and restarting the pod:
# Delete the app data PVC (this resets user accounts only, NOT Prometheus metrics)
kubectl delete pvc slopilot-rightsizing-data -n slopilot
# Restart the pod to re-seed the admin user
kubectl rollout restart deployment slopilot-rightsizing -n slopilot
After the pod restarts, log in with admin / Changeme1! and set a new password.
Warning
Resetting the application data PVC deletes all user accounts, not just the admin. All users will need to be recreated.