Troubleshooting¶
This page documents common symptoms, their causes, and remediation steps. Each section is organized as symptom → cause → fix.
For configuration reference see Configuration. For operational procedures see Operations.
Pod is in CrashLoopBackOff¶
Retrieve the logs from the previous (crashed) container instance:
Common log messages and their remediation:
| Log Message | Cause | Remediation |
|---|---|---|
License key not found |
The license Secret is missing or has the wrong key name | Re-run the installer, or manually create the secret: kubectl create secret generic slopilot-rightsizing-license -n slopilot --from-literal=license-key=SLOPILOT-XXXX |
JWT secret must be at least 32 characters |
The authentication Secret is corrupt or was created with an invalid value | Delete the Secret and re-run the installer: kubectl delete secret slopilot-rightsizing-users -n slopilot |
Cannot read config |
The ConfigMap is missing | Verify the Helm release is installed: helm list -n slopilot |
Readiness Probe Failing (503)¶
The /ready endpoint returns 503 when any required dependency is not available. Diagnose the specific failure:
The response body describes which check failed.
Common causes:
-
Prometheus not ready: Normal during the first 1–2 minutes after a fresh deployment. Wait for the Prometheus pod to become ready:
-
License server unreachable: Verify that outbound HTTPS (port 443) to
license.slopilot.euis allowed by your network policies and firewall rules: -
Kubernetes informers not synced: Check that the ServiceAccount has the correct ClusterRole bindings:
"Collecting..." Displayed for All Workloads¶
No action required
This is expected behavior in the first week or two after installation. The analysis engine requires a minimum period of metric history before producing recommendations. Recommendations will appear automatically as data accumulates, with confidence increasing over time.
If "Collecting..." persists beyond two weeks on workloads that have been running continuously, check the Prometheus pod logs for scrape errors and verify that the bundled Prometheus is running correctly.
License Errors on the Settings Page¶
-
Expired license: Contact your Valuematic representative to renew your license key.
-
Network error / license server unreachable: Verify connectivity from the pod to the license server:
If this fails, check your network policies and firewall egress rules. The license server requires outbound TCP 443 tokubectl exec -n slopilot deploy/slopilot-rightsizing -- \ wget -qO- --spider https://license.slopilot.eu/api/v1/healthlicense.slopilot.eu. -
Invalid license key: Ensure the license key stored in the Secret exactly matches what was provided by Valuematic. Retrieve the current value:
No Workloads Visible¶
-
Namespace restriction: Your license may restrict analysis to specific namespaces. Open the Settings page in the SLOpilot UI and review the namespace configuration.
-
Informer sync delay: After startup, the Kubernetes informers may take a few seconds to complete their initial sync. Check the logs for informer-related messages:
-
RBAC misconfiguration: Verify the ClusterRole was created and contains the required read permissions:
Installer RBAC Preflight Failure¶
The installer validates permissions before proceeding. If the preflight check fails, your kubeconfig context does not have sufficient permissions.
Required permissions:
createonnamespaces(cluster-scoped)createonsecretsin the target namespacecreateonclusterroles(cluster-scoped)createonclusterrolebindings(cluster-scoped)
Remediation: Switch to a kubeconfig context with cluster-admin or equivalent permissions before running the installer.
Helm Deploy Timeout (--atomic Rollback)¶
The --atomic flag causes Helm to roll back automatically when pods do not reach the Ready state within the timeout. Check events to identify the root cause:
Common causes:
-
PVC not binding: A StorageClass supporting
Ensure the StorageClass has a provisioner that supportsReadWriteOncemay not exist, or no storage capacity is available.ReadWriteOnceand has available capacity. -
Image pull failure: The
Re-run the installer with valid GHCR credentials.registry-pull-secretcredentials may be expired or incorrect. -
Insufficient node capacity: Default resource requests total approximately 1.1 CPU and 2.3 Gi memory for SLOpilot itself, plus 1 CPU and 2 Gi for Prometheus. Verify nodes have sufficient allocatable capacity:
Admin Password Lost¶
If the auto-generated admin password was not saved, retrieve it from the Secret:
kubectl get secret slopilot-rightsizing-users -n slopilot \
-o jsonpath='{.data.default-admin-password}' | base64 -d
If the password was changed through the UI after the initial installation and is now unknown, delete the users Secret and re-run the installer with an explicit --admin-password:
kubectl delete secret slopilot-rightsizing-users -n slopilot
./slopilot-install.sh \
--username <ghcr-username> \
--password <ghcr-token> \
--license-key "SLOPILOT-XXXX" \
--tag vX.Y.Z \
--admin-password "your-new-password"
Warning
Deleting the users Secret also regenerates internal authentication credentials. All active user sessions will be invalidated and users will need to log in again.