Troubleshooting¶

This page documents common symptoms, their causes, and remediation steps. Each section is organized as symptom → cause → fix.

For configuration reference see Configuration. For operational procedures see Operations.

Pod is in CrashLoopBackOff¶

Retrieve the logs from the previous (crashed) container instance:

kubectl logs -n slopilot \
    -l app.kubernetes.io/name=slopilot-rightsizing \
    --previous

Common log messages and their remediation:

Log Message	Cause	Remediation
`License key not found`	The license Secret is missing or has the wrong key name	Re-run the installer, or manually create the secret: `kubectl create secret generic slopilot-rightsizing-license -n slopilot --from-literal=license-key=SLOPILOT-XXXX`
`JWT secret must be at least 32 characters`	The authentication Secret is corrupt or was created with an invalid value	Delete the Secret and re-run the installer: `kubectl delete secret slopilot-rightsizing-users -n slopilot`
`Cannot read config`	The ConfigMap is missing	Verify the Helm release is installed: `helm list -n slopilot`

Readiness Probe Failing (503)¶

The /ready endpoint returns 503 when any required dependency is not available. Diagnose the specific failure:

kubectl exec -n slopilot deploy/slopilot-rightsizing -- \
    wget -qO- http://localhost:8080/ready

The response body describes which check failed.

Common causes:

Prometheus not ready: Normal during the first 1–2 minutes after a fresh deployment. Wait for the Prometheus pod to become ready:
```
kubectl get pods -n slopilot -l app.kubernetes.io/name=prometheus -w
```
License server unreachable: Verify that outbound HTTPS (port 443) to license.slopilot.eu is allowed by your network policies and firewall rules:
```
kubectl exec -n slopilot deploy/slopilot-rightsizing -- \
    wget -qO- --spider https://license.slopilot.eu/api/v1/health
```
Kubernetes informers not synced: Check that the ServiceAccount has the correct ClusterRole bindings:
```
kubectl get clusterrolebinding \
    -l app.kubernetes.io/instance=slopilot-rightsizing
```

"Collecting..." Displayed for All Workloads¶

No action required

This is expected behavior in the first week or two after installation. The analysis engine requires a minimum period of metric history before producing recommendations. Recommendations will appear automatically as data accumulates, with confidence increasing over time.

If "Collecting..." persists beyond two weeks on workloads that have been running continuously, check the Prometheus pod logs for scrape errors and verify that the bundled Prometheus is running correctly.

License Errors on the Settings Page¶

Expired license: Contact your Valuematic representative to renew your license key.
Network error / license server unreachable: Verify connectivity from the pod to the license server:
```
kubectl exec -n slopilot deploy/slopilot-rightsizing -- \
    wget -qO- --spider https://license.slopilot.eu/api/v1/health
```
If this fails, check your network policies and firewall egress rules. The license server requires outbound TCP 443 to license.slopilot.eu.
Invalid license key: Ensure the license key stored in the Secret exactly matches what was provided by Valuematic. Retrieve the current value:
```
kubectl get secret slopilot-rightsizing-license -n slopilot \
    -o jsonpath='{.data.license-key}' | base64 -d
```

No Workloads Visible¶

Namespace restriction: Your license may restrict analysis to specific namespaces. Open the Settings page in the SLOpilot UI and review the namespace configuration.
Informer sync delay: After startup, the Kubernetes informers may take a few seconds to complete their initial sync. Check the logs for informer-related messages:
```
kubectl logs -n slopilot \
    -l app.kubernetes.io/name=slopilot-rightsizing \
    | grep -i informer
```

RBAC misconfiguration: Verify the ClusterRole was created and contains the required read permissions:

kubectl get clusterrole \
    -l app.kubernetes.io/instance=slopilot-rightsizing
kubectl describe clusterrole \
    -l app.kubernetes.io/instance=slopilot-rightsizing

Installer RBAC Preflight Failure¶

The installer validates permissions before proceeding. If the preflight check fails, your kubeconfig context does not have sufficient permissions.

Required permissions:

create on namespaces (cluster-scoped)
create on secrets in the target namespace
create on clusterroles (cluster-scoped)
create on clusterrolebindings (cluster-scoped)

Remediation: Switch to a kubeconfig context with cluster-admin or equivalent permissions before running the installer.

Helm Deploy Timeout (--atomic Rollback)¶

The --atomic flag causes Helm to roll back automatically when pods do not reach the Ready state within the timeout. Check events to identify the root cause:

kubectl get events -n slopilot \
    --sort-by='.lastTimestamp'

Common causes:

PVC not binding: A StorageClass supporting ReadWriteOnce may not exist, or no storage capacity is available.
```
kubectl get pvc -n slopilot
kubectl get storageclass
```
Ensure the StorageClass has a provisioner that supports ReadWriteOnce and has available capacity.
Image pull failure: The registry-pull-secret credentials may be expired or incorrect.
```
kubectl get events -n slopilot \
    --sort-by='.lastTimestamp' | grep -i pull
```
Re-run the installer with valid GHCR credentials.
Insufficient node capacity: Default resource requests total approximately 1.1 CPU and 2.3 Gi memory for SLOpilot itself, plus 1 CPU and 2 Gi for Prometheus. Verify nodes have sufficient allocatable capacity:
```
kubectl describe nodes | grep -A5 "Allocated resources"
```

Admin Password Lost¶

If the auto-generated admin password was not saved, retrieve it from the Secret:

kubectl get secret slopilot-rightsizing-users -n slopilot \
    -o jsonpath='{.data.default-admin-password}' | base64 -d

If the password was changed through the UI after the initial installation and is now unknown, delete the users Secret and re-run the installer with an explicit --admin-password:

kubectl delete secret slopilot-rightsizing-users -n slopilot

./slopilot-install.sh \
    --username <ghcr-username> \
    --password <ghcr-token> \
    --license-key "SLOPILOT-XXXX" \
    --tag vX.Y.Z \
    --admin-password "your-new-password"

Warning

Deleting the users Secret also regenerates internal authentication credentials. All active user sessions will be invalidated and users will need to log in again.