Troubleshooting
Quick Diagnostic Commands
# Overall pod status
kubectl get pods -n watchlight
# Events (recent scheduling/pull/crash issues)
kubectl get events -n watchlight --sort-by='.lastTimestamp'
# Describe a specific pod
kubectl describe pod -n watchlight <pod-name>
# Logs for a service
kubectl logs -n watchlight -l app.kubernetes.io/name=wl-registry --tail=100
# Follow logs in real time
kubectl logs -n watchlight -l app.kubernetes.io/name=wl-registry -f
Health Check Endpoints
Each service exposes a health endpoint. Use port-forwarding to verify:
| Service | Port | Health Path | Command |
|---|---|---|---|
| wl-registry | 8080 | /health | kubectl port-forward svc/beacon-wl-registry 8080:8080 |
| wl-apdp | 8081 | /health | kubectl port-forward svc/beacon-wl-apdp 8081:8081 |
| wl-proxy | 8079 | / | kubectl port-forward svc/beacon-wl-proxy 8079:8079 |
| wl-secrets-broker | 8082 | /health | kubectl port-forward svc/beacon-wl-secrets-broker 8082:8082 |
| wl-registry-frontend | 80 | / | kubectl port-forward svc/beacon-wl-registry-frontend 3001:80 |
Then test with:
curl http://localhost:<port>/<path>
Pod Crash Loops
Symptoms
Pod status shows CrashLoopBackOff or repeated restarts.
Diagnosis
# Check the exit code and reason
kubectl describe pod -n watchlight <pod-name>
# View the most recent logs (including previous crash)
kubectl logs -n watchlight <pod-name> --previous
Common Causes
Missing environment variables or secrets:
The pod logs will show an error like required environment variable DATABASE_URL is not set. Verify that the database secret exists:
kubectl get secrets -n watchlight
Database migration failure:
If wl-registry crash-loops on startup, check for migration errors:
kubectl logs -n watchlight -l app.kubernetes.io/name=wl-registry --tail=50
A checksum mismatch means an applied migration file was modified. This requires a new migration file to fix -- never edit applied migrations.
Insufficient resources:
If pods are OOMKilled, increase memory limits:
helm upgrade beacon ./deploy/helm/watchlight-beacon \
--namespace watchlight \
--set wl-registry.resources.limits.memory=1Gi
Database Connection Issues
Symptoms
wl-registry logs show error connecting to database or connection refused.
Built-in PostgreSQL
Verify the PostgreSQL pod is running:
kubectl get pods -n watchlight -l app.kubernetes.io/name=postgresql
Check PostgreSQL logs:
kubectl logs -n watchlight -l app.kubernetes.io/name=postgresql --tail=50
Verify the service is reachable from within the cluster:
kubectl run -n watchlight debug --rm -it --image=postgres:16-alpine -- \
psql "postgres://beacon@beacon-postgresql:5432/beacon"
Check that the password secret exists and matches:
kubectl get secret -n watchlight beacon-postgresql -o jsonpath='{.data.postgres-password}' | base64 -d
BYO Database (External)
Verify the connection string in your values:
helm get values beacon -n watchlight | grep -A2 externalDatabase
Test connectivity from inside the cluster:
kubectl run -n watchlight debug --rm -it --image=postgres:16-alpine -- \
psql "$YOUR_DATABASE_URL"
Common issues:
- Security group / firewall does not allow traffic from cluster nodes to the database.
- IAM authentication requires the pod's service account to have the correct role bindings.
- SSL mode may need to be specified in the connection string (for example,
?sslmode=require).
Image Pull Errors
Symptoms
Pod status shows ImagePullBackOff or ErrImagePull.
Diagnosis
kubectl describe pod -n watchlight <pod-name> | grep -A5 "Events"
Look for messages like unauthorized or not found.
Fixes
Missing or expired GHCR credentials:
# Verify the secret exists
kubectl get secret -n watchlight ghcr-credentials
# Recreate if needed
kubectl delete secret -n watchlight ghcr-credentials
kubectl create secret docker-registry ghcr-credentials \
--namespace watchlight \
--docker-server=ghcr.io \
--docker-username=YOUR_GITHUB_USERNAME \
--docker-password=YOUR_GHCR_TOKEN
Secret not referenced in the Helm values:
Ensure global.imagePullSecrets is set:
global:
imagePullSecrets:
- name: ghcr-credentials
Wrong image tag:
Verify the tag exists in the registry:
# Using GitHub CLI
gh api orgs/watchlight-ai-beacon/packages/container/wl-registry/versions \
--jq '.[].metadata.container.tags[]'
TLS Certificate Problems
Symptoms
Services return TLS errors, browsers show certificate warnings, or inter-service communication fails with certificate verify failed.
Ingress TLS
Verify the TLS secret exists and contains valid certificate data:
kubectl get secret -n watchlight <tls-secret-name> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -subject -dates
Check that the ingress resource references the correct secret:
kubectl get ingress -n watchlight -o yaml | grep -A5 tls
wl-discover TLS to Registry
If wl-discover cannot connect to the registry over TLS, check:
-
The CA certificate path is configured:
wl-discover:
tls:
caCertPath: "/etc/ssl/certs/ca.crt" -
The CA certificate is mounted into the pod (via
extraVolumesandextraVolumeMounts). -
Verify from inside the pod:
kubectl exec -n watchlight <wl-discover-pod> -- \
wget -q --spider https://beacon-wl-registry:8080/health
wl-discover eBPF Permission Failures
Symptoms
wl-discover logs show permission denied for BPF operations, or eBPF probes fail to attach.
Root Cause
eBPF requires specific Linux capabilities and must run as root (UID 0). A non-root user receives zero capabilities even if the security context grants them.
Verify Security Context
kubectl get daemonset -n watchlight beacon-wl-discover -o jsonpath='{.spec.template.spec.containers[0].securityContext}' | jq .
Required settings:
securityContext:
runAsUser: 0
capabilities:
add:
- BPF
- PERFMON
- SYS_RESOURCE
drop:
- ALL
Verify Kernel Support
eBPF requires Linux 5.8+ with BTF. Check from a node:
kubectl debug node/<node-name> -it --image=busybox -- \
ls /sys/kernel/btf/vmlinux
If the file does not exist, the node's kernel does not support BTF and eBPF discovery will not function. Kubernetes API-based discovery (kubernetes.enabled=true) will still work.
Disable eBPF (Fallback)
If eBPF cannot be enabled, disable it and rely on Kubernetes API discovery:
wl-discover:
ebpf:
enabled: false
hostNetwork: false
hostPID: false
securityContext:
runAsUser: 10001
runAsNonRoot: true
capabilities:
drop:
- ALL
Networking Issues
Pods Cannot Reach Each Other
Verify services are created:
kubectl get svc -n watchlight
Test connectivity between services:
kubectl run -n watchlight debug --rm -it --image=busybox -- \
wget -q -O- http://beacon-wl-registry:8080/health
Ingress Not Working
# Check ingress status
kubectl get ingress -n watchlight
# Verify the ingress controller is running
kubectl get pods -n ingress-nginx # or your controller's namespace
# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=50
Common kubectl Commands Reference
# List all Beacon resources
kubectl get all -n watchlight
# Watch pod status in real time
kubectl get pods -n watchlight -w
# Get resource usage (requires metrics-server)
kubectl top pods -n watchlight
# Execute a shell in a running pod
kubectl exec -n watchlight <pod-name> -it -- /bin/sh
# View ConfigMap contents
kubectl get configmap -n watchlight <configmap-name> -o yaml
# View Secret keys (names only, not values)
kubectl get secret -n watchlight <secret-name> -o jsonpath='{.data}' | jq 'keys'
# Restart a deployment (rolling restart)
kubectl rollout restart -n watchlight deployment/beacon-wl-registry
# Check rollout status
kubectl rollout status -n watchlight deployment/beacon-wl-registry
Getting Help
If the issue persists after following this guide:
-
Collect a diagnostic bundle:
kubectl get pods -n watchlight -o wide > diag-pods.txt
kubectl get events -n watchlight --sort-by='.lastTimestamp' > diag-events.txt
kubectl logs -n watchlight -l app.kubernetes.io/instance=beacon --all-containers --tail=200 > diag-logs.txt -
Contact Watchlight AI support at watchlight.ai/partner with the diagnostic output.