Skip to main content

Troubleshooting

Quick Diagnostic Commands

# Overall pod status
kubectl get pods -n watchlight

# Events (recent scheduling/pull/crash issues)
kubectl get events -n watchlight --sort-by='.lastTimestamp'

# Describe a specific pod
kubectl describe pod -n watchlight <pod-name>

# Logs for a service
kubectl logs -n watchlight -l app.kubernetes.io/name=wl-registry --tail=100

# Follow logs in real time
kubectl logs -n watchlight -l app.kubernetes.io/name=wl-registry -f

Health Check Endpoints

Each service exposes a health endpoint. Use port-forwarding to verify:

ServicePortHealth PathCommand
wl-registry8080/healthkubectl port-forward svc/beacon-wl-registry 8080:8080
wl-apdp8081/healthkubectl port-forward svc/beacon-wl-apdp 8081:8081
wl-proxy8079/kubectl port-forward svc/beacon-wl-proxy 8079:8079
wl-secrets-broker8082/healthkubectl port-forward svc/beacon-wl-secrets-broker 8082:8082
wl-registry-frontend80/kubectl port-forward svc/beacon-wl-registry-frontend 3001:80

Then test with:

curl http://localhost:<port>/<path>

Pod Crash Loops

Symptoms

Pod status shows CrashLoopBackOff or repeated restarts.

Diagnosis

# Check the exit code and reason
kubectl describe pod -n watchlight <pod-name>

# View the most recent logs (including previous crash)
kubectl logs -n watchlight <pod-name> --previous

Common Causes

Missing environment variables or secrets:

The pod logs will show an error like required environment variable DATABASE_URL is not set. Verify that the database secret exists:

kubectl get secrets -n watchlight

Database migration failure:

If wl-registry crash-loops on startup, check for migration errors:

kubectl logs -n watchlight -l app.kubernetes.io/name=wl-registry --tail=50

A checksum mismatch means an applied migration file was modified. This requires a new migration file to fix -- never edit applied migrations.

Insufficient resources:

If pods are OOMKilled, increase memory limits:

helm upgrade beacon ./deploy/helm/watchlight-beacon \
--namespace watchlight \
--set wl-registry.resources.limits.memory=1Gi

Database Connection Issues

Symptoms

wl-registry logs show error connecting to database or connection refused.

Built-in PostgreSQL

Verify the PostgreSQL pod is running:

kubectl get pods -n watchlight -l app.kubernetes.io/name=postgresql

Check PostgreSQL logs:

kubectl logs -n watchlight -l app.kubernetes.io/name=postgresql --tail=50

Verify the service is reachable from within the cluster:

kubectl run -n watchlight debug --rm -it --image=postgres:16-alpine -- \
psql "postgres://beacon@beacon-postgresql:5432/beacon"

Check that the password secret exists and matches:

kubectl get secret -n watchlight beacon-postgresql -o jsonpath='{.data.postgres-password}' | base64 -d

BYO Database (External)

Verify the connection string in your values:

helm get values beacon -n watchlight | grep -A2 externalDatabase

Test connectivity from inside the cluster:

kubectl run -n watchlight debug --rm -it --image=postgres:16-alpine -- \
psql "$YOUR_DATABASE_URL"

Common issues:

  • Security group / firewall does not allow traffic from cluster nodes to the database.
  • IAM authentication requires the pod's service account to have the correct role bindings.
  • SSL mode may need to be specified in the connection string (for example, ?sslmode=require).

Image Pull Errors

Symptoms

Pod status shows ImagePullBackOff or ErrImagePull.

Diagnosis

kubectl describe pod -n watchlight <pod-name> | grep -A5 "Events"

Look for messages like unauthorized or not found.

Fixes

Missing or expired GHCR credentials:

# Verify the secret exists
kubectl get secret -n watchlight ghcr-credentials

# Recreate if needed
kubectl delete secret -n watchlight ghcr-credentials
kubectl create secret docker-registry ghcr-credentials \
--namespace watchlight \
--docker-server=ghcr.io \
--docker-username=YOUR_GITHUB_USERNAME \
--docker-password=YOUR_GHCR_TOKEN

Secret not referenced in the Helm values:

Ensure global.imagePullSecrets is set:

global:
imagePullSecrets:
- name: ghcr-credentials

Wrong image tag:

Verify the tag exists in the registry:

# Using GitHub CLI
gh api orgs/watchlight-ai-beacon/packages/container/wl-registry/versions \
--jq '.[].metadata.container.tags[]'

TLS Certificate Problems

Symptoms

Services return TLS errors, browsers show certificate warnings, or inter-service communication fails with certificate verify failed.

Ingress TLS

Verify the TLS secret exists and contains valid certificate data:

kubectl get secret -n watchlight <tls-secret-name> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -subject -dates

Check that the ingress resource references the correct secret:

kubectl get ingress -n watchlight -o yaml | grep -A5 tls

wl-discover TLS to Registry

If wl-discover cannot connect to the registry over TLS, check:

  1. The CA certificate path is configured:

    wl-discover:
    tls:
    caCertPath: "/etc/ssl/certs/ca.crt"
  2. The CA certificate is mounted into the pod (via extraVolumes and extraVolumeMounts).

  3. Verify from inside the pod:

    kubectl exec -n watchlight <wl-discover-pod> -- \
    wget -q --spider https://beacon-wl-registry:8080/health

wl-discover eBPF Permission Failures

Symptoms

wl-discover logs show permission denied for BPF operations, or eBPF probes fail to attach.

Root Cause

eBPF requires specific Linux capabilities and must run as root (UID 0). A non-root user receives zero capabilities even if the security context grants them.

Verify Security Context

kubectl get daemonset -n watchlight beacon-wl-discover -o jsonpath='{.spec.template.spec.containers[0].securityContext}' | jq .

Required settings:

securityContext:
runAsUser: 0
capabilities:
add:
- BPF
- PERFMON
- SYS_RESOURCE
drop:
- ALL

Verify Kernel Support

eBPF requires Linux 5.8+ with BTF. Check from a node:

kubectl debug node/<node-name> -it --image=busybox -- \
ls /sys/kernel/btf/vmlinux

If the file does not exist, the node's kernel does not support BTF and eBPF discovery will not function. Kubernetes API-based discovery (kubernetes.enabled=true) will still work.

Disable eBPF (Fallback)

If eBPF cannot be enabled, disable it and rely on Kubernetes API discovery:

wl-discover:
ebpf:
enabled: false
hostNetwork: false
hostPID: false
securityContext:
runAsUser: 10001
runAsNonRoot: true
capabilities:
drop:
- ALL

Networking Issues

Pods Cannot Reach Each Other

Verify services are created:

kubectl get svc -n watchlight

Test connectivity between services:

kubectl run -n watchlight debug --rm -it --image=busybox -- \
wget -q -O- http://beacon-wl-registry:8080/health

Ingress Not Working

# Check ingress status
kubectl get ingress -n watchlight

# Verify the ingress controller is running
kubectl get pods -n ingress-nginx # or your controller's namespace

# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=50

Common kubectl Commands Reference

# List all Beacon resources
kubectl get all -n watchlight

# Watch pod status in real time
kubectl get pods -n watchlight -w

# Get resource usage (requires metrics-server)
kubectl top pods -n watchlight

# Execute a shell in a running pod
kubectl exec -n watchlight <pod-name> -it -- /bin/sh

# View ConfigMap contents
kubectl get configmap -n watchlight <configmap-name> -o yaml

# View Secret keys (names only, not values)
kubectl get secret -n watchlight <secret-name> -o jsonpath='{.data}' | jq 'keys'

# Restart a deployment (rolling restart)
kubectl rollout restart -n watchlight deployment/beacon-wl-registry

# Check rollout status
kubectl rollout status -n watchlight deployment/beacon-wl-registry

Getting Help

If the issue persists after following this guide:

  1. Collect a diagnostic bundle:

    kubectl get pods -n watchlight -o wide > diag-pods.txt
    kubectl get events -n watchlight --sort-by='.lastTimestamp' > diag-events.txt
    kubectl logs -n watchlight -l app.kubernetes.io/instance=beacon --all-containers --tail=200 > diag-logs.txt
  2. Contact Watchlight AI support at watchlight.ai/partner with the diagnostic output.