Troubleshooting

Quick Diagnostic Commands

# Overall pod status
kubectl get pods -n watchlight

# Events (recent scheduling/pull/crash issues)
kubectl get events -n watchlight --sort-by='.lastTimestamp'

# Describe a specific pod
kubectl describe pod -n watchlight <pod-name>

# Logs for a service
kubectl logs -n watchlight -l app.kubernetes.io/name=wl-registry --tail=100

# Follow logs in real time
kubectl logs -n watchlight -l app.kubernetes.io/name=wl-registry -f

Health Check Endpoints

Each service exposes a health endpoint. Use port-forwarding to verify:

Service	Port	Health Path	Command
wl-registry	8080	`/health`	`kubectl port-forward svc/beacon-wl-registry 8080:8080`
wl-apdp	8081	`/health`	`kubectl port-forward svc/beacon-wl-apdp 8081:8081`
wl-proxy	8079	`/`	`kubectl port-forward svc/beacon-wl-proxy 8079:8079`
wl-secrets-broker	8082	`/health`	`kubectl port-forward svc/beacon-wl-secrets-broker 8082:8082`
wl-registry-frontend	80	`/`	`kubectl port-forward svc/beacon-wl-registry-frontend 3001:80`

Then test with:

curl http://localhost:<port>/<path>

Pod Crash Loops

Symptoms

Pod status shows CrashLoopBackOff or repeated restarts.

Diagnosis

# Check the exit code and reason
kubectl describe pod -n watchlight <pod-name>

# View the most recent logs (including previous crash)
kubectl logs -n watchlight <pod-name> --previous

Common Causes

Missing environment variables or secrets:

The pod logs will show an error like required environment variable DATABASE_URL is not set. Verify that the database secret exists:

kubectl get secrets -n watchlight

Database migration failure:

If wl-registry crash-loops on startup, check for migration errors:

kubectl logs -n watchlight -l app.kubernetes.io/name=wl-registry --tail=50

A checksum mismatch means an applied migration file was modified. This requires a new migration file to fix -- never edit applied migrations.

Insufficient resources:

If pods are OOMKilled, increase memory limits:

helm upgrade beacon ./deploy/helm/watchlight-beacon \
  --namespace watchlight \
  --set wl-registry.resources.limits.memory=1Gi

Database Connection Issues

Symptoms

wl-registry logs show error connecting to database or connection refused.

Built-in PostgreSQL

Verify the PostgreSQL pod is running:

kubectl get pods -n watchlight -l app.kubernetes.io/name=postgresql

Check PostgreSQL logs:

kubectl logs -n watchlight -l app.kubernetes.io/name=postgresql --tail=50

Verify the service is reachable from within the cluster:

kubectl run -n watchlight debug --rm -it --image=postgres:16-alpine -- \
  psql "postgres://beacon@beacon-postgresql:5432/beacon"

Check that the password secret exists and matches:

kubectl get secret -n watchlight beacon-postgresql -o jsonpath='{.data.postgres-password}' | base64 -d

BYO Database (External)

Verify the connection string in your values:

helm get values beacon -n watchlight | grep -A2 externalDatabase

Test connectivity from inside the cluster:

kubectl run -n watchlight debug --rm -it --image=postgres:16-alpine -- \
  psql "$YOUR_DATABASE_URL"

Common issues:

Security group / firewall does not allow traffic from cluster nodes to the database.
IAM authentication requires the pod's service account to have the correct role bindings.
SSL mode may need to be specified in the connection string (for example, ?sslmode=require).

Image Pull Errors

Symptoms

Pod status shows ImagePullBackOff or ErrImagePull.

Diagnosis

kubectl describe pod -n watchlight <pod-name> | grep -A5 "Events"

Look for messages like unauthorized or not found.

Fixes

Missing or expired GHCR credentials:

# Verify the secret exists
kubectl get secret -n watchlight ghcr-credentials

# Recreate if needed
kubectl delete secret -n watchlight ghcr-credentials
kubectl create secret docker-registry ghcr-credentials \
  --namespace watchlight \
  --docker-server=ghcr.io \
  --docker-username=YOUR_GITHUB_USERNAME \
  --docker-password=YOUR_GHCR_TOKEN

Secret not referenced in the Helm values:

Ensure global.imagePullSecrets is set:

global:
  imagePullSecrets:
    - name: ghcr-credentials

Wrong image tag:

Verify the tag exists in the registry:

# Using GitHub CLI
gh api orgs/watchlight-ai-beacon/packages/container/wl-registry/versions \
  --jq '.[].metadata.container.tags[]'

TLS Certificate Problems

Symptoms

Services return TLS errors, browsers show certificate warnings, or inter-service communication fails with certificate verify failed.

Ingress TLS

Verify the TLS secret exists and contains valid certificate data:

kubectl get secret -n watchlight <tls-secret-name> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -subject -dates

Check that the ingress resource references the correct secret:

kubectl get ingress -n watchlight -o yaml | grep -A5 tls

wl-discover TLS to Registry

If wl-discover cannot connect to the registry over TLS, check:

The CA certificate path is configured:

wl-discover:
  tls:
    caCertPath: "/etc/ssl/certs/ca.crt"

The CA certificate is mounted into the pod (via extraVolumes and extraVolumeMounts).

Verify from inside the pod:

kubectl exec -n watchlight <wl-discover-pod> -- \
  wget -q --spider https://beacon-wl-registry:8080/health

wl-discover eBPF Permission Failures

Symptoms

wl-discover logs show permission denied for BPF operations, or eBPF probes fail to attach.

Root Cause

eBPF requires specific Linux capabilities and must run as root (UID 0). A non-root user receives zero capabilities even if the security context grants them.

Verify Security Context

kubectl get daemonset -n watchlight beacon-wl-discover -o jsonpath='{.spec.template.spec.containers[0].securityContext}' | jq .

Required settings:

securityContext:
  runAsUser: 0
  capabilities:
    add:
      - BPF
      - PERFMON
      - SYS_RESOURCE
    drop:
      - ALL

Verify Kernel Support

eBPF requires Linux 5.8+ with BTF. Check from a node:

kubectl debug node/<node-name> -it --image=busybox -- \
  ls /sys/kernel/btf/vmlinux

If the file does not exist, the node's kernel does not support BTF and eBPF discovery will not function. Kubernetes API-based discovery (kubernetes.enabled=true) will still work.

Disable eBPF (Fallback)

If eBPF cannot be enabled, disable it and rely on Kubernetes API discovery:

wl-discover:
  ebpf:
    enabled: false
  hostNetwork: false
  hostPID: false
  securityContext:
    runAsUser: 10001
    runAsNonRoot: true
    capabilities:
      drop:
        - ALL

Networking Issues

Pods Cannot Reach Each Other

Verify services are created:

kubectl get svc -n watchlight

Test connectivity between services:

kubectl run -n watchlight debug --rm -it --image=busybox -- \
  wget -q -O- http://beacon-wl-registry:8080/health

Ingress Not Working

# Check ingress status
kubectl get ingress -n watchlight

# Verify the ingress controller is running
kubectl get pods -n ingress-nginx  # or your controller's namespace

# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=50

Common kubectl Commands Reference

# List all Beacon resources
kubectl get all -n watchlight

# Watch pod status in real time
kubectl get pods -n watchlight -w

# Get resource usage (requires metrics-server)
kubectl top pods -n watchlight

# Execute a shell in a running pod
kubectl exec -n watchlight <pod-name> -it -- /bin/sh

# View ConfigMap contents
kubectl get configmap -n watchlight <configmap-name> -o yaml

# View Secret keys (names only, not values)
kubectl get secret -n watchlight <secret-name> -o jsonpath='{.data}' | jq 'keys'

# Restart a deployment (rolling restart)
kubectl rollout restart -n watchlight deployment/beacon-wl-registry

# Check rollout status
kubectl rollout status -n watchlight deployment/beacon-wl-registry

Getting Help

If the issue persists after following this guide:

Collect a diagnostic bundle:

kubectl get pods -n watchlight -o wide > diag-pods.txt
kubectl get events -n watchlight --sort-by='.lastTimestamp' > diag-events.txt
kubectl logs -n watchlight -l app.kubernetes.io/instance=beacon --all-containers --tail=200 > diag-logs.txt

Contact Watchlight AI support at watchlight.ai/partner with the diagnostic output.

Quick Diagnostic Commands​

Health Check Endpoints​

Pod Crash Loops​

Symptoms​

Diagnosis​

Common Causes​

Database Connection Issues​

Symptoms​

Built-in PostgreSQL​

BYO Database (External)​

Image Pull Errors​

Symptoms​

Diagnosis​

Fixes​

TLS Certificate Problems​

Symptoms​

Ingress TLS​

wl-discover TLS to Registry​

wl-discover eBPF Permission Failures​

Symptoms​

Root Cause​

Verify Security Context​

Verify Kernel Support​

Disable eBPF (Fallback)​

Networking Issues​

Pods Cannot Reach Each Other​

Ingress Not Working​

Common kubectl Commands Reference​

Getting Help​

Quick Diagnostic Commands

Health Check Endpoints

Pod Crash Loops

Symptoms

Diagnosis

Common Causes

Database Connection Issues

Symptoms

Built-in PostgreSQL

BYO Database (External)

Image Pull Errors

Symptoms

Diagnosis

Fixes

TLS Certificate Problems

Symptoms

Ingress TLS

wl-discover TLS to Registry

wl-discover eBPF Permission Failures

Symptoms

Root Cause

Verify Security Context

Verify Kernel Support

Disable eBPF (Fallback)

Networking Issues

Pods Cannot Reach Each Other

Ingress Not Working

Common kubectl Commands Reference

Getting Help