Kubernetes Troubleshooting Guide for Operators and Administrators

Kubernetes issues are almost always caused by misconfiguration, not bugs. This guide gives you a systematic approach to diagnosing and fixing the most common problems in production clusters -- from pods that won't start to nodes that disappear and storage that won't bind.

The three golden rules:

  1. Events tell you what happened
  2. Logs tell you why it happened
  3. Resource state tells you what is configured

Troubleshooting Methodology

Every Kubernetes problem follows the same diagnostic loop:

1. OBSERVE   -- kubectl get (what is the current state?)
2. EVENTS    -- kubectl describe (what happened?)
3. LOGS      -- kubectl logs (what did components report?)
4. STATE     -- kubectl get -o yaml (full resource configuration)
5. DIAGNOSE  -- form a hypothesis
6. TEST      -- kubectl exec, debug, port-forward
7. RESOLVE   -- apply fix
8. VERIFY    -- confirm resolution
9. DOCUMENT  -- record and update runbooks

Essential Tools

ToolPurpose
kubectl getList resource status
kubectl describeDetailed info and events
kubectl logsContainer output
kubectl execRun commands inside containers
kubectl debugEphemeral debug container (1.23+)
kubectl topCPU and memory metrics
kubectl eventsCluster event stream
kubectl explainInline resource documentation

Architecture Layers

Understanding which layer is failing halves your debugging time:

User / kubectl
     |
Control Plane        -- API Server, Scheduler, Controller Manager, etcd
     |
Networking Layer     -- CNI (Calico/Flannel/Cilium), kube-proxy, CoreDNS
     |
Node / Data Plane    -- Kubelet, containerd, Pods
     |
Storage Layer        -- CSI Driver, Volume Plugins
     |
Infrastructure       -- Cloud / Bare Metal

Pod and Container Issues

Pod Stuck in Pending

Symptoms: Pod never transitions to Running, READY shows 0/N, workload never starts.

Diagnostic steps:

# Step 1 -- check events (always start here)
kubectl describe pod <pod-name> -n <namespace>
# Read the Events section at the bottom

# Step 2 -- check node resources
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"

# Step 3 -- check resource requests on the pod
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 resources:

# Step 4 -- check node selectors
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.nodeSelector}'
kubectl get nodes --show-labels

# Step 5 -- check taints
kubectl describe node <node-name> | grep Taints
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.tolerations}'

# Step 6 -- check resource quotas
kubectl get resourcequota -n <namespace>
kubectl describe resourcequota -n <namespace>

Common event messages and fixes:

Event MessageCauseFix
0/N nodes: Insufficient cpuNot enough CPUScale cluster or reduce requests
0/N nodes: Insufficient memoryNot enough RAMAdd nodes or reduce requests
node(s) had taintsPod lacks tolerationAdd toleration to pod
didn't match node selectorLabel mismatchFix node labels or pod nodeSelector
PVC not foundPVC doesn't existCreate the PVC
# Fix: label a node to match selector
kubectl label node <node-name> disktype=ssd

# Fix: remove taint from node
kubectl taint nodes <node-name> key1=value1:NoSchedule-

# Fix: add toleration to deployment
kubectl edit deployment <deployment-name> -n <namespace>
tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoSchedule"

Pod Stuck in CrashLoopBackOff

Symptoms: Pod status shows CrashLoopBackOff, RESTARTS count climbing, application never available.

Diagnostic steps:

# Step 1 -- get previous crash logs (most important step)
kubectl logs <pod-name> -n <namespace> --previous

# Step 2 -- check exit code
kubectl get pod <pod-name> -n <namespace> \
  -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# Exit codes:
# 1   = general application error
# 137 = SIGKILL (OOMKilled)
# 139 = segfault
# 143 = SIGTERM

# Step 3 -- check if OOMKilled
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Last State"

# Step 4 -- check liveness probe config
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 livenessProbe

# Step 5 -- check env vars and config
kubectl exec <pod-name> -n <namespace> -- env
kubectl exec <pod-name> -n <namespace> -- cat /path/to/config

Fix: OOMKilled (exit code 137)

kubectl patch deployment <deployment-name> -n <namespace> --type='json' \
  -p='[
    {"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/memory","value":"1Gi"},
    {"op":"replace","path":"/spec/template/spec/containers/0/resources/requests/memory","value":"512Mi"}
  ]'

Fix: liveness probe too aggressive

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30   # give app time to start
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Fix: missing dependency (database not ready)

# Test connectivity from debug pod
kubectl run test --image=busybox -it --rm -n <namespace> -- sh
wget -O- http://service-name:port
nslookup service-name

Fix: configuration error

# Verify ConfigMap and Secret exist
kubectl get configmap,secret -n <namespace>

# Fix config and rollout restart
kubectl edit configmap <name> -n <namespace>
kubectl rollout restart deployment <deployment-name> -n <namespace>

Pod Stuck in ImagePullBackOff

Symptoms: Pod never reaches Running, events show image pull errors.

# Check what image is being pulled
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].image}'

# Check events for specific error
kubectl describe pod <pod-name> -n <namespace>

Fix: wrong image name or tag

kubectl set image deployment/<name> <container>=<correct-image>:<tag> -n <namespace>

Fix: private registry -- create pull secret

kubectl create secret docker-registry registry-secret \
  --docker-server=<registry-url> \
  --docker-username=<username> \
  --docker-password=<password> \
  -n <namespace>
spec:
  template:
    spec:
      imagePullSecrets:
      - name: registry-secret

Fix: Docker Hub rate limiting

# Use authenticated pulls
kubectl create secret docker-registry dockerhub-secret \
  --docker-server=https://index.docker.io/v1/ \
  --docker-username=<user> \
  --docker-password=<password> \
  -n <namespace>

Container OOMKilled

Symptoms: Exit code 137, OOMKilled reason in describe output, pod restarts frequently.

# Confirm OOMKilled
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Last State"

# Check current memory limits
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources}'

# Check live usage (if pod stays up long enough)
kubectl top pod <pod-name> -n <namespace>

Fix: increase memory limits

resources:
  limits:
    memory: "1Gi"
  requests:
    memory: "512Mi"

Fix: Java memory leak debugging

env:
- name: JAVA_OPTS
  value: "-Xmx512m -Xms256m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp"
# Retrieve heap dump for analysis
kubectl cp <pod-name>:/tmp/heap-dump.hprof ./heap-dump.hprof -n <namespace>

Fix: horizontal autoscaling for traffic spikes

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
  namespace: <namespace>
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: <deployment-name>
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70

Service and Networking Issues

Service Not Accessible

Symptoms: Cannot reach service from inside cluster, connection timeout, curl fails.

# Step 1 -- check service exists
kubectl get svc <service-name> -n <namespace>

# Step 2 -- check endpoints (critical)
kubectl get endpoints <service-name> -n <namespace>
# If ENDPOINTS shows <none>, selector doesn't match any pods

# Step 3 -- verify selector matches pod labels
kubectl get svc <service-name> -n <namespace> -o jsonpath='{.spec.selector}'
kubectl get pods -l app=<label-value> -n <namespace> --show-labels

# Step 4 -- check pod readiness
kubectl get pods -l app=<label> -n <namespace>

# Step 5 -- test from inside cluster
kubectl run test --image=nicolaka/netshoot -it --rm -n <namespace> -- bash
curl http://<service-name>:<port>
nslookup <service-name>

# Step 6 -- check if app is actually listening
kubectl exec <pod-name> -n <namespace> -- ss -tlnp
# Must show LISTEN on 0.0.0.0:<port>, not 127.0.0.1:<port>

Fix: selector mismatch

# Service selector must exactly match pod labels
spec:
  selector:
    app: myapp        # must match pod label
    version: v1       # all keys must match

Fix: port mismatch

spec:
  ports:
  - port: 80          # what clients connect to
    targetPort: 8080  # what the container listens on
    protocol: TCP

Fix: NetworkPolicy blocking traffic

# Allow all ingress temporarily for testing
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-all-ingress
  namespace: <namespace>
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  ingress:
  - {}

DNS Resolution Failures

Symptoms: Pods can't resolve service names, "Name or service not known" errors, can reach services by IP but not name.

# Test DNS from a pod
kubectl run test-dns --image=busybox -it --rm -- nslookup kubernetes.default
kubectl run test-dns --image=busybox -it --rm -- nslookup <service-name>.<namespace>

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

# Check DNS config inside a pod
kubectl exec <pod-name> -n <namespace> -- cat /etc/resolv.conf
# Should show:
# nameserver 10.x.x.x
# search <namespace>.svc.cluster.local svc.cluster.local cluster.local

# Check kube-dns service
kubectl get svc -n kube-system kube-dns
kubectl get endpoints -n kube-system kube-dns

Fix: CoreDNS not running

kubectl rollout restart deployment coredns -n kube-system
kubectl scale deployment coredns -n kube-system --replicas=2

Fix: NetworkPolicy blocking port 53

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: <namespace>
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

Fix: wrong DNS policy on pod

spec:
  dnsPolicy: ClusterFirst  # default, recommended

Storage and PVC Issues

PVC Stuck in Pending

Symptoms: PVC status shows Pending, pod referencing it also stuck in Pending.

# Check PVC status and events
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>

# Check StorageClass exists
kubectl get pvc <pvc-name> -n <namespace> -o jsonpath='{.spec.storageClassName}'
kubectl get storageclass

# Check available PVs
kubectl get pv
kubectl get pv -o custom-columns=NAME:.metadata.name,CAPACITY:.spec.capacity.storage,ACCESS:.spec.accessModes,STATUS:.status.phase

# Check provisioner logs
kubectl get pods -n kube-system | grep csi
kubectl logs -n kube-system <csi-provisioner-pod> --tail=50

Fix: create a StorageClass

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  fsType: ext4
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

Fix: manually create a PV (static provisioning)

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-001
spec:
  capacity:
    storage: 10Gi
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: manual
  hostPath:
    path: /mnt/data

Note: Most cloud block storage (EBS, GCP PD, Azure Disk) only supports ReadWriteOnce. Use NFS or cloud file storage for ReadWriteMany.


Node Issues

Node in NotReady State

Symptoms: kubectl get nodes shows NotReady, pods not scheduling on node.

# Check node conditions
kubectl describe node <node-name>
# Look at Conditions section:
# Ready:          False
# MemoryPressure: True/False
# DiskPressure:   True/False
# PIDPressure:    True/False

# SSH to node and check kubelet
ssh <node-address>
systemctl status kubelet
journalctl -u kubelet --since "10 minutes ago"

# Check container runtime
systemctl status containerd
crictl ps

# Check disk
df -h && df -i

# Check memory
free -h

# Test API server connectivity from node
curl -k https://<api-server-ip>:6443/healthz

Fix: kubelet not running

sudo systemctl start kubelet
sudo systemctl enable kubelet

# If failing to start
journalctl -u kubelet --no-pager | tail -50
# Check /var/lib/kubelet/config.yaml and /etc/kubernetes/kubelet.conf
sudo systemctl restart kubelet

Fix: disk pressure

# Clean unused container images
crictl rmi --prune

# Clean old containers
crictl rm $(crictl ps -a -q)

# Clean logs
sudo journalctl --vacuum-time=2d
sudo find /var/log -name "*.log" -mtime +7 -delete

Fix: certificate expired

# Renew certificates (kubeadm clusters)
kubeadm certs renew all
kubeadm certs check-expiration
systemctl restart kubelet

Fix: drain, repair, uncordon

# Safely drain a node before maintenance
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# After fixing the issue
kubectl uncordon <node-name>

Quick Reference: Symptom to Action

Pod Status Decision Tree

Pending
├── "Insufficient cpu/memory"     → scale cluster or reduce requests
├── "node(s) had taints"          → add tolerations to pod
├── "didn't match node selector"  → fix node labels or pod nodeSelector
└── "PVC not found/pending"       → fix PVC

CrashLoopBackOff
├── exit code 137 (OOMKilled)     → increase memory limits
├── application error in logs     → fix app config/code
├── missing dependency            → check service/secret availability
└── liveness probe failing        → increase initialDelaySeconds

ImagePullBackOff
├── "manifest unknown"            → fix image name or tag
├── "unauthorized"                → add imagePullSecrets
├── "rate limited"                → use authenticated pulls
└── network error                 → check registry connectivity

Running but not working
├── check logs                    → application-level error
├── check endpoints               → kubectl get endpoints svc-name
└── empty endpoints               → service selector mismatch

Service Connectivity Decision Tree

Cannot access service?
1. Does service exist?          → kubectl get svc
2. Does it have endpoints?      → kubectl get endpoints svc-name
   Empty → selector doesn't match pods
3. Are pods ready?              → kubectl get pods -l app=label
   Not ready → check readiness probe
4. Test from inside cluster     → kubectl run test --image=busybox -it --rm
5. Check NetworkPolicy          → kubectl get networkpolicy -n <namespace>
6. Is app listening on 0.0.0.0? → kubectl exec pod -- ss -tlnp

Essential kubectl Reference

Inspection

# List with useful flags
kubectl get pods -A                                    # all namespaces
kubectl get pods -o wide                               # show IPs and nodes
kubectl get pods --show-labels                         # show labels
kubectl get pods -l app=nginx                          # filter by label
kubectl get pods --sort-by=.metadata.creationTimestamp # sort by age
kubectl get pods -w                                    # watch mode

# Custom columns
kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,IP:.status.podIP

# All non-running pods
kubectl get pods -A -o jsonpath='{range .items[?(@.status.phase!="Running")]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}'

Logs

kubectl logs <pod>                         # current logs
kubectl logs <pod> --previous              # after crash
kubectl logs <pod> -c <container>          # specific container
kubectl logs <pod> --tail=50               # last 50 lines
kubectl logs <pod> --since=1h              # last hour
kubectl logs <pod> -f                      # follow
kubectl logs -l app=nginx --all-containers # all pods matching label

Debugging

# Interactive shell
kubectl exec <pod> -it -- sh

# Port forward
kubectl port-forward <pod> 8080:80
kubectl port-forward svc/<service> 8080:80

# Ephemeral debug container (1.23+)
kubectl debug <pod> -it --image=nicolaka/netshoot --target=<container>

# Debug copy of pod
kubectl debug <pod> -it --copy-to=debug-pod --image=ubuntu

# Debug node
kubectl debug node/<node-name> -it --image=ubuntu

# Copy files
kubectl cp <pod>:/path/file ./local-file
kubectl cp ./local-file <pod>:/path/file

Editing and Patching

# Edit live resource
kubectl edit deployment <name>

# Scale
kubectl scale deployment <name> --replicas=3

# Set image
kubectl set image deployment/<name> container=image:tag

# Rollout commands
kubectl rollout status deployment/<name>
kubectl rollout history deployment/<name>
kubectl rollout undo deployment/<name>
kubectl rollout restart deployment/<name>

# JSON patch
kubectl patch deployment <name> --type='json' \
  -p='[{"op":"replace","path":"/spec/replicas","value":3}]'

Permissions Check

kubectl auth can-i create pods
kubectl auth can-i create pods --as=<serviceaccount>
kubectl auth can-i --list

Log Locations Reference

Control Plane (self-hosted / kubeadm)

kubectl logs -n kube-system kube-apiserver-<node>
kubectl logs -n kube-system kube-controller-manager-<node>
kubectl logs -n kube-system kube-scheduler-<node>
kubectl logs -n kube-system etcd-<node>

Control Plane (systemd)

journalctl -u kube-apiserver
journalctl -u kube-controller-manager
journalctl -u kube-scheduler
journalctl -u etcd

Node Components (on the node)

journalctl -u kubelet
journalctl -u containerd

Add-ons

# CoreDNS
kubectl logs -n kube-system -l k8s-app=kube-dns

# CNI (Calico example)
kubectl logs -n kube-system -l k8s-app=calico-node

# kube-proxy
kubectl logs -n kube-system -l k8s-app=kube-proxy

# NGINX Ingress Controller
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

Common Event Reasons

ReasonMeaningAction
FailedSchedulingPod can't be placedCheck resources, taints, selectors
BackOffContainer crashing repeatedlyCheck logs
UnhealthyHealth check failingFix probe or application
FailedMountVolume mount failedCheck PVC, PV, StorageClass
FailedCreatePodSandBoxNetwork sandbox failedCheck CNI plugin
ImagePullBackOffImage pull failingFix image name or auth
OOMKilledMemory limit exceededIncrease limits

Control Plane Health Checks

# API server health
kubectl get --raw='/healthz?verbose'
kubectl get --raw='/readyz?verbose'

# Control plane pods
kubectl get pods -n kube-system -l tier=control-plane

# etcd health (kubeadm clusters)
kubectl -n kube-system exec etcd-<node> -- etcdctl \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  endpoint health

# Certificate expiration
kubeadm certs check-expiration

Production Operations Checklist

Monitoring Alerts to Set Up

  • Node NotReady for more than 2 minutes
  • Pod in CrashLoopBackOff for more than 5 minutes
  • PVC usage above 85%
  • API server request latency above threshold
  • Certificate expiration within 30 days
  • etcd leader changes (unexpected)
  • Node CPU or memory above 85%

Incident Response Workflow

# 1. Assess scope
kubectl get pods -A | grep -v Running
kubectl get nodes
kubectl top nodes

# 2. Quick mitigations
kubectl rollout undo deployment/<name>
kubectl scale deployment/<name> --replicas=N
kubectl delete pod <crashing-pod>  # force reschedule

# 3. Check recent events
kubectl get events --sort-by='.lastTimestamp' -A | tail -30

Production Best Practices

  • Set resource requests and limits on all containers
  • Use readiness probes -- they prevent bad deploys from killing traffic
  • Use liveness probes -- but with generous initialDelaySeconds
  • Use PodDisruptionBudgets to maintain availability during node drains
  • Use NetworkPolicies -- default deny, explicitly allow
  • Rotate certificates automatically (set up renewal before 90-day expiry)
  • Use specific image tags -- never latest in production
  • Back up etcd regularly -- it is your entire cluster state
  • Reserve system resources in kubelet config so the node never fully overcommits
ToolPurpose
k9sTerminal UI -- navigate cluster visually
sternTail logs from multiple pods at once
kubectx / kubensFast context and namespace switching
netshootNetwork debugging container (DNS, curl, tcpdump)
kubectl-neatClean up yaml output for readability
Prometheus + GrafanaMetrics and dashboards
Loki + GrafanaLog aggregation
JaegerDistributed tracing

Resources