Kubernetes Troubleshooting Guide for Operators and Administrators

16th Feb 2026
4 min read
Tags:
kubernetes,
devops,
troubleshooting,
sre,
production,
kubectl

Kubernetes issues are almost always caused by misconfiguration, not bugs. This guide gives you a systematic approach to diagnosing and fixing the most common problems in production clusters -- from pods that won't start to nodes that disappear and storage that won't bind.

The three golden rules:

Events tell you what happened
Logs tell you why it happened
Resource state tells you what is configured

Troubleshooting Methodology

Every Kubernetes problem follows the same diagnostic loop:

1. OBSERVE   -- kubectl get (what is the current state?)
2. EVENTS    -- kubectl describe (what happened?)
3. LOGS      -- kubectl logs (what did components report?)
4. STATE     -- kubectl get -o yaml (full resource configuration)
5. DIAGNOSE  -- form a hypothesis
6. TEST      -- kubectl exec, debug, port-forward
7. RESOLVE   -- apply fix
8. VERIFY    -- confirm resolution
9. DOCUMENT  -- record and update runbooks

Essential Tools

Tool	Purpose
`kubectl get`	List resource status
`kubectl describe`	Detailed info and events
`kubectl logs`	Container output
`kubectl exec`	Run commands inside containers
`kubectl debug`	Ephemeral debug container (1.23+)
`kubectl top`	CPU and memory metrics
`kubectl events`	Cluster event stream
`kubectl explain`	Inline resource documentation

Architecture Layers

Understanding which layer is failing halves your debugging time:

User / kubectl
     |
Control Plane        -- API Server, Scheduler, Controller Manager, etcd
     |
Networking Layer     -- CNI (Calico/Flannel/Cilium), kube-proxy, CoreDNS
     |
Node / Data Plane    -- Kubelet, containerd, Pods
     |
Storage Layer        -- CSI Driver, Volume Plugins
     |
Infrastructure       -- Cloud / Bare Metal

Pod and Container Issues

Pod Stuck in Pending

Symptoms: Pod never transitions to Running, READY shows 0/N, workload never starts.

Diagnostic steps:

# Step 1 -- check events (always start here)
kubectl describe pod <pod-name> -n <namespace>
# Read the Events section at the bottom

# Step 2 -- check node resources
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"

# Step 3 -- check resource requests on the pod
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 resources:

# Step 4 -- check node selectors
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.nodeSelector}'
kubectl get nodes --show-labels

# Step 5 -- check taints
kubectl describe node <node-name> | grep Taints
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.tolerations}'

# Step 6 -- check resource quotas
kubectl get resourcequota -n <namespace>
kubectl describe resourcequota -n <namespace>

Common event messages and fixes:

Event Message	Cause	Fix
`0/N nodes: Insufficient cpu`	Not enough CPU	Scale cluster or reduce requests
`0/N nodes: Insufficient memory`	Not enough RAM	Add nodes or reduce requests
`node(s) had taints`	Pod lacks toleration	Add toleration to pod
`didn't match node selector`	Label mismatch	Fix node labels or pod nodeSelector
`PVC not found`	PVC doesn't exist	Create the PVC

# Fix: label a node to match selector
kubectl label node <node-name> disktype=ssd

# Fix: remove taint from node
kubectl taint nodes <node-name> key1=value1:NoSchedule-

# Fix: add toleration to deployment
kubectl edit deployment <deployment-name> -n <namespace>

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoSchedule"

Pod Stuck in CrashLoopBackOff

Symptoms: Pod status shows CrashLoopBackOff, RESTARTS count climbing, application never available.

Diagnostic steps:

# Step 1 -- get previous crash logs (most important step)
kubectl logs <pod-name> -n <namespace> --previous

# Step 2 -- check exit code
kubectl get pod <pod-name> -n <namespace> \
  -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# Exit codes:
# 1   = general application error
# 137 = SIGKILL (OOMKilled)
# 139 = segfault
# 143 = SIGTERM

# Step 3 -- check if OOMKilled
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Last State"

# Step 4 -- check liveness probe config
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 livenessProbe

# Step 5 -- check env vars and config
kubectl exec <pod-name> -n <namespace> -- env
kubectl exec <pod-name> -n <namespace> -- cat /path/to/config

Fix: OOMKilled (exit code 137)

kubectl patch deployment <deployment-name> -n <namespace> --type='json' \
  -p='[
    {"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/memory","value":"1Gi"},
    {"op":"replace","path":"/spec/template/spec/containers/0/resources/requests/memory","value":"512Mi"}
  ]'

Fix: liveness probe too aggressive

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30   # give app time to start
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Fix: missing dependency (database not ready)

# Test connectivity from debug pod
kubectl run test --image=busybox -it --rm -n <namespace> -- sh
wget -O- http://service-name:port
nslookup service-name

Fix: configuration error

# Verify ConfigMap and Secret exist
kubectl get configmap,secret -n <namespace>

# Fix config and rollout restart
kubectl edit configmap <name> -n <namespace>
kubectl rollout restart deployment <deployment-name> -n <namespace>

Pod Stuck in ImagePullBackOff

Symptoms: Pod never reaches Running, events show image pull errors.

# Check what image is being pulled
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].image}'

# Check events for specific error
kubectl describe pod <pod-name> -n <namespace>

Fix: wrong image name or tag

kubectl set image deployment/<name> <container>=<correct-image>:<tag> -n <namespace>

Fix: private registry -- create pull secret

kubectl create secret docker-registry registry-secret \
  --docker-server=<registry-url> \
  --docker-username=<username> \
  --docker-password=<password> \
  -n <namespace>

spec:
  template:
    spec:
      imagePullSecrets:
      - name: registry-secret

Fix: Docker Hub rate limiting

# Use authenticated pulls
kubectl create secret docker-registry dockerhub-secret \
  --docker-server=https://index.docker.io/v1/ \
  --docker-username=<user> \
  --docker-password=<password> \
  -n <namespace>

Container OOMKilled

Symptoms: Exit code 137, OOMKilled reason in describe output, pod restarts frequently.

# Confirm OOMKilled
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Last State"

# Check current memory limits
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources}'

# Check live usage (if pod stays up long enough)
kubectl top pod <pod-name> -n <namespace>

Fix: increase memory limits

resources:
  limits:
    memory: "1Gi"
  requests:
    memory: "512Mi"

Fix: Java memory leak debugging

env:
- name: JAVA_OPTS
  value: "-Xmx512m -Xms256m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp"

# Retrieve heap dump for analysis
kubectl cp <pod-name>:/tmp/heap-dump.hprof ./heap-dump.hprof -n <namespace>

Fix: horizontal autoscaling for traffic spikes

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
  namespace: <namespace>
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: <deployment-name>
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70

Service and Networking Issues

Service Not Accessible

Symptoms: Cannot reach service from inside cluster, connection timeout, curl fails.

# Step 1 -- check service exists
kubectl get svc <service-name> -n <namespace>

# Step 2 -- check endpoints (critical)
kubectl get endpoints <service-name> -n <namespace>
# If ENDPOINTS shows <none>, selector doesn't match any pods

# Step 3 -- verify selector matches pod labels
kubectl get svc <service-name> -n <namespace> -o jsonpath='{.spec.selector}'
kubectl get pods -l app=<label-value> -n <namespace> --show-labels

# Step 4 -- check pod readiness
kubectl get pods -l app=<label> -n <namespace>

# Step 5 -- test from inside cluster
kubectl run test --image=nicolaka/netshoot -it --rm -n <namespace> -- bash
curl http://<service-name>:<port>
nslookup <service-name>

# Step 6 -- check if app is actually listening
kubectl exec <pod-name> -n <namespace> -- ss -tlnp
# Must show LISTEN on 0.0.0.0:<port>, not 127.0.0.1:<port>

Fix: selector mismatch

# Service selector must exactly match pod labels
spec:
  selector:
    app: myapp        # must match pod label
    version: v1       # all keys must match

Fix: port mismatch

spec:
  ports:
  - port: 80          # what clients connect to
    targetPort: 8080  # what the container listens on
    protocol: TCP

Fix: NetworkPolicy blocking traffic

# Allow all ingress temporarily for testing
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-all-ingress
  namespace: <namespace>
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  ingress:
  - {}

DNS Resolution Failures

Symptoms: Pods can't resolve service names, "Name or service not known" errors, can reach services by IP but not name.

# Test DNS from a pod
kubectl run test-dns --image=busybox -it --rm -- nslookup kubernetes.default
kubectl run test-dns --image=busybox -it --rm -- nslookup <service-name>.<namespace>

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

# Check DNS config inside a pod
kubectl exec <pod-name> -n <namespace> -- cat /etc/resolv.conf
# Should show:
# nameserver 10.x.x.x
# search <namespace>.svc.cluster.local svc.cluster.local cluster.local

# Check kube-dns service
kubectl get svc -n kube-system kube-dns
kubectl get endpoints -n kube-system kube-dns

Fix: CoreDNS not running

kubectl rollout restart deployment coredns -n kube-system
kubectl scale deployment coredns -n kube-system --replicas=2

Fix: NetworkPolicy blocking port 53

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: <namespace>
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

Fix: wrong DNS policy on pod

spec:
  dnsPolicy: ClusterFirst  # default, recommended

Storage and PVC Issues

PVC Stuck in Pending

Symptoms: PVC status shows Pending, pod referencing it also stuck in Pending.

# Check PVC status and events
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>

# Check StorageClass exists
kubectl get pvc <pvc-name> -n <namespace> -o jsonpath='{.spec.storageClassName}'
kubectl get storageclass

# Check available PVs
kubectl get pv
kubectl get pv -o custom-columns=NAME:.metadata.name,CAPACITY:.spec.capacity.storage,ACCESS:.spec.accessModes,STATUS:.status.phase

# Check provisioner logs
kubectl get pods -n kube-system | grep csi
kubectl logs -n kube-system <csi-provisioner-pod> --tail=50

Fix: create a StorageClass

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  fsType: ext4
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

Fix: manually create a PV (static provisioning)

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-001
spec:
  capacity:
    storage: 10Gi
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: manual
  hostPath:
    path: /mnt/data

Note: Most cloud block storage (EBS, GCP PD, Azure Disk) only supports ReadWriteOnce. Use NFS or cloud file storage for ReadWriteMany.

Node Issues

Node in NotReady State

Symptoms: kubectl get nodes shows NotReady, pods not scheduling on node.

# Check node conditions
kubectl describe node <node-name>
# Look at Conditions section:
# Ready:          False
# MemoryPressure: True/False
# DiskPressure:   True/False
# PIDPressure:    True/False

# SSH to node and check kubelet
ssh <node-address>
systemctl status kubelet
journalctl -u kubelet --since "10 minutes ago"

# Check container runtime
systemctl status containerd
crictl ps

# Check disk
df -h && df -i

# Check memory
free -h

# Test API server connectivity from node
curl -k https://<api-server-ip>:6443/healthz

Fix: kubelet not running

sudo systemctl start kubelet
sudo systemctl enable kubelet

# If failing to start
journalctl -u kubelet --no-pager | tail -50
# Check /var/lib/kubelet/config.yaml and /etc/kubernetes/kubelet.conf
sudo systemctl restart kubelet

Fix: disk pressure

# Clean unused container images
crictl rmi --prune

# Clean old containers
crictl rm $(crictl ps -a -q)

# Clean logs
sudo journalctl --vacuum-time=2d
sudo find /var/log -name "*.log" -mtime +7 -delete

Fix: certificate expired

# Renew certificates (kubeadm clusters)
kubeadm certs renew all
kubeadm certs check-expiration
systemctl restart kubelet

Fix: drain, repair, uncordon

# Safely drain a node before maintenance
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# After fixing the issue
kubectl uncordon <node-name>

Quick Reference: Symptom to Action

Pod Status Decision Tree

Pending
├── "Insufficient cpu/memory"     → scale cluster or reduce requests
├── "node(s) had taints"          → add tolerations to pod
├── "didn't match node selector"  → fix node labels or pod nodeSelector
└── "PVC not found/pending"       → fix PVC

CrashLoopBackOff
├── exit code 137 (OOMKilled)     → increase memory limits
├── application error in logs     → fix app config/code
├── missing dependency            → check service/secret availability
└── liveness probe failing        → increase initialDelaySeconds

ImagePullBackOff
├── "manifest unknown"            → fix image name or tag
├── "unauthorized"                → add imagePullSecrets
├── "rate limited"                → use authenticated pulls
└── network error                 → check registry connectivity

Running but not working
├── check logs                    → application-level error
├── check endpoints               → kubectl get endpoints svc-name
└── empty endpoints               → service selector mismatch

Service Connectivity Decision Tree

Cannot access service?
1. Does service exist?          → kubectl get svc
2. Does it have endpoints?      → kubectl get endpoints svc-name
   Empty → selector doesn't match pods
3. Are pods ready?              → kubectl get pods -l app=label
   Not ready → check readiness probe
4. Test from inside cluster     → kubectl run test --image=busybox -it --rm
5. Check NetworkPolicy          → kubectl get networkpolicy -n <namespace>
6. Is app listening on 0.0.0.0? → kubectl exec pod -- ss -tlnp

Essential kubectl Reference

Inspection

# List with useful flags
kubectl get pods -A                                    # all namespaces
kubectl get pods -o wide                               # show IPs and nodes
kubectl get pods --show-labels                         # show labels
kubectl get pods -l app=nginx                          # filter by label
kubectl get pods --sort-by=.metadata.creationTimestamp # sort by age
kubectl get pods -w                                    # watch mode

# Custom columns
kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,IP:.status.podIP

# All non-running pods
kubectl get pods -A -o jsonpath='{range .items[?(@.status.phase!="Running")]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}'

Logs

kubectl logs <pod>                         # current logs
kubectl logs <pod> --previous              # after crash
kubectl logs <pod> -c <container>          # specific container
kubectl logs <pod> --tail=50               # last 50 lines
kubectl logs <pod> --since=1h              # last hour
kubectl logs <pod> -f                      # follow
kubectl logs -l app=nginx --all-containers # all pods matching label

Debugging

# Interactive shell
kubectl exec <pod> -it -- sh

# Port forward
kubectl port-forward <pod> 8080:80
kubectl port-forward svc/<service> 8080:80

# Ephemeral debug container (1.23+)
kubectl debug <pod> -it --image=nicolaka/netshoot --target=<container>

# Debug copy of pod
kubectl debug <pod> -it --copy-to=debug-pod --image=ubuntu

# Debug node
kubectl debug node/<node-name> -it --image=ubuntu

# Copy files
kubectl cp <pod>:/path/file ./local-file
kubectl cp ./local-file <pod>:/path/file

Editing and Patching

# Edit live resource
kubectl edit deployment <name>

# Scale
kubectl scale deployment <name> --replicas=3

# Set image
kubectl set image deployment/<name> container=image:tag

# Rollout commands
kubectl rollout status deployment/<name>
kubectl rollout history deployment/<name>
kubectl rollout undo deployment/<name>
kubectl rollout restart deployment/<name>

# JSON patch
kubectl patch deployment <name> --type='json' \
  -p='[{"op":"replace","path":"/spec/replicas","value":3}]'

Permissions Check

kubectl auth can-i create pods
kubectl auth can-i create pods --as=<serviceaccount>
kubectl auth can-i --list

Log Locations Reference

Control Plane (self-hosted / kubeadm)

kubectl logs -n kube-system kube-apiserver-<node>
kubectl logs -n kube-system kube-controller-manager-<node>
kubectl logs -n kube-system kube-scheduler-<node>
kubectl logs -n kube-system etcd-<node>

Control Plane (systemd)

journalctl -u kube-apiserver
journalctl -u kube-controller-manager
journalctl -u kube-scheduler
journalctl -u etcd

Node Components (on the node)

journalctl -u kubelet
journalctl -u containerd

Add-ons

# CoreDNS
kubectl logs -n kube-system -l k8s-app=kube-dns

# CNI (Calico example)
kubectl logs -n kube-system -l k8s-app=calico-node

# kube-proxy
kubectl logs -n kube-system -l k8s-app=kube-proxy

# NGINX Ingress Controller
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

Common Event Reasons

Reason	Meaning	Action
`FailedScheduling`	Pod can't be placed	Check resources, taints, selectors
`BackOff`	Container crashing repeatedly	Check logs
`Unhealthy`	Health check failing	Fix probe or application
`FailedMount`	Volume mount failed	Check PVC, PV, StorageClass
`FailedCreatePodSandBox`	Network sandbox failed	Check CNI plugin
`ImagePullBackOff`	Image pull failing	Fix image name or auth
`OOMKilled`	Memory limit exceeded	Increase limits

Control Plane Health Checks

# API server health
kubectl get --raw='/healthz?verbose'
kubectl get --raw='/readyz?verbose'

# Control plane pods
kubectl get pods -n kube-system -l tier=control-plane

# etcd health (kubeadm clusters)
kubectl -n kube-system exec etcd-<node> -- etcdctl \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  endpoint health

# Certificate expiration
kubeadm certs check-expiration

Production Operations Checklist

Monitoring Alerts to Set Up

Node NotReady for more than 2 minutes
Pod in CrashLoopBackOff for more than 5 minutes
PVC usage above 85%
API server request latency above threshold
Certificate expiration within 30 days
etcd leader changes (unexpected)
Node CPU or memory above 85%

Incident Response Workflow

# 1. Assess scope
kubectl get pods -A | grep -v Running
kubectl get nodes
kubectl top nodes

# 2. Quick mitigations
kubectl rollout undo deployment/<name>
kubectl scale deployment/<name> --replicas=N
kubectl delete pod <crashing-pod>  # force reschedule

# 3. Check recent events
kubectl get events --sort-by='.lastTimestamp' -A | tail -30

Production Best Practices

Set resource requests and limits on all containers
Use readiness probes -- they prevent bad deploys from killing traffic
Use liveness probes -- but with generous initialDelaySeconds
Use PodDisruptionBudgets to maintain availability during node drains
Use NetworkPolicies -- default deny, explicitly allow
Rotate certificates automatically (set up renewal before 90-day expiry)
Use specific image tags -- never latest in production
Back up etcd regularly -- it is your entire cluster state
Reserve system resources in kubelet config so the node never fully overcommits

Recommended Debugging Tools

Tool	Purpose
k9s	Terminal UI -- navigate cluster visually
stern	Tail logs from multiple pods at once
kubectx / kubens	Fast context and namespace switching
netshoot	Network debugging container (DNS, curl, tcpdump)
kubectl-neat	Clean up yaml output for readability
Prometheus + Grafana	Metrics and dashboards
Loki + Grafana	Log aggregation
Jaeger	Distributed tracing