Kubernetes Troubleshooting Guide for Operators and Administrators
Table of Contents
A comprehensive, production-focused Kubernetes troubleshooting guide covering pods, services, storage, nodes, networking, RBAC, and control plane issues with runbook-style diagnostic steps and resolution strategies. Kubernetes issues are almost always caused by misconfiguration, not bugs. This guide gives you a systematic approach to diagnosing and fixing the most common problems in production clusters -- from pods that won't start to nodes that disappear and storage that won't bind. The three golden rules: Every Kubernetes problem follows the same diagnostic loop: Understanding which layer is failing halves your debugging time: Symptoms: Pod never transitions to Running, READY shows 0/N, workload never starts. Diagnostic steps: Common event messages and fixes: Symptoms: Pod status shows CrashLoopBackOff, RESTARTS count climbing, application never available. Diagnostic steps: Fix: OOMKilled (exit code 137) Fix: liveness probe too aggressive Fix: missing dependency (database not ready) Fix: configuration error Symptoms: Pod never reaches Running, events show image pull errors. Fix: wrong image name or tag Fix: private registry -- create pull secret Fix: Docker Hub rate limiting Symptoms: Exit code 137, Fix: increase memory limits Fix: Java memory leak debugging Fix: horizontal autoscaling for traffic spikes Symptoms: Cannot reach service from inside cluster, connection timeout, curl fails. Fix: selector mismatch Fix: port mismatch Fix: NetworkPolicy blocking traffic Symptoms: Pods can't resolve service names, "Name or service not known" errors, can reach services by IP but not name. Fix: CoreDNS not running Fix: NetworkPolicy blocking port 53 Fix: wrong DNS policy on pod Symptoms: PVC status shows Pending, pod referencing it also stuck in Pending. Fix: create a StorageClass Fix: manually create a PV (static provisioning) Note: Most cloud block storage (EBS, GCP PD, Azure Disk) only supports Symptoms: Fix: kubelet not running Fix: disk pressure Fix: certificate expired Fix: drain, repair, uncordon Troubleshooting Methodology
1. OBSERVE -- kubectl get (what is the current state?)
2. EVENTS -- kubectl describe (what happened?)
3. LOGS -- kubectl logs (what did components report?)
4. STATE -- kubectl get -o yaml (full resource configuration)
5. DIAGNOSE -- form a hypothesis
6. TEST -- kubectl exec, debug, port-forward
7. RESOLVE -- apply fix
8. VERIFY -- confirm resolution
9. DOCUMENT -- record and update runbooks Essential Tools
Tool Purpose kubectl getList resource status kubectl describeDetailed info and events kubectl logsContainer output kubectl execRun commands inside containers kubectl debugEphemeral debug container (1.23+) kubectl topCPU and memory metrics kubectl eventsCluster event stream kubectl explainInline resource documentation Architecture Layers
User / kubectl
|
Control Plane -- API Server, Scheduler, Controller Manager, etcd
|
Networking Layer -- CNI (Calico/Flannel/Cilium), kube-proxy, CoreDNS
|
Node / Data Plane -- Kubelet, containerd, Pods
|
Storage Layer -- CSI Driver, Volume Plugins
|
Infrastructure -- Cloud / Bare Metal Pod and Container Issues
Pod Stuck in Pending
# Step 1 -- check events (always start here)
kubectl describe pod <pod-name> -n <namespace>
# Read the Events section at the bottom
# Step 2 -- check node resources
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"
# Step 3 -- check resource requests on the pod
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 resources:
# Step 4 -- check node selectors
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.nodeSelector}'
kubectl get nodes --show-labels
# Step 5 -- check taints
kubectl describe node <node-name> | grep Taints
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.tolerations}'
# Step 6 -- check resource quotas
kubectl get resourcequota -n <namespace>
kubectl describe resourcequota -n <namespace>Event Message Cause Fix 0/N nodes: Insufficient cpuNot enough CPU Scale cluster or reduce requests 0/N nodes: Insufficient memoryNot enough RAM Add nodes or reduce requests node(s) had taintsPod lacks toleration Add toleration to pod didn't match node selectorLabel mismatch Fix node labels or pod nodeSelector PVC not foundPVC doesn't exist Create the PVC # Fix: label a node to match selector
kubectl label node <node-name> disktype=ssd
# Fix: remove taint from node
kubectl taint nodes <node-name> key1=value1:NoSchedule-
# Fix: add toleration to deployment
kubectl edit deployment <deployment-name> -n <namespace>tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule" Pod Stuck in CrashLoopBackOff
# Step 1 -- get previous crash logs (most important step)
kubectl logs <pod-name> -n <namespace> --previous
# Step 2 -- check exit code
kubectl get pod <pod-name> -n <namespace> \
-o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# Exit codes:
# 1 = general application error
# 137 = SIGKILL (OOMKilled)
# 139 = segfault
# 143 = SIGTERM
# Step 3 -- check if OOMKilled
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Last State"
# Step 4 -- check liveness probe config
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 livenessProbe
# Step 5 -- check env vars and config
kubectl exec <pod-name> -n <namespace> -- env
kubectl exec <pod-name> -n <namespace> -- cat /path/to/configkubectl patch deployment <deployment-name> -n <namespace> --type='json' \
-p='[
{"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/memory","value":"1Gi"},
{"op":"replace","path":"/spec/template/spec/containers/0/resources/requests/memory","value":"512Mi"}
]'livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30 # give app time to start
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3# Test connectivity from debug pod
kubectl run test --image=busybox -it --rm -n <namespace> -- sh
wget -O- http://service-name:port
nslookup service-name# Verify ConfigMap and Secret exist
kubectl get configmap,secret -n <namespace>
# Fix config and rollout restart
kubectl edit configmap <name> -n <namespace>
kubectl rollout restart deployment <deployment-name> -n <namespace> Pod Stuck in ImagePullBackOff
# Check what image is being pulled
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].image}'
# Check events for specific error
kubectl describe pod <pod-name> -n <namespace>kubectl set image deployment/<name> <container>=<correct-image>:<tag> -n <namespace>kubectl create secret docker-registry registry-secret \
--docker-server=<registry-url> \
--docker-username=<username> \
--docker-password=<password> \
-n <namespace>spec:
template:
spec:
imagePullSecrets:
- name: registry-secret# Use authenticated pulls
kubectl create secret docker-registry dockerhub-secret \
--docker-server=https://index.docker.io/v1/ \
--docker-username=<user> \
--docker-password=<password> \
-n <namespace> Container OOMKilled
OOMKilled reason in describe output, pod restarts frequently.# Confirm OOMKilled
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Last State"
# Check current memory limits
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources}'
# Check live usage (if pod stays up long enough)
kubectl top pod <pod-name> -n <namespace>resources:
limits:
memory: "1Gi"
requests:
memory: "512Mi"env:
- name: JAVA_OPTS
value: "-Xmx512m -Xms256m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp"# Retrieve heap dump for analysis
kubectl cp <pod-name>:/tmp/heap-dump.hprof ./heap-dump.hprof -n <namespace>apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
namespace: <namespace>
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: <deployment-name>
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70 Service and Networking Issues
Service Not Accessible
# Step 1 -- check service exists
kubectl get svc <service-name> -n <namespace>
# Step 2 -- check endpoints (critical)
kubectl get endpoints <service-name> -n <namespace>
# If ENDPOINTS shows <none>, selector doesn't match any pods
# Step 3 -- verify selector matches pod labels
kubectl get svc <service-name> -n <namespace> -o jsonpath='{.spec.selector}'
kubectl get pods -l app=<label-value> -n <namespace> --show-labels
# Step 4 -- check pod readiness
kubectl get pods -l app=<label> -n <namespace>
# Step 5 -- test from inside cluster
kubectl run test --image=nicolaka/netshoot -it --rm -n <namespace> -- bash
curl http://<service-name>:<port>
nslookup <service-name>
# Step 6 -- check if app is actually listening
kubectl exec <pod-name> -n <namespace> -- ss -tlnp
# Must show LISTEN on 0.0.0.0:<port>, not 127.0.0.1:<port># Service selector must exactly match pod labels
spec:
selector:
app: myapp # must match pod label
version: v1 # all keys must matchspec:
ports:
- port: 80 # what clients connect to
targetPort: 8080 # what the container listens on
protocol: TCP# Allow all ingress temporarily for testing
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-all-ingress
namespace: <namespace>
spec:
podSelector: {}
policyTypes:
- Ingress
ingress:
- {} DNS Resolution Failures
# Test DNS from a pod
kubectl run test-dns --image=busybox -it --rm -- nslookup kubernetes.default
kubectl run test-dns --image=busybox -it --rm -- nslookup <service-name>.<namespace>
# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
# Check DNS config inside a pod
kubectl exec <pod-name> -n <namespace> -- cat /etc/resolv.conf
# Should show:
# nameserver 10.x.x.x
# search <namespace>.svc.cluster.local svc.cluster.local cluster.local
# Check kube-dns service
kubectl get svc -n kube-system kube-dns
kubectl get endpoints -n kube-system kube-dnskubectl rollout restart deployment coredns -n kube-system
kubectl scale deployment coredns -n kube-system --replicas=2apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns
namespace: <namespace>
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53spec:
dnsPolicy: ClusterFirst # default, recommended Storage and PVC Issues
PVC Stuck in Pending
# Check PVC status and events
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>
# Check StorageClass exists
kubectl get pvc <pvc-name> -n <namespace> -o jsonpath='{.spec.storageClassName}'
kubectl get storageclass
# Check available PVs
kubectl get pv
kubectl get pv -o custom-columns=NAME:.metadata.name,CAPACITY:.spec.capacity.storage,ACCESS:.spec.accessModes,STATUS:.status.phase
# Check provisioner logs
kubectl get pods -n kube-system | grep csi
kubectl logs -n kube-system <csi-provisioner-pod> --tail=50apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
type: gp3
fsType: ext4
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: trueapiVersion: v1
kind: PersistentVolume
metadata:
name: pv-001
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: manual
hostPath:
path: /mnt/dataReadWriteOnce. Use NFS or cloud file storage for ReadWriteMany. Node Issues
Node in NotReady State
kubectl get nodes shows NotReady, pods not scheduling on node.# Check node conditions
kubectl describe node <node-name>
# Look at Conditions section:
# Ready: False
# MemoryPressure: True/False
# DiskPressure: True/False
# PIDPressure: True/False
# SSH to node and check kubelet
ssh <node-address>
systemctl status kubelet
journalctl -u kubelet --since "10 minutes ago"
# Check container runtime
systemctl status containerd
crictl ps
# Check disk
df -h && df -i
# Check memory
free -h
# Test API server connectivity from node
curl -k https://<api-server-ip>:6443/healthzsudo systemctl start kubelet
sudo systemctl enable kubelet
# If failing to start
journalctl -u kubelet --no-pager | tail -50
# Check /var/lib/kubelet/config.yaml and /etc/kubernetes/kubelet.conf
sudo systemctl restart kubelet# Clean unused container images
crictl rmi --prune
# Clean old containers
crictl rm $(crictl ps -a -q)
# Clean logs
sudo journalctl --vacuum-time=2d
sudo find /var/log -name "*.log" -mtime +7 -delete# Renew certificates (kubeadm clusters)
kubeadm certs renew all
kubeadm certs check-expiration
systemctl restart kubelet# Safely drain a node before maintenance
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# After fixing the issue
kubectl uncordon <node-name> Quick Reference: Symptom to Action
Pod Status Decision Tree
Pending
├── "Insufficient cpu/memory" → scale cluster or reduce requests
├── "node(s) had taints" → add tolerations to pod
├── "didn't match node selector" → fix node labels or pod nodeSelector
└── "PVC not found/pending" → fix PVC
CrashLoopBackOff
├── exit code 137 (OOMKilled) → increase memory limits
├── application error in logs → fix app config/code
├── missing dependency → check service/secret availability
└── liveness probe failing → increase initialDelaySeconds
ImagePullBackOff
├── "manifest unknown" → fix image name or tag
├── "unauthorized" → add imagePullSecrets
├── "rate limited" → use authenticated pulls
└── network error → check registry connectivity
Running but not working
├── check logs → application-level error
├── check endpoints → kubectl get endpoints svc-name
└── empty endpoints → service selector mismatch Service Connectivity Decision Tree
Cannot access service?
1. Does service exist? → kubectl get svc
2. Does it have endpoints? → kubectl get endpoints svc-name
Empty → selector doesn't match pods
3. Are pods ready? → kubectl get pods -l app=label
Not ready → check readiness probe
4. Test from inside cluster → kubectl run test --image=busybox -it --rm
5. Check NetworkPolicy → kubectl get networkpolicy -n <namespace>
6. Is app listening on 0.0.0.0? → kubectl exec pod -- ss -tlnp Essential kubectl Reference
Inspection
# List with useful flags
kubectl get pods -A # all namespaces
kubectl get pods -o wide # show IPs and nodes
kubectl get pods --show-labels # show labels
kubectl get pods -l app=nginx # filter by label
kubectl get pods --sort-by=.metadata.creationTimestamp # sort by age
kubectl get pods -w # watch mode
# Custom columns
kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,IP:.status.podIP
# All non-running pods
kubectl get pods -A -o jsonpath='{range .items[?(@.status.phase!="Running")]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}' Logs
kubectl logs <pod> # current logs
kubectl logs <pod> --previous # after crash
kubectl logs <pod> -c <container> # specific container
kubectl logs <pod> --tail=50 # last 50 lines
kubectl logs <pod> --since=1h # last hour
kubectl logs <pod> -f # follow
kubectl logs -l app=nginx --all-containers # all pods matching label Debugging
# Interactive shell
kubectl exec <pod> -it -- sh
# Port forward
kubectl port-forward <pod> 8080:80
kubectl port-forward svc/<service> 8080:80
# Ephemeral debug container (1.23+)
kubectl debug <pod> -it --image=nicolaka/netshoot --target=<container>
# Debug copy of pod
kubectl debug <pod> -it --copy-to=debug-pod --image=ubuntu
# Debug node
kubectl debug node/<node-name> -it --image=ubuntu
# Copy files
kubectl cp <pod>:/path/file ./local-file
kubectl cp ./local-file <pod>:/path/file Editing and Patching
# Edit live resource
kubectl edit deployment <name>
# Scale
kubectl scale deployment <name> --replicas=3
# Set image
kubectl set image deployment/<name> container=image:tag
# Rollout commands
kubectl rollout status deployment/<name>
kubectl rollout history deployment/<name>
kubectl rollout undo deployment/<name>
kubectl rollout restart deployment/<name>
# JSON patch
kubectl patch deployment <name> --type='json' \
-p='[{"op":"replace","path":"/spec/replicas","value":3}]' Permissions Check
kubectl auth can-i create pods
kubectl auth can-i create pods --as=<serviceaccount>
kubectl auth can-i --list Log Locations Reference
Control Plane (self-hosted / kubeadm)
kubectl logs -n kube-system kube-apiserver-<node>
kubectl logs -n kube-system kube-controller-manager-<node>
kubectl logs -n kube-system kube-scheduler-<node>
kubectl logs -n kube-system etcd-<node> Control Plane (systemd)
journalctl -u kube-apiserver
journalctl -u kube-controller-manager
journalctl -u kube-scheduler
journalctl -u etcd Node Components (on the node)
journalctl -u kubelet
journalctl -u containerd Add-ons
# CoreDNS
kubectl logs -n kube-system -l k8s-app=kube-dns
# CNI (Calico example)
kubectl logs -n kube-system -l k8s-app=calico-node
# kube-proxy
kubectl logs -n kube-system -l k8s-app=kube-proxy
# NGINX Ingress Controller
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx Common Event Reasons
Reason Meaning Action FailedSchedulingPod can't be placed Check resources, taints, selectors BackOffContainer crashing repeatedly Check logs UnhealthyHealth check failing Fix probe or application FailedMountVolume mount failed Check PVC, PV, StorageClass FailedCreatePodSandBoxNetwork sandbox failed Check CNI plugin ImagePullBackOffImage pull failing Fix image name or auth OOMKilledMemory limit exceeded Increase limits Control Plane Health Checks
# API server health
kubectl get --raw='/healthz?verbose'
kubectl get --raw='/readyz?verbose'
# Control plane pods
kubectl get pods -n kube-system -l tier=control-plane
# etcd health (kubeadm clusters)
kubectl -n kube-system exec etcd-<node> -- etcdctl \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
endpoint health
# Certificate expiration
kubeadm certs check-expiration Production Operations Checklist
Monitoring Alerts to Set Up
Incident Response Workflow
# 1. Assess scope
kubectl get pods -A | grep -v Running
kubectl get nodes
kubectl top nodes
# 2. Quick mitigations
kubectl rollout undo deployment/<name>
kubectl scale deployment/<name> --replicas=N
kubectl delete pod <crashing-pod> # force reschedule
# 3. Check recent events
kubectl get events --sort-by='.lastTimestamp' -A | tail -30 Production Best Practices
initialDelaySecondsPodDisruptionBudgets to maintain availability during node drainsNetworkPolicies -- default deny, explicitly allowlatest in production Recommended Debugging Tools
Tool Purpose k9s Terminal UI -- navigate cluster visually stern Tail logs from multiple pods at once kubectx / kubens Fast context and namespace switching netshoot Network debugging container (DNS, curl, tcpdump) kubectl-neat Clean up yaml output for readability Prometheus + Grafana Metrics and dashboards Loki + Grafana Log aggregation Jaeger Distributed tracing Resources