Kubernetes rarely breaks in obvious ways.
Most production incidents start quietly:
- a pod stuck in
Pending - a container restarting endlessly
- a service returning
503with no clear error
This post is a real‑world Kubernetes troubleshooting guide, focused on the issues engineers actually debug every week in production, not certification theory.
How to Debug Kubernetes (Before You Panic)
Always debug in this order:
- Cluster health
- Pod scheduling
- Container lifecycle
- Networking
- Storage
Random debugging wastes time.
1. Pod Stuck in Pending
What this means
A pod in Pending state means Kubernetes cannot schedule it onto any node.
This is not an application bug.
Common reasons
- Insufficient CPU or memory
- No suitable node
- PVC not bound
- Node taints or selectors blocking scheduling
What to check first
kubectl get pod <pod-name>
kubectl describe pod <pod-name>
kubectl get nodes
kubectl get pvc
👉 Always read the Events section — Kubernetes usually tells you exactly why scheduling failed.
2. CrashLoopBackOff
What this means
The container starts, crashes, and Kubernetes keeps restarting it.
Kubernetes is doing its job.
Your application is failing.
Common reasons
- App crashes on startup
- Missing environment variables
- Dependency (DB, API) not reachable
- Wrong command or arguments
What to check
kubectl logs <pod-name>
kubectl logs <pod-name> --previous
If logs show application errors, fix the app, not Kubernetes.
3. ImagePullBackOff
What this means
Kubernetes cannot pull the container image, so the pod never starts.
Common reasons
- Wrong image name or tag
- Private registry authentication missing
- Image deleted or renamed
What to check
kubectl describe pod <pod-name>
Look for errors like:
ErrImagePullunauthorizedrepository not found
4. Pod Running but Service Not Reachable
What this means
Pods are healthy, but traffic is not reaching them.
This is almost always a Service configuration issue.
Common reasons
- Service selector doesn’t match pod labels
- Wrong
targetPort - Application listening on a different port
What to check
kubectl get svc
kubectl describe svc <service-name>
kubectl get endpoints <service-name>
👉 If endpoints are empty, traffic has nowhere to go.
5. Ingress Returns 503 or 404
What this means
Ingress is reachable, but backend services are not responding correctly.
Common reasons
- Backend service misconfigured
- Readiness probe failing
- Wrong service name or port
What to check
kubectl get ingress
kubectl describe ingress <ingress-name>
kubectl logs -n ingress-nginx <controller-pod>
Ingress problems are common because they sit at the edge of the cluster.
6. Pod Gets OOMKilled
What this means
The container used more memory than allowed and was killed by the kernel.
Common reasons
- Memory limits too low
- Application memory spike
- Memory leak
What to check
kubectl describe pod <pod-name>
kubectl top pod <pod-name>
Look for: Reason: OOMKilled
7. Node Goes NotReady
What this means
A node‑level failure affecting multiple pods.
Common reasons
- Disk full
- Network issues
- Kubelet crash
- Cloud provider problems
What to check
kubectl get nodes
kubectl describe node <node-name>
kubectl drain <node-name> --ignore-daemonsets
Node issues often cause cascading failures across namespaces.
8. DNS Not Working Inside the Cluster
What this means
Services cannot discover each other by name.
Common reasons
- CoreDNS crash
- Network policies blocking DNS
- Resource starvation in
kube-system
What to test
kubectl exec -it <pod-name> -- nslookup kubernetes.default
If this fails, DNS is broken cluster‑wide.
9. PVC Stuck in Pending
What this means
Kubernetes cannot provision or bind storage.
Common reasons
- No matching StorageClass
- Provisioner failure
- Storage quota exhausted
What to check
kubectl get pvc
kubectl describe pvc <pvc-name>
Stateful workloads make this issue very common in modern clusters.
10. Deployment Rollout Stuck
What this means
A new version never becomes ready, blocking the rollout.
Common reasons
- Readiness probe failing
- Resource limits too tight
- Broken image
What to check
kubectl rollout status deployment <deployment-name>
kubectl describe deployment <deployment-name>
Quick Production Debug Checklist
When something breaks, always start here:
kubectl get nodes
kubectl get pods -A | grep -v Running
kubectl get events -A --sort-by=.lastTimestamp
This surfaces most real production issues fast.
11. Pod Stuck in Initializing (Init Containers)
What this means
A pod in Initializing state means:
- The pod is scheduled on a node âś…
- The main container has NOT started yet
- One or more init containers are still running or failing
This is not a CrashLoopBackOff, and logs of the main container will be empty.
Common real‑world reasons
- Init container waiting for a dependency (DB, service, DNS)
- Init container script failing
- Image pull issues in init container
- Permission or volume mount issues
- Network not ready yet
First thing to check (very important)
kubectl get pod <pod-name>
kubectl describe pod <pod-name>
👉 Look for the Init Containers section and check:
- Which init container is running
- Which one failed
- Exit codes and events
How to check init container logs (this is the key)
By default, kubectl logs shows only the main container, which is why people get confused.
You must explicitly specify the init container name.
kubectl logs <pod-name> -c <init-container-name>
kubectl logs <pod-name> -c <init-container-name> --previous
This is the most missed Kubernetes debugging step.
How to list init container names quickly
kubectl get pod <pod-name> -o jsonpath='{.spec.initContainers[*].name}'
Use the name from this output in the log command above.
Typical production example
- Init container checks DB readiness
- DB is slow or unreachable
- Pod stays in
Initializingforever - Main app never starts
- No logs visible unless you check init container logs
If the init container is stuck (but not failing)
Check if it’s:
- Waiting on a network call
- Sleeping or retrying endlessly
- Blocked by a volume or permission issue
Useful checks:
kubectl describe pod <pod-name>
kubectl get events --sort-by=.lastTimestamp
InfraDecode Tip
If a pod is:
- Pending → scheduling issue
- Initializing → init container issue
- CrashLoopBackOff → application issue
Knowing this distinction alone saves hours of debugging.
InfraDecode Takeaway
Kubernetes rarely fails randomly.
It fails because something was misconfigured, under‑provisioned, or misunderstood.
Good engineers debug systematically, not emotionally.
— InfraDecode
Discover more from
Subscribe to get the latest posts sent to your email.
