Kubernetes Troubleshooting in Production: The 10 Issues Engineers Face Every Week

Kubernetes rarely breaks in obvious ways.

Most production incidents start quietly:

  • a pod stuck in Pending
  • a container restarting endlessly
  • a service returning 503 with no clear error

This post is a real‑world Kubernetes troubleshooting guide, focused on the issues engineers actually debug every week in production, not certification theory.


How to Debug Kubernetes (Before You Panic)

Always debug in this order:

  1. Cluster health
  2. Pod scheduling
  3. Container lifecycle
  4. Networking
  5. Storage

Random debugging wastes time.


1. Pod Stuck in Pending

What this means

A pod in Pending state means Kubernetes cannot schedule it onto any node.
This is not an application bug.

Common reasons

  • Insufficient CPU or memory
  • No suitable node
  • PVC not bound
  • Node taints or selectors blocking scheduling

What to check first

kubectl get pod <pod-name>
kubectl describe pod <pod-name>
kubectl get nodes
kubectl get pvc

👉 Always read the Events section — Kubernetes usually tells you exactly why scheduling failed.


2. CrashLoopBackOff

What this means

The container starts, crashes, and Kubernetes keeps restarting it.

Kubernetes is doing its job.
Your application is failing.

Common reasons

  • App crashes on startup
  • Missing environment variables
  • Dependency (DB, API) not reachable
  • Wrong command or arguments

What to check

kubectl logs <pod-name>
kubectl logs <pod-name> --previous

If logs show application errors, fix the app, not Kubernetes.


3. ImagePullBackOff

What this means

Kubernetes cannot pull the container image, so the pod never starts.

Common reasons

  • Wrong image name or tag
  • Private registry authentication missing
  • Image deleted or renamed

What to check

kubectl describe pod <pod-name>

Look for errors like:

  • ErrImagePull
  • unauthorized
  • repository not found

4. Pod Running but Service Not Reachable

What this means

Pods are healthy, but traffic is not reaching them.

This is almost always a Service configuration issue.

Common reasons

  • Service selector doesn’t match pod labels
  • Wrong targetPort
  • Application listening on a different port

What to check

kubectl get svc
kubectl describe svc <service-name>
kubectl get endpoints <service-name>

👉 If endpoints are empty, traffic has nowhere to go.


5. Ingress Returns 503 or 404

What this means

Ingress is reachable, but backend services are not responding correctly.

Common reasons

  • Backend service misconfigured
  • Readiness probe failing
  • Wrong service name or port

What to check

kubectl get ingress
kubectl describe ingress <ingress-name>
kubectl logs -n ingress-nginx <controller-pod>

Ingress problems are common because they sit at the edge of the cluster.


6. Pod Gets OOMKilled

What this means

The container used more memory than allowed and was killed by the kernel.

Common reasons

  • Memory limits too low
  • Application memory spike
  • Memory leak

What to check

kubectl describe pod <pod-name>
kubectl top pod <pod-name>

Look for: Reason: OOMKilled


7. Node Goes NotReady

What this means

A node‑level failure affecting multiple pods.

Common reasons

  • Disk full
  • Network issues
  • Kubelet crash
  • Cloud provider problems

What to check

kubectl get nodes
kubectl describe node <node-name>
kubectl drain <node-name> --ignore-daemonsets

Node issues often cause cascading failures across namespaces.


8. DNS Not Working Inside the Cluster

What this means

Services cannot discover each other by name.

Common reasons

  • CoreDNS crash
  • Network policies blocking DNS
  • Resource starvation in kube-system

What to test

kubectl exec -it <pod-name> -- nslookup kubernetes.default

If this fails, DNS is broken cluster‑wide.


9. PVC Stuck in Pending

What this means

Kubernetes cannot provision or bind storage.

Common reasons

  • No matching StorageClass
  • Provisioner failure
  • Storage quota exhausted

What to check

kubectl get pvc
kubectl describe pvc <pvc-name>

Stateful workloads make this issue very common in modern clusters.


10. Deployment Rollout Stuck

What this means

A new version never becomes ready, blocking the rollout.

Common reasons

  • Readiness probe failing
  • Resource limits too tight
  • Broken image

What to check

kubectl rollout status deployment <deployment-name>
kubectl describe deployment <deployment-name>

Quick Production Debug Checklist

When something breaks, always start here:

kubectl get nodes
kubectl get pods -A | grep -v Running
kubectl get events -A --sort-by=.lastTimestamp

This surfaces most real production issues fast.

11. Pod Stuck in Initializing (Init Containers)

What this means

A pod in Initializing state means:

  • The pod is scheduled on a node âś…
  • The main container has NOT started yet
  • One or more init containers are still running or failing

This is not a CrashLoopBackOff, and logs of the main container will be empty.


Common real‑world reasons

  • Init container waiting for a dependency (DB, service, DNS)
  • Init container script failing
  • Image pull issues in init container
  • Permission or volume mount issues
  • Network not ready yet

First thing to check (very important)

kubectl get pod <pod-name>
kubectl describe pod <pod-name>

👉 Look for the Init Containers section and check:

  • Which init container is running
  • Which one failed
  • Exit codes and events

How to check init container logs (this is the key)

By default, kubectl logs shows only the main container, which is why people get confused.

You must explicitly specify the init container name.

kubectl logs <pod-name> -c <init-container-name>
kubectl logs <pod-name> -c <init-container-name> --previous

This is the most missed Kubernetes debugging step.


How to list init container names quickly

kubectl get pod <pod-name> -o jsonpath='{.spec.initContainers[*].name}'

Use the name from this output in the log command above.


Typical production example

  • Init container checks DB readiness
  • DB is slow or unreachable
  • Pod stays in Initializing forever
  • Main app never starts
  • No logs visible unless you check init container logs

If the init container is stuck (but not failing)

Check if it’s:

  • Waiting on a network call
  • Sleeping or retrying endlessly
  • Blocked by a volume or permission issue

Useful checks:

kubectl describe pod <pod-name>
kubectl get events --sort-by=.lastTimestamp

InfraDecode Tip

If a pod is:

  • Pending → scheduling issue
  • Initializing → init container issue
  • CrashLoopBackOff → application issue

Knowing this distinction alone saves hours of debugging.


InfraDecode Takeaway

Kubernetes rarely fails randomly.
It fails because something was misconfigured, under‑provisioned, or misunderstood.

Good engineers debug systematically, not emotionally.

— InfraDecode


Discover more from

Subscribe to get the latest posts sent to your email.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top