Kubernetes Troubleshooting in Production: The 10 Issues Engineers Face Every Week -

Kubernetes rarely breaks in obvious ways.

Most production incidents start quietly:

a pod stuck in Pending
a container restarting endlessly
a service returning 503 with no clear error

This post is a real‑world Kubernetes troubleshooting guide, focused on the issues engineers actually debug every week in production, not certification theory.

How to Debug Kubernetes (Before You Panic)

Always debug in this order:

Cluster health
Pod scheduling
Container lifecycle
Networking
Storage

Random debugging wastes time.

1. Pod Stuck in `Pending`

What this means

A pod in Pending state means Kubernetes cannot schedule it onto any node.
This is not an application bug.

Common reasons

Insufficient CPU or memory
No suitable node
PVC not bound
Node taints or selectors blocking scheduling

What to check first

kubectl get pod <pod-name>
kubectl describe pod <pod-name>
kubectl get nodes
kubectl get pvc

👉 Always read the Events section — Kubernetes usually tells you exactly why scheduling failed.

2. CrashLoopBackOff

What this means

The container starts, crashes, and Kubernetes keeps restarting it.

Kubernetes is doing its job.
Your application is failing.

Common reasons

App crashes on startup
Missing environment variables
Dependency (DB, API) not reachable
Wrong command or arguments

What to check

kubectl logs <pod-name>
kubectl logs <pod-name> --previous

If logs show application errors, fix the app, not Kubernetes.

3. ImagePullBackOff

What this means

Kubernetes cannot pull the container image, so the pod never starts.

Common reasons

Wrong image name or tag
Private registry authentication missing
Image deleted or renamed

What to check

kubectl describe pod <pod-name>

Look for errors like:

ErrImagePull
unauthorized
repository not found

4. Pod Running but Service Not Reachable

What this means

Pods are healthy, but traffic is not reaching them.

This is almost always a Service configuration issue.

Common reasons

Service selector doesn’t match pod labels
Wrong targetPort
Application listening on a different port

What to check

kubectl get svc
kubectl describe svc <service-name>
kubectl get endpoints <service-name>

👉 If endpoints are empty, traffic has nowhere to go.

5. Ingress Returns `503` or `404`

What this means

Ingress is reachable, but backend services are not responding correctly.

Common reasons

Backend service misconfigured
Readiness probe failing
Wrong service name or port

What to check

kubectl get ingress
kubectl describe ingress <ingress-name>
kubectl logs -n ingress-nginx <controller-pod>

Ingress problems are common because they sit at the edge of the cluster.

6. Pod Gets `OOMKilled`

What this means

The container used more memory than allowed and was killed by the kernel.

Common reasons

Memory limits too low
Application memory spike
Memory leak

What to check

kubectl describe pod <pod-name>
kubectl top pod <pod-name>

Look for: Reason: OOMKilled

7. Node Goes `NotReady`

What this means

A node‑level failure affecting multiple pods.

Common reasons

Disk full
Network issues
Kubelet crash
Cloud provider problems

What to check

kubectl get nodes
kubectl describe node <node-name>
kubectl drain <node-name> --ignore-daemonsets

Node issues often cause cascading failures across namespaces.

8. DNS Not Working Inside the Cluster

What this means

Services cannot discover each other by name.

Common reasons

CoreDNS crash
Network policies blocking DNS
Resource starvation in kube-system

What to test

kubectl exec -it <pod-name> -- nslookup kubernetes.default

If this fails, DNS is broken cluster‑wide.

9. PVC Stuck in `Pending`

What this means

Kubernetes cannot provision or bind storage.

Common reasons

No matching StorageClass
Provisioner failure
Storage quota exhausted

What to check

kubectl get pvc
kubectl describe pvc <pvc-name>

Stateful workloads make this issue very common in modern clusters.

10. Deployment Rollout Stuck

What this means

A new version never becomes ready, blocking the rollout.

Common reasons

Readiness probe failing
Resource limits too tight
Broken image

What to check

kubectl rollout status deployment <deployment-name>
kubectl describe deployment <deployment-name>

Quick Production Debug Checklist

When something breaks, always start here:

kubectl get nodes
kubectl get pods -A | grep -v Running
kubectl get events -A --sort-by=.lastTimestamp

This surfaces most real production issues fast.

11. Pod Stuck in `Initializing` (Init Containers)

What this means

A pod in Initializing state means:

The pod is scheduled on a node ✅
The main container has NOT started yet
One or more init containers are still running or failing

This is not a CrashLoopBackOff, and logs of the main container will be empty.

Common real‑world reasons

Init container waiting for a dependency (DB, service, DNS)
Init container script failing
Image pull issues in init container
Permission or volume mount issues
Network not ready yet

First thing to check (very important)

kubectl get pod <pod-name>
kubectl describe pod <pod-name>

👉 Look for the Init Containers section and check:

Which init container is running
Which one failed
Exit codes and events

How to check init container logs (this is the key)

By default, kubectl logs shows only the main container, which is why people get confused.

You must explicitly specify the init container name.

kubectl logs <pod-name> -c <init-container-name>
kubectl logs <pod-name> -c <init-container-name> --previous

This is the most missed Kubernetes debugging step.

How to list init container names quickly

kubectl get pod <pod-name> -o jsonpath='{.spec.initContainers[*].name}'

Use the name from this output in the log command above.

Typical production example

Init container checks DB readiness
DB is slow or unreachable
Pod stays in Initializing forever
Main app never starts
No logs visible unless you check init container logs

If the init container is stuck (but not failing)

Check if it’s:

Waiting on a network call
Sleeping or retrying endlessly
Blocked by a volume or permission issue

Useful checks:

kubectl describe pod <pod-name>
kubectl get events --sort-by=.lastTimestamp

InfraDecode Tip

If a pod is:

Pending → scheduling issue
Initializing → init container issue
CrashLoopBackOff → application issue

Knowing this distinction alone saves hours of debugging.

InfraDecode Takeaway

Kubernetes rarely fails randomly.
It fails because something was misconfigured, under‑provisioned, or misunderstood.

Good engineers debug systematically, not emotionally.

— InfraDecode

Discover more from

Subscribe to get the latest posts sent to your email.

How to Debug Kubernetes (Before You Panic)

1. Pod Stuck in Pending

What this means

Common reasons

What to check first

2. CrashLoopBackOff

What this means

Common reasons

What to check

3. ImagePullBackOff

What this means

Common reasons

What to check

4. Pod Running but Service Not Reachable

What this means

Common reasons

What to check

5. Ingress Returns 503 or 404

What this means

Common reasons

What to check

6. Pod Gets OOMKilled

What this means

Common reasons

What to check

7. Node Goes NotReady

What this means

Common reasons

What to check

8. DNS Not Working Inside the Cluster

What this means

Common reasons

What to test

9. PVC Stuck in Pending

What this means

Common reasons

What to check

10. Deployment Rollout Stuck

What this means

Common reasons

What to check

Quick Production Debug Checklist

11. Pod Stuck in Initializing (Init Containers)

What this means

Common real‑world reasons

First thing to check (very important)

How to check init container logs (this is the key)

How to list init container names quickly

Typical production example

If the init container is stuck (but not failing)

InfraDecode Tip

InfraDecode Takeaway

Like this:

Discover more from

Leave a Comment Cancel Reply

1. Pod Stuck in `Pending`

5. Ingress Returns `503` or `404`

6. Pod Gets `OOMKilled`

7. Node Goes `NotReady`

9. PVC Stuck in `Pending`

11. Pod Stuck in `Initializing` (Init Containers)