EKS Troubleshooting: Key Steps for Effective Debugging -

Introduction

EKS failures are tricky because you debug two layers at once:

Kubernetes layer (Pods, Nodes, Services, Events)
AWS layer (IAM auth, VPC networking, NodeGroup bootstrap, control plane logs)

This post is a diagnosis-first playbook: quick checks → root-cause signals → what to fix.

1) Start with identity: “Who am I to AWS right now?”

A surprising number of EKS issues are simply “wrong AWS identity” (wrong profile, expired SSO, different role than expected). AWS re:Post explicitly recommends verifying the caller identity when you hit access errors.

Run this first:

aws sts get-caller-identity --query "Arn" --output text

If this ARN is not the identity you expected, stop here and fix your AWS credentials/profile before touching Kubernetes.

2) Confirm you’re pointing kubectl at the right cluster

If kubeconfig is wrong, kubectl will lie to you (or show nothing).

Update kubeconfig:

aws eks --region <region> update-kubeconfig --name <cluster-name>

Quick sanity checks:

kubectl config current-context
kubectl get nodes

If kubectl get nodes shows nothing, you may be connected to the wrong cluster or nodes never joined (next section).

3) The most common EKS symptom: “Nodes never joined the cluster”

AWS documents this as a common EKS troubleshooting category and calls out aws-auth mapping as a key cause.

3.1 Check node status

kubectl get nodes -o wide

If this returns no nodes, the cluster control plane may exist but node bootstrap/joining failed.

3.2 Check aws-auth / IAM identity mapping

AWS EKS docs state that managed nodegroups add entries to aws-auth, and if those were removed/modified, nodes won’t join. They also reference using eksctl IAM identity mapping commands.

Inspect IAM identity mappings:

eksctl get iamidentitymapping --cluster <cluster-name>

Inspect aws-auth ConfigMap:

kubectl get configmap aws-auth -n kube-system -o yaml

If you changed or deleted mappings, nodes can’t authenticate and register.

Why this is “EKS-only”: vanilla Kubernetes doesn’t have aws-auth; EKS uses IAM↔RBAC integration and aws-auth is part of that legacy flow. AWS re:Post explains that access errors happen when IAM identities don’t map to Kubernetes RBAC permissions and that aws-auth is involved.

4) Debug nodegroup creation issues without SSH (CloudFormation signal)

Most people waste time trying to SSH into nodes. EKS nodegroups are created via CloudFormation stacks; the fastest “why did it fail” signal is in stack events.

If you created the cluster with eksctl, inspect stacks:

eksctl utils describe-stacks --region <region> --cluster <cluster-name>

AWS eksctl troubleshooting docs specifically discuss failed stack creation and how CloudFormation behaviors affect debugging.

If stacks keep rolling back and you need to see the failure state, eksctl docs mention using --cfn-disable-rollback to stop CloudFormation from rolling back immediately (useful during troubleshooting).

5) Turn on the missing visibility: EKS control plane logs

By default, control plane logs are often not enabled, and you end up debugging blind. AWS documents that you can send control plane logs to CloudWatch and lists the log types (api, audit, authenticator, controllerManager, scheduler). [docs.aws.amazon.com], [eksworkshop.com]

Enable ALL control plane logs with AWS CLI

aws eks update-cluster-config \
--region <region> \
--name <cluster-name> \
--logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'

This exact pattern is shown in EKS Workshop docs and AWS EKS logging guidance. [eksworkshop.com], [docs.aws.amazon.com]

Why authenticator logs matter in EKS: AWS notes authenticator logs are unique to EKS and relate to IAM-based RBAC authentication. That’s gold when “kubectl access denied” happens. [docs.aws.amazon.com]

6) “kubectl logs/exec doesn’t work” in private clusters — the gotcha

AWS eksctl troubleshooting docs mention a specific failure mode where kubectl logs / kubectl run can fail with an authorization error when nodes are in private subnets, and points to VPC settings like DNS hostnames.

If you see errors like “unable to upgrade connection” / proxy authorization errors during logs/exec, don’t assume your app is broken — validate the cluster networking prerequisites first.

7) LoadBalancer stuck in Pending (EKS + AWS integration reality)

This is another EKS-specific pain point: Kubernetes Service says LoadBalancer, but AWS LB never appears / External IP stays pending.

A practical EKS incident write-up shows this symptom and ties it to AWS LB controller and subnet tagging.

Core checks:

kubectl get svc -A
kubectl describe svc <service-name> -n <namespace>

Then look at whether you’re using AWS Load Balancer Controller and if it’s installed correctly.

Internal deployment note you can add (unique to your environment): your internal explicitly calls out needing CSI drivers (EFS/EBS) and installing an Ingress controller (NLB via ingress-nginx or ALB controller via Helm), plus having permissions to create/update/delete CRDs and cluster roles. That’s a real-world “why my ingress/LB is stuck” prerequisite many blogs skip.

8) The “EKS debugging triangle”: Nodes, Storage, Ingress

When EKS problems feel random, they usually fall into one of these:

A) Nodes / bootstrap / aws-auth

No nodes
NodeGroup create failed
auth / RBAC mapping issues
Use Sections 1–4. [docs.aws.amazon.com], [repost.aws], [docs.aws.amazon.com]

B) Storage (PVC Pending / CSI)

Your internal guide explicitly points to EFS/EBS CSI requirements on EKS and suggests verifying storage classes (kubectl get sc).

Useful checks:

kubectl get sc
kubectl get pvc -A
kubectl describe pvc <pvc-name> -n <namespace>

C) Ingress / Load Balancer

Your internal guide calls out ALB controller via Helm and stresses correct access/permissions for installation and config.

9) Minimal “EKS First Response” command set (copy/paste)

Use this exact sequence during incidents:

aws sts get-caller-identity --query "Arn" --output text
aws eks --region <region> update-kubeconfig --name <cluster-name>

kubectl get nodes -o wide
kubectl get pods -A
kubectl get events -A --sort-by=.metadata.creationTimestamp

kubectl get configmap aws-auth -n kube-system -o yaml
eksctl get iamidentitymapping --cluster <cluster-name>

eksctl utils describe-stacks --region <region> --cluster <cluster-name>

This set is “EKS-aware”: it includes aws-auth/iam mapping + CloudFormation stack view, which most generic Kubernetes troubleshooting lists don’t cover. [docs.aws.amazon.com], [docs.aws.amazon.com], [repost.aws]

🚀 InfraDecode Takeaway (4 lines)

EKS issues are usually IAM (aws-auth), Nodes (bootstrap), or AWS integrations (LB/CSI).
Start with who am I (STS) and am I even on the right cluster (kubeconfig).
Then use aws-auth + CloudFormation stack events to avoid guessing.
Enable control plane logs early — EKS authenticator logs save hours. [docs.aws.amazon.com],

Discover more from

Subscribe to get the latest posts sent to your email.

EKS Troubleshooting: Key Steps for Effective Debugging