From Chatbot to Real Agent: Building a Dynamic Tool‑Calling Kubernetes Troubleshooting Agent -

🧠 Introduction

Most of us start using LLMs like this:

Ask question → get answer

But in real-world DevOps, that’s not enough.

When a pod fails, you don’t just need an answer — you need a system that:

Checks logs
Reads pod status
Looks at events
Decides what to do next

👉 That’s where AI agents come in.

In this post, we’ll build a dynamic tool-calling Kubernetes troubleshooting agent that:

Decides which action to take
Calls tools like kubectl logs (simulated)
Uses results to find root cause

💡 What is Tool Calling?

By default, LLMs only generate text.

But with tool calling:

👉 Instead of answering directly, the model can say:

Call this tool → with these arguments

Example:

{
  "tool": "get_logs",
  "arguments": {
    "pod": "myapp"
  }
}

👉 Your code executes it → returns result → model reasons again.

🔧 Architecture We’re Building

User Input
   ↓
LLM (decision)
   ↓
Tool Call (JSON)
   ↓
Python executes tool
   ↓
Result back to LLM
   ↓
Final answer

⚙️ Step 1: Setup (Colab + Groq)

!pip install groq

from groq import Groq

client = Groq(api_key="YOUR_API_KEY")
MODEL = "llama-3.1-8b-instant"

🛠️ Step 2: Define Tools (Simulating Kubernetes)

These represent real DevOps actions.

def get_logs(pod, namespace="default", tail_lines=200):
    return {
        "pod": pod,
        "logs": "ERROR: connection refused to DB\nERROR startup failed"
    }

def describe_pod(pod, namespace="default"):
    return {
        "pod": pod,
        "status": "CrashLoopBackOff",
        "restartCount": 5
    }

def get_events(namespace="default"):
    return {
        "events": ["Readiness probe failed", "Back-off restarting container"]
    }

📐 Step 3: Tool Schema (How LLM Understands Tools)

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "get_logs",
            "description": "Fetch logs for troubleshooting",
            "parameters": {
                "type": "object",
                "properties": {
                    "pod": {"type": "string"}
                },
                "required": ["pod"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "describe_pod",
            "description": "Describe pod status",
            "parameters": {
                "type": "object",
                "properties": {
                    "pod": {"type": "string"}
                },
                "required": ["pod"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_events",
            "description": "Fetch cluster events",
            "parameters": {
                "type": "object",
                "properties": {
                    "namespace": {"type": "string"}
                },
                "required": ["namespace"]
            }
        }
    }
]

🔁 Step 4: Dynamic Agent Loop (Core Logic)

This is where the “agent behavior” happens.

import json

TOOL_REGISTRY = {
    "get_logs": get_logs,
    "describe_pod": describe_pod,
    "get_events": get_events
}

def dynamic_tool_agent(user_input):

    messages = [
        {"role": "system", "content": "You are a Kubernetes troubleshooting agent."},
        {"role": "user", "content": user_input}
    ]

    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        tools=TOOLS,
        tool_choice="auto"
    )

    msg = response.choices[0].message

    if msg.tool_calls:
        tool_call = msg.tool_calls[0]

        tool_name = tool_call.function.name
        args = json.loads(tool_call.function.arguments)

        result = TOOL_REGISTRY**args

        messages.append({
            "role": "tool",
            "content": json.dumps(result)
        })

        final = client.chat.completions.create(
            model=MODEL,
            messages=messages
        )

        return final.choices[0].message.content

    return msg.content

🧪 Step 5: Run the Agent

response = dynamic_tool_agent("Pod myapp is crashing repeatedly")

print(response)

🔥 What’s Happening Internally

When you run:

"Pod myapp is crashing"

👉 Agent does:

Calls describe_pod
Sees CrashLoopBackOff
Calls get_logs
Finds DB issue
Returns fix

🚀 Why This is Powerful

✅ Before (Static logic)

IF crash → logs
IF issue → describe

✅ Now (Dynamic AI Agent)

LLM decides ✅
Tools executed dynamically ✅
Multi-step reasoning ✅

🧠 Real DevOps Mapping

Agent Tool	Real Command
get_logs()	kubectl logs
describe_pod()	kubectl describe pod
get_events()	kubectl get events

⚠️ Key Learnings

✅ LLM does NOT execute tools

It only returns structured intent

✅ Tool results must go back to LLM

Otherwise, no reasoning loop

✅ JSON must be parsed safely

Never trust LLM outputs blindly

InfraDecode Takeaway

“Chatbots answer questions.
Agents solve problems.”

— InfraDecode

Discover more from

Subscribe to get the latest posts sent to your email.

From Chatbot to Real Agent: Building a Dynamic Tool‑Calling Kubernetes Troubleshooting Agent