From Chatbot to Real Agent: Building a Dynamic Tool‑Calling Kubernetes Troubleshooting Agent


🧠 Introduction

Most of us start using LLMs like this:

Ask question → get answer

But in real-world DevOps, that’s not enough.

When a pod fails, you don’t just need an answer — you need a system that:

  • Checks logs
  • Reads pod status
  • Looks at events
  • Decides what to do next

👉 That’s where AI agents come in.

In this post, we’ll build a dynamic tool-calling Kubernetes troubleshooting agent that:

  • Decides which action to take
  • Calls tools like kubectl logs (simulated)
  • Uses results to find root cause

💡 What is Tool Calling?

By default, LLMs only generate text.

But with tool calling:

👉 Instead of answering directly, the model can say:

Call this tool → with these arguments

Example:

{
  "tool": "get_logs",
  "arguments": {
    "pod": "myapp"
  }
}

👉 Your code executes it → returns result → model reasons again.


🔧 Architecture We’re Building

User Input
   ↓
LLM (decision)
   ↓
Tool Call (JSON)
   ↓
Python executes tool
   ↓
Result back to LLM
   ↓
Final answer

⚙️ Step 1: Setup (Colab + Groq)

!pip install groq
from groq import Groq

client = Groq(api_key="YOUR_API_KEY")
MODEL = "llama-3.1-8b-instant"

🛠️ Step 2: Define Tools (Simulating Kubernetes)

These represent real DevOps actions.

def get_logs(pod, namespace="default", tail_lines=200):
    return {
        "pod": pod,
        "logs": "ERROR: connection refused to DB\nERROR startup failed"
    }

def describe_pod(pod, namespace="default"):
    return {
        "pod": pod,
        "status": "CrashLoopBackOff",
        "restartCount": 5
    }

def get_events(namespace="default"):
    return {
        "events": ["Readiness probe failed", "Back-off restarting container"]
    }

📐 Step 3: Tool Schema (How LLM Understands Tools)

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "get_logs",
            "description": "Fetch logs for troubleshooting",
            "parameters": {
                "type": "object",
                "properties": {
                    "pod": {"type": "string"}
                },
                "required": ["pod"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "describe_pod",
            "description": "Describe pod status",
            "parameters": {
                "type": "object",
                "properties": {
                    "pod": {"type": "string"}
                },
                "required": ["pod"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_events",
            "description": "Fetch cluster events",
            "parameters": {
                "type": "object",
                "properties": {
                    "namespace": {"type": "string"}
                },
                "required": ["namespace"]
            }
        }
    }
]

🔁 Step 4: Dynamic Agent Loop (Core Logic)

This is where the “agent behavior” happens.

import json

TOOL_REGISTRY = {
    "get_logs": get_logs,
    "describe_pod": describe_pod,
    "get_events": get_events
}

def dynamic_tool_agent(user_input):

    messages = [
        {"role": "system", "content": "You are a Kubernetes troubleshooting agent."},
        {"role": "user", "content": user_input}
    ]

    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        tools=TOOLS,
        tool_choice="auto"
    )

    msg = response.choices[0].message

    if msg.tool_calls:
        tool_call = msg.tool_calls[0]

        tool_name = tool_call.function.name
        args = json.loads(tool_call.function.arguments)

        result = TOOL_REGISTRY**args

        messages.append({
            "role": "tool",
            "content": json.dumps(result)
        })

        final = client.chat.completions.create(
            model=MODEL,
            messages=messages
        )

        return final.choices[0].message.content

    return msg.content

🧪 Step 5: Run the Agent

response = dynamic_tool_agent("Pod myapp is crashing repeatedly")

print(response)

🔥 What’s Happening Internally

When you run:

"Pod myapp is crashing"

👉 Agent does:

  1. Calls describe_pod
  2. Sees CrashLoopBackOff
  3. Calls get_logs
  4. Finds DB issue
  5. Returns fix

🚀 Why This is Powerful

✅ Before (Static logic)

IF crash → logs
IF issue → describe

✅ Now (Dynamic AI Agent)

  • LLM decides ✅
  • Tools executed dynamically ✅
  • Multi-step reasoning ✅

🧠 Real DevOps Mapping

Agent ToolReal Command
get_logs()kubectl logs
describe_pod()kubectl describe pod
get_events()kubectl get events

⚠️ Key Learnings


✅ LLM does NOT execute tools

It only returns structured intent


✅ Tool results must go back to LLM

Otherwise, no reasoning loop


✅ JSON must be parsed safely

Never trust LLM outputs blindly


InfraDecode Takeaway

“Chatbots answer questions.
Agents solve problems.”

InfraDecode


Discover more from

Subscribe to get the latest posts sent to your email.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top