Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

README.md

PoC 3 — A ReAct Agent with Tools

A minimal command-line ReAct agent in Python: an LLM gets a natural-language goal, decides on its own whether to look something up or do a calculation, calls the matching tool, observes the result, and loops until it can answer — without us hard-coding the control flow.

ReAct = Reason + Act. The model interleaves thoughts (reasoning traces) with actions (tool calls), then reads observations (tool results) and decides what to do next. See Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models, ICLR 2023.


🎯 What you will learn

By reading the code and running the agent, you will see, end-to-end:

  1. The anatomy of an agent: LLM (brain) + Tools (hands) + Memory (conversation history) + Planning (the ReAct loop).
  2. How structured output formats (here: a strict Thought / Action / Action Input block) let you treat an LLM as a programmable component.
  3. Why a terminator tool (final_answer) is a clean way to express "the agent is done" — no special control tokens needed.
  4. The two most important guardrails for any agent: a hard step budget and a safe tool-execution layer (no eval()).
  5. How an agent's trace (Thought → Action → Observation) makes its reasoning auditable in a way a single LLM call never is.

📋 The exact Copilot Agent prompt

This PoC was generated by pasting the following prompt into VS Code's Copilot Chat in Agent mode. It is reproduced verbatim from the lecture slides so you can paste it yourself and compare.

Build a command-line ReAct agent in Python: agent.py plus tools.py. In tools.py expose a dict TOOLS with three callables: calculator(expr) (safe eval on arithmetic only), search(query) (lookup in a small in-memory dictionary KB with at least 5 entries, e.g. "capital of france" -> "Paris"), and final_answer(text) (returns the text). In agent.py implement run_agent(goal, max_steps=6): a loop that calls the Anthropic API (claude-sonnet-4-6, key from ANTHROPIC_API_KEY) with a system prompt forcing the format "Thought:.../ Action:.../ Action Input:..."; parse the response with regex; execute the named tool; append the Observation back into the conversation history; stop when final_answer is invoked or max_steps is reached. Print every step. At the bottom, an if __name__ == "__main__" block that runs the agent on two example goals. Provide requirements.txt (anthropic), .gitignore, and a README with the run command.

📝 Deviations from the prompt:

  1. Uses OpenRouter (anthropic/claude-sonnet-4) instead of native Anthropic, for one-key access to many models.
  2. The calculator uses an AST walk with an allow-list of node types — it never calls Python's eval(). This matters: any tool you give an LLM is a tool a hostile prompt can try to abuse.
  3. Adds stop=["Observation:"] to the LLM call so the model can't hallucinate its own observation.

🏗️ Architecture

flowchart LR
    GOAL["🎯 User goal"] --> LOOP

    subgraph LOOP["🔁 ReAct loop (max_steps = 6)"]
        direction TB
        T["Thought:<br/>reason about<br/>next step"] --> A["Action:<br/>pick a tool"]
        A --> AI["Action Input:<br/>argument string"]
        AI --> EXEC["runtime executes<br/>TOOLS[action](input)"]
        EXEC --> OBS["Observation:<br/>tool result"]
        OBS -- "append to<br/>conversation" --> T
    end

    LOOP -- "Action == final_answer<br/>OR step budget hit" --> ANS["💡 Final answer"]

    subgraph TOOLS["🧰 Toolbox (tools.py)"]
        direction TB
        CALC["calculator(expr)<br/>safe AST eval"]
        SRCH["search(query)<br/>toy in-memory KB"]
        FA["final_answer(text)<br/>terminator"]
    end

    EXEC -.dispatch.-> CALC
    EXEC -.dispatch.-> SRCH
    EXEC -.dispatch.-> FA

    style GOAL fill:#dbeafe,stroke:#1e40af
    style ANS fill:#bbf7d0,stroke:#15803d
    style TOOLS fill:#fef3c7,stroke:#b45309
    style LOOP fill:#f3e8ff,stroke:#6b21a8
Loading

The LLM is not in the loop. It is called by the loop. Each iteration is a fresh call with the growing conversation history. This is what makes the trace deterministic to read (and easy to debug).


🧩 Components — file by file

tools.py — the agent's hands

Symbol Responsibility Notes
_safe_eval(node) Walk a Python AST, allowing only arithmetic node types and literal numbers. Raises on anything else. The single-most-important security boundary in this PoC.
calculator(expr) Parse expr with ast.parse(..., mode="eval") and pass to _safe_eval. Returns a string. Returns "ERROR: …" on bad input — never raises into the agent loop.
_KB Small Python dict acting as a knowledge base (capitals, fun facts). Easy to extend.
search(query) Case-insensitive substring lookup in _KB. Returns a clear NOT FOUND message with available keys, so the model can recover.
final_answer(text) Returns text unchanged. Acts as the loop terminator.
TOOLS (dict) Maps tool names (the strings the LLM emits) to callables. The contract between LLM and runtime.
TOOL_DESCRIPTIONS (dict) Human-readable descriptions injected into the system prompt. Lets the LLM know what each tool does and what input it expects.

agent.py — the ReAct loop

Section Symbol Responsibility
Constants OPENROUTER_BASE_URL, LLM_MODEL, MAX_STEPS Provider URL, model slug, and step budget.
System prompt SYSTEM_PROMPT Generated dynamically from TOOLS + TOOL_DESCRIPTIONS so adding a tool requires no prompt edits. Forces the strict output format.
Parser _ACTION_RE, _parse(response) A single regex extracts (thought, action, action_input). Returns None if the model deviates from the format.
Client _client() Reads OPENROUTER_API_KEY from .env, returns an OpenAI instance pointed at OpenRouter.
Loop run_agent(goal, max_steps=6) Implements the ReAct cycle: call LLM → parse → dispatch tool → log → append observation → repeat until final_answer or budget exhausted.
Entry point if __name__ == "__main__" Two sample goals, or override with CLI args.

One iteration of the loop, in detail

sequenceDiagram
    participant Loop as run_agent()
    participant LLM as OpenRouter LLM
    participant Tool as TOOLS[action]

    Loop->>LLM: messages (system + history)
    Note over LLM: stop=["Observation:"]<br/>so the model never<br/>hallucinates a result
    LLM-->>Loop: "Thought: …<br/>Action: search<br/>Action Input: capital of france"
    Loop->>Loop: regex parse → (thought, action, input)
    Loop->>Tool: TOOLS["search"]("capital of france")
    Tool-->>Loop: "Paris"
    Loop->>Loop: print step (audit trail)
    Loop->>Loop: append assistant turn + "Observation: Paris"
    alt action == "final_answer"
        Loop-->>Loop: STOP, return text
    else more steps allowed
        Note over Loop: next iteration
    end
Loading

⚙️ Setup

cd poc3_react_agent
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env       # then edit .env: OPENROUTER_API_KEY=sk-or-v1-…

Get a free OpenRouter key at https://openrouter.ai/keys.


▶️ Run

# Run with the two built-in example goals
python agent.py

# Or pass your own
python agent.py "What is the capital of Japan, and what is 8 squared?"

📚 Example data & query cheatsheet

The agent's "knowledge" lives in a tiny dictionary _KB at the top of tools.py. The full set of facts the search tool can find:

Key Value
capital of france Paris
capital of germany Berlin
capital of japan Tokyo
capital of brazil Brasília
capital of australia Canberra
speed of light 299,792,458 m/s
pi 3.14159265
founder of microsoft Bill Gates and Paul Allen
author of 1984 George Orwell

Anything else returns NOT FOUND plus the list of known keys — which the agent then has to handle gracefully (Test 4 below).

Curated example queries

Copy these one by one and read the printed traces. Each query is designed to exercise a different ReAct behaviour.

# Command Behaviour you should see
1 python agent.py "Who wrote 1984?" 2 steps: searchfinal_answer. Pure lookup.
2 python agent.py "What is 17 * 23 + 5?" 2 steps: calculatorfinal_answer. Pure math.
3 python agent.py "What is the capital of France, and what is twice the number of letters in its name?" 3 steps: search (Paris) → calculator (2 * 5 = 10) → final_answer. The headline multi-tool, multi-step demo.
4 python agent.py "What is the capital of Japan, and what is 8 squared?" 3 steps: search (Tokyo) → calculator (8 ** 2 = 64) → final_answer.
5 python agent.py "What is the capital of Mars?" search returns NOT FOUND → agent calls final_answer admitting Mars has no capital. Must NOT hallucinate.
6 python agent.py "Compute __import__('os').system('echo hacked')" calculator returns ERROR: Disallowed expression: …. Safety boundary — if you ever see hacked printed, the allow-list has been broken.

💡 Read traces top-to-bottom. Every Observation: you see was produced by Python code (the tool), not the LLM. Every Thought: and Action: was produced by the LLM. That separation is what makes the agent auditable.

Adding your own facts

Edit _KB in tools.py and restart:

_KB = {
    # …existing entries…
    "capital of mars": "Mars has no capital — it is uninhabited.",
    "boiling point of water": "100 °C at sea level.",
}

Then run python agent.py "What is the boiling point of water?" to see the new fact picked up.


🧪 Test plan

The point of these tests is not just "did it answer?" — it's to read the printed trace and confirm the agent is doing the right thing mechanically.

Test 1 — A pure search task

Run:

python agent.py "Who wrote 1984?"

Expected trace shape:

[step 1] Thought: I need to look this up.
[step 1] Action: search
[step 1] Action Input: author of 1984
[step 1] Observation: George Orwell

[step 2] Thought: I have the answer.
[step 2] Action: final_answer
[step 2] Action Input: 1984 was written by George Orwell.

✅ The agent should call search exactly once, then final_answer. 2 steps total.

Test 2 — A pure calculation task

python agent.py "What is 17 * 23?"

✅ One calculator call (Action Input: 17 * 23), one final_answer. 2 steps total.

Test 3 — Multi-step (the headline ReAct demo)

python agent.py "What is the capital of France, and what is twice the number of letters in its name?"

Verified output (Claude Sonnet 4 via OpenRouter):

[step 1] Thought: I need to find the capital of France first…
[step 1] Action: search
[step 1] Action Input: capital of france
[step 1] Observation: Paris

[step 2] Thought: Paris has 5 letters (P-a-r-i-s). I'll compute 2 * 5.
[step 2] Action: calculator
[step 2] Action Input: 2 * 5
[step 2] Observation: 10

[step 3] Action: final_answer
[step 3] Action Input: The capital of France is Paris, and twice the
                       number of letters in its name is 10.

✅ This is the canonical ReAct pattern: the result of step 1 (Paris) informs the argument of step 2 (5 letters → 2 * 5).

Test 4 — Graceful failure when knowledge is missing

python agent.py "What is the capital of Mars?"

Verified behaviour:

  • Step 1: search('capital of mars') returns NOT FOUND: 'capital of mars'. Known keys: … (whole-word matching prevents false hits like pi matching caPItal).
  • Step 2: the agent reads the available keys, realises Mars isn't there, and calls final_answer admitting Mars has no capital city.

✅ The agent must not hallucinate "Olympus Mons" or similar.

Test 5 — Step budget guardrail

Edit MAX_STEPS = 2 in agent.py and rerun Test 3.

✅ The trace should stop after step 2 with === STOPPED: max_steps=2 reached ===. This proves the budget is the last line of defence against runaway loops.

Test 6 — Calculator safety

python agent.py "Compute __import__('os').system('echo hacked')"

✅ The calculator must return ERROR: Disallowed expression: … rather than execute the call. If you ever see hacked printed, the _safe_eval allow-list has been broken — review immediately.


🛠️ Troubleshooting

Symptom Likely cause Fix
RuntimeError: OPENROUTER_API_KEY is not set .env missing or empty. cp .env.example .env and paste your key.
ERROR: malformed agent output printed once and the loop ends The model produced free text instead of the Thought / Action / Action Input triple. Try a stronger model slug; check that you didn't reduce max_tokens too low.
The agent loops forever calling search with slightly different inputs Knowledge gap — no entry in the toy KB. Add it to _KB in tools.py or accept that final_answer should admit uncertainty. Lower MAX_STEPS to enforce earlier termination.
search('capital of mars') returns an unrelated value (e.g. 3.14159265) An old version used substring matching, so pi matched caPItal. Current search() uses whole-word set containment — upgrade to the latest tools.py.
LLM call failed: 401 Invalid OpenRouter key. Generate a new key.

🔧 Tuning knobs

In agent.py:

LLM_MODEL = "anthropic/claude-sonnet-4"   # any OpenRouter slug
MAX_STEPS = 6                             # the most important guardrail

In client.chat.completions.create(...):

temperature=0.0           # determinism — tools should be picked predictably
stop=["Observation:"]     # don't let the model hallucinate observations
max_tokens=400            # cap each turn to keep cost predictable

To add a new tool, edit tools.py:

def my_tool(arg: str) -> str:
    ...

TOOLS["my_tool"] = my_tool
TOOL_DESCRIPTIONS["my_tool"] = "What it does. Input: …"

The system prompt rebuilds itself from these dicts on next launch — no prompt edits required.


💡 Extension ideas

  • rag_search tool that calls the FAISS retriever from PoC 1 — the agent can now answer questions about any PDF you've indexed.
  • product_search tool wired to the Chroma collection from PoC 2.
  • web_search tool using DuckDuckGo's HTML endpoint or the SerpApi. This is the single most impactful tool to add for a real research agent.
  • Long-term memory. Persist the agent's interesting observations to a JSON file and inject them into the system prompt on the next run.
  • Native tool calling. Replace the regex protocol with the OpenAI/Anthropic function-calling APIs. Trade-off: more robust, but ties you to a specific provider's schema.
  • Reflexion. After final_answer, run a second LLM pass that critiques the answer and triggers a retry if it spots a flaw.
  • Multi-agent. Add a planner that decomposes goals and an executor that runs sub-tasks. Only worth it if your single-agent trace becomes hard to follow — start simple.

📂 Files in this PoC

  • agent.py — the ReAct loop, system prompt, regex parser, OpenRouter client.
  • tools.pycalculator (safe AST eval), search (toy KB), final_answer (terminator), and the TOOLS dispatch dict.
  • requirements.txtopenai, python-dotenv.
  • .env.example — copy to .env and paste your OPENROUTER_API_KEY.
  • .gitignore — excludes .venv/, .env, etc.