I've been using Claude Code for a while now, and like any developer who relies on a tool daily, I eventually got curious about what's happening behind the scenes. How does an LLM that can only produce text somehow end up reading my files, running curl commands, and writing new files to disk? The "AI agent" buzzword gets thrown around so much that it started feeling like magic. And magic usually means I don't understand them yet.
So, using Mihail Eric’s “The Emperor Has No Clothes: How to Code Claude Code in 200 Lines of Code” great article as inspiration, I decided to build my own mini version of Claude Code from scratch. A small REPL, a handful of tools (read files, write files, list directories, run bash), and an agent loop glueing everything together. No fancy frameworks, just the Anthropic SDK and some TypeScript.
The result is a project I've pushed to GitHub, and in this article, I'll walk you through what I built, the concepts I had to wrap my head around, and some real runs showing the agent in action. If you've ever wondered how Claude Code actually works, or you want to build something similar for your own use case, this is for you.
Here’s a quick demo showing the custom harness at work:
Let's dig in.
Why Bother Building One?
Before we get into the code, it's worth asking: why build this at all when Claude Code already exists and works beautifully?
A few reasons I kept coming back to:
- Understanding through building. Using a tool is one thing; knowing what makes it tick is another. Once I built my own harness, the behaviour of Claude Code (and any other agent) stopped feeling mysterious.
- It's a genuinely small amount of code. The core agent loop fits on a single screen. I expected way more machinery.
- Custom tools are where the real power is. Once you understand the pattern, you can build agents that do things no general-purpose tool would do: hitting your company's internal APIs, manipulating proprietary data formats, automating niche workflows.
By the end of the project, I had a working REPL that could do things like "summarise all the npm packages in this project with their current download stats and save it to a file", and watch the model orchestrate eight parallel shell calls to do it. Not bad at all.
What is an LLM, Really?
Let's start from the beginning. A Large Language Model is a neural network trained to predict the next chunk of text (a "token") given the text that came before. That's it. When you chat with Claude, ChatGPT, or Gemini, you're sending a sequence of messages to such a model and getting back a continuation.
Source: https://platform.openai.com/tokenizer
Three facts became absolutely critical for this project:
- The model is stateless. Every API call is independent. If you want it to remember the previous turn, you must send the previous turn back in the next request. That's what the "history" in this project is: a plain array of messages we prepend to every call. Anthopic provides improvements to this approach, but I wanted to keep it simple.
- The model can only produce text. It cannot, by itself, read a file on your computer or run
git status. It only emits tokens. - Tool use is a structured way for the model to say: "I would like to run
read_filewithpath='src/index.ts'". Your code then actually runs the tool and sends the result back. This is the entire trick behind "AI agents". Here’s we the “magic” happens.
That last one was the "aha!" moment for me. The model isn't running code. It's emitting a structured request, you run the code, and then you tell the model what happened. The agent loop is just the scaffolding around that dance.
An Anthropic API request is basically this:
1{
2 "model": "claude-sonnet-4-6",
3 "max_tokens": 4096,
4 "system": "You are a coding assistant...", // system prompt
5 "tools": [ /* tool definitions */ ],
6 "messages": [
7 { "role": "user", "content": "list the files in src" },
8 { "role": "assistant", "content": [ /* text + tool_use blocks */ ] },
9 { "role": "user", "content": [ /* tool_result blocks */ ] },
10 ...
11 ]
12}The model responds with a list of content blocks (text blocks, tool_use blocks) plus a stop_reason. That's what we'll be looping around.
This is a good moment to give you a quick reminder that you can find the code for this project on my GitHub.
What is a "Harness"?
A harness is the code around the model that makes it useful for a given task. At minimum it does five things:
- Sends requests to the model's API.
- Parses the response.
- If the model asked to use a tool, actually runs the tool.
- Sends the tool's output back to the model.
- Repeats until the model says it's done.
That's all this project is. Everything else (spinner, markdown rendering, input box) is polish.
The diagram above shows the high-level choreography. The Harness is the yellow box, which is the structure around a program (blue lifeline) plus an LLM (green lifeline). Nothing about this box is Anthropic-specific. Claude Code, our little REPL, and any other "AI agent" you've heard of most likely fit this overall shape (although I recently noticed that Claude Code has some optimisations around programmatic tool calling).
Read it top to bottom:
- The user sends a message.
- The program (harness code) forwards it to the LLM.
- The LLM decides it needs a tool, and returns a "please call this tool" message back to the program.
- The program actually runs the tool on the machine (the model can't, it only emits text).
- The program forwards the tool's output back to the LLM.
- The LLM turns that result into a final answer.
- The program relays that answer to the user.
The loop-back arrow between program and LLM (steps 3 to 5) can repeat many times in one turn. That loop is the heart of the project, and we'll unpack it shortly.
Project Setup
A quick tour of the scaffolding before we get to the interesting parts.
package.json: nothing exotic. I'm using"type": "module"for native ES modules,tsxto run TypeScript directly (no compile step), and a handful of dependencies:@anthropic-ai/sdk(the official Anthropic client),dotenv(loads.env.local),ora(spinner),cli-markdown(renders markdown in the terminal), andstring-width(measures display width of unicode for the input box in the terminal)..env.local: this file holds secrets. Git must never see it:
ANTHROPIC_API_KEY=sk-ant-...
The .gitignore has .env*.local so it won't be committed. Losing an API key to a public repo is a classic mistake, and I wanted to defend against it by default.
src/config.tsholds the knobs:
1dotenv.config({ path: '.env.local' })
2
3export const MODEL = 'claude-sonnet-4-6'
4export const MAX_TOKENS = 4096
5export const MAX_WRITE_BYTES = 1_000_000
6export const MAX_HISTORY_TURNS = 40
7export const BASH_TIMEOUT_MS = 30_000
8export const BASH_MAX_BUFFER = 1_000_000A couple of these deserve a note. MAX_TOKENS is the model's hard upper bound on how long a single response can be. I set 4096 tokens (around 3000 words). If it hits this cap mid-answer, the API tells us (stop_reason === 'max_tokens') and we surface that back to the user instead of silently truncating. The other caps exist to bound blast radius if the model misbehaves, and we'll see each of them in action later.
The System Prompt
The system prompt is where you set the persona, the house rules, and the output format. Mine is short and deliberate:
1export const systemPrompt = `You are a coding assistant whose goal is to help users solve coding tasks.
2When the user asks about a file's contents respond with a short preview (at most the first 10 lines) inside a fenced Markdown code block and summarise the rest in plain language rather than quoting it in full.
3Always prefer tool use over guessing. After a tool call, incorporate the result and continue until the task is complete.`Two things in there are doing real work:
- "prefer tool use over guessing": without this, the model might end up making up the contents of files instead of reading them. Saying so explicitly nudges the behaviour.
- "after a tool call, incorporate the result and continue": reminds the model that the agent loop will give it another turn, so it doesn't need to answer prematurely.
I deliberately did not list each tool in the system prompt. The tools field in the API request already documents them (name, description, and JSON schema for inputs). Repeating that in the system prompt just wastes tokens and risks drift.
Tools: How the Model Reaches Out to the Machine
This is where things get interesting. A tool has three parts:
- Definition (as required by Anthropic’s API): metadata the API sends to the model:
name,description, and a JSON Schema for its input. - Handler: a function in your code that actually does the work when the model invokes the tool.
- Result: a string the model will see on the next turn.
Every tool in this project implements the Tool interface from src/types.ts:
1export interface ToolResult {
2 content: string // sent back to the model
3 display: string // one-line human summary printed to the terminal
4}
5
6export interface Tool {
7 definition: ToolDefinition
8 handler: (input: Record<string, unknown>) => Promise<ToolResult> | ToolResult
9}Splitting content and display is a small detail I'm pretty happy with. content is what the model sees, often a JSON blob so it can parse it easily. display is what the human sees, a practical one-liner so you're not staring at walls of JSON while the agent works (trust me, it’s not fun 😅).
Sandboxing: The Most Important 10 Lines in the Project, from a Security Perspective
Before writing any tool that touches the filesystem, it's worth thinking about what happens if the model writes path = "../../../etc/passwd". That path resolves outside your project. safeResolve prevents this:
1export function safeResolve(rel: string, root = process.cwd()): string {
2 if (path.isAbsolute(rel)) throw new Error(...);
3 const full = path.resolve(root, rel);
4 const rootWithSep = root.endsWith(path.sep) ? root : root + path.sep;
5 if (full !== root && !full.startsWith(rootWithSep)) throw new Error(...);
6 return full;
7}We treat process.cwd() (the directory you were in when you ran npm start) as the sandbox root. path.resolve normalises ../ segments, and then we check the result still lives inside the root. If it doesn't, we refuse.
Every filesystem tool runs its input through safeResolve first. Even if the model is jailbroken or ignores the system prompt, it can't touch files outside the project.
The Four Tools
I kept the toolkit small:
read_filerunssafeResolve, thenfs.readFile, and returns the raw UTF-8 contents to the model plus a short display line (file path, bytes, lines) to the terminal.write_fileadds two guardrails on top ofsafeResolve: a size cap (MAX_WRITE_BYTES, 1 MB) to stop a runaway model filling your disk, and auto-creation of parent directories so the model can write to paths whose folders don't exist yet.list_dirreturns a JSON array of{ name, type }entries for each item in the directory. JSON makes the result easy for the model to parse.bashis the most powerful and most dangerous one:
1const { stdout, stderr } = await execAsync(cmd, {
2 cwd: process.cwd(),
3 timeout: BASH_TIMEOUT_MS, // 30 s
4 maxBuffer: BASH_MAX_BUFFER, // 1 MB combined
5 shell: '/bin/sh'
6})I wrapped Node's child_process.exec with promisify so we can await it. The timeout kills hanging processes, and maxBuffer caps runaway output. The tool returns a JSON blob either way:
1{
2 "ok": true,
3 "exitCode": 0,
4 "stdout": "...",
5 "stderr": "...",
6 "timedOut": false
7}One decision worth calling out: non-zero exits become ok: false but we do not throw. Why? Because an exit code of 1 from grep (no match) is a valid, useful signal for the model. If we threw, the model would only see an exception and couldn't use that signal.
Finally, the tool registry in src/tools/index.ts wires everything up:
1export const tools: Tool[] = [readFile, writeFile, listDir, bash]
2export const toolMap = new Map(tools.map(t => [t.definition.name, t]))
3export const toolDefinitions = tools.map(t => t.definition)toolDefinitions is what we send to the API (the model never sees the handler, only the metadata). toolMap is what the loop uses to find the right handler given a tool_use name.
The Agent Loop: The Heart of the Project
This is the bit I wanted to understand most before I started. How do you turn "the model emitted a tool_use block" into "the tool actually ran and the model got the result"? Let's look at the diagram first, then the code.
The diagram splits into two nested loops:
- Internal loop (the light-grey box): handles one user prompt. While the LLM keeps asking for tools, the program keeps extracting
tool_useblocks into atoolUsesarray, invoking eachtool.handler(), and feeding the combinedtool_resultback in as a user-role message. The loop exits the moment a response comes back withstop_reason: end_turninstead oftool_use. - External loop (the outer light-yellow dashed box): one iteration per REPL turn. Before the final reply goes back to the user, the program trims the
historyarray so the next turn doesn't carry unbounded context.
Here's pseudocode first, because the real code has a bit of ceremony that can obscure the idea:
append user message to history
loop:
response = model.create(history, tools)
append response to history
if response says "done":
return text
if response says "call tool X with args A":
result = handlers[X](A)
append result to history
continue
Now the actual thing (simplified):
1export async function runAgent(userMessage, history = []) {
2 const messages = [...history, { role: 'user', content: userMessage }];
3
4 while (true) {
5 const response = await client.messages.create({
6 model: MODEL,
7 max_tokens: MAX_TOKENS,
8 system: systemPrompt,
9 tools: toolDefinitions,
10 messages,
11 });
12
13 messages.push({ role: 'assistant', content: response.content });
14
15 if (response.stop_reason === 'end_turn' || response.stop_reason === 'max_tokens') {
16 // extract text blocks and return
17 return { reply, history: trimHistory(messages, MAX_HISTORY_TURNS) };
18 }
19
20 if (response.stop_reason !== 'tool_use') throw ...;
21
22 // one or more tool calls → run them in parallel
23 const toolUses = response.content.filter(b => b.type === 'tool_use');
24 const toolResults = await Promise.all(toolUses.map(async (block) => {
25 const tool = toolMap.get(block.name);
26 const result = await tool.handler(block.input);
27 return { type: 'tool_result', tool_use_id: block.id, content: result.content };
28 }));
29
30 messages.push({ role: 'user', content: toolResults });
31 }
32}A few concepts buried in there are worth unpacking.
stop_reason
Every response tells us why it stopped generating:
end_turn: the model thinks it's done. Return the text.tool_use: the model emitted one or moretool_useblocks; we have to run them and send the results back.max_tokens: the model ran out of room. We surface this to the user.
Tool Results Are a User-Role Message
This one confused me at first. The Anthropic API models the conversation as alternating user/assistant turns. Tool results are considered "user-side" input. The program, acting on behalf of the user, is replying to the model with what the tool did. Each tool_result block carries a tool_use_id that ties it back to the corresponding tool_use from the previous assistant turn.
1{
2 "role": "user",
3 "content": [
4 {
5 "type": "tool_result",
6 "tool_use_id": "toolu_01...",
7 "content": "[{name:..., type:...}, ...]"
8 }
9 ]
10}If you forget to include a tool_result for every tool_use_id, the next API call fails with an "orphaned tool_use" error. That constraint matters in the next section.
Parallel Tool Execution
If the model emits three tool_use blocks in one response, we run all three with Promise.all. The API expects the next user message to contain a tool_result for every tool_use_id in the previous assistant turn, in one batch. Running them sequentially would be slower for no benefit. We'll see this pay off big time in Prompt 3 below.
History Trimming
Each tool call adds two messages to the history (assistant with tool_use, user with tool_result). That's tokens. Left unchecked, history grows until you blow through the model's context window.
trimHistory keeps the most recent MAX_HISTORY_TURNS (40) messages, but it can only cut at safe boundaries:
1function trimHistory(messages, cap) {
2 if (messages.length <= cap) return messages
3 const minDropIndex = messages.length - cap
4 for (let i = minDropIndex; i < messages.length; i++) {
5 const m = messages[i]
6 if (m.role === 'user' && typeof m.content === 'string') {
7 return messages.slice(i)
8 }
9 }
10 return messages
11}Remember the orphaned tool_use problem? If we cut in the middle of a tool-use/tool-result pair, the next request fails. So we only cut at a plain-text user message, because those are always safe split points.
The Terminal UI
I won't dwell on this since it's not the interesting part, but the REPL itself is in src/index.ts and the input rendering lives in src/input.ts and src/render.ts. In short:
src/input.tsuses raw mode (stdin.setRawMode(true)) to handle keystrokes one at a time, with proper support for arrow keys, UTF-16 surrogate pairs (so emojis don't break), andstring-width(so emojis take the right number of columns in the highlighted input box).src/render.tsis basically a one-liner aroundcli-markdownthat turns the model's markdown output into ANSI sequences for the terminal (*bold**becomes actual bold, code blocks get a highlighted background, etc).src/index.tsis a classic REPL loop: read a line, evaluate it (run the agent), print the reply, loop.
That's the polish. None of it is required for a tool-using agent to work.
Three Real Runs
Enough theory. Let's look at what the thing actually does. I fed three prompts into the REPL, progressively more interesting: zero tools, one tool, many tools. During these runs I enabled debug logging (>>> Variable: ...) so every API response and tool batch is visible.
Prompt 1: "hi" (no tool use)
The simplest possible interaction. One API call, no tools invoked, straight back to the prompt.
Here's what happens:
- REPL reads
hifrom the input box. runAgent('hi', [])starts the spinner.- Single POST to the Messages API with
messages: [{ role: 'user', content: 'hi' }]and the usualsystem+toolsfields. - Response comes back:
stop_reason: "end_turn", a single text block ("Hi there! How can I help you today?..."), and usage around 1,115 input tokens and 32 output tokens. - Loop sees
end_turn, extracts the text, returns. No tool handlers run. renderMarkdownprints the greeting and the REPL loops back.
Takeaway: even a trivial "hi" carries around 1 KB of input tokens because the system prompt and the four tool definitions are sent on every request. That overhead is the price of giving the model the option to use tools (but, as I mentioned before, this has been improved by Anthropic).
Prompt 2: "list files in this directory" (one tool)
Two API calls wrapping one list_dir invocation. The canonical one-tool flow.
Turn 1: the model asks to call list_dir.
Response comes back with stop_reason: "tool_use" and a single tool_use block:
1{
2 "type": "tool_use",
3 "id": "toolu_01Ji3bqmtnez61GGyQPdSdfT",
4 "name": "list_dir",
5 "input": { "path": "." }
6}Interestingly, the model emitted only a tool_use block here, with no preceding text narration. Whether it narrates is non-deterministic; here it went straight to the call.
Tool runs: the loop finds list_dir in toolMap, calls its handler with { path: '.' }, and prints [Tool call] list_dir . (12 entries) to the terminal. The tool_result pushed back to the model is a JSON array of those 12 entries.
Turn 2: the model writes the answer.
Second API call, same messages array plus the assistant's tool_use and the user-side tool_result. The model now has the directory listing in context and responds with stop_reason: "end_turn" and a markdown answer with two sections: Directories (.claude, .git, node_modules) and Files (package.json, tsconfig.json, etc), each with a one-line description.
renderMarkdown turns the markdown into ANSI (bold headings, bullets) and prints it.
Takeaway: a tool round-trip costs one extra API call and adds both the tool_use block and the (potentially large) tool_result to history for every subsequent turn. The 12-entry JSON payload is cheap. read_file on a big file would not be.
Prompt 3: "Summarise the packages and save to a file" (parallel tools + write)
This is the fun one. The prompt was:
Write a summary of all the packages used in this project, with their current download stats, and save it to a packages-summary.md file
This exercises almost everything: multi-turn planning, three different tools, parallel tool calls via Promise.all, and write_file with a 1 KB+ markdown payload.
Turn 1: list_dir to orient.
The model emits a text block ("Let me start by exploring the project structure to identify the package manager and dependencies.") followed by tool_use: list_dir { path: '.' }. Stop reason: tool_use. Handler returns the directory listing.
Turn 2: read_file on package.json.
With the listing in context, the model asks for the manifest. The handler reads the file and returns the raw UTF-8. The model now knows the 5 dependencies and 3 devDependencies by name and version.
Turn 3: eight parallel bash calls. Here's the interesting bit.
The response content is a text block ("Now let me fetch the npm download stats for all 8 packages simultaneously.") followed by eight tool_use blocks in a single assistant turn. One bash call per package:
curl -s https://api.npmjs.org/downloads/point/last-week/typescript
curl -s https://api.npmjs.org/downloads/point/last-week/@anthropic-ai/sdk
curl -s https://api.npmjs.org/downloads/point/last-week/cli-markdown
curl -s https://api.npmjs.org/downloads/point/last-week/dotenv
curl -s https://api.npmjs.org/downloads/point/last-week/ora
curl -s https://api.npmjs.org/downloads/point/last-week/string-width
curl -s https://api.npmjs.org/downloads/point/last-week/@types/node
curl -s https://api.npmjs.org/downloads/point/last-week/tsx
The loop handles this exactly as the parallel-execution code path describes. All eight execAsync calls run concurrently via Promise.all.
The terminal prints eight [Tool call] bash ... lines in interleaved order (whichever curl finishes first logs first). The next user message pushed onto messages is a single content array containing eight tool_result blocks, each keyed by its tool_use_id. If even one were missing, the next API call would reject with an orphaned-tooluse error, which is why the loop only returns _after Promise.all settles.
Turn 4: write_file with the assembled markdown.
Now the model has the package list, each package's weekly download count, and enough training knowledge to describe what each one does. It emits one more tool_use with the full markdown content for packages-summary.md. The handler runs safeResolve, checks the byte count against MAX_WRITE_BYTES, and writes the file.
Turn 5: end_turn with the rendered summary.
Final response: stop_reason: "end_turn", single text block. The model's message confirms the save and then re-renders the key tables inline (Dependencies, Dev Dependencies, Key highlights). cli-markdown turns those into ANSI tables with pink headers, highlighted code spans, and an emoji bullet list.
Takeaways from this run:
- The model decomposed the task into four distinct phases without being told to. The system prompt nudge "prefer tool use over guessing" is doing real work here. Nothing else told it to run
curlinstead of hallucinating download counts. - Parallel
tool_useis free throughput. Eight serial bash calls at around 300 ms each would have been 2.4 s of added latency; running them in parallel keeps the whole turn under a second of tool time. - Every intermediate artefact (directory listing,
package.jsoncontents, eight JSON blobs, the full markdown body) is now in history. By Turn 5 this conversation is tens of KB of tokens.trimHistoryonly kicks in past 40 turns, but the per-request cost has already climbed sharply. A reminder that context is not free.
Here’s a video showing how all of this looks like in real life:
How to Extend This
If you want to build on top of this (and I hope you do), adding a new tool is almost embarrassingly simple:
- Create
src/tools/my_tool.tsexporting{ definition, handler }. - Register it in
src/tools/index.ts. - That's it. The model sees it on the next request.
A few things I learned about good tool design along the way:
- Name tools like functions (
run_tests, nottests). - Write a description that states what the tool does, what it returns, and when to use it. The model reads this exactly like a human reads a docstring.
- Validate inputs. The schema is enforced by the API, but bad values inside the schema (e.g. invalid paths) are your handler's responsibility.
- Return structured output (JSON) when the result has multiple fields the model might reason about.
- Fail gracefully. Prefer returning
{ ok: false, error: ... }over throwing, unless the input was outright malformed.
If you want to swap models, edit MODEL in src/config.ts. Smaller cheaper models like Haiku are great for quick iteration; Opus is the smartest (and slowest and priciest).
A natural next step I haven't taken yet is streaming: currently we wait for the full response before rendering. The SDK supports streaming; you'd replace messages.create with the streaming variant and render text as it arrives.
A Safety Checklist
Before pointing a tool-using agent at real code, make sure:
- Your API key lives in
.env.localand.env*.localis in.gitignore. - Filesystem tools resolve paths through
safeResolve(or an equivalent). Never pass model-supplied paths straight tofs. - Write tools have a size cap.
- Shell tools have a timeout and an output cap.
- You understand that
bashhas no command allowlist in this project. It can rungit, it can also runrm -rfon anything inside the sandbox. Only point it at projects where that's OK, or add an allowlist or confirmation prompt before enabling it on important repos.
Closing Thoughts
Building this project changed how I think about AI agents. What used to feel like magic turned out to be a surprisingly short feedback loop: the model emits a structured request, my code runs a function, the model sees the result, and the dance continues until it's done. That’s the special sauce. Just a well-designed API, some careful plumbing, and a lot of thought about what happens when things go wrong.
If you've been using Claude Code (or any other agent) and wondering how it works, I'd genuinely encourage you to build your own tiny version. The whole codebase is on GitHub if you want to fork it and extend it with your own tools.
Once you see the loop from the inside, you start imagining all sorts of agents you could build. Internal company tools, domain-specific assistants, automation for that one annoying workflow you've been putting off for months. The pattern generalises.
And honestly, that's the most rewarding part of this kind of project: the moment something stops feeling like magic and starts feeling like something you could build yourself.
Further reading
- Anthropic Tool Use docs
- Messages API reference
- The source repository. Every file is small on purpose; reading them straight through is the fastest way to cement the concepts above.
Thanks for reading, and see you next time! 👋
Key Takeaways
AI agents are less mysterious once you build one yourself. At the core, they are a loop between the model, your program, and the tools your program exposes.
LLMs do not actually run code or access files by themselves. They emit structured tool requests, and your harness decides how to execute them.
The agent loop is the heart of the whole system. The model asks for a tool, the program runs it, the result goes back to the model, and the process repeats until the task is done.
A useful LLM harness does not require a huge amount of code. With a small REPL, a few tools, and the Anthropic SDK, you can build a working mini Claude Code style agent.
Tool design matters a lot. Clear names, good descriptions, useful schemas, structured outputs, and graceful errors make the model much more effective.
Safety guardrails are essential when tools touch the filesystem or shell. Path sandboxing, file size limits, command timeouts, and output caps help reduce the blast radius when the model does something unexpected.
Parallel tool execution can make agents much more efficient. When the model requests multiple independent tool calls, running them concurrently can reduce latency without complicating the user experience.
Building your own harness opens the door to custom agents. Once you understand the pattern, you can create agents for internal APIs, niche workflows, proprietary data, project automation, or anything a general tool would never support.


