303. agent system

Agent System

An AI agent is an autonomous system that uses a large language model (LLM) as its core reasoning engine to perceive, plan, and act in pursuit of a goal. Unlike traditional software, where control flow is explicitly programmed, an agent delegates decision-making to the model at runtime. The operating system solved analogous problems for CPU processes decades ago, from scheduling and memory management to inter-process communication and resource isolation, and many of the same architectural patterns are now resurfacing in agent design. This post was co-written with Claude Code (Opus 4.6), itself an AI agent built on the architecture described below.

https://huggingface.co/blog/shivance/illustrated-llm-os
https://x.com/karpathy/status/1723140519554105733

I

1.1. Inference

From an agent’s perspective, the LLM is a stateless function that receives a sequence of tokens (the prompt) and returns a probability distribution over the next token. Autoregressive decoding repeats this iteratively, sampling one token at a time and appending it to the sequence until a stop condition is met. The quality and diversity of generated output is governed by sampling strategies. Temperature scales the logit distribution (lower values sharpen it toward greedy decoding, higher values flatten it toward uniform sampling), top-k restricts sampling to the $k$ most probable tokens, and top-p (nucleus sampling) dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold $p$. These parameters give the caller fine-grained control over the trade-off between coherence and creativity.

At inference time, a KV cache stores the key-value pairs from each attention layer so that previously computed tokens are not reprocessed at every step. For a model with $L$ layers, $h$ KV heads of dimension $d_h$, and sequence length $n$, the cache consumes $2 \times L \times h \times d_h \times n \times b$ bytes (where $b$ is the bytes per element, e.g. 2 for FP16). A 70B-parameter model with grouped-query attention (GQA) typically has 8 KV heads ($h{=}8$, $d_h{=}128$), so at 4K context the cache requires roughly 1.3 GB per request in FP16. Without GQA (i.e. $h{=}64$) the same model would need ~10 GB. This splits inference into two distinct phases, prefill (processing the entire input prompt in parallel to populate the cache, compute-bound) and decode (generating tokens one at a time, memory-bandwidth-bound as each step reads the full cache but performs little computation).

Prefill can process thousands of tokens in tens of milliseconds, while decode typically runs at 30-100 tokens/s per request depending on model size and hardware. Continuous batching (as in vLLM) allows new requests to enter a running batch as soon as slots free up, maximising GPU utilisation. Speculative decoding further accelerates generation by using a smaller draft model to propose $\gamma$ tokens at once (typically $\gamma = 4$-$8$), which the larger model verifies in a single forward pass, accepting correct tokens and rejecting wrong ones.

The context window defines the maximum number of tokens an LLM can process in a single forward pass, encompassing both the input prompt and the generated output. This fixed capacity (e.g. 4K, 128K, 1M tokens depending on the model) is the fundamental memory constraint of any agent built on an LLM. Scaling the earlier 70B GQA example to 200K tokens, the KV cache alone reaches ~65 GB in FP16, making context management a first-class engineering problem. When the conversation history or accumulated tool outputs exceed this limit, the agent must decide what to retain and what to discard, much like an OS evicting pages from physical memory under pressure. Strategies include truncation, summarisation, and retrieval-augmented generation (RAG), where relevant context is fetched from an external store on demand rather than held in the window permanently.

1.2. Tool Use

A bare LLM can only generate text. To interact with the external world, it needs tool use (also called function calling), the ability to emit structured requests that an external runtime executes on its behalf, analogous to a user-space process issuing a system call to the kernel. The model cannot directly access files, databases, or APIs. Instead, it produces a tool call (typically a JSON object specifying the function name and arguments), the host application executes it, and the result is fed back into the model’s context for the next reasoning step. This controlled interface enforces a privilege boundary between the model’s reasoning and the system’s capabilities.

The ReAct (Reason + Act) pattern formalises this loop. The model alternates between reasoning steps (chain-of-thought) and action steps (tool calls), using the results of each action to inform the next reasoning step. Structured output (e.g. JSON mode, constrained decoding) ensures that tool calls conform to a predefined schema, making them parseable and type-safe. System prompts define the agent’s identity, available tools, and behavioural constraints, serving as the “configuration file” that shapes the model’s execution context before any user input is processed.

Modern LLMs are specifically trained for agentic capabilities. The typical pipeline begins with supervised fine-tuning (SFT) on curated tool-call traces (prompt, tool call, result, final answer), teaching the model when to invoke tools and how to format arguments. This is followed by RLHF or RLAIF post-training, where a reward model $R(x, y)$ scores entire tool-use trajectories on criteria such as correct tool selection, valid argument generation, and accurate result interpretation. The policy $\pi_\theta$ is optimised via PPO or similar algorithms to maximise $\mathbb{E}_{y \sim \pi_\theta}[R(x, y)] - \beta \cdot D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})$where the KL penalty against a reference policy $\pi_{\mathrm{ref}}$ prevents the model from drifting too far from its SFT baseline.

Toolformer (Schick et al., 2023) demonstrated that tool use can be learned through self-supervised finetuning, by inserting API calls into text and retaining only those that reduce perplexity. Agentic capabilities are evaluated through benchmarks such as SWE-bench (resolving real GitHub issues), BFCL (function calling accuracy via AST matching), AgentBench (8 diverse environments from bash to web browsing), and TAU-bench (end-to-end customer service with policy compliance).

II

2.1. Model Context Protocol

The Model Context Protocol (MCP), introduced by Anthropic in 2024, is an open standard that defines how LLM applications connect to external tools and data sources. Before MCP, every integration between an agent and a tool required bespoke glue code, creating an $m \times n$ problem ($m$ agent frameworks times $n$ tool providers). MCP reduces this to $m + n$ by standardising the interface, much like POSIX standardised the system call interface across Unix variants so that programs could run on any compliant OS without modification.

MCP follows a client-server architecture. An MCP server exposes capabilities negotiated during an explicit initialise handshake, namely tools (executable functions), resources (readable data), and prompts (reusable templates). An MCP client (embedded in the host application, e.g. Claude Code or an IDE extension) discovers available servers, presents their capabilities to the model, and routes tool calls to the appropriate server. Under the hood, all messages use JSON-RPC 2.0 as the wire protocol. Every request carries a jsonrpc, method, id, and optional params field, while responses return either a result or error keyed by the same id. Unlike REST, which is stateless and maps actions to HTTP verbs across many endpoints, JSON-RPC is stateful, bidirectional, and routes everything through a single endpoint.

MCP defines two transports. stdio is the simplest, where the client launches the server as a subprocess and exchanges newline-delimited JSON-RPC messages over stdin/stdout, analogous to Unix pipes. Streamable HTTP (which replaced the earlier HTTP+SSE transport) targets remote servers. The client POSTs JSON-RPC messages to a single MCP endpoint, and the server responds with either a JSON body or an SSE (server-sent events) stream for incremental delivery. The server can also push requests and notifications back to the client via SSE, enabling bidirectional communication over HTTP. Session management is handled through an Mcp-Session-Id header, and SSE event IDs support resumability if a connection drops. This separation of concerns allows tool providers to build once and integrate with any MCP-compatible agent.

For remote servers exposed over Streamable HTTP, MCP specifies OAuth 2.1 for authentication. During the initialise handshake the client discovers the server’s authorization metadata (via RFC 8414 /.well-known/oauth-authorization-server), performs the OAuth flow (including PKCE for public clients), and attaches the resulting bearer token to every subsequent JSON-RPC request. The transport layer authenticates the caller once, and all tool invocations within that session run under the established identity. stdio transport, by contrast, inherits the permissions of the parent process and requires no separate auth, just as a child process forked by a shell inherits its parent’s UID.

2.2. API

The Anthropic API (Messages API) provides the programmatic interface for building applications on top of Claude. A request is an HTTP POST to api.anthropic.com/v1/messages with a JSON body specifying a model identifier, a system prompt, and a list of messages (alternating user and assistant turns). The response contains content blocks, either text blocks for natural language output or tool_use blocks for structured tool call requests (each carrying a tool name, a unique id, and a JSON input conforming to the tool’s JSON Schema). When the model emits a tool_use block, the client executes the function, appends the result as a tool_result block (keyed by the same id), and sends the updated conversation back. This multi-turn loop is the foundation of all agent behaviour.

Streaming delivers tokens incrementally via server-sent events (SSE), reducing time-to-first-token and enabling real-time UIs. Each SSE event carries a typed delta (e.g. content_block_delta for partial text, input_json_delta for partial tool input), allowing clients to render output progressively. The API also supports prompt caching (cache_control on message blocks) to avoid reprocessing static prefixes across turns, and extended thinking which exposes the model’s chain-of-thought as a separate thinking content block before the final response. Token counting endpoints allow clients to estimate costs and manage context budgets before sending a request.

The Messages Batches API enables asynchronous, high-throughput workloads. A single batch can contain up to 100,000 requests, each a complete Messages API call with its own model, system prompt, and conversation. The batch is submitted as one HTTP POST, processed within 24 hours, and results are retrieved per-request once complete. Pricing is 50% of the standard synchronous rate. This is the agent equivalent of batch processing in an OS, where jobs that do not require interactive latency (e.g. bulk evaluation, dataset annotation, offline planning) are queued and executed when capacity is available, trading responsiveness for throughput and cost efficiency.

III

3.1. Architecture

An agent’s decision process maps to a partially observable Markov decision process (POMDP), formalised as the tuple $(S, A, T, R, \Omega, O, \gamma)$. The state space $S$ is the full environment (codebase, file system, external services), only partially observable through observations $\Omega$ (tool results, user messages, error logs). At each step the agent selects an action $a \in A$ (generate text, call a tool, request clarification) based on its belief state $b(s)$, a probability distribution over possible states maintained implicitly in the context window. The transition function $T(s’ \mid s, a)$ captures how the environment changes after an action, the observation function $O(o \mid s’, a)$ determines what the agent sees, and the reward $R(s, a)$ scores outcomes (task completion, correctness, user satisfaction). The agent implicitly optimises the discounted return $\sum_{t=0}^{\infty} \gamma^t r_t$, though in practice the discount factor $\gamma$ is never set explicitly; it emerges from the context window’s finite capacity, which naturally down-weights distant history.

An agent is built from four components, namely a model, knowledge, memory, and a control strategy. The model (LLM or VLM) serves as the reasoning core, analogous to a CPU. It takes in the current state and produces either a final response or an action (tool call). The model itself is stateless; all persistent state must be managed externally. Multimodal models (e.g. Claude with vision, GPT-4o) extend this to images and documents, enabling agents that can interpret screenshots, read PDFs, or navigate GUIs.

Knowledge is what the agent draws on to make decisions. Internal knowledge is baked into the model’s weights during pretraining, covering language, facts, and reasoning patterns, but is frozen at the training cutoff and cannot be updated without retraining. External knowledge is accessed at runtime through tools and data sources such as relational databases (e.g. PostgreSQL, via text-to-SQL), object stores (e.g. S3, for documents and files), web search APIs, and retrieval-augmented generation (RAG), where relevant chunks are fetched from a vector store and injected into the prompt. The split mirrors the distinction between ROM (fixed, fast) and disk (dynamic, slower). Internal knowledge is always available but potentially stale, while external knowledge is current but incurs latency and retrieval noise.

Memory governs what the agent retains across reasoning steps. Short-term memory is the context window itself, holding the conversation history, tool results, and scratchpad, but bounded by the model’s token limit and volatile across sessions. Long-term memory persists beyond a single session through external stores such as vector databases (e.g. Pinecone, Weaviate) for semantic retrieval via ANN algorithms (e.g. HNSW, FAISS), key-value stores for structured facts, or plain files. RAG pipelines implement a form of demand paging, where the agent fetches only what it needs rather than loading all knowledge into the context window, with the choice of chunking strategy, embedding model, and ANN index directly shaping recall quality.

MemGPT (Packer et al., 2023) takes this analogy literally. The LLM itself controls paging decisions via explicit function calls (core_memory_append, archival_memory_search), treating main context as RAM and external storage as disk. Rather than relying on fixed heuristics for context management, the agent decides what to page in and out as part of its normal reasoning loop.

The control strategy determines how the agent sequences its reasoning and actions. The simplest approach is ReAct (Reason + Act), where the model alternates between chain-of-thought reasoning and tool calls in a single loop, deciding the next action based on the latest observation. Planning-based agents go further by decomposing tasks into subtask sequences upfront. Chain-of-thought (CoT) is the basic form; Tree of Thoughts (Yao et al., 2023) explores multiple reasoning branches via BFS or DFS; LLM+P offloads planning to a classical planner via PDDL. The trade-off mirrors CPU scheduling, where greedy execution is fast but suboptimal, while upfront planning incurs latency but yields better outcomes for complex workloads.

Self-reflection enables agents to learn from their own failures within a session. Reflexion (Shinn et al., 2023) stores verbal feedback from failed attempts in an episodic memory buffer, which the agent references in subsequent trials, a form of reinforcement learning through natural language rather than weight updates. This aligns with the Anthropic observation that workflows often outperform agents, and that most tasks are better served by fixed orchestration patterns (prompt chaining, routing, parallelisation, evaluator-optimizer loops) than by fully autonomous agents. True agent autonomy, where the model dynamically directs its own process, is best reserved for open-ended problems where the solution path cannot be predetermined.

3.2. Multi-Agent

A multi-agent system distributes work across multiple LLM instances, each with a specialised role (e.g. coder, reviewer, researcher). An orchestrator manages the lifecycle of these agents, deciding when to spawn, route tasks to, and terminate them. The orchestrator-worker pattern from Anthropic’s taxonomy captures this directly, where a central LLM dynamically breaks down a task, delegates subtasks to worker LLMs, and synthesises their results. An evaluator-optimizer loop can be layered on top, where one agent generates output and another critiques it, iterating until quality thresholds are met. This decomposition offers the same benefits as microservices over monoliths, since each agent can have a focused system prompt, a tailored set of tools, and a smaller context window, reducing confusion and improving reliability at the cost of coordination overhead.

Communication between agents follows patterns familiar from IPC. Message passing (agents exchange structured messages through the orchestrator) is simple and maintains isolation, analogous to pipes or sockets. Shared context (agents read from and write to a common memory store, e.g. a shared file or database) enables higher throughput but requires synchronisation to avoid conflicts, analogous to shared memory. The choice between a single powerful agent and many specialised agents echoes the monolithic vs microkernel debate, with simplicity on one side and modularity on the other.

Where MCP standardises how an agent connects to tools (agent-to-tool), Google’s Agent-to-Agent protocol (A2A, 2025) standardises how agents connect to each other (agent-to-agent). Each agent publishes an Agent Card, a JSON manifest served at /.well-known/agent-card.json that advertises the agent’s capabilities, supported input/output modalities, and authentication requirements. A client agent discovers a remote agent’s card, sends it a task (a structured message with an initial prompt), and receives results back as artifacts (text, files, structured data). Tasks support streaming via SSE for long-running work and can include multiple conversational turns. A2A is transport-level and does not prescribe how agents reason internally, only how they exchange work, making it complementary to MCP rather than competing with it.

In production, cost optimisation is a first-class concern. Prompt caching reduces cached token costs to ~10% of normal input price, while model routing dispatches simple queries to cheaper models (e.g. Haiku) and complex reasoning to frontier models (e.g. Opus), with a 3-tier routing architecture typically cutting costs by 40-60%. Security is equally critical. Prompt injection (both direct and indirect, where malicious instructions are embedded in tool results or external documents) is ranked the #1 risk by OWASP’s 2025 LLM Top 10. Agents like Claude Code mitigate this through OS-level sandboxing (bubblewrap on Linux, seatbelt on macOS) that isolates filesystem and network access, ensuring that even a successful injection cannot exfiltrate data or modify files outside the working directory.

AI coding tools illustrate these patterns in practice. Claude Code, Copilot, and Cursor use the LLM as a code generation engine, where prompts act as a higher-level language on the abstraction spectrum (assembly, C, Python, English). The generated code is validated by the same toolchain that validates human-written code, namely compilers, type checkers, linters, and test suites. In this framing, the programmer becomes an orchestrator, the LLM becomes a process that produces code, and the existing development infrastructure becomes the kernel that enforces correctness. The operating system solved these coordination problems for CPU processes in the 1970s; the agent ecosystem is re-solving them for LLM processes today.

I gathered words solely for my own purposes without any intention to break the rigorosity of the subjects.
Well, I prefer eating corn in spiral .

Hikikomori