Learn about AI >

Context Management: Deciding What an AI Gets to Think About

Context management is the active engineering discipline of deciding exactly what information a large language model (LLM) is allowed to see at any given moment. It is the software architecture that controls what goes into that space, what gets compressed, what gets retrieved from external databases, and what gets deleted to make room for new information. It is the shift from simply writing good instructions to orchestrating a dynamic flow of data.

Context management is the active engineering discipline of deciding exactly what information a large language model (LLM) is allowed to see at any given moment. While the context window is the physical limit of how much data a model can hold, context management is the software architecture that controls what goes into that space, what gets compressed, what gets retrieved from external databases, and what gets deleted to make room for new information. It is the shift from simply writing good instructions to orchestrating a dynamic flow of data.

For a long time, the primary skill in artificial intelligence development was prompt engineering. You wrote a clever set of instructions, gave the model a few examples, and hoped for the best. That approach worked perfectly well for simple chatbots and single-turn tasks like summarizing an email or writing a poem. But as developers began building autonomous agents that could use tools, search the web, and run in continuous loops, the static prompt broke down.

When an agent is running a complex workflow, it generates a massive amount of intermediate data. It might pull five documents from a database, run a Python script that outputs a long error log, and then try to summarize a previous conversation. If you simply dump all of that information into the model's context window, you will quickly hit the token limit. Even if you don't hit the limit, the model will suffer from information overload, losing track of its original instructions in a sea of irrelevant text. This realization led to a fundamental shift in how production AI systems are built. The problem was no longer how to talk to the model; the problem was how to manage its attention budget (Anthropic, 2025).

Every token you send to a language model costs money and adds latency to the response. More importantly, every token depletes the model's finite attention capacity. The transformer architecture that powers modern AI requires the model to compare every single token against every other token to understand their relationships. This means the computational cost grows quadratically as the context expands. A study on latency found that processing 15,000 words of context resulted in a seven-fold increase in response time compared to shorter inputs (Redis, 2026).

Context management treats the context window not as a text box, but as a highly constrained computational resource. The goal is to find the absolute smallest set of high-signal tokens that will allow the model to successfully complete its current task.

The Anatomy of the Token Budget

To manage context effectively, you have to understand what is actually competing for space inside the window. In a production AI application, the context is rarely just the user's prompt. It is a complex assembly of different data types, all fighting for a share of the token budget.

The foundation is the system prompt, which contains the core instructions, behavioral guidelines, and output formatting rules. This is the model's baseline reality. In production systems, this is rarely a static paragraph. It is often a dynamically assembled set of rules that changes based on the user's access level, the specific sub-task being executed, and the current state of the application. Next are the tool definitions. If the agent is allowed to search the web, query a database, or execute code, the precise JSON schemas and descriptions of how those tools work must be loaded into the context so the model knows they exist and how to format its requests.

Then comes the retrieved knowledge. This is the data pulled in dynamically via retrieval-augmented generation (RAG), such as company documents, previous support tickets, or API documentation. The challenge here is that retrieved documents are often noisy, containing paragraphs of irrelevant boilerplate alongside the single sentence the model actually needs. Following that is the conversation history, which includes the back-and-forth dialogue between the user and the agent, as well as any intermediate reasoning steps the agent has taken. In a multi-turn interaction, this history grows linearly with every exchange, quickly becoming the largest consumer of the token budget. Finally, there is the user's current query and the empty space required for the model to generate its response—if you fill the context window to 99% capacity with input, the model will only be able to generate a few words before hitting the hard limit and cutting off mid-sentence.

If you add all of this up, it is very easy to exhaust a 100,000-token context window in just a few turns of a complex task. Furthermore, research into multi-agent systems has shown that autonomous workflows can consume up to fifteen times more tokens than a standard chat interaction (Big Data Boutique, 2026). You cannot simply let the context grow unchecked. You have to actively orchestrate it, treating every token as a precious resource that must justify its inclusion.

The Four Strategies of Orchestration

The discipline of context management generally relies on four core strategies to keep the token budget under control while ensuring the model has the information it needs. These strategies dictate how data moves between the model's active memory and external storage.

The Four Core Strategies of Context Management
Strategy Function Primary Use Case
Write Saving information outside the context window for later retrieval. Creating scratchpads, logging intermediate tool outputs, or forming long-term memories.
Select Pulling only the most relevant information into the active context. Retrieval-augmented generation (RAG), dynamic tool selection, and targeted memory recall.
Compress Reducing the token count of existing information while retaining meaning. Summarizing old conversation history or pruning irrelevant sentences from retrieved documents.
Isolate Splitting context across separate agents so each only sees what it needs. Multi-agent architectures where a researcher agent and a writer agent have different context windows.

The "Select" strategy is perhaps the most critical for immediate performance. Instead of loading every possible tool definition into the context window, a routing system can analyze the user's query and only inject the descriptions of the two or three tools that are actually relevant. This dynamic selection approach has been shown to improve model accuracy significantly by reducing the amount of distracting information the model has to parse.

Similarly, the "Isolate" strategy represents a major architectural shift. Instead of forcing one model to hold the instructions for researching, writing, and editing all at once, developers split the task. One agent gets the context needed to search the database. It passes its findings to a second agent, whose context window only contains the writing instructions and the raw data. This prevents the context from becoming bloated with instructions that are not relevant to the immediate sub-task.

The Mechanics of Compression

When information must remain in the context window but the token budget is running low, developers turn to compression techniques. The goal of compression is to reduce the footprint of the data without losing the semantic meaning required for the model to reason effectively.

The most basic approach is the sliding window, where the system simply deletes the oldest messages in the conversation history as new ones arrive. While easy to implement, this is a blunt instrument that often results in the model forgetting important constraints established early in the interaction. If a user says "always reply in French" in turn one, and that message slides out of the window by turn twenty, the model will revert to English.

A more sophisticated approach is dynamic summarization. Instead of deleting old messages, the system maintains a living summary of the conversation. When the context hits a certain threshold—say, 70% of the token budget—a background process takes the oldest chunk of the dialogue, summarizes it, and replaces the raw transcript with the much shorter summary. As the conversation continues, the new messages are appended to the existing summary, creating a rolling, compressed record of the interaction (Kargar, 2024). This preserves the narrative arc of the conversation without paying the token cost of every exact phrasing and pleasantry.

For agentic workflows that generate massive amounts of intermediate data, researchers have developed a technique called observation masking. When an agent runs a tool—like executing a block of code or querying a SQL database—the output can be thousands of lines long. The agent needs to see that output in the immediate moment to decide what to do next. But once it has made its decision and moved on to the next step, that raw output becomes dead weight in the context window. Observation masking hides the raw output from previous turns, replacing it with a placeholder (e.g., [Observation: 45 rows returned, execution successful]), while preserving the agent's reasoning and the actions it took. This drastically reduces token consumption while allowing the agent to remember its thought process and the general outcome of its actions (JetBrains, 2025).

At the cutting edge of compression is loss-aware pruning and semantic compression. Loss-aware pruning uses a smaller, cheaper model to evaluate a long piece of text and score each sentence based on how important it is to the overall meaning (often measured by how much it impacts the model's perplexity). The system then drops the lowest-scoring sentences before feeding the text to the primary LLM. Semantic compression takes this a step further by asking an LLM to rewrite the text into a highly dense, almost shorthand format that preserves the intent but strips out all conversational filler. This allows developers to compress retrieved documents by removing boilerplate and redundant phrasing, preserving the core facts while saving valuable tokens for the actual reasoning task.

Evaluating Context Quality

Because context management is an engineering discipline, it requires rigorous testing and observability. You cannot optimize what you cannot measure. In production systems, developers rely on specific metrics to evaluate whether their context management strategies are actually working, rather than just guessing based on a few manual tests.

The most common framework for this is RAGAS (Retrieval-Augmented Generation Assessment), which breaks context quality down into measurable dimensions. The first is context precision: did the system retrieve and inject only the information that was actually needed, or did it flood the window with irrelevant noise? High precision means the token budget was spent efficiently. The second is context recall: did the system manage to find and include all the necessary facts required to answer the user's query? High recall means the model wasn't starved of information.

Beyond retrieval metrics, engineers must monitor the generation quality itself. Faithfulness measures whether the model's output was derived entirely from the provided context, or if it hallucinated facts from its pre-training data because the context was insufficient. Answer relevancy measures whether the final output actually addressed the user's prompt, which often degrades when the context window becomes too bloated and the model loses track of the original instruction.

Finally, there are the hard infrastructure metrics: latency and cost. By tracking the exact number of tokens sent in every request, teams can identify which workflows are the most expensive and target them for aggressive compression or observation masking. A sudden spike in latency often indicates that a retrieval system is pulling too many documents, forcing the model to grind through a massive context window and slowing down the entire application.

Navigating the Failure Modes

When context management fails, the results are highly predictable. Because the model relies entirely on the context window for its reality, any pollution or overload in that space directly degrades the output. Engineers categorize these breakdowns into specific failure modes that must be actively monitored in production.

Context distraction occurs when the window is filled with too much marginally relevant information. If a retrieval system pulls ten documents when only two were needed, the model has to spend its attention budget sifting through the noise. This often leads to the model ignoring the actual instructions and simply summarizing the retrieved text, or worse, hallucinating connections between unrelated documents. It is the AI equivalent of trying to solve a math problem while someone reads a dictionary out loud in the same room.

Context confusion happens when the window contains too many competing instructions or tool definitions. If an agent is given access to thirty different APIs at once, it will frequently call the wrong one or mix up the required parameters. The model becomes overwhelmed by the options and loses its ability to reliably select the correct path forward. This is why the "Select" strategy is so vital—by dynamically filtering the available tools down to just the three or four that matter for the current step, developers can completely eliminate context confusion.

Context clash is a more subtle failure mode. It occurs when the context window contains contradictory information. For example, the system prompt might instruct the agent to "never offer refunds," but a retrieved support document might say "offer a refund if the product is defective." When faced with a context clash, the model will often freeze, hallucinate a compromise, or simply pick one instruction at random. Resolving this requires strict hierarchical rules within the context management system, ensuring that system-level instructions always override retrieved data.

The most dangerous failure mode is context poisoning. This occurs when incorrect, hallucinated, or maliciously injected information enters the context window and is allowed to persist. Because agents often read their own previous outputs to decide what to do next, a single hallucination early in a workflow can become foundational truth for the rest of the task. If the context is not actively managed and pruned, that poisoned data will compound, leading the agent further and further off track (Weaviate, 2025). In security contexts, this is known as indirect prompt injection, where an attacker hides malicious instructions inside a document they know the agent will retrieve, effectively hijacking the agent's context window from the inside.

The Infrastructure of Attention

Managing all of this requires serious infrastructure. Context management is not a prompt you write; it is a software architecture you build. It requires orchestration layers that can intercept queries, route them to vector databases, rank the results, compress the history, and assemble the final payload in milliseconds.

This is why the conversation in AI development has moved away from the models themselves and toward the systems that surround them. A mediocre model with excellent context management will almost always outperform a state-of-the-art model that is simply fed a massive, uncurated wall of text. The intelligence of an AI application is no longer just in the weights of the neural network; it is in the engineering that decides exactly what that network gets to think about.