Reflection (LLMs): Learning from Verbal Feedback Across Episodes

Reflection in large language models (LLMs) is the capacity of an AI agent to examine its own outputs, identify errors or weaknesses, and use that assessment to produce improved results in subsequent attempts. Rather than relying solely on the initial output generated in a single pass, a reflective agent evaluates its performance against a goal or feedback signal and adjusts its approach.

This process mirrors human learning, where reviewing past mistakes informs better future decision-making.

The traditional approach to using language models involves zero-shot generation, where the model produces an answer from start to finish without any opportunity to revise its work. While modern models perform remarkably well under these constraints, complex tasks often require a more iterative approach. Reflection introduces a feedback loop into the agent's workflow, allowing it to pause, critique its own reasoning, and try again. This capability is a core component of agentic design patterns, enabling systems to break out of purely reactive generation and exhibit more methodical, deliberate problem-solving behavior.

The concept of reflection addresses a fundamental limitation in early generative AI systems: the inability to recognize and correct mistakes. When a human writes an essay or solves a complex math problem, they rarely produce a perfect result on the first try. Instead, they write a draft, review it, identify logical gaps or awkward phrasing, and revise. They might even consult external resources to verify facts before finalizing their work. Reflection in LLMs attempts to replicate this cognitive process programmatically. By structuring the interaction so that the model is explicitly prompted to evaluate its own output against a set of criteria, developers can significantly enhance the reliability and accuracy of the final result. This shift from single-pass generation to iterative, reflective loops marks a critical evolution in how we interact with and deploy artificial intelligence.

Furthermore, reflection is not merely a prompting trick; it represents a paradigm shift toward agentic AI. In an agentic workflow, the LLM is not just a passive responder but an active participant in problem-solving. It can break down a complex task, execute the first step, reflect on the outcome, and decide whether to proceed or pivot. This capacity for self-correction is essential for deploying AI in high-stakes environments, such as software engineering, legal analysis, or medical diagnostics, where the cost of an uncorrected hallucination or logical error is unacceptably high. By embedding reflection into the core architecture of these systems, we move closer to AI that can operate autonomously and reliably over extended periods.

‍

The Reflexion Framework

While reflection is a broad concept, it was formalized as a specific architecture in the Reflexion framework (Shinn et al., 2023). Reflexion reinforces language agents not by updating their underlying neural network weights—which requires expensive and time-consuming reinforcement learning—but through linguistic feedback. The framework consists of three distinct components working in concert.

The actor generates text and actions based on the current state observations. It takes an action in an environment, receives an observation, and produces a trajectory of steps. The evaluator then scores the outputs produced by the actor. It takes the generated trajectory as input and outputs a reward signal, which can be a scalar value or free-form language, depending on the task. Finally, the self-reflection model generates verbal reinforcement cues to assist the actor in self-improvement. It takes the reward signal, the current trajectory, and its persistent memory to generate specific, actionable feedback.

The critical innovation of Reflexion is its use of an episodic memory buffer. The verbal feedback generated by the self-reflection model is stored in this long-term memory. When the agent attempts the task again in the next episode, it retrieves these verbal lessons and uses them as additional context. This allows the agent to learn from prior failings and optimize its behavior over multiple trials, effectively turning environmental feedback into a semantic gradient signal.

The actor is typically implemented using a prompting strategy that encourages explicit reasoning, such as Chain of Thought (CoT) or ReAct (Reasoning and Acting). By forcing the actor to articulate its thought process before taking an action, the system creates a richer trace of the decision-making process. This trace is invaluable for the subsequent evaluation and reflection stages, as it provides a clear window into why the actor made a particular choice, rather than just what choice it made.

The evaluator plays a critical role in providing the necessary friction for learning. In many implementations, the evaluator relies on exact match criteria or heuristic functions specific to the task domain. For example, in a coding task, the evaluator might be a suite of unit tests. If the generated code fails a test, the evaluator outputs a binary failure signal along with the specific error message. In more open-ended tasks, the evaluator might be another instance of an LLM prompted to act as a strict grader, assessing the actor's output against a detailed rubric.

The self-reflection model is where the actual "learning" occurs. It analyzes the actor's trajectory and the evaluator's feedback to diagnose the root cause of the failure. Crucially, it must translate this diagnosis into a concise, actionable piece of advice. For instance, if the actor failed a coding task because it didn't handle edge cases, the self-reflection model might generate the feedback: "The previous implementation failed because it did not account for empty input arrays. In the next attempt, ensure that an explicit check for empty arrays is included at the beginning of the function."

This episodic memory buffer acts as a persistent repository of hard-won wisdom. Unlike the short-term context window, which is cleared at the start of each new task, the episodic memory persists across trials. When the agent faces the same or a similar task again, it queries this buffer to retrieve relevant reflections. This mechanism allows the agent to bypass the expensive process of fine-tuning its internal weights, instead relying on dynamic, in-context learning to adapt its behavior. The result is a system that becomes progressively more capable and efficient as it accumulates experience, much like a human practitioner mastering a new skill.

‍

Reflection vs. Self-Refinement

It is important to distinguish reflection, particularly as implemented in Reflexion, from related techniques like Self-Refinement. While both involve a model critiquing and improving its own output, they operate on different structural levels.

Self-refinement occurs within a single episode or task execution. The model generates an initial draft, critiques it, and revises it in place before delivering the final output to the user. It is an iterative loop contained entirely within one generation cycle.

Reflexion, conversely, operates across multiple episodes. The agent attempts a task, receives feedback, generates a verbal reflection on why it failed, and stores that reflection in memory. The episode ends. When the agent attempts the task again in a new episode, it uses the stored reflection to guide its approach. Reflexion requires persistent memory to carry lessons from one attempt to the next, whereas self-refinement does not.

Comparing Reflection and Self-Refinement
Feature	Self-Refinement	Reflexion
Scope	Within a single episode	Across multiple episodes
Mechanism	Generate → Critique → Revise	Generate → Evaluate → Store Reflection → Retry
Memory Requirement	Short-term (current context only)	Long-term (episodic memory buffer)
Primary Goal	Polish a specific output	Learn a general strategy or correct a persistent error
Feedback Source	Typically intrinsic (self-critique)	Often extrinsic (environment reward)

‍

The Impact of External Grounding

A significant challenge in reflection is ensuring that the model's self-critique is accurate. If a model lacks the knowledge to generate a correct answer initially, it may also lack the knowledge to identify its own errors during reflection. This limitation has led to the development of frameworks that ground the reflection process in external tools.

The CRITIC framework (Gou et al., 2024) addresses this by allowing LLMs to validate and amend their outputs in a manner similar to human interaction with tools. Starting with an initial output, the agent interacts with appropriate tools—such as a search engine for fact-checking or a code interpreter for debugging—to evaluate specific aspects of the text. It then revises the output based on the concrete feedback obtained during this validation process. This tool-interactive critiquing consistently enhances performance by providing the model with objective, external signals rather than relying solely on its internal representations.

The reliance on intrinsic self-critique—where a model evaluates its own output without external input—has been a subject of significant debate in the AI research community. Studies, such as those by Huang et al. (2023), have demonstrated that LLMs often struggle to self-correct reasoning errors when relying solely on their internal capabilities. If a model hallucinates a fact during generation, it is highly likely to hallucinate the validation of that fact during reflection. This phenomenon, sometimes referred to as the "echo chamber effect," highlights the necessity of external grounding.

The CRITIC framework (Gou et al., 2024) exemplifies how external tools can break this echo chamber. By integrating tools like Python interpreters, web search APIs, and specialized calculators, CRITIC transforms the reflection process from a subjective self-assessment into an objective verification exercise. When the model generates a piece of code, it doesn't just ask itself, "Does this look right?" Instead, it executes the code in a secure sandbox and observes the actual output. If the code throws a syntax error or fails a test case, the model receives deterministic, undeniable feedback.

This tool-interactive critiquing is particularly powerful in domains requiring strict factual accuracy or logical rigor. In mathematical problem-solving, for instance, an LLM might generate a plausible-sounding but mathematically flawed proof. By offloading the actual computation to an external calculator during the reflection phase, the system can catch arithmetic errors that the LLM alone might miss. Similarly, in fact-checking applications, querying a search engine allows the model to verify claims against up-to-date, external knowledge bases, significantly reducing the incidence of confident hallucinations. The integration of external tools effectively bridges the gap between the LLM's linguistic fluency and the rigorous demands of real-world applications.

‍

Performance Gains and Tradeoffs

The implementation of reflection patterns yields substantial improvements across various benchmarks. In the original Reflexion study, agents achieved a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the 80% accuracy of the baseline model (Shinn et al., 2023). Similarly, in sequential decision-making tasks, Reflexion agents improved over strong baselines by an absolute 22% in 12 iterative learning steps.

However, these performance gains come with inherent tradeoffs. Reflection requires multiple calls to the language model, which increases both latency and computational cost. For simple, low-stakes queries, the overhead of a reflection loop may not be justified. Furthermore, the effectiveness of reflection is bounded by the model's capacity to accurately evaluate its own work or effectively utilize external tools. If the feedback signal is flawed, the reflection process can lead the model astray, reinforcing incorrect assumptions rather than correcting them.

The empirical evidence supporting the efficacy of reflection is compelling. Beyond the impressive results on the HumanEval benchmark, reflection has shown significant promise in more nuanced, open-ended tasks. In the HotPotQA dataset, which requires multi-hop reasoning across multiple documents, Reflexion agents demonstrated a 20% improvement over baseline approaches (Shinn et al., 2023). This suggests that the ability to pause, review retrieved information, and adjust the reasoning path is crucial for navigating complex information landscapes.

Furthermore, research by Renze (2024) on self-reflection in LLM agents across various multiple-choice question datasets confirmed that self-reflection significantly improves problem-solving performance across a wide range of popular models. The study found that even simple forms of self-reflection, where the model is merely prompted to reconsider its answer, can yield statistically significant gains. This underscores the fundamental utility of the reflection pattern, independent of the specific architectural implementation.

However, the tradeoffs associated with reflection must be carefully managed in production environments. The most immediate cost is latency. A standard zero-shot generation might take a few seconds, whereas a full reflection loop—involving generation, evaluation, reflection, and revision—can take several times longer. In user-facing applications where real-time responsiveness is critical, this added latency can degrade the user experience.

Additionally, the financial cost of API calls scales linearly with the number of steps in the reflection loop. For high-volume applications, the expense of running multiple LLM inferences per user query can quickly become prohibitive. Therefore, system architects must implement intelligent routing mechanisms to determine when reflection is necessary. A simple factual query might be routed to a fast, zero-shot model, while a complex coding task or legal analysis might trigger a full, multi-agent reflection workflow. Balancing the desire for maximum accuracy with the constraints of latency and cost is one of the primary challenges in deploying reflective AI systems at scale.

‍

Reflection in Agentic Workflows

Reflection is increasingly recognized as a foundational design pattern for building effective agentic systems. It can be applied at multiple stages of a workflow: evaluating whether a user's initial request is feasible, checking if a proposed plan aligns with the overall goal, or verifying that a completed sequence of actions actually solved the intended problem.

The reflection pattern is central to how the multi-agent software factory Sgai operates. When a developer agent generates code to fulfill a specific goal, the process does not end with the initial output. A separate reviewer agent evaluates the code, running tests and critiquing the implementation. This multi-agent reflection loop continues until the tests pass and the reviewer is satisfied, ensuring that the final deliverable meets rigorous standards before it is marked complete. By automating the critical feedback step, systems can achieve a level of reliability and sophistication that single-pass generation simply cannot match.

Reflection (LLMs): Learning from Verbal Feedback Across Episodes

The Reflexion Framework

Reflection vs. Self-Refinement

The Impact of External Grounding

Performance Gains and Tradeoffs

Reflection in Agentic Workflows

Learn More About Learning Paradigms in AI