Critique (LLMs): Generating Actionable Feedback for AI Improvement

Critique in LLMs is the process of examining an output, identifying its specific weaknesses, and producing structured feedback that can guide a revision. Critique isn't just a thumbs-up or thumbs-down. A score of 4 out of 10 tells you nothing useful. What makes critique powerful is when it's actionable.

There's a useful asymmetry buried inside how language models work: it's often much easier for a model to recognize a good answer than to produce one from scratch. A model might struggle to write a correct proof from nothing, but given a flawed proof, it can frequently spot the error and explain what went wrong. This gap between generation and evaluation is the foundation of critique in LLMs — the process of examining an output, identifying its specific weaknesses, and producing structured feedback that can guide a revision.

That last part matters. Critique isn't just a thumbs-up or thumbs-down. A score of 4 out of 10 tells you nothing useful. What makes critique powerful is when it's actionable — when it says "the loop on line 14 has an off-by-one error; change < to <=" rather than "this code is incorrect." That specificity is what allows critique to drive real improvement, whether it's guiding a model to revise a single response or training an entirely new model to behave more safely.

Critique is distinct from related concepts like Self-Refinement and Reflection (LLMs). Self-refinement is the full loop — generate, critique, revise, repeat. Reflection is about storing feedback across episodes so a model can improve over time. Critique is the specific mechanism inside that loop that produces the feedback signal. It's the red pen, not the whole editing process.

As models have become more capable, critique has quietly become one of the most important things they do. The bottleneck in many AI workflows isn't generation anymore — it's evaluation. A model that can write complex code but can't reliably tell whether that code is secure or correct is only half as useful as it looks. Building robust critique mechanisms is now a central focus of AI research, and the techniques for doing it well have gotten surprisingly sophisticated.

It's also worth noting that critique isn't a single technique — it's a family of approaches that vary in cost, reliability, and the kind of feedback they produce. Understanding those differences is key to knowing when to use which.

‍

Not All Feedback Is Created Equal

When we ask a model to evaluate an output, the form that evaluation takes depends entirely on what we need to do with it next.

The simplest form is a scalar score — a number from 1 to 10 rating helpfulness, safety, or quality. Scalar scores are easy to compute and easy to compare, which makes them useful for ranking outputs or training reward models. But they're also blunt instruments. A score doesn't tell you why something is good or bad, which limits how much you can do with it downstream.

A step up is pairwise comparison, where the model looks at two candidate outputs and picks the better one. This sidesteps the difficulty of assigning absolute scores and is the backbone of how many models are trained to align with human preferences — human annotators compare pairs, and the model learns to predict those preferences. Pairwise comparison is also more robust to calibration issues; it's easier to say "A is better than B" than to say "A is a 7.3 out of 10."

The most useful form, though, is actionable critique — natural language feedback that identifies specific problems and explains how to fix them. This is the kind of feedback that can actually drive revision. It's more expensive to generate than a score, but it's also far more informative, and it's what most modern critique-driven systems rely on when the goal is improvement rather than just evaluation.

Types of LLM Critique Mechanisms
Critique Type	How It Works	Best Used For
Scalar Score	The model assigns a numerical rating to the output.	Ranking outputs, training reward models.
Pairwise Comparison	The model selects the better of two candidate outputs.	Preference learning, RLHF training.
Actionable Critique	The model generates natural language feedback identifying specific flaws and revisions.	Self-refinement loops, agentic workflows, code review.
Tool-Grounded Critique	The model uses external tools (search, code interpreters) to validate claims before critiquing.	Fact-checking, math, coding, verifiable reasoning.

‍

There's also a structural distinction worth drawing: critique can be intrinsic (the same model evaluates its own output) or external (a separate model, sometimes called an LLM-as-a-judge, evaluates the output). External critique tends to be more reliable because the evaluator doesn't share the generator's blind spots. But it also costs more — you're running two models instead of one. The right choice depends on the task, the stakes, and the available compute budget.

‍

The Reliability Problem: When Self-Critique Fails

It's tempting to assume that asking a model to critique its own work will automatically produce better results. The reality is messier. Pure intrinsic critique — where a model evaluates its own output using only its internal knowledge — has real limitations, and they're worth understanding before building systems that depend on it.

Research has shown that models often struggle to reliably self-critique their own reasoning, especially on complex tasks. If a model doesn't know how to solve a graph coloring problem, asking it to critique its own incorrect solution usually doesn't help. It might even reject a correct answer and replace it with a wrong one (Valmeekam et al., 2023). The problem is structural: the model is using the same internal logic to evaluate the answer that it used to generate it. If that logic is flawed, the critique inherits the same flaw. You can't proofread your own blind spots.

This is sometimes called the "echo chamber effect," and it's one of the central challenges in building critique-driven systems. The model's critique ends up confirming the same errors it made during generation, which means the revision loop converges on a wrong answer rather than a right one. For simple tasks — checking tone, catching obvious grammatical errors, verifying that a response addresses the question — intrinsic critique works reasonably well. For tasks that require genuine reasoning or factual accuracy, it often doesn't.

The solution isn't to abandon self-critique. It's to ground it in something external.

‍

Grounding Critique with External Tools

One of the most effective approaches to the reliability problem is tool-grounded critique, where the model interacts with external tools to validate its outputs before generating feedback. The CRITIC framework (Gou et al., 2024) is a prominent example of this approach.

Instead of relying purely on internal knowledge, a model using CRITIC will run a search query to fact-check a claim, execute code in an interpreter to see if it throws an error, or query a calculator to verify a math result. The feedback from those tools forms the basis of the critique, making it far more reliable than intrinsic evaluation alone. The model isn't just guessing whether its answer is correct — it's checking. This is a meaningful distinction, especially for tasks where errors are subtle and hard to detect without running the actual computation.

A related approach appears in Self-RAG (Asai et al., 2023), which embeds critique directly into the generation process for retrieval-augmented systems. In Self-RAG, the model generates special "reflection tokens" that evaluate, in real time, whether retrieval is needed, whether the retrieved documents are actually relevant, and whether the generated response is fully supported by the evidence. If a critique token flags an unsupported claim, the model can trigger a new retrieval step before continuing. It's critique woven into the fabric of generation rather than applied after the fact — which means the model can catch and correct factual gaps mid-stream rather than discovering them only at the end.

‍

Critique as a Training Mechanism

Critique isn't just useful for improving individual outputs — it's also a powerful tool for shaping how models behave at scale. This is the core insight behind Constitutional AI (Bai et al., 2022), a training approach developed by Anthropic.

In Constitutional AI, the model is given a "constitution" — a list of principles like "choose the response that is least harmful" or "avoid responses that are deceptive." During training, the model generates a response, critiques that response against the constitution, and then revises it. The revised outputs are used as training data, fine-tuning the model toward safer, more principled behavior. The critique mechanism itself becomes the teacher, which means developers can train models to avoid harmful outputs without needing humans to manually label thousands of toxic examples. That's a significant scaling advantage — and it's one of the reasons Constitutional AI has been influential in the broader field of AI safety.

The key insight is that critique can encode values. By specifying what the model should evaluate for — not just quality, but safety, honesty, helpfulness — developers can use the critique mechanism to steer model behavior in principled directions. The constitution is, in a sense, a formalization of the evaluation criteria, and the critique loop is the mechanism that enforces it.

‍

The Role of Reward Models

In many production AI systems, critique is formalized through reward models — separate neural networks trained specifically to evaluate output quality and assign a scalar score. Reward models are a cornerstone of Reinforcement Learning from Human Feedback (RLHF), the technique behind models like ChatGPT. Human annotators rank candidate outputs, the reward model learns to predict those rankings, and the generator model is then trained to maximize the reward model's score.

The limitation of pure reward models, though, is that a scalar score doesn't explain anything. It tells the generator that an output is bad, but not why or how to fix it. This is why researchers are increasingly combining reward models with natural language critique — using the score to signal that something is wrong, and the critique to explain what (Lightman et al., 2023). Process reward models, which evaluate the quality of individual reasoning steps rather than just the final answer, are a particularly promising direction: they can identify where a chain of reasoning went wrong, not just that the final answer was incorrect.

The combination gives you both the signal strength of a numerical reward and the diagnostic value of actionable feedback — which is considerably more useful than either alone.

‍

Challenges in Scaling Critique

Scaling critique introduces its own set of problems. The most obvious is computational cost — generating detailed, actionable feedback for every output is expensive, especially at the scale of production systems. This has pushed research toward more efficient approaches: smaller, specialized critic models that are cheaper to run, or systems that apply expensive critique only when the generator's confidence is low or the stakes are high.

A subtler problem is reward hacking, where the generator learns to exploit weaknesses in the critic rather than genuinely improving its outputs. If the reward model has blind spots, a sufficiently capable generator will find them — and it will find them faster than you expect. This dynamic requires continuous monitoring and updating of the critic to stay ahead of the generator's exploits. It's one of the reasons that critique systems in production tend to be more complex than they look on paper: the critic and generator are in an ongoing adversarial relationship, and maintaining the critic's reliability is an active engineering challenge, not a one-time setup.

There's also the question of critique quality itself. A critic model that produces confident but wrong feedback is worse than no critic at all — it will guide the generator toward incorrect revisions with false certainty. Calibrating the critic's confidence, and knowing when to fall back to human review, is an underappreciated part of building reliable critique-driven systems.

‍

Critique in Agentic Workflows

As AI systems become more autonomous, critique is becoming a foundational component of how multi-agent workflows are structured. Rather than relying on a single model to both generate and evaluate its own work, well-designed agentic systems separate these roles: one agent generates, another critiques. The separation matters because a specialized critic can be trained specifically to produce fine-grained, actionable assessments — a task that's quite different from generation, and one that benefits from different training data and optimization objectives.

Research on the Critique-Guided Improvement (CGI) framework found that a small, specialized critic model could outperform much larger general-purpose models at providing useful revision feedback (Yang et al., 2025). Bigger isn't always better when the job is evaluation rather than generation. A critic trained specifically to identify logical gaps in arguments, or security vulnerabilities in code, will often catch things that a general-purpose model misses — even if that general-purpose model is far larger and more capable overall.

This pattern is exactly how Sgai structures its multi-agent software factory: a reviewer agent explicitly critiques the developer agent's output, and the task isn't marked complete until the critique is satisfied and tests pass. A critique pass can evaluate generated documentation against a company's brand voice and technical guidelines before publishing, catching inconsistencies that a one-shot generation pass would miss.

Critique is the engine that drives iterative improvement. By teaching models not just to generate answers, but to evaluate, verify, and explain how to fix them, we move from systems that just guess at the right answer to systems that actively work to get it right.