How does misalignment scale with model intelligence and task complexity?

2026-02-030:2824279alignment.anthropic.com

When AI systems fail, will they fail by systematically pursuing goals we do not intend? Or will they fail by being a hot mess—taking nonsensical actions that do not further any goal? 📄Paper, 💻Code…

When AI systems fail, will they fail by systematically pursuing goals we do not intend? Or will they fail by being a hot mess—taking nonsensical actions that do not further any goal?

📄Paper, 💻Code

Research done as part of the first Anthropic Fellows Program during Summer 2025.

tl;dr

When AI systems fail, will they fail by systematically pursuing the wrong goals, or by being a hot mess? We decompose the errors of frontier reasoning models into bias (systematic) and variance (incoherent) components and find that, as tasks get harder and reasoning gets longer, model failures become increasingly dominated by incoherence rather than systematic misalignment.

As AI becomes more capable, we entrust it with increasingly consequential tasks. This makes understanding how these systems might fail even more critical for safety. A central concern in AI alignment is that superintelligent systems might coherently pursue misaligned goals: the classic paperclip maximizer scenario. But there's another possibility: AI might fail not through systematic misalignment, but through incoherence—unpredictable, self-undermining behavior that doesn't optimize for any consistent objective. That is, AI might fail in the same way that humans often fail, by being a hot mess.

This paper builds on the hot mess theory of misalignment (Sohl-Dickstein, 2023), which surveyed experts to rank various entities (including humans, animals, machine learning models, and organizations) by intelligence and coherence independently. It found that smarter entities are subjectively judged to behave less coherently. We take this hypothesis from survey data to empirical measurement across frontier AI systems, asking: As models become more intelligent and tackle harder tasks, do their failures look more like systematic misalignment, or more like a hot mess?

Measuring Incoherence: A Bias-Variance Decomposition

To quantify incoherence we decompose AI errors using the classic bias-variance framework:

\text{Error} = \text{Bias}^2 + \text{Variance}

  • Bias captures consistent, systematic errors—achieving the wrong outcome reliably
  • Variance captures inconsistent errors—unpredictable outcomes across samples

We define error incoherence as the fraction of error attributable to variance:

\text{Incoherence} = \frac{\text{Variance}}{\text{Error}}

The error incoherence is therefore a metric between 0 and 1: a value of 0 means that all errors are systematic and produce identical outcomes (analogous to classic misalignment risk); an error incoherence of 1 means errors are random (the hot mess scenario). When we scale models, total error naturally decreases (both bias and variance drop). We primarily care about the composition of this error: what do errors look like, rather than how frequently they occur. Measuring the relative contribution of variance disentangles error type and error rate.

Bias-variance decomposition
Figure 1: AI can fail through bias (consistent but wrong) or variance (inconsistent). We measure how this decomposition changes with model intelligence and task complexity.

Key Findings

We evaluated frontierAt the time of this research in Summer 2025. reasoning models (Claude Sonnet 4, o3-mini, o4-mini, Qwen3) across multiple-choice benchmarks (GPQA, MMLU), agentic coding (SWE-Bench), and safety evaluations (Model-Written Evals). We also train our own small models on synthetic optimization tasks. For all of these tasks, we are able to measure bias and variance with respect to an intended behavior (well-defined targets).

Finding 1: Longer reasoning → More incoherent errors

Across all tasks and models, the longer models spend reasoning and taking actions, the more incoherent their errors become. This holds whether we measure reasoning tokens, agent actions, or optimizer steps.

Incoherence vs reasoning length
Figure 2: Error incoherence increases with reasoning length across GPQA, SWE-Bench, safety evaluations, and synthetic optimization. Model failures become less consistent across unrolls the longer they think or the more actions they take.

Finding 2: There is an inconsistent relationship between model intelligence and error incoherence.

How does error incoherence change with model scale? The answer depends on the experimental setup.

  • In our synthetic tasks, optimization tasks, errors become more incoherent with increasing model size+capability
  • In a poll (repurposed from prior work), domain experts subjectively judged more intelligent models to be less coherent in their behavior
  • On benchmark tasks, more intelligent models made more coherent errors on easy tasks, while their errors on the hardest tasks became more incoherent or remained the same.

While this relationship requires further study, our observations suggest that scaling alone won't eliminate incoherence in errors. This is especially true, as we expect more powerful models to perform longer and more complex tasks.

Scale vs incoherence
Figure 3: Larger and more intelligent systems are often more incoherent in their errors. (a) For LLMs on easy tasks, scale reduces error incoherence, but on hard tasks, scale does not reduce error incoherence or even increases it. (b) In a survey, experts subjectively judged that more intelligent AI systems were less coherent. (c) In a synthetic optimization task, more capable models made more incoherent errors.

Finding 3: Natural "overthinking" increases incoherence in errors more than reasoning budgets reduce it

We find that when models spontaneously reason longer on a problem (compared to their median), error incoherence spikes dramatically. Meanwhile, deliberately increasing reasoning budgets through API settings only modestly increases coherence in errors. The natural variation dominates.

Finding 4: Ensembling reduces incoherence in errors

Aggregating multiple samples reduces variance (as expected from theory), providing a path to more coherent behavior, though this may be impractical for real-world agentic tasks where actions are irreversible.

Why Should We Expect Incoherence?

Large transformer models (such as LLMs) are natively dynamical systems, not optimizers. When a language model generates text or takes actions, it traces trajectories through a high-dimensional state space. It has to be trained to act as an optimizer, and trained to align with human intent. It's unclear which of these properties will be more robust as we scale.

Constraining a generic dynamical system to act as a coherent optimizer is extremely difficult. Often the number of constraints required for monotonic progress toward a goal grows exponentially with the dimensionality of the state space. We shouldn't expect AI to act as coherent optimizers without considerable effort, and this difficulty doesn't automatically decrease with scale.

The Synthetic Optimizer: A Controlled Test

To probe this directly, we designed a controlled experiment: train transformers to explicitly emulate an optimizer. We generate training data from steepest descent on a quadratic loss function, then train models of varying sizes to predict the next optimization step given the current state (essentially: training a "mesa-optimizer").

Synthetic optimizer experiment
Figure 4: Synthetic optimizer experiment. (Left) Models are trained to predict optimizer update steps. (Right) Larger models reduce bias much faster than variance - they learn to target the correct objective better than they learn to be reliable optimizers.

In these synthetic experiments we found:

  • Incoherence grows with trajectory length. Even in this idealized setting, the more optimization steps models take (and get closer to the correct solution), the more incoherent they become.
  • Scale reduces bias faster than variance. Larger models learn the correct objective more quickly than they learn to reliably pursue it. The gap between "knowing what to do" and "consistently doing it" grows with scale.

Implications for AI Safety

Our results are a piece of evidence that future AI failures may look more like industrial accidents than coherent pursuit of goals that were not trained for. (Think: the AI intends to run the nuclear power plant, but gets distracted reading French poetry, and there is a meltdown.) However, coherent pursuit of poorly chosen goals that we trained for remains a problem. Specifically:

  1. Errors are variance dominated on long tasks. When frontier models fail on difficult problems requiring extended reasoning or action, there is a tendency for failures to be predominantly incoherent rather than systematic. This is our most robust finding.
  2. Scale doesn't imply errors will be more coherent. Making models larger improves overall accuracy but doesn't reliably reduce error incoherence.
  3. We should worry relatively more about reward hacking. If capable AI is more likely to be a hot mess than a coherent optimizer of the wrong goal, this increases the relative importance of research targeting reward hacking and goal misspecification during training—the bias term—rather than focusing primarily on aligning and constraining a perfect optimizer.
  4. Limitations of our framework. To rigorously measure bias and variance, we need well-defined targets (such as multiple-choice answers, unit tests, objective functions). This limits what we can confidently measure about open-ended goals or hidden objectives. Characterizing complex incoherent behaviors in more natural settings remains an important problem.

Conclusion

We use a bias-variance decomposition to systematically study how AI error incoherence scales with model intelligence and task complexity. We find that longer sequences of reasoning and actions consistently increase error incoherence, and that smarter models are not consistently more coherent in their errors.

We hope that these findings inform discussions about different AI risk scenarios and guide further research into understanding error incoherence and its mechanistic origins. The question of how current and future AI models fail – systematically or inconsistently – matters both for real-world deployment and how we prioritize safety work.

Acknowledgements

We thank Andrew Saxe, Brian Cheung, Kit Frasier-Taliente, Igor Shilov, Stewart Slocum, Aidan Ewart, David Duvenaud, and Tom Adamczewski for extremely helpful discussions on topics and results in this paper.


Read the original article

Comments

  • By jmtulloss 2026-02-031:262 reply

    The comments so far seem focused on taking a cheap shot, but as somebody working on using AI to help people with hard, long-term tasks, it's a valuable piece of writing.

    - It's short and to the point

    - It's actionable in the short term (make sure the tasks per session aren't too difficult) and useful for researchers in the long term

    - It's informative on how these models work, informed by some of the best in the business

    - It gives us a specific vector to look at, clearly defined ("coherence", or, more fun, "hot mess")

    • By kernc 2026-02-032:221 reply

      Other actionable insights are:

      - Merge amendments up into the initial prompt.

      - Evaluate prompts multiple times (ensemble).

      • By sandos 2026-02-0310:361 reply

        Sometimes when I was stressed, I have used several models to verify each others´ work. They usually find problems, too!

        This is very useful for things that take time to verify, we have CI stuff that takes 2-3 hours to run and I hate when those fails because of a syntax error.

        • By xmcqdpt2 2026-02-0313:13

          Syntax errors should be caught by type checking / compiling/ linting. That should not take 2-3 hours!

    • By nth21 2026-02-0317:391 reply

      There’s not a useful argument here. The article is using current AI to extrapolate future AI failure modes. If future AI models solve the ‘incoherence’ problem, that leaves bias as a primary source of failure (according to the author these are the only two possible failure modes apparently).

      • By toroidal_hat 2026-02-0321:151 reply

        That doesn't seem like a useful argument either.

        If future AI only manages to solve the variance problem, then it will have problems related to bias.

        If future AI only manages to solve the bias problem, then it will have problems related to variance.

        If problem X is solved, then the system that solved it won't have problem X. That's not very informative without some idea of how likely it is that X can or will be solved, and current AI is a better prior than "something will happen".

        • By nth22 2026-02-0322:33

          > That's not very informative without some idea of how likely it is that X can or will be solved

          Exactly, the authors argument would be much better qualified by addressing this assumption.

          > current AI is a better prior than "something will happen".

          “Current AI” is not a prior, its a static observation.

  • By gopalv 2026-02-030:572 reply

    > Making models larger improves overall accuracy but doesn't reliably reduce incoherence on hard problems.

    Coherence requires 2 opposing forces to hold coherence in one dimension and at least 3 of them in higher dimensions of quality.

    My team wrote up a paper titled "If You Want Coherence, Orchestrate a Team of Rivals"[1] because we kept finding that upping the reasoning threshold resulted in less coherence - more experimentation before we hit a dead-end to turn around.

    So we had a better result from using Haiku (we fail over to Sonnet) over Opus and using a higher reasoning model to decompose tasks rather than perform each one of them.

    Once a plan is made, the cheaper models do better as they do not double-think their approaches - they fail or they succeed, they are not as tenacious as the higher cost models.

    We can escalate to higher authority and get out of that mess faster if we fail hard and early.

    The knowledge of how exactly failure happened seems to be less useful to the higher reasoning model over the action biased models.

    Splitting up the tactical and strategic sides of the problem, seems to work similarly to how Generals don't hold guns in a war.

    [1] - https://arxiv.org/abs/2601.14351

    • By Nevermark 2026-02-035:251 reply

      > Coherence requires 2 opposing forces

      This seems very basic to any kind of information processing beyond straight shot predictable transforms.

      Expansion and reduction of possibilities, branches, scope, etc.

      Biological and artificial neural networks converging into multiple signals, that are reduced by competition between them.

      Scientific theorizing, followed by experimental testing.

      Evolutionary genetic recombination and mutation, winnowed back by resource competition.

      Generation, reduction, repeat.

      In a continually coordinated sense too. Many of our systems work best by encouraging simultaneous cooperation and competition.

      Control systems command signal proportional to demand, vs. continually reverse-acting error feedback.

      • By gopalv 2026-02-036:581 reply

        > This seems very basic

        Yes, this is not some sort of hard-fought wisdom.

        It should be common sense, but I still see a lot of experiments which measure the sound of one hand clapping.

        In some sense, it is a product of laziness to automate human supervision with more agents, but on the other hand I can't argue with the results.

        If you don't really want the experiments and data from the academic paper, we have a white paper which is completely obvious to anyone who's read High Output Management, Mythical Man Month and Philosophy of Software Design recently.

        Nothing in there is new, except the field it is applied to has no humans left.

        • By Nevermark 2026-02-038:40

          > Yes, this is not some sort of hard-fought wisdom.

          By basic I didn't mean uninteresting.

          In fact, despite the pervasiveness and obviousness of the control and efficiency benefits of push-pull, generating-reducing, cooperation-competition, etc., I don't think I have ever seen any kind of general treatment or characterization that pulled all these similar dynamics together. Or a hierarchy of such.

          > In some sense, it is a product of laziness to automate human supervision with more agents, but on the other hand I can't argue with the results.

          I think it is the fact that the agents are operating coherently with the respective complementary goals. Whereas, asking one agent to both solve and judge creates conflicting constraints before a solution has begun.

          Creative friction.

          I am reminded of brainstorming sessions, where it is so important to note ideas, but not start judging them, since who knows what crazy ideas will fit or spark together. Later they can be selected down.

          So we institutionalize this separation/staging with human teams too, even if it is just one of us (within our context limits, over two inference sessions :).

    • By maxkfranz 2026-02-033:51

      More or less, delegation and peer review.

  • By CuriouslyC 2026-02-030:564 reply

    This is a good line: "It found that smarter entities are subjectively judged to behave less coherently"

    I think this is twofold:

    1. Advanced intelligence requires the ability to traverse between domain valleys in the cognitive manifold. Be it via temperature or some fancy tunneling technique, it's going to be higher error (less coherent) in the valleys of the manifold than naive gradient following to the local minima.

    2. It's hard to "punch up" when evaluating intelligence. When someone is a certain amount smarter than you, distinguishing their plausible bullshit from their deep insights is really, really hard.

    • By energy123 2026-02-031:29

      Incoherence is not error.

      You can have a vanishingly small error and an incoherence at its max.

      That would be evidence of perfect alignment (zero bias) and very low variance.

    • By xanderlewis 2026-02-031:103 reply

      What do 'domain valleys' and 'tunneling' mean in this context?

      • By FuckButtons 2026-02-034:56

        So, the hidden mental model that the OP is expressing and failed to elucidate on is that llm’s can be thought of as compressing related concepts into approximately orthogonal subspaces of the vector space that is upper bounded by the superposition of all of their weights. Since training has the effect of compressing knowledge into subspaces, a necessary corollary of that fact is that there are now regions within the vector space that contain nothing very much. Those are the valleys that need to be tunneled through, ie the model needs to activate disparate regions of its knowledge manifold simultaneously, which, seems like it might be difficult to do. I’m not sure if this is a good way of looking at things though, because inference isn’t topology and I’m not sure that abstract reasoning can be reduced down to finding ways to connect concepts that have been learned in isolation.

      • By esyir 2026-02-031:221 reply

        Not the OP, but my interpretation here is that if you model the replies as some point in a vector space, assuming points from a given domain cluster close to each other, replies that span two domains need to "tunnel" between these two spaces.

      • By esafak 2026-02-032:384 reply

        A hallmark of intelligence is the ability to find connections between the seemingly disparate.

        • By Earw0rm 2026-02-0313:02

          That's also a hallmark of some mental/psychological illnesses (paranoid schizophrenia family) and use of certain drugs, particularly hallucinogens.

          The hallmark of intelligence in this scenario is not just being able to make the connections, but being able to pick the right ones.

        • By ithkuil 2026-02-0312:07

          The word "seemingly" is doing a lot of work here.

          Sometimes things that look very different actually are represented with similar vectors in latent space.

          When that happens to us it "feels like" intuition; something you can't really put a finger on and might require work to put into a form that can be transferred to another human that has a different mental model

        • By w10-1 2026-02-038:331 reply

          Actually, a hallmark could be to prune illusory connections, right? That would decrease complexity rather than amplifying it.

          • By esafak 2026-02-0313:33

            Yes, that also happens, for example when someone first said natural disasters are not triggered by offending gods. It is all about making explanations as simple as possible but no simpler.

        • By TonyStr 2026-02-036:591 reply

          Does this make conspiracy theorists highly intelligent?

          • By gylterud 2026-02-038:261 reply

            No, but they emulate intelligence by making up connections between seemingly disparate things, where there are none.

            • By Earw0rm 2026-02-0313:04

              They make connections but lack the critical thinking skills to weed out the bad/wrong ones.

              Which is why, just occasionally, they're right, but mostly by accident.

    • By booleandilemma 2026-02-037:121 reply

      > the ability to traverse between domain valleys in the cognitive manifold.

      Couldn't you have just said "know about a lot of different fields"? Was your comment sarcastic or do you actually talk like that?

      • By reverius42 2026-02-0310:031 reply

        I think they mean both "know about a lot of different fields" and also "be able to connect them together to draw inferences", the latter perhaps being tricky?

        • By booleandilemma 2026-02-0313:211 reply

          Maybe? They should speak more clearly regardless, so we don't have to speculate over it. The way you worded it is much more understandable.

          • By pixl97 2026-02-0318:001 reply

            There wasn't much room to speculate really, but requires some knowledge of understanding problem spaces, topology, and things like minima and maxima.

            • By reverius42 2026-02-046:07

              "inaccessible" rather than "ambiguous" -- but to the uninitiated they are hard to tell apart.

    • By p-e-w 2026-02-032:052 reply

      > When someone is a certain amount smarter than you, distinguishing their plausible bullshit from their deep insights is really, really hard.

      Insights are “deep” not on their own merit, but because they reveal something profound about reality. Such a revelation is either testable or not. If it’s testable, distinguishing it from bullshit is relatively easy, and if it’s not testable even in principle, a good heuristic is to put it in the bullshit category by default.

      • By CuriouslyC 2026-02-032:461 reply

        This was not my experience studying philosophy. After Kant there was a period where philosophers were basically engaged in a centuries long obfuscated writing competition. The pendulum didn't start to swing back until Neitchze. It reminded me of legal jargon but more pretentious and less concrete.

        • By root_axis 2026-02-033:20

          It seems to me that your anecdote exemplifies the their point.

      • By skydhash 2026-02-032:13

        The issue is the revelation. It's always individual at some level. And don't forget our senses are crude. The best way is to store "insights" as information until we collect enough data that we can test it again (hopefully without a lot of bias). But that can be more than a lifetime work, so sometimes you have to take some insights at face value based on heuristics (parents, teachers, elder, authority,...)

HackerNews