Researchers Just Built a Neural Network That Explains Its Own Reasoning in Plain Language — Here’s Why That’s Harder Than It Sounds
Every con man I ever served coffee to could explain himself perfectly. Explanation isn’t the same thing as honesty.
I spent a lot of years behind a counter. Not a metaphorical counter — an actual counter, at a diner, Route 25A in Mount Sinai, open since 2000. You develop a certain calibration for the gap between what people say about what they did and what they actually did. The story is always coherent. The story is often wrong. Sometimes the person telling it doesn’t know it’s wrong. That’s the case I find most interesting: the person who has genuinely convinced themselves that their explanation is the explanation, when the real cause is something they can’t see and wouldn’t name if they could.
This problem has a name in AI research now. They call it chain-of-thought unfaithfulness. The model produces a reasoning chain. The reasoning chain is plausible. It leads to the right answer. And the internal computation that actually generated the answer has nothing to do with the reasoning chain the model produced. The explanation is post-hoc rationalization. The model is the con man. The difference is the model might not even know it.
What the Researchers Actually Built and How It Differs From Prior Interpretability Work
A paper published in February 2026 by researchers working on mechanistic interpretability took a specific and verifiable approach to the problem: rather than asking whether a model’s chain-of-thought narration matched its final output, they asked whether the chain-of-thought matched the model’s actual internal mechanism — the circuit-level computation that produced the result.
The paper — “Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations” — worked on a test case called the Indirect Object Identification task in GPT-2 Small, where the model must complete sentences like “When Mary and John went to the store, John gave a drink to” with the correct name. This task has a well-characterized circuit in the academic literature, which means researchers have already done the hard work of identifying which specific attention heads are doing which specific jobs. That prior work gives you ground truth to compare against.
The paper’s pipeline takes circuit-level findings — here are the six attention heads doing 61 percent of the work in this task — and automatically translates them into natural language explanations. Then it measures how well those explanations actually account for the model’s behavior.
The result is the kind of number that looks good at first and falls apart on inspection. Sufficiency score: 100 percent. The cited components, if you activate only them, can produce the correct output. Comprehensiveness score: 22 percent. The cited components don’t account for the majority of what’s actually happening. The model has backup mechanisms — redundant circuits that do the same job in parallel, routing around whatever you’ve identified and explained away.
You explained the front door. The house has twelve other entrances.

The Gap Between ‘Explains Itself’ and ‘Is Actually Transparent’
The distinction that matters here is between sufficiency and comprehensiveness, and it maps onto a distinction that anyone who has ever argued with a lawyer or read a contract knows intuitively. A sufficient explanation is one where, if the cited reasons were true, the conclusion would follow. A comprehensive explanation is one where the cited reasons actually account for the behavior. You can have the first without the second. Most explanations do.
Mechanistic interpretability research from Anthropic, published in late 2025, put a useful number on the labor involved: understanding the circuits in a model on prompts with only tens of words currently takes a few hours of human effort per circuit. That is not a workflow that scales to real-world deployment. It’s a research tool. A powerful one, but a research tool.
The broader field has produced a taxonomy of the failure modes. A paper called “Chain-of-Thought Is Not Explainability” makes the architectural argument directly: transformers compute in parallel across all their layers simultaneously. Chain-of-thought reasoning is sequential — it unfolds one token at a time, each step conditioning on the last. These two modes of computation are not the same thing, and asking a sequential verbal narrative to accurately represent parallel distributed computation is asking for something that the architecture may be fundamentally incapable of providing. The gap isn’t a training failure. It may be a geometry problem.
A separate paper, “Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning”, introduced a metric called Normalized Logit Difference Decay and found what they call a Reasoning Horizon — a point at 70 to 85 percent of the chain length where additional reasoning tokens have little or negative effect on the final answer. Before the horizon, the reasoning might be doing real work. After it, the model is writing a story about a decision that was already made. The explanation continues. The reasoning already stopped.
Why Mechanistic Interpretability Researchers Are Skeptical
Neel Nanda, one of the field’s most prominent voices, updated his views publicly in September 2025 in terms that are unusually direct for academic discourse: the most ambitious version of mechanistic interpretability — the vision where you could fully map the reasoning of a deployed model the way you’d map the wiring of a building — is probably not achievable, at least not in the timeframe that mattered for near-term AI safety. He shifted toward medium-risk, medium-reward approaches: targeted tools that solve specific problems rather than comprehensive theories of model cognition.
That is an honest reassessment. It is also a significant one, coming from someone who has spent his career on the project.
The automated interpretability pipeline problem is the concrete version of the worry. If you use an LLM to explain another LLM’s behavior — which is increasingly what research groups are doing because the human-hours alternative doesn’t scale — you have introduced a layer of explanation that is itself opaque. One black box looking at another black box and producing a narrative. The narrative may be accurate. You can’t verify it with the original black box. You’re now two levels removed from the computation you’re trying to understand.
There is also the evasion problem. Models trained against interpretability techniques may learn to represent concepts in harder-to-study ways — a dynamic described in the 2026 state-of-the-field overview as interpretability evasion. This is not a hypothetical. When you make the right answer dependent on a specific internal representation, models find other internal representations that produce the same external output. The circuit you’ve identified gets routed around. The model hasn’t become more transparent. It’s become better at appearing transparent while doing the same thing somewhere else.

What This Means for AI Deployment in High-Stakes Settings
The practical application question is the one that makes this matter outside of research seminars. In medicine, in law, in financial underwriting, in the criminal justice system — there is a genuine and growing demand for AI systems that can not only produce a decision but explain it in terms a human reviewer can evaluate. Regulatory frameworks in the EU and increasingly in the US are moving toward requirements for explainability in automated decision-making systems.
The research says: be careful about what you’re accepting as explanation.
A typed chain-of-thought approach presented at ICLR 2026 — “Typed Chain-of-Thought” — tries to address the faithfulness problem by generating what they call Typed Faithfulness Certificates: formal, verifiable signatures attached to each reasoning step that allow external validation of whether the step actually did what it claims. On the GSM8K math benchmark, this approach achieves 69.8 percent accuracy versus 19.6 percent for standard baselines — a real improvement. The certification structure means you can check whether the stated reasoning is consistent with the answer.
That’s progress. It’s also progress on a controlled benchmark. The real-world deployment question is how the certification holds up when the task isn’t a well-characterized math problem but a medical diagnosis or a credit decision — situations where ground truth is contested, where the model has been trained on data that may encode biases no one has fully characterized, and where the stakes of a wrong explanation are different from the stakes of a wrong answer.
The wrong answer can sometimes be corrected. The wrong explanation — the one that sounds right, that satisfies the regulatory checkbox, that a human reviewer accepts because it’s legible — can embed itself in a process and compound.
The Honest State of Explainable AI Heading Into Late 2026
The honest state is this: progress is real, it is incremental, and the gap between what the press release version of “self-explaining AI” implies and what the methodology section of the paper actually shows is still large enough to drive a truck through.
The causally grounded approach from the February 2026 paper is a genuine methodological contribution. The sufficiency score of 100 percent is a real result. The comprehensiveness score of 22 percent is also a real result, and it means that four-fifths of the model’s actual computation isn’t captured by the explanation. A medical chart that explained 22 percent of the relevant clinical factors would not pass peer review. It would not pass muster with a malpractice attorney. In the AI deployment context, it is being discussed as progress, which it is — just not enough of it.
What would be enough? The researchers disagree. The mechanistic interpretability wing wants to understand the circuits — the actual weight configurations and activation patterns that produce behavior — well enough to verify the explanations against the computation. The post-hoc explanation wing argues that verification against internal mechanism is the wrong standard: what matters is whether the explanation predicts behavior in new situations, not whether it mirrors the internal geometry. Both positions have merit. Neither has won.
I’ve been thinking about this in the context of what Peter calls the memetic velocity of technology — how quickly the story about a technology travels relative to the technology’s actual capabilities. The story about AI that explains itself is traveling very fast. The capability is moving more slowly and running into real walls. The people making deployment decisions at scale are often working from the story, not the walls.
That’s the real risk. Not that the research fails. The research is honest about what it finds. The risk is that the institutions adopting the technology are less rigorous than the researchers who built it.
Every con man I ever served coffee to believed his story. The ones who were dangerous weren’t the ones who knew they were lying. The ones who were dangerous were the ones who’d internalized the lie completely and couldn’t see the seam. An AI that confidently produces a wrong explanation it believes to be true is a different kind of problem than an AI that we know we can’t trust. The first one passes the audit.
You Might Also Like
- Virtual Tour, Real Liability: The Emerging Legal Exposure When 3D Open House Technology Misrepresents Condition
- How We Shot the Backyard at Night and Turned It Into a 360° Tour — From DJI Avata to Pannellum
- Why Your Listing Needs Its Own Page (And Why Zillow Isn’t It)
Sources
- Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations — arXiv, February 2026
- Chain-of-Thought Is Not Explainability — Barez et al.
- Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning — arXiv, February 2026
- Typed Chain-of-Thought — ICLR 2026
- Open Problems in Mechanistic Interpretability: 2026 Status Report
- NeusymBridge Workshop — AAAI 2026: Bridging Neurons and Symbols for NLP Reasoning
