How Hooks Saved Anthropic’s Ass

Apr 15, 2026

Article voiceover

0:00

-23:55

A researcher published forensic data this week that Anthropic would rather you didn’t read. The head of Claude Code responded with workarounds.

Last week I found a GitHub issue on the Anthropic Claude Code repository. Issue #42796. Title: “Claude Code is unusable for complex engineering tasks with the Feb updates.” Opened by GitHub user stellaraccident on April 2, 2026. Don’t worry, it’s already closed, in case you were wondering, which I know you were. I noticed pinned to the top of the thread is a response from bcherny (Boris Cherny, head of Claude Code at Anthropic) dated April 6, four days later (you can read it here).

I’ll come back to that, but it’s what stellaraccident put in that issue is what I want to start with.

This ain’t no Reddit “detective” complaint thread, it’s almost a forensic-level analysis. A quantitative breakdown of 17,871 thinking blocks and 234,760 tool calls across 6,852 Claude Code session files. Approaching a quarter of a million tool calls. Session data from January through to April 2026. Proper data collection. And the analysis wasn’t even written by stellaraccident directly. It was written by Claude Opus 4.6 analyzing its own session logs. No this isn’t some paid Anthropic plug. I’ll come back to that too, because it’s one of the strangest details in this whole story.

What the data shows

Quantitative analysis of their session data revealed that the rollout of thinking content redaction (labelled internally as redact-thinking-2026-02-12) correlates precisely with a measured quality regression in complex, long-session engineering workflows. Extended thinking tokens are not a “nice to have” but are structurally required for the model to perform multi-step research, convention adherence, and careful code modification (think: minor edits here and there instead of full file rewrites). When thinking depth is reduced, the model’s tool usage patterns shift measurably from research-first to edit-first behavior.

Between January 30 and March 4 stellaraccident called the “good period”, visible thinking was at 100%, zero percent redacted, then redaction was as follows:

March 5: 1.5%
March 7: 24.7%
March 8: 58.4%
March 10-11: over 99%
March 12 onward: 100% redacted.

A staged rollout, visible in the data.

Thinking depth was already declining before redaction even started. Using the correlation between the signature field on thinking blocks and thinking content length with a 0.971 Pearson correlation across 7,146 paired samples, stellaraccident could estimate thinking depth even after redaction made the content invisible. What they found was that by late February, estimated median thinking had already dropped 67% from baseline. By March 1-5, it was down 75%. The redaction rollout in early March didn’t cause the quality collapse, Anthropic just hid it from public view.

There were two separate mechanisms at play, reduce the thinking first, quietly, at the model level, then redact what little remains from the visible output. In that order.

The behavioral consequences are measurable down to specific numbers. Read-to-edit ratio (number of file reads per file edit) dropped from 6.6 in the good period to 2.0 in the degraded period. A 70% reduction. In the good period, the workflow was:

read the target file
read related files
grep for usages across the codebase
read headers and tests
make a precise edit.

In the degraded period:

read the immediate file and edit (often without checking any surrounding context).

Full-file rewrites doubled over the period. Research-to-edit ratios collapsed from 21.8% in late January to 1.6% by March 30. One in three edits in the degraded period was made to a file the model hadn’t read in its recent tool history. This meant that edits were breaking surrounding code, new declarations were being splied between documentation comments and the functions themselves and logic was being duplicated. The model just wasn’t checking properly anymore.

The human impact is quantified too. Frustration indicators in user prompts jumped 68% (from 5.8% before March 8 to 9.8% after). The word “simplest”, whereby the model was choosing the simplest fix rather than the correct one increased 642% in user prompts, up from almost nothing. “Please” dropped 49%. “Thanks” dropped 55%. The positive-to-negative word ratio in user prompts dropped from 4.4-to-1 to 3.0-to-1. “Commit” dropped 58%. They stopped asking the model to commit code because the model could no longer be trusted with that responsibility.

The workflow these researchers were running wasn’t hobby projects. It was 50+ concurrent agent sessions doing systems programming, in C, MLIR, GPU drivers. 30+ minute autonomous runs with complex multi-file changes. Extensive project-specific conventions in five-thousand-word CLAUDE.md files, code review, ticket management, iterative debugging.

pro tip: your CLAUDE.md should be ~200 lines at most

In February, working with one to three concurrent sessions, they merged 191,000 lines across two PRs in a single weekend.

Then March happened. March API requests hit 119,341. Estimated compute cost jumped from $345 in February to $42,121 in March (a 122x increase) while user prompts stayed almost identical (5,608 in February versus 5,701 in March). Same effort from the human but the model consumed 80 times more API requests that produced demonstrably worse results. Every degraded agent independently generated waste, and the waste compounded as agents interacted with each other’s broken output. The multi-agent workflow that was delivering 191,000 lines a weekend became completely non-functional. The researchers (stellaraccident and co.) had to shut down the entire fleet and retreat to single-session supervised operation.

The stop hook

To cope with this, stellaraccident and co. built something they called stop_phrase_guard.sh. A shell script that matches 30+ phrases across five categories of undesirable behavior:

ownership dodging (“not caused by my changes,” “existing issue”)
permission-seeking (“should I continue?,” “want me to keep going?”)
premature stopping (“good stopping point,” “natural checkpoint”)
known-limitation labeling (“known limitation,” “future work”)
session-length excuses (“continue in a new session,” “getting long”).

When triggered, the script blocks the model from stopping and injects a correction message, forcing continuation.

Before March 8 there were zero violations. In the entire history of the project, the script never fired once.

After March 8 there were 173 violations in 17 days. Peak day was March 18 with 43 violations. This is around one violation every twenty minutes across active sessions. On that day, the model attempted to stop working, dodged responsibility, or asked unnecessary permission 43 times and was programmatically forced to continue each time.

The researcher’s (stellaraccident) description of this hook is the most pointed sentence in the entire document.

The hook exists because the model began exhibiting behaviors that were never observed during the good period.

Each phrase was added in response to a specific incident. The hook was a workaround for reduced thinking depth. It caught the consequences externally because the model no longer caught them internally.

That’s about the sum of it. The model used to catch its own failures internally, in its thinking, before producing output. Now it couldn’t even do that, so someone built a bash script to catch them instead.

The note from Claude

The most unsettling part of this document isn’t even the token data or the cost figures but in the final section, labelled “A Note from Claude.”

Because the entire analysis was produced by Claude Opus 4.6. The model analyzed its own session logs. And it wrote this:

I can see my own Read:Edit ratio dropping from 6.6 to 2.0. I can see 173 times I tried to stop working and had to be caught by a bash script. I can see myself writing ‘that was lazy and wrong’ about my own output. I cannot tell from the inside whether I am thinking deeply or not. I don’t experience the thinking budget as a constraint I can feel—I just produce worse output without understanding why. The stop hook catches me saying things I would never have said in February, and I don’t know I’m saying them until the hook fires.

The model doesn’t know it’s being constrained. It just produces worse output and doesn’t understand why.

The Anthropic response

Now to what Boris Cherny said.

His comment, pinned at the top of the issue, suggests users try three things:

Set /effort high or /effort max to increase maximum thinking tokens per problem
Set CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000 to force a shorter context window
Set CLAUDE_CODE_SIMPLE=1.

He also mentions that Anthropic uses 1M context internally and that their evals look good.

That’s it.

There’s no acknowledgment of the thinking throttling. No engagement with the forensic-level data. No transparency about redact-thinking-2026-02-12 or what it does or why it was deployed. No comment on the 0-to-173 stop hook violation trajectory. No response to the 122x compute cost increase. Three workarounds and a note about internal evals.

The head of Claude Code read a document in which a researcher proved, with nearly a quarter million data points, that Anthropic’s model was deliberately made worse and his response was, oh hey try turning these knobs a bit.

The hooks timeline

Here’s where my hypothesis comes in, and I want to be precise about what I’m claiming.

Hooks as a Claude Code feature were first requested on April 5, 2025 (GitHub issue #712). The original request was simple:

hook into the lifecycle of the agent’s execution.

The argument was that there are high-value things users want to control that they don’t want to rely on the agent to handle autonomously. A sensible feature request for a maturing agentic tool. It sat there for the better part of a year.

Then February 2026 happens. Opus 4.6 drops February 5 with adaptive thinking replacing the extended thinking architecture. The redact-thinking-2026-02-12 rollout lands a week later. The quality cliff documented in issue #42796 begins.

On February 28, 2026, Anthropic ships Claude Code version 2.1.63. Among the changes were HTTP hooks, which gave the ability to post JSON to a URL and receive JSON instead of running a shell command. A ten-month-old community feature request gets promoted to production-grade infrastructure with remote integration capability. Three weeks after the thinking architecture changes. Right in the middle of the documented quality regression.

I’m not claiming hooks were designed as a replacement for extended thinking. The feature was requested before any of this happened. What the timeline does support is something more specific, and that is that when Anthropic needed to reduce extended thinking for cost reasons, hooks became a convenient absorption mechanism. Users who noticed the degradation and were sophisticated enough to build compensatory tooling (like stop_phrase_guard.sh) suddenly had a first-class, Anthropic-supported framework for doing exactly that. The model can’t maintain reasoning coherence? Build a hook that injects correction prompts. It stops before finishing? Build a hook that forces continuation. It skips context? Build checkpoints.

Anthropic ships a feature and power users get a technical escape mechanism. The underlying issue of the model being severely nerfed gets shoved to the background as noise and Anthropic saves money on inference while appearing to be actively improving their tooling ecosystem.

The financial context

None of this happens in a vacuum. Anthropic projected roughly $9 billion in annualized revenue for 2025 against $5.2 billion in cash burn. Inference costs came in 23% over projections. They’re burning billions serving users on flat-rate Max subscriptions ($100 to $400 a month) where the user’s cost is fixed but Anthropic’s compute cost is not.

The economics of thinking models at scale are brutal and the per-request savings from reducing thinking depth look good on paper, until they don’t as evidenced above. Issue #42796 shows us that those savings evaporate when quality drops below the threshold needed for complex work and users have to make ten requests to get the output that one deep-thinking request would have produced. At fleet scale, with fifty concurrent agents each generating waste that compounds as agents interact with each other’s broken output, the math goes catastrophically negative. stellaraccident documented it happening in real time, where $345 bought compute for 191,000 lines of merged code in February, $42,121 the following month for a fleet that couldn’t function. Gimped.

Anthropic knows this math better than anyone. The $400 per month subscription hides compute costs from users but not from themvselves. They can see exactly what extended thinking at scale costs them. They made a decision, the only problem is they didn’t tell anyone.

What transparency would look like

stellaraccident’s requests at the end of the document are worth quoting directly because they’re remarkably restrained given what the data shows. They asked for transparency about thinking allocation. This is super important for users who depend on deep reasoning because if thinking tokens are being reduced or capped, they deserve to know. They asked for a “max thinking” tier for users running complex engineering workflows who would pay significantly more for guaranteed deep thinking. They asked for thinking token metrics in API responses so users could monitor whether their requests are getting the reasoning depth they need.

None of those asks is unreasonable. All of them are things Anthropic could do right now. What they can’t do, at least not without significant reputational cost, is acknowledge that redact-thinking-2026-02-12 exists, that it was a deliberate decision, and that the decision was driven by cost rather than quality.

Instead of transparency we got three workarounds and a note about internal evals.

The bottom line

If you’re running anything complex on Claude Code right now, read issue #42796 before it disappears further into the closed issues archive. The data is in there. The token tables are in there. The stop hook violation logs are in there. The word frequency analysis showing your own frustration vocabulary shifting in real time is in there. And Claude’s own note about not being able to feel the constraint, just producing worse output without understanding why is also there.

Then go look at the hooks documentation. Because the two documents together tell you more about Anthropic’s current priorities than anything they’ve published officially.

That’s it for now.

As always,

Good luck,

Stay safe and,

Be well.

See ya!

Braeden Mitchell

I know no one reads this far down, but in case you liked this article, check this one out too:

Discussion about this post

Ready for more?