For those using Claude Code, I recommend Learning mode to instruct Claude to walk you through implementing the solution yourself rather than doing it for you. It’s very helpful when diving into a new domain, and helps build lower level intuition.
To enable it, run /config > output styles > Learning
Learning mode has been a huge help for me, it quickly became my favorite way to learn. I ended up created a “Coaching Mode” output style that took some of the learning concepts like stubbing todos for the users and added other intructions that better fit how I learn
A few of us from the Claude Code team will be hanging around if anyone has questions! Very excited for this launch -- dynamic workflows have been a game changer for engineering here at Anthropic. Can't wait to hear what you think.
Hi Boris. Love the velocity of features. Are you planning on adding a secrets manager? Enterprise workflows almost always require an encrypted parameter or calling a secret.
Personally, I am happy paying 1password for my personal secret management. Their security credibility and bona fides are well-established. I'd strongly consider them for a business contract too.
Personally I would just like to be able to read more than 2 lines of an AskUserQuestion on the iOS app. Ever since the feature launched it's truncated the question, so you cannot actually read it.
Using the keyword “Workflow”like “Ultrathink” is problematic?
Ultrathink is uncommon enough that it is unlikely to be used in code or prompt outside its intended purpose.
Workflow is generic keyword and used in so many contexts both inside the codebase and orchestration tooling like say temporal.io or others that name their constructs “workflows”.
Thanks to you and the anthropic team for developing such exciting tools! The blog post seems to position workflows for “breadth”: generating fixes / refactors against large code bases. What about for “depth”: developing specific new features and functionality end-to-end? I’ve struggled to make this work reliably using the current experimental agent teams. Does this replace or augment that functionality?
Yes, it also helps! That's a place where raw model capability is the most helpful, but we do find that some dynamic workflow configurations can be helpful too.
Cool! If you can point to any examples of those types of workflow configurations I’d be super interested. For example, to have a team of agents review a PR and iterate on it until all requirements are met including UX, security and product functionality goals. If they could “converge” to a solution like workflows seems to be designed for that would be amazing.
This is really dissapointing release for such a promising technique. Long walks with fanned vectors can actually be token optimizing vs token burning when combined with self grading each agent along the walk and compared to manual long coding walks to solve first pass problems. But instead this frames it (assumptively) as a tokenmaxxing strategy. There are also many other strartegies that can prove effeciency and wider solution consideration with consensus, but none of this is explained why its an improvement or better than other technqiues.
Its like you guys aren't even aware of the primary problem you are all facing: your token burns aren't paying off anyore against standard coding -- and looking net negative. I have to ask, are you this unaware of your core problem set here?
There are no any examples, proofs, or scenarios that show why there is improvement either in complexity or reliability of the solution or effeciency to the path of the solution. I'm baffled.
How granular is the control over the internal process?
In my experiments I've had some success modeling the work to be done as a DAG of typed artifacts with a combination of code + LLM doing decomposition, transforms, synthesis, and fitness checking to generate the output. It took me a lot of tries to arrive at that formula and it would be cool to have something more general. I also run part of it against local compute because it would be far beyond my budget to do it all on Opus, so something for that would be nice too.
Can you please fix the issue where like 99.99999% of the time Claude tries to launch a subagent on its own accord it gets "Prompt is too long" and tries several more times, then gives up and does it without the subagent. Big waste of time and tokens and not getting almost any subagent advantages. Not kidding that this happens about 100 times a day.
I tried creating a workflow in Claude 1.9255.2 (1dc8f7) 2026-05-27T01:57:20.000Z
and got
API Error: 400 messages.3.content.11: `thinking` or `redacted_thinking` blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response.
Tried again in
Claude 1.9659.1 (193bcb) 2026-05-28T16:22:15.000Z also but may need a new chat
VSCode has an official client? Given IDE usage is being restricted from Claude Code via the CC SDK tokens going to the Claude API rather than your CC Subscription, i'm unclear which IDEs can actually use claude code now.
Eg is Zed capable of using a Claude Code Subscription?
Oh, yea here's all the proof you need. Even Zed themselves admit you won't be able to use Claude Code via ACP via Subscription: https://zed.dev/blog/terminal-threads
So yea, bcherny didn't reply to me but as far as i can tell - No, Zed nor VSCode will have Claude Code natively in it. The best we can do is embed a Terminal into the editor and run CC in that.
With that said, because bcherny advertised VSCode, i'm going to guess VSCode is going to get special treatment. Really annoying.
to be clear, i'm referring to the recent fact where it appears that they're disabling all Claude Code (Subscription) usage from the SDK. Which ACP would be included on.
As usual though it's not super clear exactly what is allowed or not.
Is there an example of how y'all use Dynamic Workflows internally that you could share with the rest of us here so that we can mimic something similar?
Hey, yep. A few things I personally used dynamic workflows for over the last few weeks:
1. Autonomously landed 20+ optimizations to reduce Claude Code's token usage by ~15%
2. Ported tree-sitter, color-diff, yoga-layout, and a number of other WASM and Rust native modules to TypeScript, improving CPU and memory use by 2-10x in the process
3. Made our CI faster, and repeatedly found and fixed flaky tests (with /loop)
4. Migrated from regex-based bash static analysis to tree-sitter, reducing false positive permission prompts by 45%
5. Reduced Claude Agent SDK startup time by 61%, by repeatedly profiling and optimizing the startup path, putting up a number of PRs in the process
> Ported tree-sitter, color-diff, yoga-layout, and a number of other WASM and Rust native modules to TypeScript, improving CPU and memory use by 2-10x in the process
Curious to learn more on this (unless there’s a write-up in the works). I’m naive on this matter but:
1. is this because it’s higher cost when passing objects back and forth across the JS/native boundary?
2. Does this have anything more specific to do with use of Bun?
3. is the stance for claude code then to keep all the deps in raw TypeScript?
4. How do you folks keep these ported deps up-to-date?
Very cool. What % of the CC team's engineering would you say goes into QoL (as opposed to new feature development)? Obviously some live in a grey area, while others are more clear like making CI faster.
Is there not a reason to instead port claude code to rust? Do you have internal benchmarks that show that claude code is better at typescript than rust?
just wanted to say thank you, just did a 2 days "ai computer use" workshop - think a virtual desktop on hetzner with claude code in yolo mode, a github account, vercel and logged in into a google account and claude had all the credentials and then let a mix of marketing / product manager / sales / customer support let loose. 2k token budget ... and just let them see do magic again and again.
Hi Boris, what is the advantage of using /code-review vs just asking Opus to “code review”?
As a casual user working on hobby projects, I struggle to keep up with the pace of changes and knowing what to use when. My default now is to use Opus for all coding (sonnet is fine but seems dumber) and to prompt it for everything I need. I’ve had great success with this but clearly I’m missing power user functions with the slash commands and such.
The advantage is that /code-review supplies a structured idea of how to review and what that process should look like and then launches independent subagents to approach the issue from multiple angles.
It's analogous to how in the early days you could see benefits by telling the models to "think step by step". /code-review is something like "review angle by angle". "Consider removed behavior" and also "Look at language gotchas" and also "Look at test changes"...etc. Yes these are all somewhat implicitly already part of what "code review" means, but the models perform best with explicitness.
If you want my 2c as a power user: just don't think about it and use /code-review xhigh --fix. This will cover like 98% of what you want out of code review. It's a good skill.
I don't even bother looking at the code until I've run a code review pass on it. Why waste my time with trivial bug fixes? I find the best way to spend time right now is like:
- Defining the issue/ticket, what "success" looks like (if I have a good idea of this), high level approach guidance 50%
- Dispatch agent to work on it 5%
- Occasionally return and nudge agent + send /simplify or /code-review 5%
- Look at the code/session summary, divergences from the plan, ask followup questions 40%
Occasionally yes there is some solution the AI chose that is suboptimal and I would prefer fixed in a different way. Mostly though it's straightforward.
Are you thinking of the /effort level in Claude Code? I would just go with xhigh as a reasonable default. Most important thing in prompting is specifying what "done" and "success" looks like to you. Ask Claude to help you come up with a well formed request and spend most of your time on that, then paste that into a brand new session.
No more like is there a specific slash tool to be using when coding or planning. I guess that’s just Claude code in general but since there’s a specific review tool I was curious about specific coding tools
And why would someone use the various levels? Is a low code review even worth running? And how do I know what level to use in the first place?
This stuff all seems so nebulous to me and I’ve yet to see anything that says use x in y situation. So I default to higher effort levels than I likely need.
Hey Boris, thanks for the great product and for listening!
I find the mix between slash commands that are programmatic harness configuration and control commands (/config, /model, /feedback, /fork, /usage, etc.) and ones that are little more than prompt template insertion (/code-review, /<skill>, etc.) to be a little confusing and unnecessary. A slash command should be one thing, and one thing only: a command for the harness, not the agent.
When I invoke a slash command like /code-review, I should be invoking some additional harness functionality, something above and beyond the agent's sphere of influence - not just pasting some hidden text into the next turn. Otherwise, why wouldn't I just say "Claude, review this code"?
Yet most of these "added value" commands bloating the slash command list, are just shortcuts for copy and paste. I don't want to go to have to learn the syntax of a special /code-review command (which options are positional args, which are --flags, etc.), and I'm much less likely to use or even be aware of a command like this, when I can just ask "Do a balanced code review and fix the issues", or use the GUI to set the effort level to xhigh before asking "Review my code." That way I can also be more specific about exactly what I need, rather than relying on what's in the canned prompt - a prompt which I'll probably never read and vet myself anyway. The value added by the slash command needs to be really high compared to just typing a prompt, for it to justify the friction of discovery and learning the syntax.
So I suppose I'm advocating for a different system. Keep slash commands for meta-level harness control and configuration, and add a new mechanism for canned prompt insertion, one which is tailor made for that purpose rather than overloading the slash command system. Let the user see what's in the canned prompts, and even make adjustments or edits as needed before sending them, one-time or persisted. Provide a GUI in the app with the user's favorite prompts, where the user can add, delete, and edit them, making it easy to invoke and insert them as needed. Or let the agent automatically discover and use them as needed, rather than requiring the user to remember and recall their magic shortcuts and their arguments. That's just one idea.
Skills, plugins, commands, and so on, need to be consolidated not just for code review of course but across the full architecture of how prompt templates are managed.
What clicked for me recently was treating skills as composable. Having meta-skills that call smaller skills in order. The "skill vs command vs subagent" confusion partly dissolves once you let skills call other skills. The meta-skill holds the workflow state, the smaller ones each do one job well.
> # do an expensive and extremely thorough review (reliably catches >99% of bugs, costs $3-20 per review depending on complexity):
/code-review ultra
main suggestion would be to sound a lot less optimistic about that it finds 99% of bugs or that its at all thorough, and instead list that it is time capped, and will only find bugs that you explicitly tell it to look for.
i used my three runs of ultrareview.
the first run with no other prompting found a couple typos in markdown only
the second one i prompted it with several themes of known open bugs in the code, and it found 6 items
and then the third one i ran after doing an actual long audit through gemini to make a much more detailed prompt about issues in the code
and for that one, instead of doing an exhaustive run, it just never started, so no idea if it worked
but the experience had no relation at all with the reliability or thoroughness claims
Hey Boris, some feedback. I like the new /code-review skill but was disappointed you guys removed /simplify because I quite liked the focus on finding code reuse/efficiency opportunities.
I see now in 2.1.152 you added those focus areas back to /code-review, but still bundled with the correctness finding. It would be great to have more fine grained control over the /code-review angles beyond just effort level. Or maybe you would recommend that I just specify that as freeform input after effort level?
Yep, you can add free-form input. Will update /simplify to only check for code quality and not bugs (the way it used to work), that's a good suggestion.
We did both -- we did a number of UI iterations (eg. improving thinking loading states, making it more clear how many tokens are being downloaded, etc.). But we also reduced the default effort level after evals and dogfooding. The latter was not the right decision, so we rolled it back after finding that UX iterations were insufficient (people didn't understand to use /effort to increase intelligence, and often stuck with the default -- we should have anticipated this).
Having a "Recovery Mode"/"Safe Boot" flag to disable our configurations (or progressively enable) to see how claude code responds would be nice. Sometimes I get worried some old flag I set is breaking things. Maybe the flag already exists? I tried Claude doctor but it wasn't quite the solution.
For instance:
Is Haiku supposed to hit a warm system-prompt cache in a default Claude code setup?
I had `DISABLE_TELEMETRY=1` in my env and found the haiku requests would not hit a warm-cached system prompt. E.g. on first request just now w/ most recent version (v2.1.118, but happened on others):
w/ telemetry off - input_tokens:10 cache_read:0 cache_write:28897 out:249
w/ telemetry on - input_tokens:10 cache_read:24344 cache_write:7237 out:243
I used to think having so many users was leading to people hitting a lot of edge cases, 3 million users is 3 million different problems. Everyone can't be on the happy path. But then I started hitting weird edge cases and started thinking the permutations might not be under control.
> people didn't understand to use /effort to increase intelligence, and often stuck with the default -- we should have anticipated this
UI is UI. It is naive to expect that you build some UI but users will "just magically" find out that they should use it as a terminal in the first place.
It took you a month to revert after multiple complaints. You still blamed users for using the product exactly as you advertised it. And all of your official channels were completely quite for two months, whether it was about new draconian peak hour limits, or about the new defaults, or about exponentially increasing token costs.
People literally started seeing issues immediately as you changed the defaults: https://x.com/levelsio/status/2029307862493618290 And despite a huge amount of reports you still kept it for a whole month.
And then you shipped a completely untested feature with prompt cache misses and literally gaslit users and blamed users for using the product as advertised.
Now untold umber of people have been hit by these changes, so as an apology you reset usage limits three hours before they would reset anyway.
Good job.
Edit. By the way, a very telling sentence from the report:
--- start quote ---
We’ll ensure that a larger share of internal staff use the exact public build of Claude Code (as opposed to the version we use to test new features); and we'll make improvements to our Code Review tool that we use internally
--- end quote ---
Translation: no one is using or even testing the product we ship, and we blindly trust Claude Code to review and find bugs for us. Last one isn't even a translation: https://x.com/bcherny/status/2017742750473720121
Off topic, but I'm hoping you'll maybe see this. There's been an issue with the VS code extension that makes it pretty much impossible to use (PreToolUse can't intercept permission requests anymore, using PermissionRequest hooks always open the diff viewer and steals focus):
Normally, when you have a conversation with Claude Code, if your convo has N messages, then (N-1) messages hit prompt cache -- everything but the latest message.
The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss, all N messages. We noticed that this corner case led to outsized token costs for users. In an extreme case, if you had 900k tokens in your context window, then idled for an hour, then sent a message, that would be >900k tokens written to cache all at once, which would eat up a significant % of your rate limits, especially for Pro users.
We tried a few different approaches to improve this UX:
1. Educating users on X/social
2. Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)
3. Eliding parts of the context after idle: old tool results, old messages, thinking. Of these, thinking performed the best, and when we shipped it, that's when we unintentionally introduced the bug in the blog post.
Hope this is helpful. Happy to answer any questions if you have.
I appreciate the reply, but I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.
I feel like that is a choice best left up to users.
i.e. "Resuming this conversation with full context will consume X% of your 5-hour usage bucket, but that can be reduced by Y% by dropping old thinking logs"
Another way to think about it might be that caching is part of Anthropic's strategy to reduce costs for its users, but they are now trying to be more mindful of their costs (probably partly due to significant recent user growth as well as plans to IPO which demand fiscal prudence).
Perhaps if we were willing to pay more for our subscriptions Anthropic would be able to have longer cache windows but IDK one hour seems like a reasonable amount of time given the context and is a limitation I'm happy to work around (it's not that hard to work around) to pay just $100 or $200 a month for the industry-leading LLM.
Full disclosure: I've recently signed up for ChatGPT Pro as well in addition to my Claude Max sub so not really biased one way or the other. I just want a quality LLM that's affordable.
I might be willing to pay more, maybe a lot more, for a higher subscription than claude max 20x, but the only thing higher is pay per token and i really dont like products that make me have to be that minutely aware of my usage, especially when it has unpredictability to it. I think there's a reason most telecoms went away from per minute or especially per MB charging. Even per GB, as they often now offer X GB, and im ok with that on phone but much less so on computer because of the unpredictability of a software update size.
Kinda like when restaurants make me pay for ketchup or a takeaway box, i get annoyed, just increase the compiled price.
For sure, I agree with that sentiment. It's interesting to consider the psychological component of that, like how "free shipping" is not really free, it's oftentimes just packaged into the price of the product but somehow it feels like we're getting a better deal.
I would not be surprised to see Anthropic, OpenAI etc head in the direction you mention as they mature and all of these datacenters currently undergoing construction come online in the next few years and drive down costs.
That doesn’t make sense to pay more for cache warming. Your session for the most part is already persisted. Why would it be reasonable to pay again to continue where you left off at any time in the future?
I’m coming at this as a complete Claude amateur, but caching for any other service is an optimisation for the company and transparent for the user. I don’t think I’ve ever used a service and thought “oh there’s a cache miss. Gotta be careful”.
I completely agree that it’s infeasible for them to cache for long periods of time, but they need to surface that information in the tools so that we can make informed decisions.
That is because LLM KV caching is not like caches you are used to (see my other comments, but it's 10s of GB per request and involves internal LLM state that must live on or be moved onto a GPU and much of the cost is in moving all that data around). It cannot be made transparent for the user because the bandwidth costs are too large a fraction of unit economics for Anthropic to absorb, so they have to be surfaced to the user in pricing and usage limits. The alternative is a situation where users whose clients use the cache efficiently end up dramatically subsidizing users who use it inefficiently, and I don't think that's a good solution at all. I'd much rather this be surfaced to users as it is with all commercial LLM apis.
Think of it like this: Anthropic has to keep a full virtual machine running just for you. How long should it idle there taking resources when you only pay a static monthly fee and not hourly?
They have a limited number of resources and can’t keep everyone’s VM running forever.
That price at Vultr gets you 1GB of RAM, and 25GB of relatively slow SSD.
The KV cache of your Claude context is:
- Potentially much larger than 25GB. (The KV cache sizes you see people quoting for local models are for smaller models.)
- While it's being used, it's all in RAM.
- Actually it's held in special high-performance GPU RAM, precision-bonded directly to the silicon of ludicrously expensive, state of the art GPUs.
- The KV state memory has to be many thousands of times faster than your 25GB state.
- It's much more expensive per GB than the CPU memory used by a VM. And that in turn is much more expensive than the SSD storage of your 25GB.
- Because Claude is used by far more people (and their agents) than rent VMs, far more people are competing to use that expensive memory at the same time
There is a lot going on to move KV cache state between GPU memory and dedicated, cheaper storage, on demand as different users need different state. But the KV cache data is so large, and used in its entirety when the context is active, that moving it around is expensive too.
Now check out the cost difference in 25GB of computer RAM vs GPU RAM.
And yes, this is also why computer RAM has jumped the shark in costs.
The bandwidth differences in total data transferred per hour aren't even in the same 5 orders of magnitude between your server and the workloads LLMs are doing. And this is why the compute and power markets are totally screwed.
Genuine question: is the cost to keep a persistent warmed cache for sessions idling for hours/days not significant when done for hundreds of thousands of users? Wouldn’t it pose a resource constraint on Anthropic at some point?
No, the cache is a few GB large for most usual context sizes. It depends on model architecture, but if you take Gemma 4 31B at 256K context length, it takes 11.6GB of cache
note: I picked the values from a blog and they may be innacurate, but in pretty much all model the KV cache is very large, it's probably even larger in Claude.
To extend your point: it's not really the storage costs of the size of the cache that's the issue (server-side SSD storage of a few GB isn't expensive), it's the fact that all that data must be moved quickly onto a GPU in a system in which the main constraint is precisely GPU memory bandwidth. That is ultimately the main cost of the cache. If the only cost was keeping a few 10s of GB sitting around on their servers, Anthropic wouldn't need to charge nearly as much as they do for it.
That cost that you're talking about doesn't change based on how long the session is idle. No matter what happens they're storing that state and bring it back at some point, the only difference is how long it's stored out of GPU between requests.
Are you sure about that? They charge $6.25 / MTok for 5m TTL cache writes and $10 / MTok for 1hr TTL writes for Opus. Unless you believe Anthropic is dramatically inflating the price of the 1hr TTL, that implies that there is some meaningful cost for longer caches and the numbers are such that it's not just the cost of SSD storage or something. Obviously the details are secret but if I was to guess, I'd say the 5m cache is stored closer to the GPU or even on a GPU, whereas the 1hr cache is further away and costs more to move onto the GPU. Or some other plausible story - you can invent your own!
Storing on GPU would be the absolute dumbest thing they could do. Locking up the GPU memory for a full hour while waiting for someone else to make a request would result in essentially no GPU memory being available pretty rapidly. This type of caching is available from the cloud providers as well, and it isn't tied to a single session or GPU.
> Storing on GPU would be the absolute dumbest thing they could do
No. It’s not dumb. There will be multiple cache tiers in use, with the fastest and most expensive being on-GPU VRAM with cache-aware routing to specific GPUs and then progressive eviction to CPU ram and perhaps SSD after that. That is how vLLM works as you can see if you look it up, and you can find plenty of information on the multiple tiers approach from inference providers e.g. the new Inference Engineering book by Philip Kiely.
You are likely correct that the 1hr cached data probably mostly doesn’t live on GPU (although it will depend on capacity, they will keep it there as long as they can and then evict with an LRU policy). But I already said that in my last post.
Yesterday I was playing around with Gemma4 26B A4B with a 3 bit quant and sizing it for my 16GB 9070XT:
Total VRAM: 16GB
Model: ~12GB
128k context size: ~3.9GB
At least I'm pretty sure I landed on 128k... might have been 64k. Regardless, you can see the massive weight (ha) of the meager context size (at least compared to frontier models).
Exactly, even in the throes of today's wacky economic tides, storage is still cheap. Write the model state immediately after the N context messages in cache to disk and reload without extra inference on the context tokens themselves. If every customer did this for ~3 conversations per user you still would only need a small fraction of a typical datacenter to house the drives necessary. The bottleneck becomes architecture/topology and the speed of your buses, which are problems that have been contended with for decades now, not inference time on GPUs.
This has nothing to do with the cost of storage. Surprisingly, you are not better informed than Anthropic on the subject of serving AI inference models.
The reason I've been querying the 1 hour is a user's quota resets are often longer than that, as a result I've seen situations where someone builds a large context, then hits their quota limit, waits 2+ hours, their cache is gone, their first message then eats 20%+ of their current session quota and the user doesn't want to compact as they're still trying to get the model into a good understanding of the problem, this seems to be a really painful consequence for users on anything less than a max plan which seems like an unintended consequence of Anthropic's own system design choices?
IE How their quota and caching interact with each other, it doesn't make pro and max a little different, it makes it significantly different by unintentionally penalising pro users
a countdown clock telling you that you should talk to the model again before your streak expires? that's the kind of UX i'd expect from an F2P mobile game or an abandoned shopping cart nag notification
Well sure if you put it that way, they're similar. But it's either you don't see it and you get surprised by increased quota usage, or you do see it and you know what it means. Bonus points if they let you turn it off.
Plenty of room for a middle ground, like a static timestamp per session that shows expiration time, without the distraction of a constantly changing UI element.
But perhaps Claude Code could detect that you're actively working on this stuff (like typing a prompt or accessing the files modified by the session), and send keep-cache-alive pings based on that? Presumably these pings could be pretty cheap, as the kv-cache wouldn't need to be loaded back into VRAM for this. If that would work reliably, cache expiry timeouts could be more aggressive (5 min instead of an hour).
I tried to hack the statusline to show this but when i tried, i don't think the api gave that info. I'd love if they let us have more variables to access in the statusline.
Nit: It doesn’t have to live in GPU memory. The system will use multiple levels of caching and will evict older cached data to CPU RAM or to disk if a request hasn’t recently come in that used that prefix. The problem is, the KV caches are huge (many GB) and so moving them back onto the GPU is expensive: GPU memory bandwidth is the main resource constraint in inference. It’s also slow.
The larger point stands: the cache is expensive. It still saves you money but Anthropic must charge for it.
Edit: there are a lot of comments here where people don't understand LLM prefix caching, aka the KV cache. That's understandable: it is a complex topic and the usual intuitions about caching you might have from e.g. web development don't apply: a single cache blob for a single request is in the 10s of GB at least for a big model, and a lot of the key details turn on the problems of moving it in and out of GPU memory. The contents of the cache is internal model state; it's not your context or prompt or anything like that. Furthermore, this isn't some Anthropic-specific thing; all LLM inference with a stable context prefix will use it because it makes inference faster and cheaper. If you want to read up on this subject, be careful as a lot of blogs will tell you about the KV cache as it is used within inference for an single request (a critical detail concept in how LLMs work) but they will gloss over how the KV cache is persisted between requests, which is what we're all talking about here. I would recommend Philip Kiely's new book Inference Engineering for a detailed discussion of that stuff, including the multiple caching levels.
> I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.
You didn't do your due diligence on an expensive API. A naïve implementation of an LLM chat is going to have O(N^2) costs from prompting with the entire context every time. Caching is needed to bring that down to O(N), but the cache itself takes resources, so evictions have to happen eventually.
What is being discussed is KV caching [0], which is used across every LLM model to reduce inference compute from O(n^2) to O(n). This is not specific to Claude nor Anthropic.
> How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?
1. Compute scaling with the length of the sequence is applicable to transformer models in general, i.e. every frontier LLM since ChatGPT's initial release.
2. As undocumented changes happen frequently, users should be even more incentivized to at least try to have a basic understanding of the product's cost structure.
> You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.
I think "internal technical implementation" is a stretch. Users don't need to know what a "transformer" is to understand the trade-off. It's not trivial but it's not something incomprehensible to laypersons.
They are caching internal LLM state, which is in the 10s of GB for each session. It's called a KV cache (because the internal state that is cached are the K and V matrices) and it is fundamental to how LLM inference works; it's not some Anthropic-specific design decision. See my other comment for more detail and a reference.
> 99.99% of users won't even understand the words that are being used.
That's a bad estimate. Claude Code is explicitly a developer shaped tool, we're not talking generically ChatGPT here, so my guess is probably closer to 75% of those users do understand what caching is, with maybe 30% being able to explain prompt caching actually is. Of course, those users that don't understand have access to Claude and can have it explain what caching is to them if they're interested.
I somewhat disagree that this is due diligence. Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.
> Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.
Does mmap(2) educate the developer on how disk I/O works?
At some point you have to know something about the technology you're using, or accept that you're a consumer of the ever-shifting general best practice, shifting with it as the best practice shifts.
That might be an absurd comparison, but we can fix that.
If you were being charged per character, or running down character limits, and printing on printers that were shared and had economic costs for stalled and started print runs, then:
You wouldn’t “need” to understand. The prints would complete regardless. But you might want to. Personal preference.
>If you were being charged per character, or running down character limits, and printing on printers that were shared and had economic costs for stalled and started print runs,
and the system was being run by some of the planet’s brightest people whose famous creation is well known to disseminate complex information succinctly,
>then:
You would expect to be led to understand, like… a 1997 Prius.
“This feature showed the vehicle operation regarding the interplay between gasoline engine, battery pack, and electric motors and could also show a bar-graph of fuel economy results.” https://en.wikipedia.org/wiki/Toyota_Prius_(XW10)
There are open-source and even open-weight models that operate in exactly this way (as it's based off of years of public research), and even if there weren't the way that LLMs generate responses to inputs is superbly documented.
Seems like every month someone writes up a brilliant article on how to build an LLM from scratch or similar that hits the HN page, usually with fancy animated blocks and everything.
It's not at all hard to find documentation on this topic. It could be made more prominent in the U/I but that's true of lots of things, and hammering on "AI 101" topics would clutter the U/I for actual decision points the user may want to take action upon that you can't assume the user already knows about in the way you (should) be able to assume about how LLMs eat up tokens in the first place.
Okay, sure. There's a dollar/intelligence tradeoff. Let me decide to make it, don't silently make Claude dumber because I forgot about a terminal tab for an hour. Just because a project isn't urgent doesn't mean it's not important. If I thought it didn't need intelligence I would use Sonnet or Haiku.
To some extent I'd say it is indeed reasonable. I had observed the effect for a while: if I walked away from a session I noticed that my next prompt would chew up a bunch of context. And that led me to do some digging, at which point I discovered their prompt caching.
So while I'd agree with your sarcasm that expecting users to be experts of the system is a big ask, where I disagree with you is that users should be curious and actively attempting to understand how it works around them. Given that the tooling changes often, this is an endless job.
> users should be curious and actively attempting to understand how it works
Have you ever talked with users?
> this is an endless job
Indeed. If we spend all our time learning what changed with all our tooling when it changes without proper documentation then we spend all our working lives keeping up instead of doing our actual jobs.
There are general users of the average SaaS, and there are claude code users. There's no doubt in my mind that our expectations should be somewhat higher for CC users re: memory. I'm personally not completely convinced that cache eviction should be part of their thought process while using CC, but it's not _that_ much of a stretch.
Personally I've never thought about cache eviction as it pertains to CC. It's just not something that I ever needed to think about. Maybe I'm just not a power user but I just use the product the way I want to and it just works.
This oversells how obfuscated it is. I'm far from a power user, and the opposite of a vibe coder. Yet I noticed the effect on my own just from general usage. If I can do it, anyone can do it.
My point is the opposite. I don't think my observation was smart, and I'm surprised to so many people here, a venue with a lot of people who use this stuff far more than I do, think it wasn't an easy to grok thing.
I’m not. Why would anyone believe marketing speak for any product? One should always assume that at best they’re fluffing their product up and more likely that they’re telling straight up lies
1. False advertisement is a thing, to the point there are laws against it
2. They were caught blatantly lying, and you're literally telling everyone it's the users' fault for not digging into the black box that is Claude Code (and more so Anthropic's servers) and figuring its behavior for themselves. A behavior that suddenly changed on a March day [1] and which previously very few people ever needed to investigate.
I'm not saying this is a great state of affairs. But I'm saying that it's so pervasive in daily life that yes, at least part of the blame lies on users for not taking this into account. As a developer it's important to at least try to understand the tools and libraries on which one relies. Relying on magic black boxes is not a good plan on the user's part, and they need to be defensive about this. Too many developers have been more than happy to hand the keys over to the AI assistants and hope for the best.
Also it wasn't completely undocumented, rather it was hiding in not-quite-plain sight. Which itself is a bit duplicitous, but again something that's far from unique on the part of Anthropic.
I believe if one were to read my post it'd have been clear that I *am* a user.
This *is* "hacker" news after all. I think it's a safe assumption that people sitting here discussing CC are an inquisitive sort who want to understand what's under the hood of their tools and are likely to put in some extra time to figure it out.
We're inquisitive but at the end of the day many of us just want to get our work done. If it's a toy project, sure. Tinker away, dissect away. When my boss is breathing down my neck on why a feature is taking so long? No time for inquiries.
Agreed. systems work the way they work. Its up to the user to determining what those limitations are. I don't like the concept of molding software based on every expectation a user has. Sometimes that expectation is unwarranted. You can see this in game development. Regardless of expressed criticism, sometimes gamers don't know what they want or what they need. A game should be developed by the design goals of the team, not cater to every whim the player base wants. We have seen were that can go.
It is more useful to read posts and threads like this exact thread IMO. We can't know everything, and the currently addressed market for Claude Code is far from people who would even think about caching to begin with.
It seems you haven't done the due diligence on what the parent meant :)
It's not about "constructing a prompt" in the sense of building the prompt string. That of course wouldn't be costly.
It is about reusing llm inference state already in GPU memory (for the older part of the prompt that remains the same) instead of rerunning the prompt and rebuilding those attention tensors from scratch.
You not only skipped the diligence but confused everyone repeating what I said :(
that is what caching is doing. the llm inference state is being reused. (attention vectors is internal artefact in this level of abstraction, effectively at this level of abstraction its a the prompt).
The part of the prompt that has already been inferred no longer needs to be a part of the input, to be replaced by the inference subset. And none of this is tokens.
>It seems you haven't done the due diligence on what part of the API is expensive - constructing a prompt shouldn't be same charge/cost as llm pass.
I think you missed what the parent meant then, and the confusing way you replied seemed to imply that they're not doing inference caching (the opposite of what you wanted to mean).
The parent didn't said that caching is needed to merely avoid reconstructing the prompt as string. He just takes that for granted that it means inference caching, to avoid starting the session totally new. That's how I read "from prompting with the entire context every time" (not the mere string).
So when you answered as if they're wrong, and wrote "constructing a prompt shouldn't be same charge/cost as llm pass", you seemed to imply "constructing a prompt shouldn't be same charge/cost as llm pass [but due to bad implementation or overcharging it is]".
You are right, I was wrong in my understanding there. It stemmed from my own implementation; an inference often wrote extra data such as tool call, so I was using it to preserve relevant information alongwith desired output, to be able to throw away the prompt every time. I realize inference caching is one better way (with its pros and cons).
I said "prompting with the entire context every time," I think it should be clear even to laypersons that the "prompting" cost refers to what the model provider charges you when you send them a prompt.
10s of GBs? ( 1,000,000 context * 1,000 vector size ) ^ 2 = 1,000,000,000,000,000,000… oh wow.. I must be miscalculating
What about only storing the conversation and then recomputing the embeddings in the cache? Does that cost a lot? Doing a lot of matrix multiplication does not cost dollars of compute, especially on specialized hardware, right?
Context length 1e6, vector length 1e3, and 1e2 model layers for 100e9 context size. Costs will go up even more with a richer latent space and more model layers, and the western frontier outfits are reasonably likely to be maximizing both.
If there was an exponential cost, I would expect to see some sort of pricing based on that. I would also expect to see it taking exponentially longer to process a prompt. I don't believe LLMs work like that. The "scary quadratic" referenced in what you linked seems to be pointing out that cache reads increase as your conversation continues?
If I'm running a database keeping track of a conversation, and each time it writes the entire history of the conversation instead of appending a message, are we calling that O(N^2) now?
Yes, that is indeed O(N^2). Which, by the way, is not exponential.
Also by the way, caching does not make LLM inference linear. It's still quadratic, but the constant in front of the quadratic term becomes a lot smaller.
> Also by the way, caching does not make LLM inference linear. It's still quadratic, but the constant in front of the quadratic term becomes a lot smaller.
Touché. Still, to a reasonable approximation, caching makes the dominant term linear, or equiv, linearly scales the expensive bits.
> I would also expect to see it taking exponentially longer to process a prompt. I don't believe LLMs work like that.
Try this out using a local LLM. You'll see that as the conversation grows, your prompts take longer to execute. It's not exponential but it's significant. This is in fact how all autoregressive LLMs work.
What we would call O(n^2) in your rewriting message history would be the case where you have an empty database and you need to populate it with a certain message history. The individual operations would take 1, 2, 3, .. n steps, so (1/2)*n^2 in total, so O(n^2).
This is the operation that is basically done for each message in an LLM chat in the logical level: the complete context/history is sent in to be processed. If you wish to process only the additions, you must preserve the processed state on server-side (in KV cache). KV caches can be very large, e.g. tens of gigabytes.
How big this cached data is? Wouldn't it be possible to download it after idling a few minutes "to suspend the session", and upload and restore it when the user starts their next interaction?
I often see a local model QWEN3.5-Coder-Next grow to about 5 GB or so over the course of a session using llamacpp-server. I'd better these trillion parameter models are even worse. Even if you wanted to download it or offload it or offered that as a service, to start back up again, you'd _still_ be paying the token cost because all of that context _is_ the tokens you've just done.
The cache is what makes your journey from 1k prompt to 1million token solution speedy in one 'vibe' session. Loading that again will cost the entire journey.
What they mean when they say 'cached' is that it is loaded into the GPU memory on anthropic servers.
You already have the data on your own machine, and that 'upload and restore' process is exactly what is happening when you restart an idle session. The issue is that it takes time, and it counts as token usage because you have to send the data for the GPU to load, and that data is the 'tokens'.
Wrong on both counts. The kv-cache is likely to be offloaded to RAM or disk. What you have locally is just the log of messages. The kv-cache is the internal LLM state after having processed these messages, and it is a lot bigger.
I shouldn't have said 'loaded into GPU memory', but my point still stands... the cached data is on the anthropic side, which means that caching more locally isn't going to help with that.
> upload and restore it when the user starts their next interaction
The data is the conversation (along with the thinking tokens).
There is no download - you already have it.
The issue is that it gets expunged from the (very expensive, very limited) GPU cache and to reload the cache you have to reprocess the whole conversation.
That is doable, but as Boris notes it costs lots of tokens.
A strange view. The trade-off has nothing to do with a specific ideology or notable selfishness. It is an intrinsic limitation of the algorithms, which anybody could reasonably learn about.
Sure, the exact choice on the trade-off, changing that choice, and having a pretty product-breaking bug as a result, are much more opaque. But I was responding to somebody who was surprised there's any trade-off at all. Computers don't give you infinite resources, whether or not they're "servers," "in the cloud," or "AI."
He was surprised because it was not clearly communicated. There's a lot of theory behind a product that you could (or could not) better understand, but in the end, something like price doesn't have much to do with the theoretical and practical behavior of the actual application.
It'd probably be helpful for power users and transparency to actually show how the cache is being used. If you run local models with llamacpp-server, you can watch how the cache slots fill up with every turn; when subagents spawn, you see another process id spin up and it takes up a cache slot; when the model starts slowing down is when the context grows (amd 395+ around 80-90k) and the cache loads are bigger because you've got all that.
So yeah, it doesn't take much to surface to the user that the speed/value of their session is ephemeral because to keep all that cache active is computationally expensive because ...
You're still just running text through a extremely complex process, and adding to that text and to avoid re-calculation of the entire chain, you need the cache.
I too would far rather bear a token cost than have my sessions rot silently beneath my feet. I usually have ~5 running CC sessions, some of which I may leave for a week or two of inactivity at a time.
Yes, me too. This is good to know, but basically it means I can’t rely on old conversations any more. Using a “handoff” file to try and start a new conversation is effectively the same thing as what they did under the hood. So yeah, you can’t rely on old conversations to be as informed when you pick it back up.
Instead of just dropping all the context, the system could also run a compaction (summarizing the entire convo) before dropping it. Better to continue with a summary than to lose everything.
Is there a way to say: I am happy to pay a premium (in tokens or extra usage) to make sure that my resumed 1h+ session has all the old thinking?
I understand you wouldn't want this to be the default, particularly for people who have one giant running session for many topics - and I can only imagine the load involved in full cache misses at scale. But there are other use cases where this thinking is critical - for instance, a session for a large refactor or a devops/operations use case consolidating numerous issue reports and external findings over time, where the periodic thinking was actually critical to how the session evolved.
For example, if N-4 was a massive dump of some relevant, some irrelevant material (say, investigating for patterns in a massive set of data, but prompted to be concise in output), then N-4's thinking might have been critical to N-2 not getting over-fixated on that dump from N-4. I'd consider it mission-critical, and pay a premium, when resuming an N some hours later to avoid pitfalls just as N-2 avoided those pitfalls.
Could we have an "ultraresume" that, similar to ultrathink, would let a user indicate they want to watch Return of the (Thin)king: Extended Edition?
I think it’s crazy that they do this, especially without any notice. I would not have renewed my subscription if I knew that they started doing this.
Especially in the analysis part of my work I don‘t care about the actual text output itself most of the time but try to make the model „understand“ the topic.
In the first phase the actual text output itself is worthless it just serves as an indicator that the context was processed correctly and the future actual analysis work can depend on it.
And they‘re… just throwing most the relevant stuff out all out without any notice when I resume my session after a few days?
This is insane, Claude literally became useless to me and I didn’t even know it until now, wasting a lot of my time building up good session context.
There would be nothing lost if they said „If you click yes, we will prune your old thinking making Claude faster and saving you tons of tokens“. Most people would say yes probably so why not ask them… make it an env variable (that is announced not a secretly introduced one to opt out of something new!) or at least write it in a change log if they really don’t want to allow people to use it like before, so there‘d be chance to cancel the subscription in time instead of wasting tons of time on work patterns that not longer work
Pointing at their terms of service will definitely be the instantly summoned defense (as would most modern companies) but the fact that SaaS can so suddenly shift the quality of product being delivered for their subscription without clear notification or explicitly re-enrollment is definitely a legal oversight right now and Italy actually did recently clamp down on Netflix doing this[1]. It's hard to define what user expectations of a continuous product are and how companies may have violated it - and for a long time social constructs kept this pretty in check. As obviously inactive and forgotten about subscriptions have become a more significant revenue source for services that agreement has been eroded, though, and the legal system has yet to catch up.
1. Specifically, this suite was about price increases without clear consideration for both parties - but the same justifications apply to service restrictions without corresponding price decreases.
> Our systems will smartly ignore any reasoning items that aren’t relevant to your functions, and only retain those in context that are relevant. You can pass reasoning items from previous responses either using the previous_response_id parameter, or by manually passing in all the output items from a past response into the input of a new one.
So to defend a litte, its a Cache, it has to go somewhere, its a save state of the model's inner workings at the time of the last message. so if it expires, it has to process the whole thing again. most people don't understand that every message the ENTIRE history of the conversion is processed again and again without that cache. That conversion might of hit several gigs worth of model weights and are you expecting them to keep that around for /all/ of your conversions you have had with it in separate sessions?
No? It's not because it's a cache, it's because they're scared of letting you see the thinking trace. If you got the trace you could just send it back in full when it got evicted from the cache. This is how open weight models work.
The issue is that if they send the full trace back, it will have to be processed from the start if the cache expired, and doing that will cause a huge one-time hit against your token limit if the session has grown large.
So what Boris talked about is stripping things out of the trace that goes back to regenerate the session if the cache expires. Doing this would help avert burning up the token limit, but it is technically a different conversation, so if CC chooses poorly on stripping parts of the context then it would lead to Claude getting all scatter-brained.
They literally can. They could make the API free to use if they wanted. There is no law that states that costs have to equal the cost it takes to process the request.
I’m not familiar with the Claude API but OpenAI has an encrypted thking messages option. You get something that you can send back but it is encrypted. Not available on Anthropic?
No of course it’s unrealistic for them to hold the cache indefinitely and that’s not the point. You are keeping the session data yourself so you can continue even after cache expiry. The point I‘m making is that it made me very angry that without any announcement they changed behavior to strip the old thinking even when you have it in your session file. There is absolutely no reason to not ask the user about if they want this
And it’s part of a larger problem of unannounced changes it‘s just like when they introduced adaptive thinking to 4.6 a few weeks ago without notice.
Also they seem to be completely unaware that some users might only use Claude code because they are used to it not stripping thinking in contrast to codex.
Anyway I‘m happy that they saw it as a valid refund reason
It seems like an opportunity for a hierarchical cache. Instead of just nuking all context on eviction, couldn’t there be an L2 cache with a longer eviction time so task switching for an hour doesn’t require a full session replay?
Living where? If it's in the GPU, then it's still taking up precious space that could be used for serving other sessions. If it's not in the GPU, then it doesn't help.
what matters isn't that it's a cache; what matter is it's cached _in the GPU/NPU_ memory and taking up space from another user's active session; to keep that cache in the GPU is a nonstarter for an oversold product. Even putting into cold storage means they still have to load it at the cost of the compute, generally speaking because it again, takes up space from an oversold product.
> There would be nothing lost if they said „If you click yes, we will prune your old thinking making Claude faster and saving you tons of tokens“. Most people would say yes probably so why not ask them
The irony is that Claude Design does this. I did a big test building a design system, and when I came back to it, it had in the chat window "Do you need all this history for your next block of work? Save 120K tokens and start a new chat. Claude will still be able to use the design system." Or words to that effect.
This is exactly what also confused me. I had the exact same prompt in Claude code as well, and the no option implies you can also keep the whole history. But clicking keep apparently only ever kept the user and assistant messages not the whole actual thinking parts of the conversation
Not at the moment apparently. They remove the thinking messages when you continue after 1 hour. That was the whole idea of that change. So the LLM gets all your messages, its responses etc but not the thinking parts, why it generated that responses. You get a lobotomised session.
OK didn't know that. I also resume fairly old sessions with 100-200k of context, and I sometimes keep them active for a while (but with large breaks in between).
Still on Opus 4.6 with no adaptive thinking, so didn't really notice anything worse in the past weeks, but who knows.
Why cant you just build a project document that outlines that prompt that you want to do? Or have claude save your progress in memory so you can pick it up later? Thats what I do. It seems abhorrent to expect to have a running prompt that left idle for long periods of time just so you can pick up at a moments whim...
This violates the principle of least surprise, with nothing to indicate Claude got lobotomized while it napped when so many use prior sessions as "primed context" (even if people don't know that's what they were doing or know why it works).
The purpose of spending 10 to 50 prompts getting Claude to fill the context for you is it effectively "fine tunes" that session into a place your work product or questions are handled well.
// If this notion of sufficient context as fine tune seems surprising, the research is out there.)
Approaches tried need to deal with both of these:
1) Silent context degradation breaks the Pro-tool contract. I pay compute so I don't pay in my time; if you want to surface the cost, surface it (UI + price tag or choice), don't silently erode quality of outcomes.
2) The workaround (external context files re-primed on return) eats the exact same cache miss, so the "savings" are illusory — you just pushed the cost onto the user's time. If my own time's cheap enough that's the right trade off, I shouldn't be using your machine.
I don't envy you Boris. Getting flak from all sorts of places can't be easy. But thanks for keeping a direct line with us.
I wish Anthropic's leadership would understand that the dev community is such a vital community that they should appreciate a bit more (i.e. not nice sending lawyers afters various devs without asking nicely first, banning accounts without notice, etc etc). Appreciate it's not easy to scale.
OpenAI seems to be doing a much better job when it comes to developer relations, but I would like to see you guys 'win' since Anthropic shows more integrity and has clear ethical red lines they are not willing to cross unlike OpenAI's leadership.
I think the best option would be tell a user who is about to resurrect a conversation that has been evicted from cache that the session is not cached anymore and the user will have to face a full cost of replaying a session, not only the incremental question and answer.
(In understand under the hood that llms are n^2 by default but it's very counter intuitive - and given how popular cc is becoming outside of nerd circles, probably smaller and smaller fraction of users is aware of it)
I would like to decide on it case by case. Sometimes the session has some really deep insight I want to preserve, sometimes it's discardable.
I got exactly this warning message yesterday, saying that it could use up a significant amount of my token budget if I resumed the conversation without compaction.
Wouldn't it help if the system did compaction before the eviction happens? But the problem is that Claude probably don't want to automatically compact all sessions that have been left idle for one hour (and very likely abandoned already), that would probably introduce even more additional costs.
Maybe the UI could do that for sessions that the user hasn't left yet, when the deadline comes near.
I saw that too, but that's actually even worse on cache - the entire conversation is then a cache miss and needs to be loaded in in order to do the compaction. Then the resulting compacted conversation is also a cache miss.
You ideally want to compact before the conversation is evicted from cache. If you knew you were going to use the conversation again later after cache expiry, you might do this deliberately before leaving a session.
Anthropic could do this automatically before cache expiry, though it would be hard to get right - they'd be wasting a lot of compute compacting conversations that were never going to be resumed anyway.
> I think the best option would be tell a user who is about to resurrect a conversation that has been evicted from cache that the session is not cached anymore and the user will have to face a full cost of replaying a session
This feature has been live for a few days/weeks now, and with that knowledge I try remember to a least get a process report written when I'm for example close to the quota limit and the context is reasonably large. Or continue with a /compact, but that tends to lead to be having to repeat some things that didn't get included in the summary. Context management is just hard.
Thanks for giving more information. Just as a comment on (1), a lot of people don't use X/social. That's never going to be a sustainable path to "improve this UX" since it's...not part of the UX of the product.
It's a little concerning that it's number 1 in your list.
Then you need to update your documentation and teach claude to read the new documentation because here is what claude code answered:
Question: Hey claude, if we have a conversation, and then i take a break. Does it change the expected output of my next answer, if there are 2 hours between the previous message end the next one?
Answer: No. A 2-hour gap doesn't change my output. I have no internal clock between messages — I only see the conversation content plus the currentDate context injected each turn. The prompt cache may expire (5 min TTL), which affects
cost/latency but not the response itself.
The only things that can change output across a break: new context injected (like updated date), memory files being modified, or files on disk changing.
-- This answer directly contradict your post. It seems like the biggest problem is a total lack of documentation for expected behavior.
A similar thing happens if I ask claude code for the difference between plan mode, and accept edits on.
Then Claude told me the only difference was that with plan mode it would ask for permission before doing edits. But I really don't think this is true. It seems like plan mode does a lot more work, and present it in a total different way. It is not just a "I will ask before applying changes" mode.
This isn't how LLMs work. They aren't self aware like this, they're trained on the general internet. They might have some pointers to documentation for certain cases, but they generally aren't going to have specialized knowledge of themselves embedded within. Claude code has no need to know about its own internal programming, the core loop is just javascript code.
Don't be silly, they don't expect you to ask the Ai questions and get the right answers. Obviously if you want to know what's going on you should look at their first solution - check what advice they have posted on X...
These controversies erupt regularly, and I hope that you will see a common thing with most of them: you make a decision for your users without informing them.
Please fight this hubris. Your users matter. Many of us use your tools for everyday work and do not appreciate having the rug pulled from under them on a regular basis, much less so in an underhanded and undisclosed way.
I don't mind the bugs, these will happen. What I do not appreciate is secretly changing things that are likely to decrease performance.
That is not what I wrote. The phrases "without informing them", "in an underhanded and undisclosed way" and "secretly changing things" were important. I'm all for product evolution, but users should be informed when the product is changed, especially when the change can be for the worse (like dumbing down the model).
I've spent my entire working career dealing with companies that do the opposite. The product still goes stale. Find a better excuse.
You're acquiring users as a recurring revenue source. Consider stability and transparency of implementation details cost of doing business, or hemorrhage users as a result.
While I hate all the gaslighting Anthropic seems to do recently (and the fact that their harness broke the code quality, while they forbid use of third party harnesses), making decisions for users is what UX is.
See also the difference between eg. MacOS (with large M, the older good versions) and waiting for "Year of linux on desktop".
I don't think the issue is making decisions for users, but trying to switch off the soup tap in the all-you-can-eat soup bar. Or, wrong business model setting wrong incentives to both sides.
I leave sessions idle for hours constantly - that's my primary workflow. If resuming a 900k context session eats my rate limit, fine, show me the cost and let me decide whether to /clear or push through. You already show a banner suggesting /clear at high context - just do the same thing here instead of silently lobotomizing the model.
Have the tool maintain a doc, and use either the built-in memory or (I prefer it this way) your own. I've been pretty critical of some other aspects of how Claude Code works but on this one I think they're doing roughly the right thing given how the underlying completion machinery works.
Edit: If you message me I can share some of my toolchain, it's probably similar to what a lot of other people here use but I've done some polishing recently.
The cache is stored on Antropics servers, since its a save state of the LLM's weights at the time of processing. its several gigs in size. Every SINGLE TIME you send a message and its a cache miss you have to reprocess the entire message again eating up tons of tokens in the process
clarification though: the cache that's important to the GPU/NPU is loaded directly in the memory of the cards; it's not saved anywhere else. They could technically create cold storage of the tokens (vectors) and load that, but given how ephemeral all these viber coders are, it's unlikely there's any value in saving those vectors to load in.
So then it comes to what you're talking about, which is processing the entire text chain which is a different kind of cache, and generating the equivelent tokens are what's being costed.
But once you realize the efficiency of the product in extended sessions is cached in the immediate GPU hardware, then it's obvious that the oversold product can't just idle the GPU when sessions idle.
I'm writing this message even though I don't have much to add because it's often the case on HN that criticism is vocal and appreciation is silent and I'd like to balance out the sentiment.
Anthropic has fumbled on many fronts lately but engaging honestly like this is the right thing to do. I trust you'll get back on track.
> Engaging so directly with a highly critical audience is a minefield that you're navigating well.
They spent two months literally gaslighting this "critical audience" that this could not be happening and literally blaming users for using their vibe-coded slop exactly as advertised.
All the while all the official channels refused to acknowledge any problems.
Now the dissatisfaction and subscription cancellations have reached a point where they finally had to do something.
No mention of anything like "hey, we just fixed two big issues, one that lasted over a month." Just casual replies to everybody like nothing is wrong and "oh there's an issue? just let us know we had no idea!"
Don't forget "our investigation concluded you are to blame for using the product exactly as advertised" https://x.com/lydiahallie/status/2039800718371307603 including gems like "Sonnet 4.6 is the better default on Pro. Opus burns roughly twice as fast. Switch at session start"
Very easy to do when you stand to make tens of millions when your employer IPOs. Let's not maybe give too much praise and employ some critical thinking here.
We should encourage minimal dependency on multibillion tech companies like anthropic. They, and similar companies are just milking us… but since their toys are soo shiny, we don’t care
I'm also a Claude Code user from day 1 here, back from when it wasn't included in the Pro/Max subscriptions yet, and I was absolutely not aware of this either. Your explanation makes sense, but I naively was also under the impression that re-using older existing conversations that I had open would just continue the conversation as is and not be a treated as a full cache miss.
My biggest learning here is the 1 hour cache window. I often have multiple Claudes open and it happens frequently that they're idle for 1+ hours.
This cache information should probably get displayed somewhere within Claude Code
But.. that doesn't solve the problem of having no indication in-session when it'll lose the cache. A nudge to /clear does nothing to indicate "or else face significant cost" nor does it indicate "your cache is stale".
Instead of showing actual usage, costs and cache status you spent two months denying the issue even exists, making the product silently worse, and now you're "iterating on this"
> We tried a few different approaches to improve this UX:
1. Educating users on X/social
No. You had random
developers tweet and reply at random times to random users while all of your official channels were completely silent. Including channels for people who are not terminally online on X
There's a cultural divide between SV and the 85% of SMB using M365, for example. When everyone you know uses a thing, I mean, who doesn't?*
There's a reason live service games have splash banners at every login. No matter what you pick as an official e-coms channel, most of your users aren't there!
* To be fair, of all these firms, ANTHROP\C tries the hardest to remember, and deliver like, some people aren't the same. Starting with normals doing normals' jobs.
Resuming sessions after more than 1 hour is a very common workflow that many teams are following. It will be great if this is considered as an expected behaviour and design the UX around it. Perhaps you are not realising the fact that Claude code has replaced the shells people were using (ie now bash is replaced with a Claude code session).
I think thats a bad idea. It seems like expecting to have a prompt open like this, accumulating context puts a load on the back end. Its one of those things that is a bad habit. Like trying to maintain open tabs in a browser as a way to keep your work flow up to date when what you really should be doing is taking notes of your process and working from there.
I have project folders/files and memory stored for each session, when I come back to my projects the context is drawn from the memory files and the status that were saved in my project md files.
Create a better workflow for your self and your teams and do it the right way. Quick expect the prompt to store everything for you.
For the Claude team. If you havent already, I'd recommend you create some best practices for people that don't know any better, otherwise people are going to expect things to be a certain way and its going to cause a lot of friction when people cant do what the expect to be able to do.
Opus 4.7 loves doing complex, long-running tasks like deep research, refactoring code, building complex features, iterating until it hits a performance benchmark.
For very long-running tasks, I will either (a) prompt Claude to verify its work with a background agent when it's done... so Claude can cook without being blocked on me.
The long context window means fewer compactions and longer-running sessions. I've found myself starting new sessions much less frequently with 1 million context.
This just does not match my workflow when I work on low-priority projects, especially personal projects when I do them for fun instead of being paid to do them. With life getting busy, I may only have half an hour each night with Claude to make some progress on it before having to pause and come back the next day. It’s just the nature of doing personal projects as a middle-aged person.
The above workflow basically doesn’t hit the rate limit. So I’d appreciate a way to turn off this feature.
Why does the system work like that? Is the cache local, or on Claude's servers?
Why not store the prompt cache to disk when it goes cold for a certain period of time, and then when a long-lived, cold conversation gets re-initiated, you can re-hydrate the cache from disk. Purge the cached prompts from disk after X days of inactivity, and tell users they cannot resume conversations over X days without burning budget.
The cache is on Antropics server, its like a freeze frame of the LLM inner workings at the time. the LLM can pick up directly from this save state. as you can guess this save state has bits of the underlying model, their secret sauce. so it cannot be saved locally...
Maybe they could let users store an encrypted copy of the cache? Since the users wouldn't have Anthropic's keys, it wouldn't leak any information about the model (beyond perhaps its number of parameters judging by the size).
I'm unsure of the sizes needed for prompt cache, but I suspect its several gigs in size (A percentage of the model weight size), how would the user upload this every time they started a resumed a old idle session, also are they going to save /every/ session you do this with?
A few gigs of disk is not that expensive. Imo they should allocate every paying user (at least) one disk cache slot that doesn't expire after any time. Use it for their most recent long chat (a very short question-answer that could easily be replayed shouldn't evict a long convo).
I don't know how large the cache is, but Gemini guessed that the quantized cache size for Gemini 2.5 Pro / Claude 4 with 1M context size could be 78 gigabytes. ChatGPT guessed even bigger numbers. If someone is able to deliver a more precise estimate, you're welcome to :-).
So it would probably be a quite a long transfer to perform in these cases, probably not very feasible to implement at scale.
Whats lost on this thread is these caches are in very tight supply - they are literally on the GPUs running inference. the GPUs must load all the tokens in the conversation (expensive) and then continuing the conversation can leverage the GPU cache to avoid re-loading the full context up to that point. but obviously GPUs are in super tight supply, so if a thread has been dead for a while, they need to re-use the GPU for other customers.
They could let you nominate an S3 bucket (or Azure/GCP/etc equivalent). Instead of dropping data from the cache, they encrypt it and save it to the bucket; on a cache miss they check the bucket and try to reload from it. You pay for the bucket; you control the expiry time for it; if it costs too much you just turn it off.
Encryption can only ensure the confidentiality of a message from a non-trusted third party but when that non-trusted third party happens to be your own machine hosting Claude Code, then it is pointless. You can always dump the keys (from your memory) that were used to encrypt/decrypt the message and use it to reconstruct the model weights (from the dump of your memory).
jetbalsa said that the cache is on Anthropic's server, so the encryption and decryption would be server-side. You'd never see the encryption key, Anthropic would just give you an encrypted dump of the cache that would otherwise live on its server, and then decrypt with their own key when you replay the copy.
We at UT-Austin have done some academic work to handle the same challenge. Will be curious if serving engines could modified. https://arxiv.org/abs/2412.16434 .
The core idea is we can use user-activity at the client to manage KV cache loading and offloading. Happy to chat more!!
The main issue here is not UX, but rather that you did something which degraded quality without transparency. You should have documented this and also highlighted the change in an announcement. There should never be an undocumented change that reduces quality. There should never be something the user can do (or fail to do) that reduces quality without that being documented. To regain trust, Anthropic should make an announcement committing to documenting/announcing any future intentional quality-reducing changes.
In addition, the following is less important, but as other commenters have stated: walking away from a conversation and coming back to it more than an hour later is very common and it would be nice if there were a way for the user to opt to retain maximum quality (e.g. no dropped thinking) in this case. In the longer term, it would be nice if there were a way for the user to wait a few minutes for a stale session to resume, in exchange for not having a large amount of quota drained (ie have a 'slow mode' invoked upon session resumption that consumes less quota).
This sounds like one of those problems where the solution is not a UX tweak but an architecture change. Perhaps prompt cache should be made long term resumable by storing it to disk before discarding from memory?
I agree.. Maybe parts of the cache contents are business secrets.. But then store a server side encrypted version on the users disk so that it can be resumed without wasting 900k tokens?
But if you have a tiered cache, then waiting several seconds / minutes is still preferable to getting a cache miss. I suspect the larger problem is the amount of tinkering they are doing with the model makes that not viable.
reasonably, if i'm in an interactive session, its going to have breaks for an hour or more.
whats driving the hour cache? shouldnt people be able to have lunch, then come back and continue?
are you expecting claude code users to not attend meetings?
I think product-wise you might need a better story on who uses claude-code, when and why.
Same thing with session logs actually - i know folks who are definitely going to try to write a yearly RnD report and monthly timesheets based on text analysis of their claude code session files, and they're going to be incredibly unhappy when they find out its all been silently deleted
Prioritize outcomes for users using your product. That should lead to improving the viral/visibility aspect of documentation notification, as well as other aspects of documentation. Make this a differentiator of your product. Widespread misperceptions hurt outcomes.
Could you create one location educating advanced users, and:
• Promote, Organize and Maintain it
• Develop a group of users that have early access to "upcoming notifications we're working on"
• Perhaps give a third party specializing in making information visible responsibility for it
• Read comments by users in various places to determine what should be communicated. Just under this comment @dbeardsl begins "I appreciate the reply, but I was never under the impression that ...".
The speed that key users are informed of issues is critical. This is just off the top of my head, a much better plan I'm sure could be created.
I still don't understand it, yes it's a lot of data and presumably they're already shunting it to cpu ram instead of keeping it on precious vram, but they could go further and put it on SSD at which point it's no longer in the hotpath for their inference.
I don't think you can store the cache on client given the thinking is server side and you only get summaries in your client (even those are disabled by default).
If they really need to guard the thinking output, they could encrypt it and store it client side. Later it'd be sent back and decrypted on their server.
But they used to return thinking output directly in the API, and that was _the_ reason I liked Claude over OpenAI's reasoning models.
I assume they are already storing the cache on flash storage instead of keeping it all in VRAM. KV caches are huge - that’s why it’s impractical to transfer to/from the client. It would also allow figuring out a lot about the underlying model, though I guess you could encrypt it.
What would be an interesting option would be to let the user pay more for longer caching, but if the base length is 1 hour I assume that would become expensive very quickly.
Just to contextualize this... https://lmcache.ai/kv_cache_calculator.html. They only have smaller open models, but for Qwen3-32B with 50k tokens it's coming up with 7.62GB for the KV cache. Imagining a 900k session with, say, Opus, I think it'd be pretty unreasonable to flush that to the client after being idle for an hour.
Yes — encryption is the solution for client side caching.
But even if it’s not — I can’t build a scenario in my head where recalculating it on real GPUs is cheaper/faster than retrieving it from some kind of slower cache tier
Isn't that exactly what people had been accusing Anthropic of doing, silently making Claude dumber on purpose to cut costs? There should be, at minimum, a warning on the UI saying that parts of the context were removed due to inactivity.
The entire reason I keep a long-lived session around is because the context is hard-won — in term of tokens and my time.
Silently degrading intelligence ought to be something you never do, but especially not for use-cases like this.
I’m looking back at my past few weeks of work and realizing that these few regressions literally wasted 10s of hours of my time, and hundreds of dollars in extra usage fees. I ran out of my entire weekly quota four days ago, and had to pause the personal project I was working on.
I was running the exact same pipeline I’ve run repeatedly before, on the same models, and yet this time I somehow ate a week’s worth of quota in less than 24h. I spent $400 just to finish the pipeline pass that got stuck halfway through.
I’m sorry to be harsh, but your engineering culture must change. There are some types of software you can yolo. This isn’t one of them. The downstream cost of stupid mistakes is way, way too high, and far too many entirely avoidable bugs — and poor design choices — are shipping to customers way too often.
> The entire reason I keep a long-lived session around is because the context is hard-won — in term of tokens and my time.
Silently degrading intelligence ought to be something you never do, but especially not for use-cases like this.
how does this help me as a customer? if i have to redo the context from scratch, i will pay both the high token cost again, but also pay my own time to fill it.
the cost of reloading the window didnt go away, it just went up even more
> I’m sorry to be harsh, but your engineering culture must change. There are some types of software you can yolo. This isn’t one of them. The downstream cost of stupid mistakes is way, way too high, and far too many entirely avoidable bugs — and poor design choices — are shipping to customers way too often.
I have to imagine this isn't helped by working somewhere where you effectively have infinite tokens and usage of the product that people are paying for, sometimes a lot.
This points to a fairly fundamental mismatch between the realities of running an LLM and the expectations of users. As a user, I _expect_ the cost of resuming X hours/days later to be no different to resuming seconds or minutes later. The fact that there is a difference, means it's now being compensated for in fairly awkward ways -- none of the solutions seem good, just varying degrees of bad.
Is there a more fundamental issue of trying to tie something with such nuanced costs to an interaction model which has decades of prior expectation of every message essentially being free?
> As a user, I _expect_ the cost of resuming X hours/days later to be no different to resuming seconds or minutes later.
As an informed user who understands his tools, I of course expect large uncached conversations to massively eat into my token budget, since that's how all of the big LLM providers work. I also understand these providers are businesses trying to make money and they aren't going to hold every conversation in their caches indefinitely.
I'd hazard a guess that there's a large gulf between proportion of users who know as much as you, and the total number using these tools. The fact that a message can perform wildly differently (in either cost, or behaviour if using one of the mitigations) based on whether I send it at t vs t+1 seems like a major UX issue, especially given t is very likely not exposed in the UI.
I drop sessions very frequently to resume later - that's my main workflow with how slow Claude is. Is there anything I can do to not encounter this cache problem?
Wow so that's why you did #2? The explanation in the CLI is really not clear. I thought it was just a suggestion to compact, no idea it was way more expensive than if I hadn't left it idle for an hour.
You guys really need to communicate that better in the CLI for people not on social
So you made this change completely invisible to the user, without the user being able to choose between the two behaviors, and without even documenting it in the (extremely verbose) changelog [1]? I can't find it, the Docs Assistant can't find it (well, it "I found it!" three times being fed your reply with a non-matching item).
I frequently debug issues while keeping my carefully curated but long context active for days. Losing potentially very important context while in the middle of a debugging session resulting in less optimal answers, is costing me a lot more money than the cache misses would.
In my eyes, Claude Code is mainly a context management tool. I build a foundation of apparent understanding of the problem domain, and then try to work towards a solution in a dialogue. Now you tell me Anthrophic has been silently breaking down that foundation without telling me, wasting potentially hours of my time.
It's a clear reminder that these closed-source harnesses cannot be trusted (now or in the future), and I should find proper alternatives for Claude Code as soon as possible.
I can see how this makes sense as a default behavior for cost conscious users. I would prefer to have the option for my company to pay more to rehydrate the cache than to have there be a model performance difference when having idled for an hour.
"We tried a few different approaches to improve this UX:
1. Educating users on X/social
2. Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)
3. Eliding parts of the context after idle: old tool results, old messages, thinking. Of these, thinking performed the best, and when we shipped it, that's when we unintentionally introduced the bug in the blog post."
I see how these interventions help users reduce their token burn rate, but they don't address the need for an enterprise user to maintain quality.
A common workflow for me is kick off a prompt, commute home, eat dinner, follow up on prompt. Frequently 80K tokens or less in the context, frequently > 3 hours. Or when running multiple sessions it's easy to let a session idle for a few hours while I focus on one. Or many meetings might mean idle time for an hour.
Also, for enterprise users, I don't think education on X is a great place. There are people upskilling on this that never intentionally go on X.
First thing that comes to mind would be a weekly tip feed of footguns and underutilized functionality published to an anthropic website. "The Old New Thing" "Guru of the Week" "Abseil tips of the week" all have that format.
> Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)
I feel like I'm missing something here. Why would I revisit an old conversation only to clear it?
To me it sounds like a prompt-cache miss for a big context absolutely needs to be a per-instance warning and confirmation. Or even better a live status indicating what sending a message will cost you in terms of input tokens.
> that would be >900k tokens written to cache all at once
Probably that's why I hit my weekly limits 3-4 days ago, and was scheduled to reset later today. I just checked, and they are already reset.
Not sure if it's already done, shouldn't there be a check somewhere to alert on if an outrageous number of tokens are getting written, then it's not right ?
Appreciate the responses here. However, I feel like these responses are just to show us how much you know about the product and aren't actually helpful.
Instead, why don't you and Anthropic be more open about changes to these tools rather than waiting for users to complain, then investigating things after the fact that you should have investigated in the first place, and then posting on social media about all the cool tech details?
My company is tens of thousands strong. The amount of churn in Claude Code is a major issue and causing real awareness of the lack of stability and lack of customer support Anthropic provides.
And Claude Code is actually becoming a prototypical example of the dangers of vibe coded products and the burdens they place.
We hit limits, and we come back when the limit is lifted. Isn't it obvious sessions are going to stay idle for more than 1 hour when Claude itself is hitting the limits?
I switched to Codex, Claude has gotten to a point where it's just unusable for the regular Joe.
You need to seriously look at your corporate communications and hire some adults to standarise your messaging, comms and signals. The volatility behind your doors is obvious to us and you'd impress us much more if you slowed down, took a moment to think about your customers and sent a consistent message.
You lost huge trust with the A/B sham test. You lost trust with enshittification of the tokenizer on 4.6 to 4.7. Why not just say "hey, due to huge input prices in energy, GPU demand and compute constraints we've had to increase Pro from $20 to $30." You might lose 5% of customers. But the shady A/B thing and dodgy tokenizer increasing burn rate tells everyone inc. enterprise that you don't care about honesty and integrity in your product.
I hope this feedback helps because you still stand to make an awesome product. Just show a little more professionalism.
How big is the cache? Could you just evict the cache into cheap object storage and retrieve it when resuming? When the user starts the conversation back up show a "Resuming conversation... ⭕" spinner.
> The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss, all N messages. We noticed that this corner case led to outsized token costs for users.
I dont agree with this being characterized as a "corner case".
Isn't this how most long running work will happen across all serious users?
I am not at my desk babysitting a single CC chat session all day. I have other things to attend to -- and that was the whole point of agentic engineering.
Dont CC users take lunch breaks?
How are all these utterly common scenarios being named as corner cases -- as something that is wildly out of the norm, and UX can be sacrificed for those cases?
Ahh that makes sense. Sometimes it's convenient to re-use an older conversation that has all the context I need. But maybe it's just the last 20% that's relevant.
It would be nice to be able to summarize/cut into a new leaner conversation vs having to coax all the context back into a fresh one. Something like keep the last 100,000 tokens.
I believe /compact achieves something like this? It just takes so long to summarize that it creates friction.
Ever since I heard about this behaviour I've been trying to figure out how to handle long running Claude sessions and so far every approach I've tried is suboptimal
It takes time to create a good context which can then trigger a decent amount of work in my experience, so I've been wondering how much this is a carefully tuned choice that's unlikely to change vs something adjustable
Just curious, is there a consolidated list of all these "education" tips?
Intuitively I understand this due to how context windows work and you're looking to increase cache hits, has Anthropic tried compact/summarise on idle as a configurable option? Seems to have decent tradeoffs + education in a setting.
For idle sessions I would MUCH rather pay the cost in tokens than reduced quality. Frankly, it's shocking to me that you would make that trade-off for users without their knowledge or consent.
Hi, thanks for Claude Code. I was wondering though if you'd considering adding a mode to make text green and characters come down from the top of the screen individually, like in The Matrix?
From a utility perspective using a tiered cache with some much higher latency storage option for up to n hours would be very useful for me to prevent that l1 cache miss.
Sorry but I think this should be left up to the user to decide how it works and how they want to burn their tokens. Also a countdown timer is better than all of these other options you mention.
Hi Boris! Wanted to let you know that I find those ads with you saying "now when you code, you use an agent" obnoxious because of that incorrect statement. I have no interest in slop coding. I find it way more ergonomic and effective to use code to tell a machine precisely what to do than to use English to tell it vaguely. I hate that your ad is misleading so many non-coders, who will actually believe your lie that nobody codes anymore. Probably doesn't help that YouTube was playing it as an interruption in every video I watched. I probably saw it 100 times and was getting to the "throw the remote at the tv" stage XD.
> Since the devs on HN (& the whole world) is buying what looks like nonsense to me - what am I missing?
Input tokens are expensive, since the whole model has to be run for each token. They're cheaper than output tokens because the model doesn't need to run the sampler, so some pipeline parallelism is possible, but on the other hand without caching the input token cost would have to be paid anew for each output token.
Prompt caching fixes that O(N^2) cost, but the cache itself is very heavyweight. It needs one entry per input token per model layer, and each entry is an O(1000)-dimensional vector. That carries a huge memory cost (linear in context length), and when cached that means the context's memory space is no longer ephemeral.
That's why a 'cache write' can carry a cost; it is the cost of both processing the input and committing the backing store for the cache duration.
Boris from the Claude Code team here. We agree, and will be spending the next few weeks increasing our investment in polish, quality, and reliability. Please keep the feedback coming.
For there to be any trust in the above, the tool needs to behave predictably day to day. It shouldn't be possible to open your laptop and find that Claude suddenly has an IQ 50 points lower than yesterday. I'm not sure how you can achieve predictability while keeping inference costs in check and messing with quantization, prompts, etc on the backend.
Maybe a better approach might be to version both the models and the system prompts, but frequently adjust the pricing of a given combination based on token efficiency, to encourage users to switch to cheaper modes on their own. Let users choose how much they pay for given quality of output though.
Sure, I've cancelled my Max 20 subscription because you guys prioritize cutting your costs/increasing token efficiency over model performance.
I use expensive frontier labs to get the absolute best performance, else I'd use an Open Source/Chinese one.
Frontier LLMs still suck a lot, you can't afford planned degradation yet.
My biggest problem with CC as a harness is that I can't trust "Plan" mode. Long running sessions frequently start bypassing plan mode and executing, updating files and stuff, without permission, while still in plan mode. And the only recovery seems to be to quit and reload CC.
Right now my solution is to run CC in tmux and keep a 2nd CC pane with /loop watching the first pane and killing CC if it detects plan mode being bypassed. Burning tokens to work around a bug.
Here's one person's feedback. After the release of 4.7, Claude became unusable for me in two ways: frequent API timeouts when using exactly the same prompts in Claude Code that I had run problem-free many times previously, and absurdly slow interface response in Claude Cowork. I found a solution to the first after a few days (add "CLAUDE_STREAM_IDLE_TIMEOUT_MS": "600000" to settings.json), but as of a few hours ago Cowork--which I had thought was fantastic, by the way--was still unusable despite various attempts to fix it with cache clearing and other hacks I found on the web.
hm. ml people love static evals and such, but have you considered approaches that typically appear in saas? (slow-rollouts, org/user constrained testing pools with staged rollouts, real-world feedback from actual usage data (where privacy policy permits)?
And you didn't invest anything in polish, quality and reliability before... why? Because for any questions people have you reply something like "I have Claude working on this right now" and have no idea what's happening in the code?
A reminder: your vibe-coded slop required peak 68GB of RAM, and you had to hire actual engineers to fix it.
A month prior their vibe-coders was unironically telling the world how their TUI wrapper for their own API is a "tiny game engine" as they were (and still are) struggling to output a couple of hundred of characters on screen: https://x.com/trq212/status/2014051501786931427
Yeah you don't have to convince me. I switched to Codex mid-January in part because of the dubious quality of the tui itself and the unreliability of the model. Briefly switched back through March, and yep, still a mistake.
Once OpenAI added the $100 plan, it was kind of a no-brainer.
if only there were a place with 9.881 feedbacks waiting to be triaged...
and that maybe not by a duplicate-bot that goes wild and just autocloses everything,
just blessing some of the stuff there with a "you´ve been seen" label would go a long way...
Common pattern of checking the claude code issue tracker for a bug: land on issue #12587, auto closed as duplicate of #12043; check #12043, auto closed as duplicated of #11657; check #11657, auto closed as duplicate of #10645; check #10645, never got a response, or closed as not planned, or some other bullshit.
Because then they lose vertical integration and the extra ability it grants to tune settings to reduce costs / token use / response time for subscription users.
Or improve performance and efficiency, if we’re generous and give them the benefit of the doubt.
It makes sense, in a way. It means the subscription deal is something along the lines of fixed / predictable price in exchange for Anthropic controlling usage patterns, scheduling, throttling (quotas consumptions), defaults, and effective workload shape (system prompt, caching) in whatever way best optimises the system for them (or us if, again, we’re feeling generous) / makes the deal sustainable for them.
It may be (but I wouldn’t know) that some of other changes not covered here reduced costs on their side without impacting users, improving the viability of their subscription model. Or maybe even improved things for users.
I’d really appreciate more transparency on this, and not just when things fail.
But I’ve learned my lesson. I’ve been weening off Claude for a few weeks, cancelled my subscription three weeks ago, let it expire yesterday, and moved to both another provider and a third-party open source harness.
Nothing you wrote makes sense. The limits are so Anthropic isn't on a loss. If they can customize Claude using Code, I see no reason why they couldn't do so with other wrappers. Other wrappers can also make use of cache.
If you worry about "degraded" experience, then let people choose. People won't be using other wrappers if they turn out to be bad. People ain't stupid.
By imposing the use of their harness, they control the system prompt:
> On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality, and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7
They can pick the default reasoning effort:
> On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode
They can decide what to keep and what to throw out (beyond simple token caching):
> On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6
It literally is all in the post.
I don't worry about anything though. It's not my product. I don't work for Anthropic, so I really couldn't care less about anyone else's degraded (or not) experience.
Evidently, all these things you just dismissed matter, else all the changes I quoted from the original post wouldn’t have affected anyone, or half as many people, or half as much. Anthropic wouldn’t have had any complaints to investigate, the article promoting this entire thread wouldn’t exist, and we wouldn’t be having this very conversation.
Defaults matter. A large share of people never change them (status quo bias, psychological inertia). Having control over them (and usage quotas) means Anthropic can control and fine-tune what this fixed subscription costs them.
And evidently (re, the original article), they tried to do so.
> Defaults matter. A large share of people never change them (status quo bias, psychological inertia). Having control over them (and usage quotas) means Anthropic can control and fine-tune what this fixed subscription costs them.
Allowing third party wrappers doesn't mean Claude Code would cease to exist. The opposite actually, Claude Code would be the default.
People dissatisfied with Code would simply use other wrappers. I call it a win-win. Don't see how Anthropic would be on a lose here, they would still retain the ability to control the defaults.
Except one of the major other wrappers was pi, through OpenClaw. With countless hundreds of thousands of instances running every hour on that heartbeat.
I have no idea what the share of OpenClaw instances running on pi was, or third-party wrappers in general, but it was obviously large enough that Anthropic decided they had to put an end to it.
Conversely, from the latest developments, it would seem they are perfectly fine with people running OpenClaw with Claude models through Claude Code’s programmatic interface using subscriptions.
But in the end, this, my take, your take, is all conjecture. We are both on the outside looking in.
Hey, Boris from the Claude Code team here. People were getting extra cyber warnings when using old versions of Claude Code with Opus 4.7. To fix it, just run claude update to make sure you're on the latest.
Under the hood, what was happening is that older models needed reminders, while 4.7 no longer needs it. When we showed these reminders to 4.7 it tended to over-fixate on them. The fix was to stop adding cyber reminders.
We've been investigating these reports, and a few of the top issues we've found are:
1. Prompt cache misses when using 1M token context window are expensive. Since Claude Code uses a 1 hour prompt cache window for the main agent, if you leave your computer for over an hour then continue a stale session, it's often a full cache miss. To improve this, we have shipped a few UX improvements (eg. to nudge you to /clear before continuing a long stale session), and are investigating defaulting to 400k context instead, with an option to configure your context window to up to 1M if preferred. To experiment with this now, try: CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000 claude.
2. People pulling in a large number of skills, or running many agents or background automations, which sometimes happens when using a large number of plugins. This was the case for a surprisingly large number of users, and we are actively working on (a) improving the UX to make these cases more visible to users and (b) more intelligently truncating, pruning, and scheduling non-main tasks to avoid surprise token usage.
In the process, we ruled out a large number of hypotheses: adaptive thinking, other kinds of harness regressions, model and inference regressions.
We are continuing to investigate and prioritize this. The most actionable thing for people running into this is to run /feedback, and optionally post the feedback ids either here or in the Github issue. That makes it possible for us to debug specific reports.
Boris, you're seeing a ton of anecdotes here and Claude has done something that has affected a bunch of their most fervent users.
Jeff Bezos famously said that if the anecdotes are contradicting the metrics, then the metrics are measuring the wrong things. I suggest you take the anecdotes here seriously and figure out where/why the metrics are wrong.
On the subject of metrics, better user-facing metrics to understand and debug usage patterns would be a great addition. I'd love an easier way to understand the ave cost incurred by a specific skill, for example. (If I'm missing something obvious, let me know.)
But the default 1M context window just rolled out a few weeks ago. If refreshing old sessions on 1M context windows is the problem, it's completely aligned with what Boris is saying.
The quantitative ux research team at Google was created for exactly this problem: a service which became popular before the right metrics existed, meaning metrics need to be derived first, then optimized. We would observe users (irl), read their logs, then generate experiments to improve the behavior as measured by logs, and return to see if the experiment improves irl experiences. There were not many of us and we are around :)
I worked with Boris in the past and in my experience, Boris cares deeply about the customer. I'd vouch that Boris really cares about the issue people are running into.
The idea is that Claude Code is surprisingly buggy and unrefined for something created by the very tool and processes that are supposed to be replacing us as we speak.
Sure they can. The solution is pretty simple and in your own post. Choose either:
* Make the product good to the point code is no longer slop and shit.
* Stop hyping the quality when it isn’t there.
* Do a hybrid approach. Use their own product but actually have competent humans in the loop to make the code good.
This is not hard. Be honest and humble and that criticism goes away. It’s no one’s fault but Anthropic’s that they hype up their product to more than it can do and use it carelessly to build itself. It’s not a no-win scenario if you’re the one causing your own obviously avoidable problems.
If you mean Google website login, that step is needed because the email address is used to determine which identity provider to use. E.g. I have three different accounts that branch off from that same initial login flow.
One is my person "gmail.com" account, and the other two go through enteprise identity providers related to my employment and their G-Suite licenses. So after I put in one of these three email addresses, I get prompted for the appropriate next step. Only one of them involves giving a password to a Google server. The other two are redirects to completely separate login systems operated by my employer.
I mean I get it logically makes sense. But it still seems like a waste of time for a small percentage of use cases.
Maybe a better approach is put in your login have it automatically detect if it requires an identity provider. Gray out the password to signal to the user password is not necessary and automatically redirect.
Less clicking, don't break flow and think of a smoother solution.
HN sometimes talks about pathological customers who will never be happy. Boris is probably the single best rep in the community, possibly ever.
The way your tone and complaints come across reminds me of this. As a paying customer ($5k spend per month in my corporate job), I’d rather anthropic keep doing what they’re doing — innovating and shipping useful stuff at blinding speed — and not index on your feedback. I think the tradeoffs they would cost far outweigh the consequences.
You’re not getting a worthwhile sla on a subscription at this rate. What are you going to get? A few dollars? An sla isn’t useful unless it actually bites for the provider and actually compensates the customer. And it costs money - how much are you willing to spend for this insurance?
Wait, where is there a 'beta' tag to something that they are charging real money for? Why is this software any different than any other software and we should completely give away our rights as a consumer to ensure what we pay for is delivered?
I think the parent is saying that one should be aware that the whole LLM industry is still in an experimental stage and far from mature. What you want isn’t what’s being offered. I agree that there should be higher standards, but what we currently have is an arms race. The consequence is to factor that into the value proposition and maybe not rely too much on it.
SLAs should be standard for any paid service, especially on the enterprise side, but also on the consumer side. Being immature as a company does not excuse a lack of service delivery.
Not every customer, even a paying customer, demands reliability at a particular level. Market segmentation tends to address those situations: pay more, get more.
Users on $200 plan complaining, already at max level of subscription, I don't think a $200 subscription should make you feel like you are getting unfair advantage. Like restricting claude -p to API ... after I paid so much? Moderate use should not do that. I am not running it batch mode on a million inputs.
They can be held to account when they fail to deliver what they promise! But what is promised for delivery is what's in the Terms of Service (i.e. the agreement). Nothing more. If it's not in there, you can't hold them to account for it.
> It's too easy for companies to fail to provide their service as long as they never promise to provide their service.
I don't even know what this means. You can't make anyone work for free, nor dictate the terms of what kind of work someone will do without their consent. I assume you are not pro-slavery.
You didn't merely call out their failure. You said it was "too easy," implying something more, like they owe you something. It's a pretty entitled point of view.
"[W]ant[ing] companies to put some effort into avoiding ... failures" is not the same as "hold[ing] them to account". The former is "this sucks and I don't like it." The latter is "punish them or force them to do what I want!"--i.e., some sort of legal remedy.
What right as a consumer do you have that is pertinent here, other than to have the vendor adhere to the terms of the agreement you have with them?
Anthropic has many customers despite the fact that they have occasional problems. They’re not suing Anthropic because Anthropic isn’t promising in its agreement something they can’t deliver.
I think you’re reading into the agreement something that isn’t there, and that’s the cause of your confusion.
I am not reading into an agreement, I am saying there is no agreement to be found to ensure service delivery and the associated liability that would come for any SLA. Also, where is the Anthorpic SLA for Enterprise?
Does it exist?
Just because people pay for things doesn't mean they know or understand what they are paying for. Nor is there the legal precedence to actually understand where the rub lies or how that impacts business.
> Just because people pay for things doesn't mean they know or understand what they are paying for.
I believe, respectfully, that’s precisely what is happening in this thread because you keep complaining about the absence of an SLA that was never in the agreement, as though it is—or is supposed to be—there, and therefore the existence of some “rights” that would flow from that.
I am sorry you feel this way, but the reality of the situation is there is zero reason to trust anything Anthropic or Boris says. They have no legal liability or obligation to tell the truth, besides brand risk, which to people like you is mitigated for a single person to show up, post, and thats it.
You should work at these companies and understand they have good intentioned employees otherwise they’d rarely pass the cultural interviews plus background checks plus backchanneling. Have a bit more faith in the employees
Maybe... maybe... maybe... none of this builds trust when there is something that does build trust; putting revenue on the line and opening yourself to legal liability. Otherwise everything is empty and meaningless, its just PR, and nothing more.
Then you should offer to pay them for one. I’m sure they’d love to hear from you, and they could probably deliver one to you for the right price. But it will be a high price.
I feel like you aren't really understanding what a Service-level Agreement actually is in practice. It's not a piece of paper with a specific number of nines and an associated price tag. They can be and often are very complicated documents that take multiple rounds of redlining to arrive at something both parties agree to.
If zero data-retention was non-negotiable for the customer, it's totally possible that the negotiations ended there.
I'm not sure what you're trying to accomplish or unearth beyond what's already been said, which certainly suffices for me.
As both an attorney and SRE, I understand what an SLA is. And you can absolutely get an SLA when you buy cloud services from many vendors, including AWS. Some vendors provide it at all price points; others include it at higher service tiers, without complex negotiations needed at all. And, yes, if it’s not on the menu, you may need to negotiate one. But you can’t conclusively say “they don’t offer one” unless you’ve actually gone to the company and asked.
It seems like you could save a lot of time and confusion by talking about the SLA that you pay for from Anthropic instead of establishing your bona fides by posting links to various unrelated companies’ SLA pages.
Like how was your experience negotiating your SLA with Anthropic? What ballpark are you paying for the SLA with Anthropic that you have in place? How many 9s does your Anthropic SLA cover? Obviously you haven’t posted a half dozen times in this thread about how Anthropic by nature of existing offers SLAs without any knowledge of that, so some simple stuff about your SLA with Anthropic would be helpful.
I make no unqualified claims as to whether Anthropic offers an SLA. I never did. But I do know that it's unreasonable to claim they don't when you didn't even take the steps to conclusively determine it for yourself.
As I said: "I’m sure they’d love to hear from you, and they could probably deliver one to you for the right price. But it will be a high price."
Oh, well in that case, if posting URLs counts as proof of… something, there doesn’t appear to be any SLA page anywhere in their sitemap.
https://www.anthropic.com/sitemap.xml
Maybe it is just common for enterprise SaaS businesses to offer SLAs without having a page about it though. Something like that could possibly be unjustifiably burdensome as well because it’s not like they could just type “make a page about how we offer SLAs” and have it magically appear
That’s a good point. Having an SLA page is an indicator that a business offers SLAs, not having an SLA page is also an indicator that they offer SLAs, just secretly. If you think about it all of the people constantly complaining about uptime and saying stuff like “I would pay money for an SLA from Anthropic if I could” probably means that they are killing it with all those secret SLAs.
I mean obviously they have to offer them, because they exist, as otherwise you’d have to believe something crazy like “they don’t currently offer them” for reasons “that they haven’t disclosed”
Again, many companies will do things they don’t ordinarily offer for the right price. I’ve seen it happen myself (on both the buyer and seller side) on many occasions.
It goes to the extent of the company itself! Very few businesses publicize that they’re for sale or put their company’s purchase price on their website. But acquisitions happen all the time.
Anyway, I don’t appreciate your sarcasm coupled with what seems to be willful ignorance about how the world works, so I won’t be participating in this discussion with you anymore.
I don’t get it. If you wanted to convince everybody about a vast universe of secret business and your expertise in it, why would you start with telling people that weren’t able to get an SLA from Anthropic that Anthropic offers SLAs? And then admit that you don’t actually know and then double down?
Like if I wanted to convince people that In’N’Out has a secret menu (they do) I wouldn’t start by saying “They have the ingredients to make onion rings, therefore they sell onion rings” (they do not). They offer burgers with lettuce instead of a bun (“protein style”) though. That’s a fact that you can verify by going there or calling them and asking about it. I didn’t rely on my assumptions based on other fast food restaurants, I relied on my knowledge of the topic!
Edit: It seems like bad faith to admit that you’re using “probably” interchangeably with “I don’t know” and then editing in “for a billion dollars” several posts into a conversation.
I guess enjoy posting about entirely unrelated conversations in other threads though. (otterley’s post about my having previously had a short amicable exchange with dang in a different thread was deleted, but I’ll leave this part up. I think digging through people’s post histories to find unrelated grievances is icky, for lack of a better word, and wildly unhelpful for any type of discussion)
Even with the “for a billion dollars” addition, admitting “I don’t know” and “probably” are interchangeable doesn’t really change anything from a logical standpoint. Nobody argued against you not knowing, so I don’t understand the purpose of the repetition.
> why would you start with telling people that weren’t able to get an SLA
That hasn’t been established. There’s no evidence that they went to Anthropic and tried to negotiate one.
> that Anthropic offers SLAs
I didn’t. I said “they probably will for the right price.” There are two modifiers in that statement. And the price is unspecified. Their first offer could be a billion dollars. Too expensive? Negotiate down.
Boring corporate Ai will surely come, but hey, lets enjoy the wild west while it lasts. I am grateful to see Boris come here to address problems people face. I 100% sure nobody is making him - he has one of the coolest jobs in the world.
So that means we just eject any critical thinking when it comes to companies, especially where they is no liability or obligation for them (Boris or Anthropic) to be honest.
Don’t like Anthropic? Use a competing service. At this point the sheer volume of your commentary is not particularly complimentary to your own critical thinking skills. It’s not your job to correct the internet or to convince randoms of the rightness of your position. Of all the things in the world to be pissed at so insistently, this seems to be a pretty minor one.
So Anthropic is trying to save money on infrastructure, we all get it. However, it's not ok to degrade the performance your users have paid for. Last week the issue was that you reduced the default "effort" level, now the prompt cache is shortened. Several users experience far more restrictive usage limits lately.
There is only so much you can do through "UX improvements" or some smart routing on the backend. Your flagship product is actively getting worse, and if users need to fiddle with hidden settings and keep track of GitHub issues every week they will start voting with their money.
For context, my company gives each developer a decent monthly allowance for Claude and if push comes to shove, we are allowed to fallback to using
AWS Bedrock hosted Anthropic models.
When you pay for a Claude subscription, what exactly were you promised?
> they will start voting with their money.
And go where? Sooner or later the party is going to be over and Claude and its competitors are going to have to start charging enough to actually be profitable when the VC money dries up.
> When you pay for a Claude subscription, what exactly were you promised?
I was promised 5x or 20x the amount of resources that the free tier would offer. I implicitly expected the same quality too, not some watered-down version of the product they allowed me to sample before committing to a subscription.
Sooner or later Anthropic will run out of VC money, yes. That's their problem, not mine. When I took an Uber while it was subsidized by venture capital, the driver did not drop me half way through my destination because they were having cash flow issues.
It’s exhausting enough to deal with services that change around on an annual/semi-annual basis with pricing and expectations.
Now the expectation is that we should tolerate goalposts being shuffled around on a weekly/daily basis with the added requirement of digging into bug tickets because there’s no attempt at transparency? The tech is cool but this is absolutely insane.
If you’re an individual developer paying $100-200/mo for a service that keeps changing, there is a LOT of reason to keep an eye on other products.
I’m not saying that there isn’t a reason to keep an eye on other products. I’m saying that every other product in the space has the same unit economics and will eventually need to charge enough to be profitable - and to continue training and hardware expansion.
Honestly a developer paying $200 a month is a nothingburger and if using their service to the fullest is losing them money.
For context, the company I work for gives each consultant a $2000 a month allowance and I think there are probably around 500-700 people with that allowance. I’m sure everyone doesn’t use it all.
If they have limited hardware resources, where do you think they are going to focus?
Classic VC pump playbook - run it uneconomically until everyone is addicted, then 5x prices once you have enough critical mass. See 2010s "Millennial Lifestyle Subsidy"..
It seems pretty transparent that they are heavily resource constrained, (training run for Claude 5.x, higher usage / growth than anticipated). I don’t disagree that their long play is monopolistic pricing, but what we’re observing seems better explained by the fact they have a very tight compute budget they are trying to optimize over to put as much as they can into next gen experiments / training to make sure they stay competitive over the next 6-months / year.
Why did this become an issue seemingly overnight when 1M context has been available for a while, and I assume prompt caching behavior hasn't changed?
EDIT: prompt caching behavior -did- change! 1hr -> 5min on March 6th. I'm not sure how starting a fresh session fixes it, as it's just rebuilding everything. Why even make this available?
It feels like the rules changed and the attitude from Anth is "aw I'm sorry you didn't know that you're supposed to do that." The whole point of CC is to let it run unattended; why would you build around the behavior of watching it like a hawk to prevent the cache from expiring?
This is not accurate. The main agent typically uses a 1h cache (except for API customers, which can enable 1h but it is not on by default because it costs more). Sub-agents typically use a 5m cache.
What does it mean that sub-agents use a 5 min cache? Is this just for growing contexts submitted by subagent itself? What about fixed sub-agent context prefixes such as tool definitions? Are those 1hr TTL?
It seems that a more flexible cache control mechanism, useful for variable duration sub-agents, would be more like an arena allocator. Let the client tag their API key activity with different "cache group" (arena) identifiers, then provide an API method to let them free each cache group when they are finished with it. Each sub-agent would then use it's own cache group and clear it when the sub-agent exits, rather than just having a fixed 5min or 1hr TTL. The client could provide a default TTL for each cache group to use in case they forget to free.
Context prefixes like tool definitions that will be the same for multiple invocations of the same sub-agent type could then be created (maybe by main agent) with a different cache group, and a longer default TTL.
As of yesterday subagents were often getting the entire session copied to them. Happened to me when 2 turns with Claude spawned a subagent, caused 2 compactions, and burned 15% of my 5-hour limit (Max 5x).
how long they stay around after the cache miss is irrelevant if I am burning all the prior tokens again. also, how much context they have depends entirely on the task and your workflow. I you have a subagent implement a feature and use the compile + test loop to ensure it is implemented correctly before a supervisor agent reviews what was implemented vs asked then yes, subagents do have a lot of context.
I'd say it's next to impossible to have a subagent doing a compile+test loop where at least 1 call doesn't get made to the API over multiple 5-minute stretches to keep the cache warm. In such a case it may just be the same as doing the compile+test manually and then having the agent troubleshoot any issues before iterating.
but how to make claude-code send that when paying by API-key?
or when using a custom ANTHROPIC_BASE_URL? (requests will contain cache_control, but no ttl!)
2.1.108
Added ENABLE_PROMPT_CACHING_1H env var to opt into 1-hour prompt cache TTL on API key, Bedrock, Vertex, and Foundry (ENABLE_PROMPT_CACHING_1H_BEDROCK is deprecated but still honored),
and FORCE_PROMPT_CACHING_5M to force 5-minute TTL
docs are not updated yet, directly from the changelog^
And this is "a bit better" - but seemingly still nowhere close to what subscribers get where main thread, agent, initial and follow-up messages may all get there own ?intelligent? 5min or 1h decision :/
The /clear nudge isn't a solution though. Compacting or clearing just means rebuilding context until Claude is actually productive again. The cost comes either way.
I get that 1M context windows cost more than the flat per-token price reflects, because attention scales with context length, but the answer to that is honest pricing or not offering it. Not annoying UX nudges.
What’s actually indefensible is that Claude is already pushing users to shrink context via, I presume, system prompt. At maybe 25% fill:
“This seems like a good opportunity to wrap it up and continue in a fresh context window.”
“Want to continue in a fresh context window? We got a lot of work done and this next step seems to deserve a fresh start!”
If there’s a cost problem, fix the pricing or the architecture. But please stop the model and UI from badgering users into smaller context windows at every opportunity. That is not a solution, it’s service degradation dressed as a tooltip.
The cost issues they're seeing (at least from what they've stated) are from users, not internally. Basically, it takes either $5 or $6.25 (depending on 5m or 1h ttl) to re-ingest a 1M context length conversation into cache for opus 4.6, that's obviously a very high cost, and users are unhappy with it.
I think 400k as a default seems about right from my experience, but just having the ability to control it would be nice. For the record, even just making a tool call at 1M tokens costs 50 cents (which could be amortized if multiple calls are made in a round), so imo costs are just too high at long context lengths for them to be the default.
For me definitely the worst regression was the system prompt telling claude to analyze file to check if it's malware at every read. That correlates with me seeing also early exhausted quotas and acknowledgments of "not a malware" at almost every step.
It is a horrible error of judgement to insert a complex request for such a basic ability. It is also an error of judgement to make claude make decisions whether it wants to improve the code or not at all.
It is so bad, that i stopped working on my current project and went to try other models. So far qwen is quite promising.
I don't think that's accurate. The malware prompt has been around since Sonnet 3.7. We carefully evaled it for each new model release and found no regression to intelligence, alongside improved scores for cyber risk. That said, we have removed the prompt for Opus 4.6 since it no longer needed it.
I started seeing "not a malware, continuing" in almost every reply since around 2 weeks ago. Maybe you just reintroduced it with some regression? Opus 4.6
I'm happy to provide any other info that can be useful (as long as i'm not sharing any information about the code or tools we use into a public github issue).
1. I've never seen this. Is there a config option to unhide it if it's happening? Is this in Claude Code? Does it have to be set to verbose or something?
2. Can we pay more/do more rigorous KYC to disable it if it's active?
> Since Claude Code uses a 1 hour prompt cache window for the main agent, if you leave your computer for over an hour then continue a stale session, it's often a full cache miss. To improve this, we have shipped a few UX improvements (eg. to nudge you to /clear before continuing a long stale session), and are investigating defaulting to 400k context instead
I don’t understand this. I frequently have long breaks. I never want to clear or even compact because I don’t want to lose the conversations that I’ve had and the context. Clearing etc causes other issues like I have to restate everything at times and it misses things. I do try to update the memory which helps. I wish there was a better solution than a time bound cache
I wanted this as well. Even asked about it at an openai talk. Basically a way to get the KV cache to the client (they can encrypt it if they care about me REing it, make a compressed latent if they don't wanna egress 20GB, whatever, I'm fine with a black box) so that I can load it later and avoid these cache misses.
I think the primary reason they cannot do this is that they change the memory and communication layouts in their serving stack rather aggressively. And naturally keeping the KV cache portable across all such layouts is a very difficult task. So you'd have to version the cache down to a specific deployment, and invalidate it the moment anything even small changes. So giving the user a handle to the cache sort of prevents you from making large changes to memory layout. Which is I suppose not that enticing. Also, client side KV caches are only meaningful in today's 1M contexts. Few y back it wasn't necessary, since just recomputing would be better for everybody.
To be clear, I don't mean they send it along with every request. Rather, they do their current TTL cache, and then when I'm at the end of a session, I request it in one shot and then close the session. And it doesn't have to come to the literal client, they can egress it to a storage service that we pay for, whatever. But ya the compat problem makes it all a non starter.
The KV cache consists of activation vectors for every attention head at every layer of the model for every token, so it gets quite large. ChatGPT also estimates 60-100GB for full token context of an Opus-sized model:
I don't want a nudge. I want a clear RED WARNING with "You've gone away from your computer a bit too long and chatted too much at the coffee machine. You're better off starting a new context!"
I think after the TTL expires the session should be autocompacted and the user should given a choice to continue with compacted version or be hit with the full read cost of continuing with their large but expired context. At the moment users are blind what is going on.
Why is nobody even asking why that should be an issue? No other text editor shits the bed that way. The whole point of the computer is that it patiently waits for my input.
Hey Boris - why is the best way to get support making a Hacker News or X post, and hoping you reply? Why does Anthropic Enterprise Support never respond to inquiries?
I mean if we're building an unrelated wishlist... Can 20x max users get auto mode already? Or can the enterprise plans get something equivalent to 20x max?
Given I'm running two max accounts to get the usage I want, can we get a 25x and 40x tier? :-)
It is not inherently their fault though because usage is controlled both by the user and the harness behavior. So I was asking specifically what about the harness was messed up, can you provide that info?
Not parent but I can guess from watching mostly from the sidelines.
They introduced a 1M context model semi-transparently without realizing the effects it would have, then refused to "make it right' to the customer which is a trait most people expect from a business when they spend money on it, specially in the US, and specially when the money spent is often in the thousands of dollars.
Unless anthropic has some secret sauce, I refuse to believe that their models perform anywhere near the same on >300k context sizes than they do on 100k. People don't realize but even a small drop in success rate becomes very noticeable if you're used to have near 100%, i.e. 99% -> 95% is more noticeable than 55% -> 50%.
I got my first claude sub last month (it expires in 4 days) and I've used it on some bigish projects with opencode, it went from compacting after 5-10 questions to just expanding the context window, I personally notice it deteriorating somewhere between 200-300k tokens and I either just fork a previous context or start a new one after that because at that size even compacting seems to generate subpar summaries. It currently no longer works with opencode so I can't attest to how it well it worked the past week or so.
If the 1M model introduction is at fault for this mass user perception that the models are getting worse, then it's anthropics fault for introducing confusion into the ecosystem. Even if there was zero problems introduced and the 1M model was perfect, if your response when the users complain is to blame it on the user, then don't expect the user will be happy. Nobody wants to hear "you're holding it wrong", but it seems that anthropic is trying to be apple of LLMs in all the wrong ways as well.
Especially since Codex faced the same issue but the team decided to explicitly default to only ~200k context to avoid surprises and degradation for users.
Different users do seem to be encountering problems or not based on their behavior, but for a rapidly-evolving tool with new and unclear footguns, I wouldn't characterize that as user error.
For example, I don't pull in tons of third-party skills, preferring to have a small list of ones I write and update myself, but it's not at all obvious to me that pulling in a big list of third-party skills (like I know a lot of people do with superpowers, gstack, etc...) would cause quota or cache miss issues, and if that's causing problems, I'd call that more of a UX footgun than user error. Same with the 1M context window being a heavily-touted feature that's apparently not something you want to actually take advantage of...
Me and my colleagues faced, over the last ~1 month or so, the same issues.
With a new version of Claude Code pretty much each day, constant changes to their usage rules (2x outside of peak hours, temporarily 2x for a few weeks, ...), hidden usage decisions (past 256k it looks like your usage consumes your limits faster) and model degradation (Opus 4.6 is now worse than Opus 4.5 as many reported), I kind of miss how it can be an user error.
The only user error I see here is still trusting Anthropic to be on the good side tbh.
just like everybody else I and my colleagues at work have seen major regressions in terms of available usage over the past month, seemingly unrelated to caching/resuming. On an enterprise sub doing the same work I personally went from being able to have several sessions running concurrently without hitting limits, to only having one session at a time and hitting my 5h every day twice a day in 3-4 hours tops (and due to the apparent lower intelligence I have been at the terminal watching what opus is doing like a hawk, so it's not a I went for coffee I have to hit the cache). The first day I ever hit my 5h this year was the day everybody reported it (I think it was the Monday you introduced the 2x promotion after hours? not sure, like 3 weeks ago?)
To avoid 1M issues, this week I have also intentionally used the 256k context model, disabled adaptive thinking and did the same "plans in multiple short steps with /clear in-between" to minimize context usage, and yet nothing helps. It just feels ~2x to ~3x less tokens than before, and a lot less smart than in February.
Nowadays every time I complete a plan I spend several sessions afterwards saying things like "we have done plan X, the changes are uncommitted, can you take a look at what we did" and every time it finds things that were missed or outright (bad) shortcuts/deviations from plan despite my settings.json having a clear "if in doubt ask the user, don't just take the easy way out". As a random data point, just today opus halfway through a session told me to make a change to code inside a pod then rollout restart it to use said change, and when called out on it it of course said that I was right and of course that wouldn't work...
It is understandable that given your incredible growth you are between a rock and a hard place and have to tweak limits, compute does not grow on trees, but the consistent "you are holding it wrong" messaging is not helpful. I am wondering if realistically your only option is to move everybody to metered, with clear token usage displayed, and maybe have pro/max 5/max 20 just be a "your first $x of tokens is 50/75% off". Allow folks to tweak the thinking budget, and change the system prompt to remove things like "try the easy solution first" which anecdotally has been introduced in the past while, and allow users to verify on prompt if the prompt would cause the whole context to be sent or if cache is available.
Yes same here. I use CC almost constantly every day for months across personal and work max/team accounts, as well as directly via API on google vertex. I have hardly ever noticed an issue (aside from occasional outages/capacity issues, for which I switch to API billing on Vertex). If anything it works better than ever.
You know that people are not using the same resources? It's like 9 out of 10 computers get borked and you have the 1 that seems okay and you essentially say "My computer works fine, therefore all computers work fine." Come on dude.
> To improve this, we have shipped a few UX improvements (eg. to nudge you to /clear before continuing a long stale session)
Is this really an improvement? Shouldn't this be something you investigate before introducing 1M context?
What is a long stale session?
If that's not how Claude Code is intended to be used it might as well auto quit after a period of time. If not then if it's an acceptable use case users shouldn't change their behavior.
> People pulling in a large number of skills, or running many agents or background automations, which sometimes happens when using a large number of plugins.
If this was an issue there should have been a cap on it before the future was released and only increased once you were sure it is fine? What is "a large number"? Then how do we know what to do?
It feels like "AI" has improved speed but is in fact just cutting corners.
Would it be possible to increase the cache duration if misses are a frequent source of problems?
Maybe using a heartbeat to detect live sessions to cache longer than sessions the user has already closed. And only do it for long sessions where a cache miss would be very expensive.
Even if Anthropic is working in good faith to lower infrastructure costs, developers need more than 5 minutes to notice that CC completed a task, review its changes and ask it to merge. Only developers who do not review code changes can live with such a TTL...
Consider making this value configurable as the ideal TTL value is different for each person. If people are willing to pay more for 30 minutes TTL than 5 minutes, they should be able to.
Claude Code is the most prompt cache-efficient harness, I think. The issue is more that the larger the context window, the higher the cost of a cache miss.
That might be, but the argument was that poor cache utilization was costing Anthropic too much money in other harnesses. If cache is considered in rate limits, it doesn’t matter from a cost perspective, you’ll just hit your rate limits faster in other harnesses that don’t try to cache optimize.
There were two issues with some other 3p harnesses:
1. Poor cache utilization. I put up a few PRs to fix these in OpenClaw, but the problem is their users update to new versions very slowly, so the vast majority of requests continued to use cache inefficiently.
2. Spiky traffic. A number of these harnesses use un-jittered cron, straining services due to weird traffic shape. Same problem -- it's patched, but users upgrade slowly.
We tried to fix these, but in the end, it's not something we can directly influence on users' behalf, and there will likely be more similar issues in the future. If people want to use these they are welcome to, but subscriptions clients need to be more efficient than that.
How much jitter would you prefer, how many seconds / minutes out? I have some morning tasks that run while I'm asleep via claude -p, and it sounds like I'm slightly contributing to your spikes (presumably hourly and on quarter hours).
If you give doll a list of things you want to see from third party harnesses, a compliance checklist it will make sure the one it is building follows it to the letter.
I suspect 1M token context is questionable value because of the secondary effect of burning quota vs getting work done.
I think the model select that let me choose 1M made sense because I could decide if I was working on large documents and compacting more often was more effective.
Claude Code cache is not 1 hour. There is a "Closed as not planned" issue in GitHub that confirms that it has been moved to 5 minutes since March: https://github.com/anthropics/claude-code/issues/46829.
I started seeing the massive degradation exactly on the 23rd of March, hence after a few days I unsubscribed because it was completely unusable, with a ~5h session being depleted in as little as 15-20 mins.
As another data point, I pay for Pro for a personal account, and use no skills, do nothing fancy, use the default settings, and am out of tokens, with one terminal, after an hour. This is typically working on a < 5,000 line code base, sometimes in C, sometimes in Go. Not doing incredibly complicated things.
When a user walks away during the business day but CC is sitting open, you can refresh that cache up to 10x before it costs the same as a full miss. Realistically it would be <8x in a working day.
Long term claude code user here. Is the first time i've had to setup a hook to codex to review claude output.
Is hallucinating like never before
Is missing key concepts/instructions in context like never before
Is writing bad code that will "pass test" much more. Before it use to try be critic and do good code, now it will try to hack test and bypass intructions for a green pass.
There's an issue someone raised showing that prompt caches are only 5 minutes.
The reply seems to be: oh huh, interesting. Maybe that's a good thing since people sometimes one-shot? That doesn't feel like the messaging I want to be reading, and the way it conflicts with the message here that cache is 1 hour is confusing.
Is there any status information or not on whether cache is used? It sure looks like the person analyzing the 5m issue had to work extremely hard to get any kind of data. It feels like the iteration loop of people getting better at this stuff would go much much better if this weren't such a black box, if we had the data to see & understand: is the cache helping?
Hi, thanks for Claude Code. I was wondering though if you'd considering adding a mode to make text green and characters come down from the top of the screen individually, like in The Matrix?
I’ve seen the /clear command prompt and I found the verbiage to be a bit unclear. I think clarifying that the cache has expired and providing an understandable metric on the impact - ie “X% of your 5-hour window” for Pro/Mad users and details on token use for API users. A pop-up that requires explicit acknowledgment might also help, although that could be more of an annoyance to enterprise users.
One pattern I use frequently is using one high level design and implementation agent that I’ll use for multiple sessions and delegate implementation to lower level agents.
In this case it’d be helpful to have one of two options:
1. If Claude CLI could create an auto compaction of the conversation history before cache expiration. For example, if I’m beyond X minutes or Y prompts in a conversation and I’ve been inactive for a threshold it could auto-compact close to the expiration and provide that as an option on resume.
2. If I could configure cache expiration proactively and Anthropic could use S3 or a similar slow load mechanism to offload the cache for a longer period - possibly 24-72h.
I can appreciate that longer KV cache expiration would complicate capacity management and make inference traffic less fungible but I wouldn’t mind waiting seconds to minutes for it to load from a slower store to resume without quota hits.
Pulling all the skills and agents in the world in, when unused are a big hit. I deleted all of mine and added back as needed and there was an improvement.
Running Claude Cowork in the background will hit tokens and it might not be the most efficient use of token use.
Last, but not least, turning off 1M token context by default is helpful.
One thing I didn't see anywhere here, except your mention about pulling in large number of skills, is that the token consumption is significantly higher for users with many agents, skills, and MCPs installed, and many are mere ghosts. The 5m TTL from #46829 compounds the effect: in my case, I found ~20k tokens of ghost context I hadn't intentionally opened. Each idle period after 5m wastes that as a full cache miss.
Boris, would you please confirm on-record: is the current cache TTL for the main agent context 1h or 5m? Issue #46829 was closed as "not planned".
Could we get an option to use Opus with a smaller context window? I noticed that results get much worse way earlier than when you reach 1M tokens, and I would love to have a setting so that I could force a compaction at eg 300k tokens.
The only people who are going to run into issues are superpower users who are running this excessively beyond any reasonable measure.
Most people are going to be quite happy with your service. But at the same time, and this is just a human nature thing people are 10 times more likely we complain about an issue than to compliment something working well.
I don't know how to fix this, but I strongly suspect this isn't really a technical issue. It's more of a customer support one.
> defaulting to 400k context instead, with an option to configure your context window to up to 1M if preferred
This seems really useful!
I'm surprised that "Opus 4.6" (200K) and "Opus 4.6 1M" are the only Opus options in the desktop app, whereas in the CLI/TUI app you don't seem to even get that distinction.
I bet that for a lot of folks something like 400k, 600k or 800k would work as better defaults, based on whatever task they want to work on.
Boris, wasnt this the same thing ~2 weeks ago? Is it the same cache misses as before? What's the expected time till solved? Seems like its taking a while
it seems if context can't be held for over an hour it should warn you a countdown or such; i already enabled the tokens verbosity thing to see what token level i'm at, but i often leave things sitting rather than complete so that i'm tying things up to start something new in the morning rather than starting on a new thing. so like i just resumed a session that was near-complete, and now it's gone and reloaded all that session in? bit i hadn't detached it. i kind of thougth /summary itself had to read the whole token flow, but that the token context was held locally for some reason..
Hello Boris! How do I increase the 1 hour prompt cache window for the main agent? I would love to be able to set that to, say, 4 hours. That gives me enough time to work on something, go teach a class, grab a snack, and come back and pick up where I left off.
Number 2 makes me chuckle honestly. Too many people going down the 10x rabbit holes on youtube. Next up, a framework that 100xs your workflow. You know its good because it comes with 300 agents and 20 mcp servers and 1200 skills
Resizing the context window seems like a very good idea to me. I noticed a decline of productivity when the 1M context window was released and I'd like to bring it back to 200k, because it was totally fine for the things I was working on.
Thank you for your responses, especially on a Sunday. They give us some insights and at least a couple temporary workarounds to use, while the issues are being addressed :) much appreciated
I would argue that KV caching is a net gain for Ant and a well-maintained cache is the biggest thing that can generate induced demand and a thriving third party ecosystem. https://safebots.ai/papers/KV.pdf
shouldn't compaction be interactive with the user as to what context will continue to be the most relevant in the future??? what if the harness allowed for a turn to clarify the user's expected future direction of the conversation and did the consolidation based upon the addition info?
there definitely seems to be a benefit to pruning the context and keeping the signal to noise high wrt what is still to be discussed.
Where can i learn about concepts like prompt cache misses? I don't have a mental model how that interacts with my context of 1M or 400k tokens... I can cargo cult follow instructions of course but help us understand if you can so we can intelligently adapt our behavior. Thanks.
Thanks. Just noting that those docs say the cache duration is 5 min and not 1 hour as stated in sibling comment:
> By default, the cache has a 5-minute lifetime. The cache is refreshed for no additional cost each time the cached content is used.
>
> If you find that 5 minutes is too short, Anthropic also offers a 1-hour cache duration at additional cost.
Apparently Anthropic downgraded cache TTL to 5 min without telling anyone. My biggest issue with the recent issues with Claude Code is the lack transparency, although it looks like even Boris doesn't know about one:
https://news.ycombinator.com/item?id=47736476
Why are you all of a sudden running into so many issues like this? Could it be that all of the Anthropics employees have completely unlimited and unbounded accounts, which means you don't get a feeling of how changes will affect the customers?
I think the suspicion regarding skills and plugins is fair and logical. And it is absolutely the case that some use significantly more tokens.
with that said, on my 5x plan, I could have multiple sessions working and the limit was far away. Around when you introduced the whole more tokens during off-peak hours and fewer tokens during working US hours, Even with a single session, using no plugins at all (I uninstalled OMC) I run into limits very often.
I have not performed any rigorous tests but it feels like I have about 25% of what I used to have or less. This is all without using teams of agents, or ralph loops or anything like that. Just /plan and execute in a single session. I have restored the /clear context before executing plan to try and mitigate things. I will also try the 400k context since, in my experience, the 1M tokens have not made Opus 4.6 noticeably smarter for my small webapp use-case.
Best of luck to you!
ps: whenever you introduce a change, please make it optional AND ask the user about it at first. Don't just yank things suddenly (like the /clear context and apply plan option.) as I spent hours trying to figure out how I broke it before I saw your note and how to re-enable it.
I have a feature request: I build an mcp server, but now it has over 60 tools. Most sessions i really don’t need most of them. I suppose I could make this into several servers. But it would maybe be nice to give the user more power here. Like let me choose the tools that should be loaded or let me build servers that group tools together which can be loaded. Not sure if that makes sense …
There's also CLAUDE_CODE_DISABLE_1M_CONTEXT and I'm really not clear on what the difference is and why to pick one over the other. But I guess one disables models that have 1m and the other keeps those models but sets the limit lower?
It seems just fine to me. This is what Anthropic needs to do if they want to survive. I'm always looking out for someone to integrate an actually good harness to a good model. Once that happens, I'm jumping ship if Anthropic keeps playing these tricks.
It's almost unusable for me now. A simple prompt to merge 3 sub-100-line files with simple node code, on Sonnet 4.6, uses up 20% of my 5 hour quota, on a new/fresh session.
To be fair, my comment was a bit harsher before the update. The way they handle the development, communication and how they treat customers isn't fine. I've seen some angry people post and comment in manners which truly deserved the label hostile.
The whole product with the infrastructure and Claude Code's code appear to be vibe coded.
They appear to take issues seriously mostly when they become posts on hacker news and when articles are published online by major news sites. Customer support is mostly a bot. I don't even know how to reach some actual humans to get support.
I'm sorry if you and others are offended. They've had these issues for several weeks now. I haven't seen any real improvements during this time. I see more features and more bugs.
There have been several releases made over the last few days without any changelogs. The quotas are still as opaque as they've been. This company has some extremely shady business practices.
As an (ex) paying customer, I'm expecting some consistency. I used to be satisfied with the value I got, until the limits changed overnight, and I'd get a ten of my previous usage.
If Anthropic is allowed to alter the deal whenever, then I'd expect to be able to get my money back, pro-rata, no questions asked.
All those apply to OpenAI+Codex too, but they're far more generous with limits than Anthropic, and with granting fresh limits to apologize when they fuck up.
To enable it, run /config > output styles > Learning
reply