More

bcherny · 2026-06-01T19:29:25 1780342165

For those using Claude Code, I recommend Learning mode to instruct Claude to walk you through implementing the solution yourself rather than doing it for you. It’s very helpful when diving into a new domain, and helps build lower level intuition.

To enable it, run /config > output styles > Learning

FistfulOfHaws · 2026-06-01T23:26:44 1780356404

Learning mode has been a huge help for me, it quickly became my favorite way to learn. I ended up created a “Coaching Mode” output style that took some of the learning concepts like stubbing todos for the users and added other intructions that better fit how I learn

nbbaier · 2026-06-02T00:47:32 1780361252

This sounds neat - do you have this publicly available?

bigmadshoe · 2026-06-02T14:30:41 1780410641

I think it’s a part of official Claude code.

nbbaier · 2026-06-02T22:45:51 1780440351

I meant your specific tweaks

bcherny · 2026-05-28T17:36:47 1779989807

There's two main differences:

1. Support for 1-2 OOMs more agents, to do more work in parallel

2. A phased, semi-structured approach where work happens in steps

bcherny · 2026-05-28T17:24:14 1779989054

A few of us from the Claude Code team will be hanging around if anyone has questions! Very excited for this launch -- dynamic workflows have been a game changer for engineering here at Anthropic. Can't wait to hear what you think.

_boffin_ · 2026-05-28T20:22:58 1779999778

This isn't related to Dynamic Workflows, but more on the telemetry / observability side of things.

Why'd you guys not want to allow the traceparent in hooks, but allowed the session.id? Any plans on changing that?

nebben64 · 2026-05-28T22:55:33 1780008933

hey Boris, with multi-agent work, e.g. Agent Teams, following other agents is not possible because they work so fast. (my tmux panes are just matrix)

do you think something like a /speed config can be introduced to adjust agent working speed and let people adjust?

hbarka · 2026-05-28T17:50:32 1779990632

Hi Boris. Love the velocity of features. Are you planning on adding a secrets manager? Enterprise workflows almost always require an encrypted parameter or calling a secret.

mrud · 2026-05-29T21:37:17 1780090637

I think they have vaults[0] which provide it. I don't think this is available on the non platform side.

[0] https://platform.claude.com/docs/en/managed-agents/vaults

hbarka · 2026-05-30T00:47:15 1780102035

This is it, thank you. I’m surprised I couldn’t get Claude to surface this capability.

tomjakubowski · 2026-05-28T19:50:51 1779997851

Why should secrets be built in? What's the issue with tool use and something like 1password's or Vault's CLI?

hbarka · 2026-05-28T22:08:34 1780006114

Another piece to pay for.

tomjakubowski · 2026-06-02T15:53:40 1780415620

There are free alternatives: https://openbao.org/

Personally, I am happy paying 1password for my personal secret management. Their security credibility and bona fides are well-established. I'd strongly consider them for a business contract too.

dbbk · 2026-06-01T14:43:36 1780325016

Personally I would just like to be able to read more than 2 lines of an AskUserQuestion on the iOS app. Ever since the feature launched it's truncated the question, so you cannot actually read it.

Does anyone at Anthropic use the iOS app? Ever?

manquer · 2026-05-29T18:10:32 1780078232

Great feature and wonderful launch!

Using the keyword “Workflow”like “Ultrathink” is problematic?

Ultrathink is uncommon enough that it is unlikely to be used in code or prompt outside its intended purpose.

Workflow is generic keyword and used in so many contexts both inside the codebase and orchestration tooling like say temporal.io or others that name their constructs “workflows”.

bryan0 · 2026-05-28T17:43:25 1779990205

Thanks to you and the anthropic team for developing such exciting tools! The blog post seems to position workflows for “breadth”: generating fixes / refactors against large code bases. What about for “depth”: developing specific new features and functionality end-to-end? I’ve struggled to make this work reliably using the current experimental agent teams. Does this replace or augment that functionality?

bcherny · 2026-05-28T17:47:20 1779990440

Yes, it also helps! That's a place where raw model capability is the most helpful, but we do find that some dynamic workflow configurations can be helpful too.

bryan0 · 2026-05-28T18:01:21 1779991281

Cool! If you can point to any examples of those types of workflow configurations I’d be super interested. For example, to have a team of agents review a PR and iterate on it until all requirements are met including UX, security and product functionality goals. If they could “converge” to a solution like workflows seems to be designed for that would be amazing.

tsunamifury · 2026-05-28T17:28:19 1779989299

This is really dissapointing release for such a promising technique. Long walks with fanned vectors can actually be token optimizing vs token burning when combined with self grading each agent along the walk and compared to manual long coding walks to solve first pass problems. But instead this frames it (assumptively) as a tokenmaxxing strategy. There are also many other strartegies that can prove effeciency and wider solution consideration with consensus, but none of this is explained why its an improvement or better than other technqiues.

Its like you guys aren't even aware of the primary problem you are all facing: your token burns aren't paying off anyore against standard coding -- and looking net negative. I have to ask, are you this unaware of your core problem set here?

There are no any examples, proofs, or scenarios that show why there is improvement either in complexity or reliability of the solution or effeciency to the path of the solution. I'm baffled.

rsstack · 2026-05-28T17:29:36 1779989376

Will you document how to (AI-)author and share reusable workflows between team members, to ensure some consistency of quality?

Maybe blasphemy, but will workflows be able to use non-Anthropic LLMs (e.g., delegating some steps to local models, but design and review by Claude)?

bcherny · 2026-05-28T17:32:15 1779989535

Yes, more docs + technical details coming soon.

hedgehog · 2026-05-28T21:37:37 1780004257

How granular is the control over the internal process?

In my experiments I've had some success modeling the work to be done as a DAG of typed artifacts with a combination of code + LLM doing decomposition, transforms, synthesis, and fitness checking to generate the output. It took me a lot of tries to arrive at that formula and it would be cool to have something more general. I also run part of it against local compute because it would be far beyond my budget to do it all on Opus, so something for that would be nice too.

wilg · 2026-05-28T18:00:44 1779991244

Can you please fix the issue where like 99.99999% of the time Claude tries to launch a subagent on its own accord it gets "Prompt is too long" and tries several more times, then gives up and does it without the subagent. Big waste of time and tokens and not getting almost any subagent advantages. Not kidding that this happens about 100 times a day.

wilg · 2026-05-28T17:39:45 1779989985

I tried creating a workflow in Claude 1.9255.2 (1dc8f7) 2026-05-27T01:57:20.000Z

and got

API Error: 400 messages.3.content.11: `thinking` or `redacted_thinking` blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response.

Tried again in

Claude 1.9659.1 (193bcb) 2026-05-28T16:22:15.000Z also but may need a new chat

bcherny · 2026-05-28T17:53:19 1779990799

Looking

wilg · 2026-05-28T17:56:59 1779991019

Still seeing it in new threads with Claude 1.9659.1 (193bcb) 2026-05-28T16:22:15.000Z

k2xl · 2026-05-28T17:28:02 1779989282

How do you guys plan feature support between the CLI and Claude Desktop?

bcherny · 2026-05-28T17:31:20 1779989480

We generally build features into the Claude Agent SDK, which is shared by CLI, Desktop, VSCode, and cloud.

unshavedyak · 2026-05-28T19:11:25 1779995485

VSCode has an official client? Given IDE usage is being restricted from Claude Code via the CC SDK tokens going to the Claude API rather than your CC Subscription, i'm unclear which IDEs can actually use claude code now.

Eg is Zed capable of using a Claude Code Subscription?

fredoliveira · 2026-05-28T22:01:37 1780005697

> is Zed capable of using a Claude Code Subscription?

Yes. Zed connects to Claude Code via ACP.

unshavedyak · 2026-05-29T23:06:20 1780095980

Oh, yea here's all the proof you need. Even Zed themselves admit you won't be able to use Claude Code via ACP via Subscription: https://zed.dev/blog/terminal-threads

So yea, bcherny didn't reply to me but as far as i can tell - No, Zed nor VSCode will have Claude Code natively in it. The best we can do is embed a Terminal into the editor and run CC in that.

With that said, because bcherny advertised VSCode, i'm going to guess VSCode is going to get special treatment. Really annoying.

unshavedyak · 2026-05-28T23:08:37 1780009717

to be clear, i'm referring to the recent fact where it appears that they're disabling all Claude Code (Subscription) usage from the SDK. Which ACP would be included on.

As usual though it's not super clear exactly what is allowed or not.

k2xl · 2026-05-29T15:50:53 1780069853

it's confusing that ultracode is not enabled on the desktop though - or at least it isn't clear how to enable

thallavajhula · 2026-05-28T17:26:19 1779989179

Hi Boris! Thanks for Claude Code.

Is there an example of how y'all use Dynamic Workflows internally that you could share with the rest of us here so that we can mimic something similar?

bcherny · 2026-05-28T17:30:18 1779989418

Hey, yep. A few things I personally used dynamic workflows for over the last few weeks:

1. Autonomously landed 20+ optimizations to reduce Claude Code's token usage by ~15%

2. Ported tree-sitter, color-diff, yoga-layout, and a number of other WASM and Rust native modules to TypeScript, improving CPU and memory use by 2-10x in the process

3. Made our CI faster, and repeatedly found and fixed flaky tests (with /loop)

4. Migrated from regex-based bash static analysis to tree-sitter, reducing false positive permission prompts by 45%

5. Reduced Claude Agent SDK startup time by 61%, by repeatedly profiling and optimizing the startup path, putting up a number of PRs in the process

6. Shipped 69 code simplification PRs, deleting >10k lines of code

sangeeth96 · 2026-05-28T18:15:23 1779992123

> Ported tree-sitter, color-diff, yoga-layout, and a number of other WASM and Rust native modules to TypeScript, improving CPU and memory use by 2-10x in the process

Curious to learn more on this (unless there’s a write-up in the works). I’m naive on this matter but:

1. is this because it’s higher cost when passing objects back and forth across the JS/native boundary? 2. Does this have anything more specific to do with use of Bun? 3. is the stance for claude code then to keep all the deps in raw TypeScript? 4. How do you folks keep these ported deps up-to-date?

guybedo · 2026-05-28T19:47:15 1779997635

this feels more like a PR statement than a description of how you used the tool though

verve_rat · 2026-05-28T21:40:56 1780004456

None of those are helpful examples we could mimic to figure out how to use the tools.

This reads like a CV, not trying to help or educate.

mkw5053 · 2026-05-28T17:58:38 1779991118

Very cool. What % of the CC team's engineering would you say goes into QoL (as opposed to new feature development)? Obviously some live in a grey area, while others are more clear like making CI faster.

rahkiin · 2026-05-28T17:43:00 1779990180

You _reduced_ its _efficiency_? Why do you make CC more inefficient?

isoprophlex · 2026-05-28T18:02:30 1779991350

Maxxing everything is all the rage. Gotta cpumaxx or bossman isnt getting his money's worth

bcherny · 2026-05-28T17:47:51 1779990471

Typo! Edited

theLiminator · 2026-05-28T19:16:50 1779995810

Is there not a reason to instead port claude code to rust? Do you have internal benchmarks that show that claude code is better at typescript than rust?

JimJohn4292 · 2026-05-28T18:02:40 1779991360

Boris, what are your thoughts on WASM as a technology and it's practical implications for AI in the future?

vblanco · 2026-05-28T18:27:46 1779992866

I have my own version and the workflow keyword conflicts with it rather heavily. Will there be a way to disable that prompt section/keyword?

bcherny · 2026-05-28T18:56:51 1779994611

Yep! Set disableWorkflows:true in your settings.json

vblanco · 2026-05-28T19:28:10 1779996490

thank you

stvpwrs · 2026-05-28T17:43:12 1779990192

Will workflows be reusable? I have a big use case of sharable and repeatable workflows for projects. Especially if this comes to Cowork.

bcherny · 2026-05-28T17:47:25 1779990445

andrewmutz · 2026-05-28T17:59:11 1779991151

Any idea how soon dynamic workflows might be available in Cowork?

m0meni · 2026-05-28T17:30:59 1779989459

What language are the workflows in? Curious what you settled on. And are they running in the cloud or locally?

bcherny · 2026-05-28T17:32:41 1779989561

JavaScript, running locally or in the cloud.

gfunk911 · 2026-05-28T21:52:05 1780005125

How much overlap do you feel like dynamic workflows have with RLMs?

firemelt · 2026-05-29T19:05:24 1780081524

hi boris I get confused shpuld I kep using /feature-dev of worfklows if I want to addinga feature to my app?

franze · 2026-05-28T18:10:36 1779991836

just wanted to say thank you, just did a 2 days "ai computer use" workshop - think a virtual desktop on hetzner with claude code in yolo mode, a github account, vercel and logged in into a google account and claude had all the credentials and then let a mix of marketing / product manager / sales / customer support let loose. 2k token budget ... and just let them see do magic again and again.

thx for all that amazing tec and save ai

bcherny · 2026-05-27T16:57:35 1779901055

Totally. You can do that now, and Claude will know to use /code-review.

bcherny · 2026-05-27T16:10:18 1779898218

Hey, Boris from the CC team here. I agree, we're working on consolidating these. Going forward it will just be the built-in /code-review skill.

Here's how to use the skill on the latest version:

/code-review # do a balanced code review. checks for bugs and inconsistencies, poor code quality, duplication, band aids, etc.

/code-review --fix # same as above, but also fix the issues

# choose an explicit effort level (defaults to your current effort level). all of these also accept --fix:

/code-review low

/code-review medium

/code-review high

/code-review xhigh

/code-review max

# do an expensive and extremely thorough review (reliably catches >99% of bugs, costs $3-20 per review depending on complexity):

/code-review ultra

Open to feedback if anyone has feedback or ideas for how to make these even nicer to use.

bix6 · 2026-05-27T16:25:23 1779899123

Hi Boris, what is the advantage of using /code-review vs just asking Opus to “code review”?

As a casual user working on hobby projects, I struggle to keep up with the pace of changes and knowing what to use when. My default now is to use Opus for all coding (sonnet is fine but seems dumber) and to prompt it for everything I need. I’ve had great success with this but clearly I’m missing power user functions with the slash commands and such.

extr · 2026-05-27T16:30:33 1779899433

The advantage is that /code-review supplies a structured idea of how to review and what that process should look like and then launches independent subagents to approach the issue from multiple angles.

It's analogous to how in the early days you could see benefits by telling the models to "think step by step". /code-review is something like "review angle by angle". "Consider removed behavior" and also "Look at language gotchas" and also "Look at test changes"...etc. Yes these are all somewhat implicitly already part of what "code review" means, but the models perform best with explicitness.

If you want my 2c as a power user: just don't think about it and use /code-review xhigh --fix. This will cover like 98% of what you want out of code review. It's a good skill.

HlessClaudesman · 2026-05-27T17:35:20 1779903320

We've all spent time -fixing someone's bright idea of a -fix. I'm sceptical of the time saving of applying a -fix before I understand the problem(s).

Outsourcing comprehension to a machine is probably gonna cost you more time in the long run.

extr · 2026-05-27T18:37:48 1779907068

I don't even bother looking at the code until I've run a code review pass on it. Why waste my time with trivial bug fixes? I find the best way to spend time right now is like:

- Defining the issue/ticket, what "success" looks like (if I have a good idea of this), high level approach guidance 50%

- Dispatch agent to work on it 5%

- Occasionally return and nudge agent + send /simplify or /code-review 5%

- Look at the code/session summary, divergences from the plan, ask followup questions 40%

Occasionally yes there is some solution the AI chose that is suboptimal and I would prefer fixed in a different way. Mostly though it's straightforward.

bix6 · 2026-05-27T16:45:40 1779900340

Thank you I will try this!

Is there something equivalent when coding in the first place? Eg /code high “prompt”

extr · 2026-05-27T18:39:22 1779907162

Are you thinking of the /effort level in Claude Code? I would just go with xhigh as a reasonable default. Most important thing in prompting is specifying what "done" and "success" looks like to you. Ask Claude to help you come up with a well formed request and spend most of your time on that, then paste that into a brand new session.

bix6 · 2026-05-28T00:24:17 1779927857

No more like is there a specific slash tool to be using when coding or planning. I guess that’s just Claude code in general but since there’s a specific review tool I was curious about specific coding tools

sdevonoes · 2026-05-27T22:53:21 1779922401

It’s simpler to just use “review code”. It’s also way cheaper

pverheggen · 2026-05-27T16:55:54 1779900954

As a general rule, I'd give the Markdown a read for any skills/commands you might find useful, it'll give you a good idea of the specifics it adds.

https://github.com/anthropics/claude-code/blob/main/plugins/...

bcherny · 2026-05-27T16:31:14 1779899474

/code-review has a specific prompt that we've found is a good balance of precision, recall, and cost. You could totally roll your own prompt also.

bix6 · 2026-05-27T16:44:08 1779900248

And why would someone use the various levels? Is a low code review even worth running? And how do I know what level to use in the first place?

This stuff all seems so nebulous to me and I’ve yet to see anything that says use x in y situation. So I default to higher effort levels than I likely need.

mil22 · 2026-05-27T17:08:49 1779901729

Hey Boris, thanks for the great product and for listening!

I find the mix between slash commands that are programmatic harness configuration and control commands (/config, /model, /feedback, /fork, /usage, etc.) and ones that are little more than prompt template insertion (/code-review, /<skill>, etc.) to be a little confusing and unnecessary. A slash command should be one thing, and one thing only: a command for the harness, not the agent.

When I invoke a slash command like /code-review, I should be invoking some additional harness functionality, something above and beyond the agent's sphere of influence - not just pasting some hidden text into the next turn. Otherwise, why wouldn't I just say "Claude, review this code"?

Yet most of these "added value" commands bloating the slash command list, are just shortcuts for copy and paste. I don't want to go to have to learn the syntax of a special /code-review command (which options are positional args, which are --flags, etc.), and I'm much less likely to use or even be aware of a command like this, when I can just ask "Do a balanced code review and fix the issues", or use the GUI to set the effort level to xhigh before asking "Review my code." That way I can also be more specific about exactly what I need, rather than relying on what's in the canned prompt - a prompt which I'll probably never read and vet myself anyway. The value added by the slash command needs to be really high compared to just typing a prompt, for it to justify the friction of discovery and learning the syntax.

So I suppose I'm advocating for a different system. Keep slash commands for meta-level harness control and configuration, and add a new mechanism for canned prompt insertion, one which is tailor made for that purpose rather than overloading the slash command system. Let the user see what's in the canned prompts, and even make adjustments or edits as needed before sending them, one-time or persisted. Provide a GUI in the app with the user's favorite prompts, where the user can add, delete, and edit them, making it easy to invoke and insert them as needed. Or let the agent automatically discover and use them as needed, rather than requiring the user to remember and recall their magic shortcuts and their arguments. That's just one idea.

Skills, plugins, commands, and so on, need to be consolidated not just for code review of course but across the full architecture of how prompt templates are managed.

wonkyfruit · 2026-05-28T15:04:37 1779980677

What clicked for me recently was treating skills as composable. Having meta-skills that call smaller skills in order. The "skill vs command vs subagent" confusion partly dissolves once you let skills call other skills. The meta-skill holds the workflow state, the smaller ones each do one job well.

8note · 2026-05-27T18:02:32 1779904952

> # do an expensive and extremely thorough review (reliably catches >99% of bugs, costs $3-20 per review depending on complexity):

/code-review ultra

main suggestion would be to sound a lot less optimistic about that it finds 99% of bugs or that its at all thorough, and instead list that it is time capped, and will only find bugs that you explicitly tell it to look for.

i used my three runs of ultrareview.

the first run with no other prompting found a couple typos in markdown only

the second one i prompted it with several themes of known open bugs in the code, and it found 6 items

and then the third one i ran after doing an actual long audit through gemini to make a much more detailed prompt about issues in the code

and for that one, instead of doing an exhaustive run, it just never started, so no idea if it worked

but the experience had no relation at all with the reliability or thoroughness claims

bmitc · 2026-05-27T20:43:51 1779914631

Why doesn't Claude invoke LSPs? It always asks to install them, but then it never uses them, as mentioned in the comment you replied to.

extr · 2026-05-27T16:24:41 1779899081

Hey Boris, some feedback. I like the new /code-review skill but was disappointed you guys removed /simplify because I quite liked the focus on finding code reuse/efficiency opportunities.

I see now in 2.1.152 you added those focus areas back to /code-review, but still bundled with the correctness finding. It would be great to have more fine grained control over the /code-review angles beyond just effort level. Or maybe you would recommend that I just specify that as freeform input after effort level?

bcherny · 2026-05-27T16:57:02 1779901022

Yep, you can add free-form input. Will update /simplify to only check for code quality and not bugs (the way it used to work), that's a good suggestion.

extr · 2026-05-28T18:16:06 1779992166

Damn already there in 154. Thank you man.

arps18 · 2026-05-27T17:08:19 1779901699

Thanks, Boris, for reading and reviewing :)

svieira · 2026-05-27T17:39:48 1779903588

> reliably catches >99% of bugs

In what scope?

bcherny · 2026-04-23T19:05:49 1776971149

Hey, Boris from the team here.

We did both -- we did a number of UI iterations (eg. improving thinking loading states, making it more clear how many tokens are being downloaded, etc.). But we also reduced the default effort level after evals and dogfooding. The latter was not the right decision, so we rolled it back after finding that UX iterations were insufficient (people didn't understand to use /effort to increase intelligence, and often stuck with the default -- we should have anticipated this).

big_toast · 2026-04-23T20:53:00 1776977580

Having a "Recovery Mode"/"Safe Boot" flag to disable our configurations (or progressively enable) to see how claude code responds would be nice. Sometimes I get worried some old flag I set is breaking things. Maybe the flag already exists? I tried Claude doctor but it wasn't quite the solution.

For instance:

Is Haiku supposed to hit a warm system-prompt cache in a default Claude code setup?

I had `DISABLE_TELEMETRY=1` in my env and found the haiku requests would not hit a warm-cached system prompt. E.g. on first request just now w/ most recent version (v2.1.118, but happened on others):

w/ telemetry off - input_tokens:10 cache_read:0 cache_write:28897 out:249

w/ telemetry on - input_tokens:10 cache_read:24344 cache_write:7237 out:243

I used to think having so many users was leading to people hitting a lot of edge cases, 3 million users is 3 million different problems. Everyone can't be on the happy path. But then I started hitting weird edge cases and started thinking the permutations might not be under control.

EugeneOZ · 2026-04-23T20:52:23 1776977543

> people didn't understand to use /effort to increase intelligence, and often stuck with the default -- we should have anticipated this

UI is UI. It is naive to expect that you build some UI but users will "just magically" find out that they should use it as a terminal in the first place.

taytus · 2026-04-23T23:22:41 1776986561

“after evals and dogfooding” couldn’t have done this before releasing the model? We are paying $200/month to beta test the software for you.

abtinf · 2026-04-23T22:41:36 1776984096

You didn’t anticipate most people stick with defaults?

bcherny · 2026-04-24T05:49:01 1777009741

We anticipated the default would be the best option for most people. We were wrong, so we reverted the default.

troupo · 2026-04-24T20:44:49 1777063489

It took you a month to revert after multiple complaints. You still blamed users for using the product exactly as you advertised it. And all of your official channels were completely quite for two months, whether it was about new draconian peak hour limits, or about the new defaults, or about exponentially increasing token costs.

People literally started seeing issues immediately as you changed the defaults: https://x.com/levelsio/status/2029307862493618290 And despite a huge amount of reports you still kept it for a whole month.

And then you shipped a completely untested feature with prompt cache misses and literally gaslit users and blamed users for using the product as advertised.

Oh. Remember this https://x.com/bcherny/status/2024152178273989085? "We move fast but test carefully"?

Now untold umber of people have been hit by these changes, so as an apology you reset usage limits three hours before they would reset anyway.

Good job.

Edit. By the way, a very telling sentence from the report:

--- start quote ---

We’ll ensure that a larger share of internal staff use the exact public build of Claude Code (as opposed to the version we use to test new features); and we'll make improvements to our Code Review tool that we use internally

--- end quote ---

Translation: no one is using or even testing the product we ship, and we blindly trust Claude Code to review and find bugs for us. Last one isn't even a translation: https://x.com/bcherny/status/2017742750473720121

krade · 2026-04-23T23:22:57 1776986577

Off topic, but I'm hoping you'll maybe see this. There's been an issue with the VS code extension that makes it pretty much impossible to use (PreToolUse can't intercept permission requests anymore, using PermissionRequest hooks always open the diff viewer and steals focus):

https://github.com/anthropics/claude-code/issues/36286 https://github.com/anthropics/claude-code/issues/25018

bcherny · 2026-04-23T19:02:42 1776970962

Hey, Boris from the Claude Code team here.

Normally, when you have a conversation with Claude Code, if your convo has N messages, then (N-1) messages hit prompt cache -- everything but the latest message.

The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss, all N messages. We noticed that this corner case led to outsized token costs for users. In an extreme case, if you had 900k tokens in your context window, then idled for an hour, then sent a message, that would be >900k tokens written to cache all at once, which would eat up a significant % of your rate limits, especially for Pro users.

We tried a few different approaches to improve this UX:

1. Educating users on X/social

2. Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)

3. Eliding parts of the context after idle: old tool results, old messages, thinking. Of these, thinking performed the best, and when we shipped it, that's when we unintentionally introduced the bug in the blog post.

Hope this is helpful. Happy to answer any questions if you have.

dbeardsl · 2026-04-23T19:28:11 1776972491

I appreciate the reply, but I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.

I feel like that is a choice best left up to users.

i.e. "Resuming this conversation with full context will consume X% of your 5-hour usage bucket, but that can be reduced by Y% by dropping old thinking logs"

giwook · 2026-04-23T21:51:25 1776981085

Another way to think about it might be that caching is part of Anthropic's strategy to reduce costs for its users, but they are now trying to be more mindful of their costs (probably partly due to significant recent user growth as well as plans to IPO which demand fiscal prudence).

Perhaps if we were willing to pay more for our subscriptions Anthropic would be able to have longer cache windows but IDK one hour seems like a reasonable amount of time given the context and is a limitation I'm happy to work around (it's not that hard to work around) to pay just $100 or $200 a month for the industry-leading LLM.

Full disclosure: I've recently signed up for ChatGPT Pro as well in addition to my Claude Max sub so not really biased one way or the other. I just want a quality LLM that's affordable.

jimkleiber · 2026-04-24T00:48:23 1776991703

I might be willing to pay more, maybe a lot more, for a higher subscription than claude max 20x, but the only thing higher is pay per token and i really dont like products that make me have to be that minutely aware of my usage, especially when it has unpredictability to it. I think there's a reason most telecoms went away from per minute or especially per MB charging. Even per GB, as they often now offer X GB, and im ok with that on phone but much less so on computer because of the unpredictability of a software update size.

Kinda like when restaurants make me pay for ketchup or a takeaway box, i get annoyed, just increase the compiled price.

giwook · 2026-04-24T22:43:29 1777070609

For sure, I agree with that sentiment. It's interesting to consider the psychological component of that, like how "free shipping" is not really free, it's oftentimes just packaged into the price of the product but somehow it feels like we're getting a better deal.

I would not be surprised to see Anthropic, OpenAI etc head in the direction you mention as they mature and all of these datacenters currently undergoing construction come online in the next few years and drive down costs.

adam_patarino · 2026-04-24T11:51:31 1777031491

Token anxiety is real mental overhead.

jimkleiber · 2026-04-24T17:38:21 1777052301

That's the phrase i was looking for, thank you.

sharts · 2026-04-23T22:50:14 1776984614

That doesn’t make sense to pay more for cache warming. Your session for the most part is already persisted. Why would it be reasonable to pay again to continue where you left off at any time in the future?

jeremyjh · 2026-04-23T23:22:36 1776986556

Because it significantly increases actual costs for Anthropic.

If they ignored this then all users who don’t do this much would have to subsidize the people who do.

tikkabhuna · 2026-04-24T05:53:38 1777010018

I’m coming at this as a complete Claude amateur, but caching for any other service is an optimisation for the company and transparent for the user. I don’t think I’ve ever used a service and thought “oh there’s a cache miss. Gotta be careful”.

I completely agree that it’s infeasible for them to cache for long periods of time, but they need to surface that information in the tools so that we can make informed decisions.

libraryofbabel · 2026-04-24T08:19:08 1777018748

That is because LLM KV caching is not like caches you are used to (see my other comments, but it's 10s of GB per request and involves internal LLM state that must live on or be moved onto a GPU and much of the cost is in moving all that data around). It cannot be made transparent for the user because the bandwidth costs are too large a fraction of unit economics for Anthropic to absorb, so they have to be surfaced to the user in pricing and usage limits. The alternative is a situation where users whose clients use the cache efficiently end up dramatically subsidizing users who use it inefficiently, and I don't think that's a good solution at all. I'd much rather this be surfaced to users as it is with all commercial LLM apis.

theshrike79 · 2026-04-24T13:25:45 1777037145

Think of it like this: Anthropic has to keep a full virtual machine running just for you. How long should it idle there taking resources when you only pay a static monthly fee and not hourly?

They have a limited number of resources and can’t keep everyone’s VM running forever.

prirun · 2026-04-24T16:15:08 1777047308

I pay $5/mo to Vultr for a VM that runs continuously and maintains 25GB of state.

jlokier · 2026-04-24T17:08:42 1777050522

That price at Vultr gets you 1GB of RAM, and 25GB of relatively slow SSD.

The KV cache of your Claude context is:

- Potentially much larger than 25GB. (The KV cache sizes you see people quoting for local models are for smaller models.)

- While it's being used, it's all in RAM.

- Actually it's held in special high-performance GPU RAM, precision-bonded directly to the silicon of ludicrously expensive, state of the art GPUs.

- The KV state memory has to be many thousands of times faster than your 25GB state.

- It's much more expensive per GB than the CPU memory used by a VM. And that in turn is much more expensive than the SSD storage of your 25GB.

- Because Claude is used by far more people (and their agents) than rent VMs, far more people are competing to use that expensive memory at the same time

There is a lot going on to move KV cache state between GPU memory and dedicated, cheaper storage, on demand as different users need different state. But the KV cache data is so large, and used in its entirety when the context is active, that moving it around is expensive too.

pixl97 · 2026-04-24T18:42:24 1777056144

Now check out the cost difference in 25GB of computer RAM vs GPU RAM.

And yes, this is also why computer RAM has jumped the shark in costs.

The bandwidth differences in total data transferred per hour aren't even in the same 5 orders of magnitude between your server and the workloads LLMs are doing. And this is why the compute and power markets are totally screwed.

PeterStuer · 2026-04-24T16:50:35 1777049435

It does not. It just has a fast way to give you the illusion it "runs continuously" with 25GB of warm memory.

Tbh, I'm not sure paged vram could solve this problem for an (assumed) huge cache miss system such as a major LLM server

danso · 2026-04-24T02:15:03 1776996903

Genuine question: is the cost to keep a persistent warmed cache for sessions idling for hours/days not significant when done for hundreds of thousands of users? Wouldn’t it pose a resource constraint on Anthropic at some point?

tmountain · 2026-04-24T09:23:44 1777022624

Related question, is it at all feasible to store cache locally to offload memory costs and then send it over the wire when needed?

dev_hugepages · 2026-04-24T11:45:16 1777031116

No, the cache is a few GB large for most usual context sizes. It depends on model architecture, but if you take Gemma 4 31B at 256K context length, it takes 11.6GB of cache

note: I picked the values from a blog and they may be innacurate, but in pretty much all model the KV cache is very large, it's probably even larger in Claude.

libraryofbabel · 2026-04-24T15:02:24 1777042944

To extend your point: it's not really the storage costs of the size of the cache that's the issue (server-side SSD storage of a few GB isn't expensive), it's the fact that all that data must be moved quickly onto a GPU in a system in which the main constraint is precisely GPU memory bandwidth. That is ultimately the main cost of the cache. If the only cost was keeping a few 10s of GB sitting around on their servers, Anthropic wouldn't need to charge nearly as much as they do for it.

tedivm · 2026-04-24T16:23:23 1777047803

That cost that you're talking about doesn't change based on how long the session is idle. No matter what happens they're storing that state and bring it back at some point, the only difference is how long it's stored out of GPU between requests.

libraryofbabel · 2026-04-24T16:54:31 1777049671

Are you sure about that? They charge $6.25 / MTok for 5m TTL cache writes and $10 / MTok for 1hr TTL writes for Opus. Unless you believe Anthropic is dramatically inflating the price of the 1hr TTL, that implies that there is some meaningful cost for longer caches and the numbers are such that it's not just the cost of SSD storage or something. Obviously the details are secret but if I was to guess, I'd say the 5m cache is stored closer to the GPU or even on a GPU, whereas the 1hr cache is further away and costs more to move onto the GPU. Or some other plausible story - you can invent your own!

tedivm · 2026-04-24T21:27:15 1777066035

Storing on GPU would be the absolute dumbest thing they could do. Locking up the GPU memory for a full hour while waiting for someone else to make a request would result in essentially no GPU memory being available pretty rapidly. This type of caching is available from the cloud providers as well, and it isn't tied to a single session or GPU.

libraryofbabel · 2026-04-25T04:28:00 1777091280

> Storing on GPU would be the absolute dumbest thing they could do

No. It’s not dumb. There will be multiple cache tiers in use, with the fastest and most expensive being on-GPU VRAM with cache-aware routing to specific GPUs and then progressive eviction to CPU ram and perhaps SSD after that. That is how vLLM works as you can see if you look it up, and you can find plenty of information on the multiple tiers approach from inference providers e.g. the new Inference Engineering book by Philip Kiely.

You are likely correct that the 1hr cached data probably mostly doesn’t live on GPU (although it will depend on capacity, they will keep it there as long as they can and then evict with an LRU policy). But I already said that in my last post.

bavell · 2026-04-24T12:19:23 1777033163

Yesterday I was playing around with Gemma4 26B A4B with a 3 bit quant and sizing it for my 16GB 9070XT:

  Total VRAM: 16GB
  Model: ~12GB
  128k context size: ~3.9GB

At least I'm pretty sure I landed on 128k... might have been 64k. Regardless, you can see the massive weight (ha) of the meager context size (at least compared to frontier models).

cadamsdotcom · 2026-04-24T00:43:52 1776991432

Sure, it wouldn’t make sense if they only had one customer to serve :)

uoaei · 2026-04-24T09:44:11 1777023851

Exactly, even in the throes of today's wacky economic tides, storage is still cheap. Write the model state immediately after the N context messages in cache to disk and reload without extra inference on the context tokens themselves. If every customer did this for ~3 conversations per user you still would only need a small fraction of a typical datacenter to house the drives necessary. The bottleneck becomes architecture/topology and the speed of your buses, which are problems that have been contended with for decades now, not inference time on GPUs.

jeremyjh · 2026-04-24T10:30:07 1777026607

This has nothing to do with the cost of storage. Surprisingly, you are not better informed than Anthropic on the subject of serving AI inference models.

A sibling comment explains:

https://news.ycombinator.com/item?id=47886200

uoaei · 2026-04-26T21:21:09 1777238469

They don't cache model state to disk. I am proposing they do.

jeremyjh · 2026-04-26T22:47:30 1777243650

I’m proposing that you should educate yourself on the subject of LLM KV context caching.

PeterStuer · 2026-04-24T16:47:03 1777049223

It may be persisted but it is not live in the inference engine.

Folcon · 2026-04-26T11:18:29 1777202309

The reason I've been querying the 1 hour is a user's quota resets are often longer than that, as a result I've seen situations where someone builds a large context, then hits their quota limit, waits 2+ hours, their cache is gone, their first message then eats 20%+ of their current session quota and the user doesn't want to compact as they're still trying to get the model into a good understanding of the problem, this seems to be a really painful consequence for users on anything less than a max plan which seems like an unintended consequence of Anthropic's own system design choices?

IE How their quota and caching interact with each other, it doesn't make pro and max a little different, it makes it significantly different by unintentionally penalising pro users

JumpCrisscross · 2026-04-23T19:39:23 1776973163

> I was never under the impression that gaps in conversations would increase costs

The UI could indicate this by showing a timer before context is dumped.

vyr · 2026-04-23T23:42:09 1776987729

a countdown clock telling you that you should talk to the model again before your streak expires? that's the kind of UX i'd expect from an F2P mobile game or an abandoned shopping cart nag notification

abustamam · 2026-04-23T23:46:50 1776988010

Well sure if you put it that way, they're similar. But it's either you don't see it and you get surprised by increased quota usage, or you do see it and you know what it means. Bonus points if they let you turn it off.

No need to gamify it. It's just UI.

thinkmassive · 2026-04-24T02:30:03 1776997803

Plenty of room for a middle ground, like a static timestamp per session that shows expiration time, without the distraction of a constantly changing UI element.

matheusmoreira · 2026-04-24T04:04:03 1777003443

Why not an automated ping message that's cheap for the model to respond to?

cortesoft · 2026-04-24T04:34:17 1777005257

Because the cache is held on anthropics side, and they aren't going to hold your context in cache indefinitely.

karsinkk · 2026-04-23T19:43:59 1776973439

Yes!! A UI widget that shows how far along on the prompt cache eviction timelines we are would be great.

vanviegen · 2026-04-24T08:47:01 1777020421

That sounds stressful.

But perhaps Claude Code could detect that you're actively working on this stuff (like typing a prompt or accessing the files modified by the session), and send keep-cache-alive pings based on that? Presumably these pings could be pretty cheap, as the kv-cache wouldn't need to be loaded back into VRAM for this. If that would work reliably, cache expiry timeouts could be more aggressive (5 min instead of an hour).

jimkleiber · 2026-04-24T00:44:15 1776991455

I tried to hack the statusline to show this but when i tried, i don't think the api gave that info. I'd love if they let us have more variables to access in the statusline.

kiratp · 2026-04-24T01:37:56 1776994676

By caching they mean “cached in GPU memory”. That’s a very very scarce resource.

Caching to RAM and disk is a thing but it’s hard to keep performance up with that and it’s early days of that tech being deployed anywhere.

Disclosure: work on AI at Microsoft. Above is just common industry info (see work happening in vLLM for example)

libraryofbabel · 2026-04-24T06:06:21 1777010781

Nit: It doesn’t have to live in GPU memory. The system will use multiple levels of caching and will evict older cached data to CPU RAM or to disk if a request hasn’t recently come in that used that prefix. The problem is, the KV caches are huge (many GB) and so moving them back onto the GPU is expensive: GPU memory bandwidth is the main resource constraint in inference. It’s also slow.

The larger point stands: the cache is expensive. It still saves you money but Anthropic must charge for it.

Edit: there are a lot of comments here where people don't understand LLM prefix caching, aka the KV cache. That's understandable: it is a complex topic and the usual intuitions about caching you might have from e.g. web development don't apply: a single cache blob for a single request is in the 10s of GB at least for a big model, and a lot of the key details turn on the problems of moving it in and out of GPU memory. The contents of the cache is internal model state; it's not your context or prompt or anything like that. Furthermore, this isn't some Anthropic-specific thing; all LLM inference with a stable context prefix will use it because it makes inference faster and cheaper. If you want to read up on this subject, be careful as a lot of blogs will tell you about the KV cache as it is used within inference for an single request (a critical detail concept in how LLMs work) but they will gloss over how the KV cache is persisted between requests, which is what we're all talking about here. I would recommend Philip Kiely's new book Inference Engineering for a detailed discussion of that stuff, including the multiple caching levels.

computably · 2026-04-23T19:39:18 1776973158

> I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.

You didn't do your due diligence on an expensive API. A naïve implementation of an LLM chat is going to have O(N^2) costs from prompting with the entire context every time. Caching is needed to bring that down to O(N), but the cache itself takes resources, so evictions have to happen eventually.

doesnt_know · 2026-04-23T20:27:55 1776976075

How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?

You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.

dlivingston · 2026-04-24T02:53:03 1776999183

What is being discussed is KV caching [0], which is used across every LLM model to reduce inference compute from O(n^2) to O(n). This is not specific to Claude nor Anthropic.

[0]: https://huggingface.co/blog/not-lain/kv-caching

computably · 2026-04-24T03:08:26 1777000106

> How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?

1. Compute scaling with the length of the sequence is applicable to transformer models in general, i.e. every frontier LLM since ChatGPT's initial release.

2. As undocumented changes happen frequently, users should be even more incentivized to at least try to have a basic understanding of the product's cost structure.

> You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.

I think "internal technical implementation" is a stretch. Users don't need to know what a "transformer" is to understand the trade-off. It's not trivial but it's not something incomprehensible to laypersons.

tempest_ · 2026-04-24T01:04:30 1776992670

I use CC, and I understand what caching means.

I have no idea how that works with a LLM implementation nor do I actually know what they are caching in this context.

libraryofbabel · 2026-04-24T08:01:29 1777017689

They are caching internal LLM state, which is in the 10s of GB for each session. It's called a KV cache (because the internal state that is cached are the K and V matrices) and it is fundamental to how LLM inference works; it's not some Anthropic-specific design decision. See my other comment for more detail and a reference.

hakanderyal · 2026-04-24T04:06:42 1777003602

CC can explain it clearly, which how I learned about how the inference stack works.

fragmede · 2026-04-24T06:34:24 1777012464

> 99.99% of users won't even understand the words that are being used.

That's a bad estimate. Claude Code is explicitly a developer shaped tool, we're not talking generically ChatGPT here, so my guess is probably closer to 75% of those users do understand what caching is, with maybe 30% being able to explain prompt caching actually is. Of course, those users that don't understand have access to Claude and can have it explain what caching is to them if they're interested.

solarkraft · 2026-04-23T20:23:04 1776975784

I somewhat disagree that this is due diligence. Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.

mpyne · 2026-04-23T21:58:47 1776981527

> Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.

Does mmap(2) educate the developer on how disk I/O works?

At some point you have to know something about the technology you're using, or accept that you're a consumer of the ever-shifting general best practice, shifting with it as the best practice shifts.

websap · 2026-04-24T00:44:54 1776991494

Does using print() in Python means I need to understand the Kernel? This is an absurd thought.

Nevermark · 2026-04-24T07:21:22 1777015282

That might be an absurd comparison, but we can fix that.

If you were being charged per character, or running down character limits, and printing on printers that were shared and had economic costs for stalled and started print runs, then:

You wouldn’t “need” to understand. The prints would complete regardless. But you might want to. Personal preference.

Which is true of this issue to.

Barbing · 2026-04-24T07:36:01 1777016161

>If you were being charged per character, or running down character limits, and printing on printers that were shared and had economic costs for stalled and started print runs,

and the system was being run by some of the planet’s brightest people whose famous creation is well known to disseminate complex information succinctly,

>then:

You would expect to be led to understand, like… a 1997 Prius.

“This feature showed the vehicle operation regarding the interplay between gasoline engine, battery pack, and electric motors and could also show a bar-graph of fuel economy results.” https://en.wikipedia.org/wiki/Toyota_Prius_(XW10)

zem · 2026-04-23T22:50:00 1776984600

mmap(2) and all its underlying machinery are open source and well documented besides.

mpyne · 2026-04-23T23:04:45 1776985485

There are open-source and even open-weight models that operate in exactly this way (as it's based off of years of public research), and even if there weren't the way that LLMs generate responses to inputs is superbly documented.

Seems like every month someone writes up a brilliant article on how to build an LLM from scratch or similar that hits the HN page, usually with fancy animated blocks and everything.

It's not at all hard to find documentation on this topic. It could be made more prominent in the U/I but that's true of lots of things, and hammering on "AI 101" topics would clutter the U/I for actual decision points the user may want to take action upon that you can't assume the user already knows about in the way you (should) be able to assume about how LLMs eat up tokens in the first place.

computably · 2026-04-24T03:37:32 1777001852

I would say this is abstracting the behavior.

margalabargala · 2026-04-23T21:00:36 1776978036

Okay, sure. There's a dollar/intelligence tradeoff. Let me decide to make it, don't silently make Claude dumber because I forgot about a terminal tab for an hour. Just because a project isn't urgent doesn't mean it's not important. If I thought it didn't need intelligence I would use Sonnet or Haiku.

pixl97 · 2026-04-24T19:00:24 1777057224

"Gets mad because their is no option"

"Gets mad because when their is options the defaults suck"

"Gets mad because the options start massively increasing costs to areospace pricing"

margalabargala · 2026-04-24T19:33:50 1777059230

Did you mean to reply to someone else? Or do you misunderstand the issue?

There is no option to avoid auto-dumbing after one hour of idle. I haven't complained about the cost at all, I'm happy to pay it.

So yeah, I'm mad because there's no option. The other two you mentioned don't apply.

someguyiguess · 2026-04-23T20:24:15 1776975855

Yes. It’s perfectly reasonable to expect the user to know the intricacies of the caching strategy of their llm. Totally reasonable expectation.

jghn · 2026-04-23T23:27:16 1776986836

To some extent I'd say it is indeed reasonable. I had observed the effect for a while: if I walked away from a session I noticed that my next prompt would chew up a bunch of context. And that led me to do some digging, at which point I discovered their prompt caching.

So while I'd agree with your sarcasm that expecting users to be experts of the system is a big ask, where I disagree with you is that users should be curious and actively attempting to understand how it works around them. Given that the tooling changes often, this is an endless job.

abustamam · 2026-04-23T23:48:56 1776988136

> users should be curious and actively attempting to understand how it works

Have you ever talked with users?

> this is an endless job

Indeed. If we spend all our time learning what changed with all our tooling when it changes without proper documentation then we spend all our working lives keeping up instead of doing our actual jobs.

Octoth0rpe · 2026-04-24T01:03:45 1776992625

There are general users of the average SaaS, and there are claude code users. There's no doubt in my mind that our expectations should be somewhat higher for CC users re: memory. I'm personally not completely convinced that cache eviction should be part of their thought process while using CC, but it's not _that_ much of a stretch.

abustamam · 2026-04-24T04:07:51 1777003671

Personally I've never thought about cache eviction as it pertains to CC. It's just not something that I ever needed to think about. Maybe I'm just not a power user but I just use the product the way I want to and it just works.

troupo · 2026-04-24T05:36:26 1777008986

Anthropic literally advertises long sessions, 1M context, high reasoning etc.

And then their vibe-coders tell us that we are to blame for using the product exactly as advertised: https://x.com/lydiahallie/status/2039800718371307603 while silently changing how the product works.

Please stop defending hapless innocent corporations.

jghn · 2026-04-24T12:57:55 1777035475

This oversells how obfuscated it is. I'm far from a power user, and the opposite of a vibe coder. Yet I noticed the effect on my own just from general usage. If I can do it, anyone can do it.

troupo · 2026-04-24T17:01:11 1777050071

Here's Anthropic's own Boris Cherny and others telling how great everything is with long sessions and contexts: https://news.ycombinator.com/item?id=47886087

taormina · 2026-04-24T15:05:50 1777043150

Listen, no one cares if you think you’re smart for seeing through the lies of their marketing team. You’re being intentionally obtuse.

jghn · 2026-04-25T00:30:02 1777077002

My point is the opposite. I don't think my observation was smart, and I'm surprised to so many people here, a venue with a lot of people who use this stuff far more than I do, think it wasn't an easy to grok thing.

taormina · 2026-04-25T06:20:05 1777098005

You’re still intentionally missing the point. Everyone knows they are lying. It doesn’t excuse the lies!

jghn · 2026-04-25T12:44:51 1777121091

I’m not. Why would anyone believe marketing speak for any product? One should always assume that at best they’re fluffing their product up and more likely that they’re telling straight up lies

troupo · 2026-04-25T17:54:56 1777139696

1. False advertisement is a thing, to the point there are laws against it

2. They were caught blatantly lying, and you're literally telling everyone it's the users' fault for not digging into the black box that is Claude Code (and more so Anthropic's servers) and figuring its behavior for themselves. A behavior that suddenly changed on a March day [1] and which previously very few people ever needed to investigate.

[1] https://x.com/levelsio/status/2029307862493618290

jghn · 2026-04-26T15:57:02 1777219022

I'm not saying this is a great state of affairs. But I'm saying that it's so pervasive in daily life that yes, at least part of the blame lies on users for not taking this into account. As a developer it's important to at least try to understand the tools and libraries on which one relies. Relying on magic black boxes is not a good plan on the user's part, and they need to be defensive about this. Too many developers have been more than happy to hand the keys over to the AI assistants and hope for the best.

Also it wasn't completely undocumented, rather it was hiding in not-quite-plain sight. Which itself is a bit duplicitous, but again something that's far from unique on the part of Anthropic.

jghn · 2026-04-24T12:59:12 1777035552

> Have you ever talked with users?

I believe if one were to read my post it'd have been clear that I *am* a user.

This *is* "hacker" news after all. I think it's a safe assumption that people sitting here discussing CC are an inquisitive sort who want to understand what's under the hood of their tools and are likely to put in some extra time to figure it out.

abustamam · 2026-04-24T14:51:55 1777042315

We're inquisitive but at the end of the day many of us just want to get our work done. If it's a toy project, sure. Tinker away, dissect away. When my boss is breathing down my neck on why a feature is taking so long? No time for inquiries.

trinsic2 · 2026-04-24T18:12:01 1777054321

Agreed. systems work the way they work. Its up to the user to determining what those limitations are. I don't like the concept of molding software based on every expectation a user has. Sometimes that expectation is unwarranted. You can see this in game development. Regardless of expressed criticism, sometimes gamers don't know what they want or what they need. A game should be developed by the design goals of the team, not cater to every whim the player base wants. We have seen were that can go.

coldtea · 2026-04-23T21:41:35 1776980495

It's not like they have a poweful all-knowing oracle that can explain it to them at their dispos... oh, wait!

esafak · 2026-04-23T21:50:18 1776981018

They have to know that this could bite them and to ask the question first.

nixpulvis · 2026-04-23T22:14:40 1776982480

I do think having some insight into the current state of the cache and a realistic estimate for prompt token use is something we should demand.

switchbak · 2026-04-24T02:03:27 1776996207

If there was an affordance on the TUI that made this visible and encouraged users to learn more - that would go a long way.

exac · 2026-04-23T22:07:11 1776982031

It is more useful to read posts and threads like this exact thread IMO. We can't know everything, and the currently addressed market for Claude Code is far from people who would even think about caching to begin with.

kang · 2026-04-23T21:17:18 1776979038

It seems you haven't done the due diligence on what part of the API is expensive - constructing a prompt shouldn't be same charge/cost as llm pass.

coldtea · 2026-04-23T21:40:28 1776980428

It seems you haven't done the due diligence on what the parent meant :)

It's not about "constructing a prompt" in the sense of building the prompt string. That of course wouldn't be costly.

It is about reusing llm inference state already in GPU memory (for the older part of the prompt that remains the same) instead of rerunning the prompt and rebuilding those attention tensors from scratch.

kang · 2026-04-23T22:20:17 1776982817

You not only skipped the diligence but confused everyone repeating what I said :(

that is what caching is doing. the llm inference state is being reused. (attention vectors is internal artefact in this level of abstraction, effectively at this level of abstraction its a the prompt).

The part of the prompt that has already been inferred no longer needs to be a part of the input, to be replaced by the inference subset. And none of this is tokens.

coldtea · 2026-04-25T18:12:06 1777140726

>It seems you haven't done the due diligence on what part of the API is expensive - constructing a prompt shouldn't be same charge/cost as llm pass.

I think you missed what the parent meant then, and the confusing way you replied seemed to imply that they're not doing inference caching (the opposite of what you wanted to mean).

The parent didn't said that caching is needed to merely avoid reconstructing the prompt as string. He just takes that for granted that it means inference caching, to avoid starting the session totally new. That's how I read "from prompting with the entire context every time" (not the mere string).

So when you answered as if they're wrong, and wrote "constructing a prompt shouldn't be same charge/cost as llm pass", you seemed to imply "constructing a prompt shouldn't be same charge/cost as llm pass [but due to bad implementation or overcharging it is]".

kang · 2026-04-26T03:32:03 1777174323

You are right, I was wrong in my understanding there. It stemmed from my own implementation; an inference often wrote extra data such as tool call, so I was using it to preserve relevant information alongwith desired output, to be able to throw away the prompt every time. I realize inference caching is one better way (with its pros and cons).

computably · 2026-04-24T02:41:45 1776998505

I said "prompting with the entire context every time," I think it should be clear even to laypersons that the "prompting" cost refers to what the model provider charges you when you send them a prompt.

kovek · 2026-04-23T21:05:30 1776978330

What if the cache was backed up to cold storage? Instead of having to recompute everything.

vanviegen · 2026-04-24T08:49:58 1777020598

They probably already do that. But these caches can get pretty big (10s of GBs per session), so that adds up fast, even for cold storage.

kovek · 2026-04-24T18:21:34 1777054894

10s of GBs? ( 1,000,000 context * 1,000 vector size ) ^ 2 = 1,000,000,000,000,000,000… oh wow.. I must be miscalculating

What about only storing the conversation and then recomputing the embeddings in the cache? Does that cost a lot? Doing a lot of matrix multiplication does not cost dollars of compute, especially on specialized hardware, right?

Majromax · 2026-04-24T19:08:22 1777057702

Context length 1e6, vector length 1e3, and 1e2 model layers for 100e9 context size. Costs will go up even more with a richer latent space and more model layers, and the western frontier outfits are reasonably likely to be maximizing both.

bontaq · 2026-04-23T22:08:49 1776982129

How's that O(N^2)? How's it O(N) with caching? Does a 3 turn conversation cost 3 times as much with no caching, or 9 times as much?

jannyfer · 2026-04-23T23:15:44 1776986144

I’m not sure that it’s O(N) with caching but this illustrates the N^2 part:

https://blog.exe.dev/expensively-quadratic

bontaq · 2026-04-24T04:13:48 1777004028

If there was an exponential cost, I would expect to see some sort of pricing based on that. I would also expect to see it taking exponentially longer to process a prompt. I don't believe LLMs work like that. The "scary quadratic" referenced in what you linked seems to be pointing out that cache reads increase as your conversation continues?

If I'm running a database keeping track of a conversation, and each time it writes the entire history of the conversation instead of appending a message, are we calling that O(N^2) now?

atq2119 · 2026-04-24T05:40:59 1777009259

Yes, that is indeed O(N^2). Which, by the way, is not exponential.

Also by the way, caching does not make LLM inference linear. It's still quadratic, but the constant in front of the quadratic term becomes a lot smaller.

computably · 2026-04-24T09:20:09 1777022409

> Also by the way, caching does not make LLM inference linear. It's still quadratic, but the constant in front of the quadratic term becomes a lot smaller.

Touché. Still, to a reasonable approximation, caching makes the dominant term linear, or equiv, linearly scales the expensive bits.

bavell · 2026-04-24T12:42:23 1777034543

> I would also expect to see it taking exponentially longer to process a prompt. I don't believe LLMs work like that.

Try this out using a local LLM. You'll see that as the conversation grows, your prompts take longer to execute. It's not exponential but it's significant. This is in fact how all autoregressive LLMs work.

_flux · 2026-04-24T08:03:48 1777017828

What we would call O(n^2) in your rewriting message history would be the case where you have an empty database and you need to populate it with a certain message history. The individual operations would take 1, 2, 3, .. n steps, so (1/2)*n^2 in total, so O(n^2).

This is the operation that is basically done for each message in an LLM chat in the logical level: the complete context/history is sent in to be processed. If you wish to process only the additions, you must preserve the processed state on server-side (in KV cache). KV caches can be very large, e.g. tens of gigabytes.

raron · 2026-04-23T20:29:59 1776976199

How big this cached data is? Wouldn't it be possible to download it after idling a few minutes "to suspend the session", and upload and restore it when the user starts their next interaction?

throwdbaaway · 2026-04-23T21:54:16 1776981256

Should be about 10~20 GiB per session. Save/restore is exactly what DeepSeek does using its 3FS distributed filesystem: https://github.com/deepseek-ai/3fs#3-kvcache

With this much cheaper setup backed by disks, they can offer much better caching experience:

> Cache construction takes seconds. Once the cache is no longer in use, it will be automatically cleared, usually within a few hours to a few days.

cyanydeez · 2026-04-23T21:48:15 1776980895

I often see a local model QWEN3.5-Coder-Next grow to about 5 GB or so over the course of a session using llamacpp-server. I'd better these trillion parameter models are even worse. Even if you wanted to download it or offload it or offered that as a service, to start back up again, you'd _still_ be paying the token cost because all of that context _is_ the tokens you've just done.

The cache is what makes your journey from 1k prompt to 1million token solution speedy in one 'vibe' session. Loading that again will cost the entire journey.

cortesoft · 2026-04-24T04:39:38 1777005578

What they mean when they say 'cached' is that it is loaded into the GPU memory on anthropic servers.

You already have the data on your own machine, and that 'upload and restore' process is exactly what is happening when you restart an idle session. The issue is that it takes time, and it counts as token usage because you have to send the data for the GPU to load, and that data is the 'tokens'.

vanviegen · 2026-04-24T08:54:25 1777020865

Wrong on both counts. The kv-cache is likely to be offloaded to RAM or disk. What you have locally is just the log of messages. The kv-cache is the internal LLM state after having processed these messages, and it is a lot bigger.

cortesoft · 2026-04-24T16:48:45 1777049325

I shouldn't have said 'loaded into GPU memory', but my point still stands... the cached data is on the anthropic side, which means that caching more locally isn't going to help with that.

nl · 2026-04-24T03:01:10 1776999670

> upload and restore it when the user starts their next interaction

The data is the conversation (along with the thinking tokens).

There is no download - you already have it.

The issue is that it gets expunged from the (very expensive, very limited) GPU cache and to reload the cache you have to reprocess the whole conversation.

That is doable, but as Boris notes it costs lots of tokens.

vanviegen · 2026-04-24T08:57:03 1777021023

You're quite confidently wrong! :-)

The kv-cache is the internal LLM state after having processed the tokens. It's big, and you do not have it locally.

nl · 2026-04-25T00:22:00 1777076520

> The kv-cache is the internal LLM state after having processed the tokens. It's big, and you do not have it locally.

Yes - generated from the data of the conversation.

Read what I said again. I'm explaining how they regenerate the cache by running the conversation though the LLM to reconstruct the KV cache state.

miroljub · 2026-04-23T21:55:14 1776981314

This sounds like a religious cult priest blaming the common people for not understanding the cult leader's wish, which he never clearly stated.

computably · 2026-04-24T09:28:56 1777022936

A strange view. The trade-off has nothing to do with a specific ideology or notable selfishness. It is an intrinsic limitation of the algorithms, which anybody could reasonably learn about.

Sure, the exact choice on the trade-off, changing that choice, and having a pretty product-breaking bug as a result, are much more opaque. But I was responding to somebody who was surprised there's any trade-off at all. Computers don't give you infinite resources, whether or not they're "servers," "in the cloud," or "AI."

miroljub · 2026-04-24T11:39:07 1777030747

He was surprised because it was not clearly communicated. There's a lot of theory behind a product that you could (or could not) better understand, but in the end, something like price doesn't have much to do with the theoretical and practical behavior of the actual application.