At this point Anthropic is a pure marketing and PR company. Super catchy names like Opus, Mythos and Fable trying to get you to think that these software products are actually super-human life changing experiences. Boris Cherny coming to HN “Hi! it’s Boris from the Claude Code team” to get real tech people’s goodwill.
From Opus 4.6 there are no noticeable improvements for me in code generation. It works very well, till 90% completion, if you guide it correctly. And you need a little luck. For serious production code I need to understand what I’m doing so it helps a bit, sometimes.
> catchy names like Opus, Mythos and Fable trying to get you to think that these software products are actually super-human life changing experiences
This is just good business sense. In what scenario would you ever make the names dumb and forgettable?
> Boris Cherny coming to HN “Hi! it’s Boris from the Claude Code team” to get real tech people’s goodwill.
This is good customer support, lol. From what I can tell, it is indeed Boris Cherny responding, not outsourced to AI or other staff. You're really getting a response from Boris. I suppose that is PR, but it's not unjustified PR, it's accurate.
I'm not even a crazy AI fan, but your criticisms are ridiculous here. It reminds me of the quote from Knives Out -- "Your Honor, she endeared herself to him through hard work and good humor."
Your observations are right but pretty insane to consider them a pure PR company lol. They are making more frequent releases so yes the release-to-release quality is smaller but we’re still ascending quality and reliability curves the same way we have since GPT-3. You get a GPT4->5 leap every like 17 or 18 months I think it is
> Super catchy names like Opus, Mythos and Fable trying to get you to think that these software products are actually super-human life changing experiences.
They're originally named after the blends at a nearby coffee shop.
I've noticed nobody at HN knows what "marketing" is or how to do it. It's not just naming things and being evil and cynical is not the most successful method.
…also frontier models are a superhuman life changing experience. If they aren't, what possibly could be?
Opus 4.7/4.8 often over-engineers on my setups, plus:
- It talks a LOT more like GPT models. You know: wrinkle, shape, gate, coarse, scope, gap, path, production-ready-workflow-of-the-day, and so on -- "that's expected, a consequence of the previous like-driven workflow". If I wanted to get a headache using AI I would have gone with GPT in the first place!
- It outputs text in a much harder way to follow along. I can't exactly say what it is. Maybe a bit of everything? Bolds are missing, bullet points are gone, paragraphs are bland and too long, and it doesn't feel like a model programming with me, but rather a somewhat full of themselves grandpa developer looking down on me. It's very weird to describe this, but it is definitely how I feel.
Granted this can totally be because of the way it reacts to the prompts now. We've got a rather large corpus of skills and "rules and good practices" that Opus 4.6 responded to great, and maybe the new models just get turned into this when fed with them....I don't know.
Either way, with Opus 4.6 being as good as it is, I need Fable to be a significant step up to justify a price increase. if it can get me to babysit opus a little bit less on some stuff, it might be worth it. Otherwise, I'm very happy with Opus 4.6 and hope they don't deprecate it.
I'd argue that 4.8 is a straight downgrade. For every type of task I've tried. It's been a gambit at this point. If 4.6 quits being available, I'm out at this point.
Reading so many contrary positions about which model is better or worse shows how difficult it is to measure intelligence based on personal experiences. Of course, benchmarks try to make the process as objective as possible, but they often don't correlate with our personal experiences.
The other day 4.6 was fantastic for x task. Today, 4.6 overengineered everything and I had to revert all my changes. When evaluating models, perhaps it makes sense to consider luck as an ingredient before reaching any personal conclusion.
Yes but there’s a reason we don’t evaluate these models this way and instead do it as carefully and thoughtfully as we can at scale. Human evaluations are important but they are an absolute minefield of footguns. 4.8 is not a downgrade from 4.6 there is an insane amount of hard data that contradicts this.
Again correct but it overstates the issue. I can say labs don’t want this. This happened arguably unintentionally in Metas llama 4 release, it went horribly, heads rolled, and like several billion dollars were paid for new talent and the org that built llama 4 was destroyed.
Evals come from a million places and new evals and robust perturbations of existing evals abound. They test a variety of tasks in a variety of ways. All of them individually are flawed. Taken together the aggregate signal is highly useful as you more or less marginalize over a lot of different things. Not to mention these companies have plenty of proprietary internal measurements, they build benchmarks themselves to probe their models and then also have flywheel traffic and A/B tests.
You are right to call out benchmarks but to dismiss them or not take them seriously is a mistake.
Listen, you can say “but benchmarks, the benchmarks!” all day long, but consumer know when we are being sold a lemon. If it can’t do the most basic of things at least as good as it used to, this is table stakes. Nevermind that if you can’t do the basic stuff, how on earth can you be trusted with more?
And you can say “If it can’t do the most basic of things at least as good as it used to, this is table stakes” all day long while people point you to much better evidence to the contrary too, I’d rather be on the other side of that.
Listen. I don’t care about evidence. I care about my lived experience for the product I paid for. I used the new product. It’s actively terrible. To the point of not being usable. We’re all ancedata, but what is “better evidence to the contrary”? The known and game-able benchmarks that they know they need to win at, so they train it to. It’s all he said, she said, which is the only reason we keep having this conversation.
Yea but it’s not right? You or I or the myriad of other institutions inside and outside of academia can probe these models with an evolving landscape of evaluation sets, even those unavailable to the developers. It’s just ignorance to claim benchmarks are somehow useless or all being gamed. You choose your tools in the way you want, but just don’t call it somehow better than a myriad of more carefully constructed setups and scaled evaluations.
Actually anecdata I gather on my job from myself and coworkers is the only benchmark I trust anymore, because it so heavily diverges from the “benchmarks”.
I would encourage you to look into the open evals of some of these benchmarks (find one that actually is open-data, this is itself a good challenge), read the results generated and assess them for yourself.
This is what myself and my coworkers (and many other people in this thread) are doing on a daily basis with real stakes and real tasks – which these benchmarks are all aiming to be a proxy for. There's a real, tangible [cost]benefit to [not] using the highest-ROI models and harnesses.
The people with real incentives and skin in the game are telling you that the data diverges from "the data".
I don't mind if you don't take it seriously, our jobs are more important to us than a benchmark is.
But I wouldn't opt-out of using your own eyes and the eyes of others so easily, especially when there are literally hundreds of billions of dollars in invested capital with an interest in a certain outcome... this is how you end up in "Emperor's New Clothes" situations.
Investigating on your specific use cases, codebases, workflows and tasks is important, there is nothing wrong with this and in fact it’s more important than benchmarks if you can do it well but the point is that is very hard and easy to totally fool yourself and go down a suboptimal path. I understand that people are going to do it regardless, I certainly do. And I have looked at more raw benchmark data than I can really even stomach, I can see annotation data in my dreams now.
Eyes and ears of others is incredibly important. But you still seem to think somehow benchmarks is part of some giant conspiratorial cabal. You have institutions without ANY skin in the game making extremely high quality benchmarks. Consider in academia there is little else to do outside of partnerships with these companies. But benchmarks you can do completely independently and with university grant level money (it costs maybe $10-100k for a reasonable benchmark in many cases). Not only that, “real tasks” are what many benchmarks measure. You have these companies with extremely good logging and well scaled measurements to really look at what works and what doesn’t.
At this point I have a workflow that is fairly rote. I've yet to use a model newer than 4.6-1M-XHIGH that I trust to earn a higher ROI on that workflow, and not for lack of trying!
I personally don't believe in any sort of cabal (Occam's Razor hasn't let me down yet). Ultimately, I don't really care *why* they're wrong as much as I care *that* they have diverged from my rubber-meets-the-road measures of value.
That is concerning to me, because people are investing 100s of B's of capital based on the putative RoI putatively available to people like ourselves. When the benchmarks support this RoI thesis, but none of the anecdata does... that's really concerning!
Re: academics, I don't think any of the data academics have access to are good proxies for the work real people are doing. And for the data that are good proxies, the model labs certainly have access to the same data, and therefore the benchmark performance against those data is irrelevant.
I am in full support of custom workflow benchmarks, and choosing the best model for your use case to balance performance and expense. Thats just good operating behavior, but the problem is the foot guns and biases people have that they are convinced they dont even if they understand on an intellectual level that everyone else has them
> but none of the anecdata does... that's really concerning!
But see this is not really true -- adoption, subjective benchmarks, verifiable benchmarks, task-dependent performance, internal product metrics, living benchmarks, all point in a pretty consistent direction. Anecdata is not the plural of data. An anecdote is like a case study. It's there to motivate the things we already have which is a huge amount of performance measures for a variety of different tasks.
> Re: academics, I don't think any of the data academics have access to are good proxies for the work real people are doing.
But this isn't really true either -- you can get this data from a variety of sources that are licensable or open source, or data that you can commission. You can critique any one methodology for this but a blanket "they are hamstrung" is not really fair or accurate.
> And for the data that are good proxies, the model labs certainly have access to the same data, and therefore the benchmark performance against those data is irrelevant.
But this is also not true -- you can have exclusive license agreements, data you hold close to the heart, or data to measure models that haven't had access to it because that data was created after these models were released.
There are plenty of problems in model measurement but the answer is not to just abandon it to be cavemen with zero respect for rigor and the biases we have to be subject to as human beings.
"Carefully and thoughtfully" is antithetical to the approach to benchmarks these days.
Maybe back when this was a scientific endeavor; not now when enormous, enormous amounts of capital are on the line. Along with an entire cult's chosen eschatology.
You can call it a cult but it’s several thousand skilled workers who know what they’re doing, by and large, most of whom have a PhD and know how science and statistics work. Benchmarks are incredibly hard, and any PR or comms department at any company is going to obviously want to make things as rosy as possible, but beneath this are earnest, expensive efforts to get good quality measurements. The better you can do this the better you can compete. If you want to make a modeling decision you run an ablation, and the quality of that decision is only as good as your measurements.
The cult in this case is TESCREAL, not everyone working on AI. Last I checked not all the "several thousand skilled workers" in AI subscribe to TESCREAL ideology, although it has been a while since I've been to the Bay. Maybe things have changed since my time at Berkeley, and Dario's belief that he will eventually be made immortal by mind uploading is more widespread.
Otherwise we agree that benchmarking is hard, the benchmarks contain hard problems, and that there are many hard working people trying to accurately gauge what is going on. It is getting harder to watch though as all that is on the line taints the overall endeavor.
No it’s: evaluating these systems are complex and there’s a reason why sociology, cognitive psychology, medicine, etc are all done in careful double blind conditions with pre registered tests. It’s not that humans are not smart enough, as I said human evaluations are incredibly important. And yet they are a minefield of biases you have to worry about and correct for.
- evaluations need to be done at the same time to avoid drift in your bias
- you need to worry about your test set: which questions are you asking? How many of them? Are they representative of your work?
- which one did you do first? Raters have a tendency to bias in one direction or another
- you also know the label! You know which model is which! This biases your assessment…
And on and on and on. Careful science exists for a reason.
There is no data that I would trust that contradicts it.
Frankly I don't give a damn about data that could be made up on the spot or appears to be scientific or meaningful while it's not at all clear how it was made (up).
Claude was heavily lobotomised for my work starting somewhen in February.
I talked to friends and people I know and trust and many felt the same. (I didn't ask them whether they felt like I did, but what they felt, how happy they were with agentic coding etc.)
I quit my abo in March and talked to said friends who are still on a plan just last week: they are still not happy, but company pays so whatever...
That’s ok but at what point is this getting into conspiracy territory? You have just said there is nothing you would believe to the contrary, but then by definition that’s not exactly a very thoughtful or insightful position.
I never said that I am not willing to believe the contrary.
I am not willing to believe the contrary from strangers on the interwebs or PR departments of companies who want to sell me something.
If people I genuinely trust tell me about their experiences, I am willing to try again.
But yes, if it doesn't work for me (for whatever reason, could be that I am holding it wrong), then I can accept that it works for everyone but me and still not use it.
Also "scientific" doesn't mean what it used to mean. When the n is small or it's just anecdotes (I am aware of the irony) blown out of proportion I really can't take the data and conclusions seriously
N isn’t small, science means what it’s always meant, statistics is a thing, and what you’re describing is just putting your trust in a very poor quality benchmark. You said you would not trust any data that indicates something that contradicts your opinion. Benchmarks are not PR they are designed by a variety of institutions completely outside the control of frontier labs. Again congratulations on your conspiracy theory.
Lol. If you're doing anything non trivial that's not a CRUD webapp but e.g. some physics simulation or high performance GPU code any and all models I've tried suck.
They are not just leagues behind what experts would code, they are not even playing the same game.
Which is to be expected, as there isn't so much physics or high performance gpu code available as there is for your typical CRUD API and JS frontend.
I can attest to this, I had a very simple 20-line shader that I asked Claude to do a basic 90-degree rotation on it, and it just completely got it wrong. Frequently adds pointless abstractions / intermediate variables even when I tell it explicitly not to in the system prompt. I can go on and on, these things just don't understand architecture. And why would they? They were trained on text.
There is something remarkable about turning speech into code (don't need to hunch over a keyboard nearly as much these days, can just talk into a mic) and it's good for first drafts / exploring ideas. But it's obvious to anyone that's paying attention we're hitting the top of the S-curve. It's no wonder the IPOs are around the corner. I mean even Dario admitted he doesn't know how they're gonna substantially increase the context window size. That says a lot.
That being said I think the harnesses are only getting better. And maybe we will get multi-modal models that understand architecture eventually. But the growing-the-blob-of-text training method that's being used now appears to be getting diminishing returns
It's getting to a point that it's offputting, and the next step would be to put it into "untrusted" bucket. Opus 4.7 already burned their credibility once, 2 more strikes remain.
Not my impression. I felt 4.7 was a regression, but I am again badly in love with 4.8 with the level of insights it produces in design discussions, and how long can it go unattended while producing spec-adhering quality code. There are problems it still can't solve well, from the edges of algorithmics and far from the mainstream, but for lots of stuff it is godlike.
Also, I dont think Boris C. is coming here for PR. He is a tech guy, and this is the best place for tech discussions. Why so cynical? The guy is an engineer.
I don’t even think that Boris is really just one person. He apparently vibe coded Claude Code and is responding on Threads, Twitter, HN and everywhere.
Yeah, the marketing is cringe and it's a bummer that such a cool and powerful technology attracts such an icky group of enthusiasts. Surely, not all are bad, but man there are lots of goobers who are just AI-pilled hypemen who can't STFU about it.
If you truly believe this, you've discovered a superpower over everyone else in the industry.
While everyone else is wasting time and money on the slower, more expensive models, you've found a way to outpace everyone for less money. Everyone else is wrong and you will get rich.
(I don't actually believe the premise is true, I'm just pointing out the logical conclusion to what you're saying so maybe we can reconsider the premise)
> At this point Anthropic is a pure marketing and PR company. Super catchy names like Opus, Mythos and Fable trying to get you to think that these software products are actually super-human
Lol anti-AI bias on HN is crazy. Simply giving your product a quirky name is now being considered manipulative advertising. Is just doing normal PR and marketing something AI companies aren't allowed to do?
when they keep saying “oooh this new model is too big and crazy and totally can’t be released” or “this new model is a 10x game changer totally unlike our previous iterations” it feels sort like boy crying wolf. yes they’re still pretty clearly improving models, but when you’ve hit diminishing returns / more incremental gains and you’re still saying this is sounds like pure PR hype from a company that previously been the “honest good guys” in the room
Their model did find thousands of security vulnerabilities across the companies they previewed Mythos with via project Glasswing. Is it not sensible that, given that emergent level of capability, that they do this gated release structure, as all those vulnerabilities would be exploitable by anyone using a Mythos-level model?
Don't forget the DoD stint that gave them this recent public boost.
Defy standard DoD precedent going back forever, that every other country has some form of too, and championing it like they are some kind of moral freedom fighters.
Like selling the DoD guns and telling them they can only shoot bad guys with those guns, and that you will be the one to decide who counts as a bad guy...
I think this says more about your type of work than anything. For bugfinding/incident response in distributed systems - which often involves extensive use of Datadog/Sentry MCPs and poring over heaps of logs in addition to reading tons of code - 4.8 has been significantly better than 4.6.
You are right; all I noticed was a big-time slowdown. They increased the quota, but I cannot even reach the end of the day with these speeds. .NET coding somehow improved, though.
I don't like PE players like Bending Spoons, but I have used Komoot extensively for years, for cycling (and more recently hiking), and haven't seen any decrease in quality since the acquisition.
Code is a liability. Saying no is because the engineer wants to reduce complexity, not because she/ he is so subjectively “obsessed” with code quality. The term “quality” is nowadays misunderstood by management. It means the right amount of effort to build the product as fast and for as low as cost possible, taking into account a team of engineers that can easily add and modify code.
"quality" isn't succinctly definable. Zen and the art of system maintenance quality code is written by an old and wise programmer and any attempt to rigidly codify what it is they did and why is doomed to fail.
And in the agentic world, that liability is both minimized and amplified. Teams that successfully mitigate AI risks will be able to churn out massive amounts of sustainable code.
You are the archetype the OP is talking about because you repeat aphorisms like "code is a liability" that compress some truth too much and forget the larger picture.
Edit: apologies for personal attack. Didn’t mean for it to come across that way
p.s. We've had to ask you this before: https://news.ycombinator.com/item?id=47103856. If you'd please review the guidelines and take the intended spirit of the site to heart, we'd be grateful.
Also what's wrong with "code is a liability"? That's just 100 % true. The idea isn't exactly novel or revealing, but it's also really fundamental. Every line of code is a liability from day one.
The comment you replied to used that as a reminder and as an opening to an actual argument, it wasn't just a knee-jerk reaction.
Production code is an asset, its maintenance and obligations are an expense, its risks might become liabilities, and companies shouldn't run more code than they need for the same reason they shouldn't own a larger vehicle fleet or more spare warehouse capacity than they need.
I don't think most engineers really disagree with this. Saying code is a liability is technically incorrect but pithy shorthand to communicate that it comes with the associated baggage of maintenance, obligation, and risk; these things suck up money the same way a liability does. Tech debt is also not real debt. It's a figure of speech.
A building is an immovable asset. It’s also made of things that wear and tear. Its value is derived from its capability to house and the capability to house something extends beyond four walls and a roof.
The asset has inherent liabilities. A codebase can be reasoned about extremely similarly
Note how the longer sentences are significantly stricter than the shorter ones. You could maybe add another condition, in the sense that the code has to generate more revenue than it costs to maintain. Then I'd start to agree.
Also note that even when a line of code is generating revenue, it never stops being a liability in almost every sense of the word. Testing it still costs money and time, understanding it costs cognitive power, having it in the context of your LLM coding agent costs tokens, and that's assuming it's a good line of code. If it's bad code (badly named, badly placed, a logic chain that works but has hidden flaws), the costs increase and reverberate throughout the codebase (and your AI coding sessions).
That's an oversimplification. Asset vs liability isn't a binary state but a superposition. An asset can carry liabilities.
Your asset might generate $10k a month in revenue, but at the same time may have a high chance of needing a $100k investment in upgrades and repairs to remain productive.
Nah man. You got to say "no" a lot. Even in the age of AI. Often times features downright make no sense, the time to implement can span weeks and it would actively damage the product in the long term. I work in a ecom startup and I got to say no so many times due to added complexity for little reward.
I think saying no is more important now with AI, as features can be built so quickly now. But there are a lot more costs after the feature has been built. Mostly with AI the code isn’t understood that well, wich incurs a cognitive debt. Then there are extra maintentance and documentation costs. And the costs of carrying around features that add no value.
I can imagine that if you’re a startup and want to try new features quickly, it makes sense to say yes more. But the senior mentioned in the article will also be able to understand that.
Are you nitpicking that with the right achitecture and safeguards, unaccountable lines of code are perfectly harmless?
Because I think in most regular situations, code-without-adjectives, uncertain commits, or any number of things might be rightfully justified as a literal legal liability for business cases.
I get that you don't like how flat it is, but on a business website, in a world forecasted to be full of black box code, the statement is correct.
Code in a vacuum may not be a personal liability, but it is a professional one in 2026 where there's a gulf between slop and secure code
It's even more complicated: the datacenter and the servers are owned and operated by the government, and the DigiD app itself is owned and operated by government-owned Logius.
From what I have been able to deduce, Solvinity is contracted for some kind of sysadmin services - so basically Kubernetes babysitting?
Please not the schools. We don’t need privacy-invading closed systems with built-in slot machines. We need deterministic open systems where kids’ privacy is protected.
That would be illegal in many jurisdictions. And schools in general take privacy very seriously. Most schools won't sign up for google edu without a solid privacy guarantee.
Google is likely very happy to give up on the privacy violations for a few years of a child's life in exchange for getting that child hooked on Google services so they can freely violate privacy for an entire adult lifetime.
That’s a promise, no technical guarantee. Then there’s Cloud Act and FISA.
> Google is likely very happy to give up on the privacy violations
“likely”, exactly. This can change any time. We’ll just have to trust them. Scrolling through this thread it seems about zero trust in a US ad company who’s specialty is feeding off people’s privacy.
We should by now demanding technical guarantees. Open source, end-to-end encrypted with e.g. an overseer board checking the company. Companies like Proton are doing this.
The default is very very heavily weighted in Googles "Chromebook" favour. Getting a school with Windows (or Mac) exclusivity is a 4-leaf clover. Google genuinely have a pretty good product with Google Classroom though, so it's not completely lost. It's just a problem when schoolkids grow up and end up with new Windows/Mac laptops and have no idea how computers work outside of the web browser.
I'd assume this opens up 'Googlebooks' to compete with the GPU/M Series Premium laptops so schools can provide them to teach things like Photoshop, Illustrator, CAD Design, anything that chromebooks couldn't do, right?
The performance of the machine offered at schools seems to get just a little worse every year too... like one of these days they won't have to worry about kids playing Krunker in class because they won't be able to.
It would be so much better for the student's IT proficiencies if the were some ordinary Linux computers instead. Preferably with limited central managment.
The Chromebooks are probably cheaper than the hardware itself could be, but that's a good demonstration of the issue.
It wouldn’t. The central management of Chromebook is what makes the whole system usable. All you’d be doing is sentencing school IT folks to endless, endless support requests.
Funny. At my son's school in Germany, students may bring any device they want without central administration (just Wifi and web platforms). It works quite well without inundating IT staff with support requests.
(To achieve at least some similarity of systems, you get a partial refund if you buy either iPads or convertible notebooks running Windows. My son's notebook technically runs Windows but he mostly uses plain Debian Linux with Xournal++.)
Sorry, I love Linux, but could you imagine managing a fleet of the cheapest hardware possible and also teaching a bunch of 6th graders how to use Linux? School IT workers are already heroes. I don't like Google, but they're a necessary evil to keep those guys from tearing their hair out every day unless we dedicate significantly more resources to computing in schools.
We managed fine with crappy old Windows XP Thinkpads in elementary school. Modern Linux is far easier, and I'm saying the slight challenge would be educational.
> We need deterministic open systems where kids’ privacy is protected
I don't think we need any computers really. They'll be inundated with computers and technology their whole lives. They'll figure it out. Just keep this tech out of the classroom altogether.
We've had computers in the classroom for over a decade now, scores and learning has not gone up. It's a failed experiment.
> Why are you opposed to using personal computers for education?
They'll have computers at home. And the evidence seems to point in one direction: the more exposure kids have to devices, the more stunted their development tends to be. Add to that the class division, where rich kids are increasingly raised with strictly-policed device exposure, while poor kids' classrooms are littered with iPads and Chrombooks, and I think we can start making blanket statements.
There's also the point that the rich executives at these companies that make computers for school use send their own children to schools which do not use computers for education.
If computers were that critical to education you'd think those same executives would be loading up their children with all the tech they can afford.
I don’t think we need math really. They’ll be inundated with math and arithmetic their whole lives. They’ll figure it out. Just keep math out of the classroom altogether.
From Opus 4.6 there are no noticeable improvements for me in code generation. It works very well, till 90% completion, if you guide it correctly. And you need a little luck. For serious production code I need to understand what I’m doing so it helps a bit, sometimes.
reply