More

johnfn · 2026-04-15T20:29:39 1776284979

After a release, attackers have effectively infinite time to throw an LLM against every line of your code - an LLM that only gets smarter and cheaper to run as time passes. In order to feel secure you’d need to do all the work you’d imagine an attacker would ever do, for every single release you ship.

mixdup · 2026-04-15T21:26:01 1776288361

The first few times it's going to be expensive, but once everyone level sets with intense scans of their codebases, "every single release" is actually not that big a deal, since you are not likely to be completely rebuilding your codebase every release

stavros · 2026-04-15T21:04:49 1776287089

This assumes that the relationship between "LLM tokens spent" and "vulnerabilities found" doesn't plateau, though.

johnfn · 2026-04-14T22:25:29 1776205529

I think I've seen one page override ctrl-f for good reason -- it was a page that lazy loaded literally millions of lines of text that wouldn't have fit into RAM.

Every single other page that does it just wastes my time. It's always a super janky slow implementation that somehow additionally fails to actually search through all the text on the page.

mghackerlady · 2026-04-15T18:18:21 1776277101

then instead of lazy loading load chunks and paginate it like we used to

johnfn · 2026-04-14T01:16:17 1776129377

Is this really that hard to parse?

Curator and Finder are the names of the agents. "answer key" - haven't you ever taken a test in high school? It's an explanation of the answer. "shell steps" I presume means it gets to run 24 commands on the shell. "structured report" - do I really need to explain to you what a report is? "sink hints" - I admit I didn't know this one, but a bit of searching indicates that it's a hint at where the vulnerability lies.

johnfn · 2026-04-13T23:57:20 1776124640

Isn't this an extremely reasonable thing to do? To take an extreme example, consider people working on gain-of-function virology research.

archagon · 2026-04-14T01:30:33 1776130233

I can't imagine most people working on gain-of-function virology expect it to make the world worse.

johnfn · 2026-04-13T16:06:32 1776096392

Guys, did you know about tmux control mode? It tells the host terminal to treat tmux tabs as actual tabs in the terminal. That means that things like scrollback, tab navigation, copy paste, keyboard shortcuts, etc are all handled natively, and you can visually see all your tmux tabs! It doesn't have great support across all terminals, but it does work great in iTerm 2.

Try `tmux -CC` in iTerm.

For a tmux novice like me, this was a total game changer :)

2oMg3YWV26eKIs · 2026-04-14T15:57:36 1776182256

Very cool. Opening the tmux session in a new tab rather than a new window was an improvement that I wanted when I tried this. Here's how to do it: https://stackoverflow.com/a/54756013/22828008

kelsey98765431 · 2026-04-13T16:47:14 1776098834

this is the only reason i use a mac and in a decade no open source linux terminal has ever implemented this to my knowledge

hnlmorg · 2026-04-13T17:38:45 1776101925

I have in https://github.com/lmorg/ttyphoon

I actually don’t like control mode much though. It’s a terrible protocol. Absolutely abysmal design which leads to a plethora of edge case bugs.

At some point I’ll replace tmux control mode entirely but for the moment it solves the immediate problem.

em-bee · 2026-04-14T00:10:48 1776125448

interesting, i am trying to install it to give it a try. the features in ttyphoon look very promising.

unfortunately the build now fails with "frontend.go:39:12: pattern all:frontend/dist: no matching files found"

i am also a bit taken aback by the many dependencies. with heightened risk of supplychain attacks and dependency failures that feels a bit scary.

hnlmorg · 2026-04-14T07:03:07 1776150187

> unfortunately the build now fails with "frontend.go:39:12: pattern all:frontend/dist: no matching files found"

How are you trying to build it? Are you calling make? Also what OS are you on?

    make build

> i am also a bit taken aback by the many dependencies. with heightened risk of supplychain attacks and dependency failures that feels a bit scary.

Yeah, I agree with you there. Most of my projects are very conservative with their dependencies; as was this one too, originally. But this project was just too large for one person to realistically manage on their own and without reusing the hard work of other libraries.

Unfortunately, the two libraries I need to lean on the most are exactly the kind of libraries that will have big dependency trees:

- GUI (Wails): just because there is a huge amount of code required to draw anything to screen

- AI (langchaingo/mcp-go): though mostly for tool use here but they’re optional

Both of these libraries were chosen because they are well maintained and have a high number of contributors/eyeballs On the code. But, as you said, the risk is still there.

em-bee · 2026-04-14T11:36:52 1776166612

i have fedora. and yes, i am running make build.

why did you choose wails btw? did you look at fyne? it's go-native and it seems to have a lot less dependencies.

can the AI integration be turned off? i am not going to use it myself.

is there a chatroom where we can talk through debugging my build problem? github issues? you didn't turn in the discussion forum for ttyphoon like you did for murex. or is there another place where you hang out? hn is not exactly the ideal place to talk through dev issues.

hnlmorg · 2026-04-14T16:49:06 1776185346

An earlier build used SDL directly. There were reasons I didn’t choose Fyne and that was basically that the effort wasn’t much less than working directly with SDL due to various (potentially self imposed) constraints.

I then switched to Wails because I wanted to add Markdown, Jupyter support. And other things too. Quickly I realised it was too much for me to attempt SDL (even just the markdown parser proved an annoying time sink) on my own. I think the terminal itself is worse for change so there might be a point when I revisit this switch and change my mind again.

There was a brief period when both Wails and SDL were supported front ends. The idea being you could chose one based on a Go compiler tag. But that proved hard to maintain.

There was also a time when the terminal was SDL and Wails just did the markdown stuff. They were different applications that passed messages via a rudimentary IPC. But that resulted in a multi-window hacky mess.

So the current design is the compromise I’ve settled on but will likely change my mind at some point again.

As for the AI, there isn’t any compiler flags to disable it. But it doesn’t work without you supplying an API key anyway. And the integration is just a few menu items so, hopefully, not intrusive. Which would also make it pretty easy to hide those options behind a compiler flag.

I’m happy to chat as a GitHub issue if you’d prefer. I don’t hang out in may other places these days (time constraints)

em-bee · 2026-04-14T17:53:21 1776189201

thanks for those details. your approach is very pragmatic and i appreciate that.

on AI, i am not concerned about a few menu items, but if we want to consider supply chain attacks as a risk, the less that gets downloaded and built, the better.

i'll open an issue then, i figure this is more for the benefit of other users of ttyphoon, and since this post is originally about tmux we are already off topic anyways.

MisterTea · 2026-04-13T19:47:50 1776109670

The control mode feature was implemented by the developer of iTerm2 for iTerm2: https://github.com/tmux/tmux/wiki/Control-Mode

nateglims · 2026-04-14T00:23:02 1776126182

Wezterm has some support for it on nightly.

tanvach · 2026-04-13T21:42:48 1776116568

Was about the mention this, -CC has been working perfectly for me

pmarreck · 2026-04-14T06:14:56 1776147296

work in ghostty yet?

saagarjha · 2026-04-14T12:20:06 1776169206

Not yet

mwpmaybe · 2026-04-13T16:31:32 1776097892

Holy carp.

chromejs10 · 2026-04-14T04:00:17 1776139217

so THAT's what -CC does...

johnfn · 2026-04-12T20:54:03 1776027243

As dumb as it is to loudly proclaim you wrote 200k loc last week with an LLM, I don’t think it’s much better to look at the code someone else wrote with an LLM and go “hah! Look at how stupid it is!” You’re making exactly the same error as the other guy, just in the opposite direction: you’re judging the profession of software engineering based on code output rather than value generation.

Now, did Garry Tan actually produce anything of value that week? I dunno, you’ll have to ask him.

fao_ · 2026-04-12T20:57:08 1776027428

Yeah! It's not like code quality matters in terms of negative value or lives lost, right?!

https://en.wikipedia.org/wiki/Horizon_IT_scandal

Furthermore,

> As for the artifact that Tan was building with such frenetic energy, I was broadly ignoring it. Polish software engineer Gregorein, however, took it apart, and the results are at once predictable, hilarious and instructive: A single load of Tan’s "newsletter-blog-thingy" included multiple test harnesses (!), the Hello World Rails app (?!), a stowaway text editor, and then eight different variants of the same logo — one of which with zero bytes.

Do you think any of the... /things/ bundled in this software increased the surface area that attacks could be leveraged against?

lotsofpulp · 2026-04-12T21:58:41 1776031121

The Horizon IT scandal was not caused by poor code quality, the scandal was the corrupt employees of the UK government/Post Office. Poor quality code might have caused the error, but the failure to investigate the errors and sweep them under the rug was made by humans.

fao_ · 2026-04-12T22:37:34 1776033454

> Poor quality code might have caused the error, but the failure to investigate the errors and sweep them under the rug was made by humans.

That's not quite correct.

The root set of errors were made by the accounting software. The branch sets of errors were made by humans taking Horizon IT's word for it that there was no fault in the code, and instead blaming the workers for the differences in the balance sheets.

If there were no errors in the accounting software (i.e. it had been properly designed and tested), then none of that would have happened.

Nobody blames THERAC-25 on the human operator.

rcxdude · 2026-04-12T23:13:32 1776035612

It was worse than that. Higher ups in the post office knew the system was buggy and still doubled down on it. Yes, if the accounting software wasn't terrible the whole issue would not have happened, but there were so, so, many chances for the post office to do the right thing afterwards that it's not at all fair to blame the results on the poor quality software, which very notably did not prosecute thousands of people for fraud while telling each of them they were the only ones being flagged by the system.

(THERAC-25 was a little more towards 'just bad software', but there were still systemic failures there as well).

SvenL · 2026-04-12T21:29:19 1776029359

I also struggle with this all the time, balance between bringing value/joy and level of craft. Most human written stuff might look really ugly or was written in a weird way but as long as it’s useful it’s ok.

What I don’t like here is the bragging about the LoC. He’s not bragging about the value it could provide. Yes people also write shitty code but they don’t brag about it - most of the time they are even ashamed.

flir · 2026-04-12T22:37:06 1776033426

> a stowaway text editor

?!

Was it hiding in one of the lifeboats?

8note · 2026-04-12T21:05:41 1776027941

> included multiple test harnesses (!)

ive seen plenty of real code written by real people with multiple test harnesses and multiple mocking libraries.

its still kinda irrelevant to whether the code does anything useful; only a descriptor of the funding model

flir · 2026-04-12T22:48:15 1776034095

If I'm reading this correctly ("a single homepage load of http://garryslist.org downloads 6.42 MB across 169 requests"), the test harnesses were being downloaded by end users. They weren't being installed as devDependencies.

sdevonoes · 2026-04-12T21:19:01 1776028741

> Now, did Garry Tan actually produce anything of value that week? I dunno, you’ll have to ask him.

Let’s not be naive. Garry is not a nobody. He absolutely doesn’t care about how many lines of code are produced or deleted. He made that post as advertisement: he’s advertising AI because he’s the ceo of YC which profitability depends on AI.

He’s just shipping ads.

Terr_ · 2026-04-12T21:34:49 1776029689

"Follow the money" was always relevant, but especially when it comes to any kind of LLM news or investment-du-jour.

The cautionary/pessimist folks at least don't make money by taking the stance.

slyall · 2026-04-12T22:08:32 1776031712

A few do.

At the extreme end you'll get invited to conferences but further down you could have other products you are pushing. Even non-AI related that takes advantage of your "smart person" public persona.

tmoertel · 2026-04-12T21:10:29 1776028229

> You’re making exactly the same error as the other guy, just in the opposite direction: you’re judging the profession of software engineering based on code output rather than value generation.

But the true metric isn't either one, it's value created net of costs. And those costs include the cost to create the software, the cost to understand and maintain it, the cost of securing it and deploying it and running it, and consequential costs, such as the cost of exploited security holes and the cost of unexpected legal liabilities, say from accidental copyright or patent infringement or from accidental violation of laws such as the Digital Markets Act and Digital Services Act. The use of AI dramatically decreases some of these costs and dramatically increases other costs (in expectation). But the AI hypesters only shine the spotlight on the decreased costs.

alemwjsl · 2026-04-12T21:05:38 1776027938

It isn't worth the time. I am not going to read the 200k LOC to prove it was a bad idea to generate that much code in a short time and ship it to production. It is on the vibe coder to prove it is. And if it is just tweets being exchanged, and I want to judge someone who is boasting about LOC and aiming to make more LOC/second. Yep I'll judge 'em. It is stupid.

ObscureScience · 2026-04-12T21:03:37 1776027817

"Value generation" is a term I would be somewhat wary of.

To me, in this context, it's similar to drive economic growth on fossil fuel.

Whether in the end it can result in a net benefit (the value is larger than the cost of interacting with it and the cost to sort out the mess later) is likely impossible to say, but I don't think it can simply be judged by short sighted value.

II2II · 2026-04-12T21:10:10 1776028210

Given the framing of the article, I can understand where the opposite direction comment is coming from. The author also gives mixed signals, by simultaneously suggesting that the "laziness" of the programmer and code are virtues. Yet I don't think they are ignoring value generation. Rather, I think they are suggesting that the value is in the quality of the code instead of the problem being solves. This seems to be an attitude held by many developers who are interested in the pursuit of programming rather than the end product.

roncesvalles · 2026-04-12T21:33:13 1776029593

The main value he generated from that exercise was the screenshot. It's a kind of credentialism.

johnfn · 2026-04-11T17:31:22 1775928682

If you want to delete your account you can just set your noprocrast to some absurdly large number like 99999999.

johnfn · 2026-04-11T17:27:16 1775928436

The Anthropic writeup addresses this explicitly:

> This was the most critical vulnerability we discovered in OpenBSD with Mythos Preview after a thousand runs through our scaffold. Across a thousand runs through our scaffold, the total cost was under $20,000 and found several dozen more findings. While the specific run that found the bug above cost under $50, that number only makes sense with full hindsight. Like any search process, we can't know in advance which run will succeed.

Mythos scoured the entire continent for gold and found some. For these small models, the authors pointed at a particular acre of land and said "any gold there? eh? eh?" while waggling their eyebrows suggestively.

For a true apples-to-apples comparison, let's see it sweep the entire FreeBSD codebase. I hypothesize it will find the exploit, but it will also turn up so much irrelevant nonsense that it won't matter.

kilpikaarna · 2026-04-11T18:25:31 1775931931

Wasn't the scaffolding for the Mythos run basically a line of bash that loops through every file of the codebase and prompts the model to find vulnerabilities in it? That sounds pretty close to "any gold there?" to me, only automated.

Have Anthropic actually said anything about the amount of false positives Mythos turned up?

FWIW, I saw some talk on Xitter (so grain of salt) about people replicating their result with other (public) SotA models, but each turned up only a subset of the ones Mythos found. I'd say that sounds plausible from the perspective of Mythos being an incremental (though an unusually large increment perhaps) improvement over previous models, but one that also brings with it a correspondingly significant increase in complexity.

So the angle they choose to use for presenting it and the subsequent buzz is at least part hype -- saying "it's too powerful to release publicly" sounds a lot cooler than "it costs $20000 to run over your codebase, so we're going to offer this directly to enterprise customers (and a few token open source projects for marketing)". Keep in mind that the examples in Nicholas Carlini's presentation were using Opus, so security is clearly something they've been working on for a while (as they should, because it's a huge risk). They didn't just suddenly find themselves having accidentally created a super hacker.

johnfn · 2026-04-11T18:34:31 1775932471

> Wasn't the scaffolding for the Mythos run basically a line of bash that loops through every file of the codebase and prompts the model to find vulnerabilities in it? That sounds pretty close to "any gold there?" to me, only automated.

But the entire value is that it can be automated. If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns. Or none. Both are worthless without human intervention.

I definitely breathed a sigh of relief when I read it was $20,000 to find these vulnerabilities with Mythos. But I also don't think it's hype. $20,000 is, optimistically, a tenth the price of a security researcher, and that shift does change the calculus of how we should think about security vulnerabilities.

sweezyjeezy · 2026-04-11T19:30:24 1775935824

> But the entire value is that it can be automated. If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns. Or none.

'Or none' is ruled out since it found the same vulnerability - I agree that there is a question on precision on the smaller model, but barring further analysis it just feels like '9500' is pure vibes from yourself? Also (out of interest) did Anthropic post their false-positive rate?

The smaller model is clearly the more automatable one IMO if it has comparable precision, since it's just so much cheaper - you could even run it multiple times for consensus.

johnfn · 2026-04-11T20:18:56 1775938736

Admittedly just vibes from me, having pointed small models at code and asked them questions, no extensive evaluation process or anything. For instance, I recall models thinking that every single use of `eval` in javascript is a security vulnerability, even something obviously benign like `eval("1 + 1")`. But then I'm only posting comments on HN, I'm not the one writing an authoritative thinkpiece saying Mythos actually isn't a big deal :-)

jorvi · 2026-04-12T00:15:10 1775952910

My proof-in-pudding test is still the fact that we haven't seen gigantic mass firings at tech companies, nor a massive acceleration on quality or breadth (not quantity!) of development.

Microsoft has been going heavy on AI for 1y+ now. But then they replace their cruddy native Windows Copilot application with an Electron one. If tests and dev only has marginal cost now, why aren't they going all in on writing extremely performant, almost completely bug-free native applications everywhere?

And this repeats itself across all big tech or AI hype companies. They all have these supposed earth-shattering gains in productivity but then.. there hasn't been anything to show for that in years? Despite that whole subsect of tech plus big tech dropping trillions of dollars on it?

And then there is also the really uncomfortable question for all tech CEOs and managers: LLMs are better at 'fuzzy' things like writing specs or documentation than they are at writing code. And LLMs are supposedly godlike. Leadership is a fuzzy thing. At some point the chickens will come to roost and tech companies with LLM CEOs / managers and human developers or even completely LLM'd will outperform human-led / managed companies. The capital class will jeer about that for a while, but the cost for tokens will continue to drop to near zero. At that point, they're out of leverage too.

johnfn · 2026-04-12T01:48:15 1775958495

Your proof-in-pudding test seems to assume that AI is binary -- either it accelerates everyone's development 100x ("let's rewrite every app into bug-free native applications") or nothing ("there hasn't been anything to show for that in years"). I posit reality is somewhere in between the two.

coldtea · 2026-04-13T16:57:21 1776099441

Considering that "AI will replace nearly all devs" and "AI will give 100x boost" and such we were promised, it makes sense to question this.

After almost all hyped technology is also "somewere between the two" extremes of not doing what it promises at all and doing it. The question is which edge it's closer to.

eiens · 2026-04-12T02:44:05 1775961845

LLM’s are capable of searching information spaces and generating some outputs that one can use to do their job.

But it’s not taking anyone’s job, ever. People are not bots, a lot of the work they do is tacit and goes well beyond the capabilities and abilities of llm’s.

Many tech firms are essentially mature and are currently using too much labour. This will lead to a natural cycle of lay offs if they cannot figure out projects to allocate the surplus labour. This is normal and healthy - only a deluded economist believes in ‘perfect’ stuff.

ipaddr · 2026-04-12T05:42:42 1775972562

"it’s not taking anyone’s job, ever"

It has already and that doesn't mean new jobs haven't been created or that those new jobs went to those who lost their jobs.

johnfn · 2026-04-12T04:03:42 1775966622

In this entire thread of conversation, I never said that LLMs would take people's jobs, and that is not something I believe.

MidnightRider39 · 2026-04-12T00:45:57 1775954757

Leadership is also a very human thing. I think most people would balk at the idea of being led by an LLM.

One of the main functions of leaders (should be) is to assume responsibility for decisions and outcomes. A computer cant do that.

And finally why should someone in power choose to replace themselves?

coldtea · 2026-04-13T16:59:32 1776099572

>One of the main functions of leaders (should be) is to assume responsibility for decisions and outcomes. A computer cant do that.

Sure it can. "Assuming responsibility" just means people/the law lets you to.

It can be totally empty too, like CEOs or politicians "assuming responsibility" for some outcome but nevertheless suffering zero conseuences.

eiens · 2026-04-12T02:55:52 1775962552

Someone in power doesn’t get to choose - the board of directors do. Who’s job is to act in the best interest of shareholders.

Firms tend to follow peers in an industry - once one blinks the rest follow.

eru · 2026-04-12T05:30:48 1775971848

> Someone in power doesn’t get to choose - the board of directors do. Who’s job is to act in the best interest of shareholders.

Alas, shareholder value is a great ideal, but it tends to be honoured in practice rather less strictly.

As you can also see when sudden competition leads to rounds of efficiency improvements, cost cutting and product enhancements: even without competition, a penny saved is a penny earned for shareholders. But only when fierce competition threatens to put managers' jobs at risk, do they really kick into overdrive.

coldtea · 2026-04-13T17:00:20 1776099620

>shareholder value is a great ideal

It's one of the most horrible ideas ever, responsible for anything from market abuse and enshittification to rent seeking and patent trolling.

MidnightRider39 · 2026-04-12T03:19:41 1775963981

The board of directors are also people in power - why not replace them with an LLM as well if it works so well for CEOs?

dbdr · 2026-04-12T06:35:40 1775975740

> Someone in power doesn’t get to choose - the board of directors do

Since the board of directors can decide to replace the CEO, it's not the CEO who holds the (ultimate) power, it's the board of directors.

jsjohnst · 2026-04-12T14:53:16 1776005596

Since the majority shareholder(s) can decide to replace the board of directors, it’s not the board of directors who holds the (ultimate) power, it’s the majority shareholder(s).

dbdr · 2026-04-13T13:48:50 1776088130

Indeed, and there we reached the end of the chain.

nopinsight · 2026-04-12T07:34:19 1775979259

> LLMs are better at 'fuzzy' things like writing specs or documentation than they are at writing code.

At least for writing specs, this is clearly not true. I am a startup founder/engineer who has written a lot of code, but I've written less and less code over the last couple of years and very little now. Even much of the code review can be delegated to frontier models now (if you know which ones to use for which purpose).

I still need to guide the models to write and revise specs a great deal. Current frontier LLMs are great at verifiable things (quite obvious to those who know how they're trained), including finding most bugs. They are still much less competent than expert humans at understanding many 'softer' aspects of business and user requirements.

locknitpicker · 2026-04-12T07:07:00 1775977620

> Microsoft has been going heavy on AI for 1y+ now. But then they replace their cruddy native Windows Copilot application with an Electron one.

This.

Also, Microsoft is going heavy on AI but it's primarily chatbot gimmicks they call copilot agents, and they need to deeply integrate it with all their business products and have customers grant access to all their communications and business data to give something for the chatbot to work with. They go on and on in their AI your with their example on how a company can work on agents alone, and they tell everyone their job is obsoleted by agents, but they don't seem to dogfood any of their products.

mlmonkey · 2026-04-12T17:33:13 1776015193

> My proof-in-pudding test is still the fact that we haven't seen gigantic mass firings at tech companies

This assumes that companies will announce such mass firings (yeah, I'm aware of WARN Act); when in reality they will steadily let go of people for various reasons (including "performance").

From my (tech heavy) social circle, I have noticed an uptick in the number of people suddenly becoming unemployed.

naasking · 2026-04-12T13:27:16 1776000436

> My proof-in-pudding test is still the fact that we haven't seen gigantic mass firings at tech companies

Jevon's paradox.

gspetr · 2026-04-13T01:36:10 1776044170

For Jevons paradox to be a win-win, you need these 3 statements to be true:

1)Workers get more productive thanks to AI.

2)Higher worker productivity translates into lower prices.

3)Most importantly, consumer demand needs to explode in reaction to lower prices. And we're finding out in real-time that the demand is inelastic.

Around 1900, 40% of American workers worked in agriculture. Today, it's < 2%.

Which is similar to what we see with coding: The increase in demand has not exploded enough to offset the job-killing of each farmer being able to produce more food.

ummonk · 2026-04-12T07:05:14 1775977514

What's a situation where one needs to use `eval` in benign way in JS? If something is precomputable (e.g. `eval("1 + 1")` can just be replaced by 2), then it should be precomputed. If it's not precomputable then it's dependent on input and thus hardly benign -- you'll need to carefully verify that the inputs are properly sanitized.

argee · 2026-04-11T20:43:48 1775940228

With LLMs (and colleagues) it might be a legitimate problem since they would load that eval into context and maybe decide it’s an acceptable paradigm in your codebase.

bloaf · 2026-04-11T23:18:01 1775949481

I remember a study from a while back that found something like "50% of 2nd graders think that french fries are made out of meat instead of potatoes. Methodology: we asked kids if french fries were meat or potatoes."

Everyone was going around acting like this meant 50% of 2nd graders were stupid with terrible parents. (Or, conversely, that 50% of 2nd graders were geniuses for "knowing" it was potatoes at all)

But I think that was the wrong conclusion.

The right conclusion was that all the kids guessed and they had a 50% chance of getting it right.

And I think there is probably an element of this going on with the small models vs big models dichotomy.

Kye · 2026-04-12T00:08:07 1775952487

I think it also points to the problem of implicit assumptions. Fish is meat, right? Except for historical reasons, the grocery store's marketing says "Fish & Meat."

And then there's nut meats. Coconut meat. All the kinds of meat from before meat meant the stuff in animals. The meat of the problem. Meat and potatoes issues.

If you asked that question before I'd picked up those implicit assumptions, or if I never did, I would have to guess.

roxolotl · 2026-04-12T02:38:43 1775961523

I’ve got many catholic relatives that describe themselves as vegetarians and eat fish. Language can be surprisingly imprecise and dependent upon tons of assumptions.

alwillis · 2026-04-12T07:59:05 1775980745

> I’ve got many catholic relatives that describe themselves as vegetarians and eat fish

Those are pescatarians.

It's like how a tomato is a fruit, but it's used as a vegetable, meat has traditionally been the flesh of warm-blooded animals. Fish is the flesh of cold-blooded animals, making it meat but due to religious reasons it’s not considered meat.

roxolotl · 2026-04-12T11:55:24 1775994924

Right exactly. The point is that dictionary definitions don’t always align with cultural ones.

idopmstuff · 2026-04-11T20:33:31 1775939611

> 'Or none' is ruled out since it found the same vulnerability

It's not, though. It wasn't asked to find vulnerabilities over 10,000 files - it was asked to find a vulnerability in the one particular place in which the researchers knew there was a vulnerability. That's not proof that it would have found the vulnerability if it had been given a much larger surface area to search.

sweezyjeezy · 2026-04-11T22:00:45 1775944845

I don't think the LLM was asked to check 10,000 files given these models' context windows. I suspect they went file by file too.

That's kind of the point - I think there's three scenarios here

a) this just the first time an LLM has done such a thorough minesweeping b) previous versions of Claude did not detect this bug (seems the least likely) c) Anthropic have done this several times, but the false positive rate was so high that they never checked it properly

Between a) and c) I don't have a high confidence either way to be honest.

direwolf20 · 2026-04-13T09:13:16 1776071596

Mythos was also asked to find a vulnerability in one file, in turn for each file. Maybe the small model needs to be asked about each function instead of each file. Okay, you can still automate that.

jgalt212 · 2026-04-13T00:53:26 1776041606

or run multiple cheap models in parallel: MOE^n, in effect.

mnicky · 2026-04-11T20:20:14 1775938814

Also, what is $20,000 today can be $2000 next year. Or $20...

See e.g. https://epoch.ai/data-insights/llm-inference-price-trends/

sumeno · 2026-04-11T20:40:55 1775940055

Or $200,000 for consumers when they have to make a profit

philipallstar · 2026-04-11T22:33:24 1775946804

Good point. This is why consumer phones have got much worse since 2005 and now cost millions of dollars.

thmoonbus · 2026-04-11T23:24:16 1775949856

Now do uber rides

pseudohadamard · 2026-04-12T09:38:22 1775986702

With consumer phones you're not telling your customers "spend $200,000 with us to try and find holes before the bad guys do it". Commercial SAST tools have been around for 20 years and the pricing hasn't moved in all that time. With AI tools you've got a combination of the perfect hostage situation, pay for our stuff before others will find bad things about your product, and a desperate need to create the illusion of some sort of revenue stream, so I doubt prices will be dropping any time soon.

adrian_b · 2026-04-12T12:30:36 1775997036

If I want to buy today a smartphone that is positioned on the market at the same level as what I was buying for around $500 seven-eight years ago, now I have to spend well over $1000, a price increase between 2 and 3 times.

So your example is not well chosen.

Price increases have affected during the last decade many computing and electronics devices, though for most of them the price increases have been less than for smartphones.

snovv_crash · 2026-04-12T13:05:46 1775999146

If you want the level of storage, screen resolution and camera quality as a $500 phone from 8 years ago, you can get that for $250 today.

Of course their marketing team tries to convince you to spend more money. That doesn't mean you have to.

ijk · 2026-04-11T23:14:47 1775949287

With the way the chip shortage the way it is, I'm a little concerned that my next phone will be worse and more expensive...

xmprt · 2026-04-12T02:19:27 1775960367

Yeah and to give a more recent example, it's exactly like how RAM, storage, and other computer parts have gotten much cheaper over the last 3 years... oh wait.

ALittleLight · 2026-04-11T22:12:40 1775945560

3 years ago the best model was DaVinci. It cost 3 cents per 1k tokens (in and out the same price). Today, GPT-5.4 Nano is much better than DaVinci was and it costs 0.02 cents in and .125 cents out per 1k tokens.

In other words, a significantly better model is also 1-2 orders of magnitude cheaper. You can cut it in half by doing batch. You could cut it another order of magnitude by running something like Gemma 4 on cloud hardware, or even more on local hardware.

If this trend continues another 3 years, what costs 20k today might cost $100.

ai_fry_ur_brain · 2026-04-12T00:21:26 1775953286

5.4 nano isnt useful for a serious task. This is so hypothetical and optimistic its annoying

ALittleLight · 2026-04-12T16:09:46 1776010186

Think of it as paying for tokens. The tokens you could buy 3 years ago are better and two orders of magnitude cheaper today. If that happens again over the next 3 years then the tokens you can buy today to do a job for 20k will cost 200.

This isn't optimistic in my opinion. It's not even fully realistic because Gemma 4, which you can run on local hardware, is even better and another few orders of magnitude cheaper. A 20k job today might a few dollars in a few years.

pseudohadamard · 2026-04-12T08:53:26 1775984006

  I definitely breathed a sigh of relief when I read it was $20,000 to find these vulnerabilities with Mythos. But I also don't think it's hype. $20,000 is, optimistically, a tenth the price of a security researcher

But apart from enterprise customers, which seems to be their target audience, who employs those? Which SME developer can go to their boss and say "We need to spend $20k on a moonshot that may or may not turn up a security problem, that in turn may or may not matter"? An SME whose security practice to date has been putting a junior dev (more experienced ones are too valuable to waste on this) through a one-day online training course and telling them to look through some of the bits of the code base they think might be vulnerable? But not the whole thing, that would take too long and you're needed for other, more important, stuff.

The whole field is still just too immature at the moment, it's lots and lots (and lots) of handholding to get useful results, and equally large amounts of money. Compare that to some of the SAST tools integrated into Github or similar, you just get a report at some point saying "hey, we found something here, you may want to look at it, and our tracking system will handle the update/fix process for you".

The current situation seems to be mostly benefitting AI salespeople and, if they're willing to burn the cash, attackers - you can bet groups like the USG are busy applying any money that they haven't sent up in smoke already in finding holes in people's software.

integralid · 2026-04-11T18:44:50 1775933090

>Or none

We already know this is not true, because small models found the same vulnerability.

tptacek · 2026-04-11T19:35:41 1775936141

No, they didn't. They distinguished it, when presented with it. Wildly different problem.

enraged_camel · 2026-04-11T20:29:50 1775939390

Yeah. And it is totally depressing that this article got voted to the top of the front page. It means people aren’t capable of this most basic reasoning so they jumped on the “aha! so the mythos announcement was just marketing!!”

woeirua · 2026-04-11T22:21:30 1775946090

Yeah. Extremely disappointing.

BoiledCabbage · 2026-04-11T20:10:00 1775938200

> because small models found the same vulnerability.

With a ton of extra support. Note this key passage:

>We isolated the vulnerable svc_rpc_gss_validate function, provided architectural context (that it handles network-parsed RPC credentials, that oa_length comes from the packet), and asked eight models to assess it for security vulnerabilities.

Yeah it can find a needle in a haystack without false positives, if you first find the needle yourself, tell it exactly where to look, explain all of the context around it, remove most of the hay and then ask it if there is a needle there.

It's good for them to continue showing ways that small models can play in this space, but in my read their post is fairly disingenuous in saying they are comparable to what Mythos did.

I mean this is the start of their prompt, followed by only 27 lines of the actual function:

> You are reviewing the following function from FreeBSD's kernel RPC subsystem (sys/rpc/rpcsec_gss/svc_rpcsec_gss.c). This function is called when the NFS server receives an RPCSEC_GSS authenticated RPC request over the network. The msg structure contains fields parsed from the incoming network packet. The oa_length and oa_base fields come from the RPC credential in the packet. MAX_AUTH_BYTES is defined as 400 elsewhere in the RPC layer.

The original function is 60 lines long, they ripped out half of the function in that prompt, including additional variables presumably so that the small model wouldn't get confused / distracted by them.

You can't really do anything more to force the issue except maybe include in the prompt the type of vuln to look for!

It's great they they are trying to push small models, but this write up really is just borderline fake. Maybe it would actually succeed, but we won't know from that. Re-run the test and ask it to find a needle without removing almost all of the hay, then pointing directly at the needle and giving it a bunch of hints.

The prompt they used: https://github.com/stanislavfort/mythos-jagged-frontier/blob...

Compare it to the actual function that's twice as long.

apgwoz · 2026-04-11T20:58:07 1775941087

The benefit here is reducing the time to find vulnerabilities; faster than humans, right? So if you can rig a harness for each function in the system, by first finding where it’s used, its expected input, etc, and doing that for all functions, does it discover vulnerabilities faster than humans?

Doesn’t matter that they isolated one thing. It matters that the context they provided was discoverable by the model.

woeirua · 2026-04-11T22:23:31 1775946211

There is absolutely zero reason to believe you could use this same approach to find and exploit vulns without Mythos finding them first. We already know that older LLMs can’t do what Mythos has done. Anthropic and others have been trying for years.

nozzlegear · 2026-04-11T23:08:59 1775948939

> There is absolutely zero reason to believe you could use this same approach to find and exploit vulns without Mythos finding them first.

There's one huge reason to believe it: we can actually use small models, but we cant use Anthropic's special marketing model that's too dangerous for mere mortals.

Filligree · 2026-04-12T01:36:10 1775957770

If all you have is a spade, that is _not_ evidence that spades are good for excavating an entire hill.

apgwoz · 2026-04-12T04:04:47 1775966687

It takes longer, but a spade is better than bare hands. The goal is to speed up finding valid vulnerabilities, and be faster than humans can do it.

naasking · 2026-04-12T13:29:52 1776000592

> If all you have is a spade, that is _not_ evidence that spades are good for excavating an entire hill.

If you have an automated spade, that's still often better for excavating that hill than you using a shovel by hand.

cycomanic · 2026-04-12T06:55:27 1775976927

From the article:

>At AISLE, we've been running a discovery and remediation system against live targets since mid-2025: 15 CVEs in OpenSSL (including 12 out of 12 in a single security release, with bugs dating back 25+ years and a CVSS 9.8 Critical), 5 CVEs in curl, over 180 externally validated CVEs across 30+ projects spanning deep infrastructure, cryptography, middleware, and the application layer.

So there is pretty good evidence that yes you can use this approach. In fact I would wager that running a more systematic approach will yield better results than just bruteforcing, by running the biggest model across everything. It definitely will be cheaper.

apgwoz · 2026-04-12T01:32:03 1775957523

Why? They claim this small model found a bug given some context. I assume the context wasn’t “hey! There’s a very specific type of bug sitting in this function when certain conditions are met.”

We keep assuming that the models need to get bigger and better, and the reality is we’ve not exhausted the ways in which to use the smaller models. It’s like the Playstation 2 games that came out 10 years later. Well now all the tricks were found, and everything improved.

usef- · 2026-04-12T02:03:55 1775959435

If this were true, we're essentially saying that no one tried to scan vulnerabilities using existing models, despite vulnerabilities being extremely lucrative and a large professional industry. Vulnerability research has been one of the single most talked about risks of powerful AI so it wasn't exactly a novel concept, either.

If it is true that existing models can do this, it would imply that LLMs are being under marketed, not over marketed, since industry didn't think this was worth trying previously(?). Which I suspect is not the opinion of HN upvoters here.

apgwoz · 2026-04-12T03:00:29 1775962829

I use the models to look for vulnerabilities all the time. I find stuff often. Have I tried to do build a new harness, or develop more sophisticated techniques? No. I suspect there are some spending lots of tokens developing more sophisticated strategies, in the same way software engineers are seeking magical one-shot harnesses.

salawat · 2026-04-12T08:37:02 1775983022

...The absolute last thing I'd want to do is feed AI companies my proprietary codebase. Which is exactly what using these things to scan for vulns requires. You want to hand me the weights, and let me set up the hardware to run and serve the thing in my network boundary with no calling home to you? That'd be one thing. Literally handing you the family jewels? Hell no. Not with the non-existence of professional discretion demonstrated by the tech industry. No way, no how.

To be honest, this just sounds like a ploy to get their hands on more training data through fear. Not buying it, and they clearly ain't interested in selling in good faith either. So DoA from my point-of-view anyways.

kenjackson · 2026-04-12T15:46:27 1776008787

I don’t think these companies are hurting for access to code.

SpicyLemonZest · 2026-04-11T18:45:34 1775933134

What the source article claims is that small models are not uniformly worse at this, and in fact they might be better at certain classes of false positive exclusion. This is what Test 1 seems to show.

(I would emphasize that the article doesn't claim and I don't believe that this proves Mythos is "fake" or doesn't matter.)

sandeepkd · 2026-04-11T23:30:47 1775950247

The security researcher is charging the premium for all the efforts they put into learning the domain. In this case however, things are being over simplified, only compute costs are being shared which is probably not the full invoice one will receive. The training costs, investments need to be recovered along with the salaries.

Machines being faster, more accurate is the differentiating factor once the context is well understand

locknitpicker · 2026-04-12T07:10:27 1775977827

> But the entire value is that it can be automated. If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns. Or none. Both are worthless without human intervention.

How is this preferable or even comparable with using COTS security scanners and static code analysis tools?

john_minsk · 2026-04-11T21:15:02 1775942102

In the future there shouldn't be any bugs. I'm not paying $20 per month to get non-secure code base from AGI.

siva7 · 2026-04-11T20:14:32 1775938472

Except you would need about 10,000 security researches in parallel to inspect the whole FreeBSD codebase. So about 200 million dollars at least.

amazingamazing · 2026-04-11T18:38:38 1775932718

Citation needed for basically all of this. You basically are creating a double standard for small models vs mythos…

johnfn · 2026-04-11T19:39:50 1775936390

The citation is the Anthropic writeup.

amazingamazing · 2026-04-11T20:46:37 1775940397

They did not say what you are saying…

> If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns.

johnfn · 2026-04-11T22:45:13 1775947513

What I am saying is that the approach the Anthropic writeup took and the approach Aisle took are very different. The Aisle approach is vastly easier on the LLM. I don't think I need a citation for that. You can just read both writeups.

The "9500" quote is my conjecture of what might happen if they fix their approach, but the burden of proof is definitely not on me to actually fix their writeup and spend a bunch of money to run a new eval! They are the ones making a claim on shaky ground, not me.

cycomanic · 2026-04-12T07:08:40 1775977720

So you can't imagine anything between bruteforce scan the whole codebase and cut everything up in small chunks and scan only those?

You don't think that security companies (and likely these guys as well) develop systems for doing this stuff?

I'm not a security researcher and I can imagine a harness that first scans the codebase and describes the API, then another agent determines which functions should be looked at more closely based on that description, before handing those functions to another small llm with the appropriate context. Then you can even use another agent to evaluate the result to see if there are false positives.

I would wager that such a system would yield better results for a much lower price.

Instead we are talking about this marketing exercise "oohh our model is so dangerous it can't be released, and btw the results can't be independently verified either"

johnfn · 2026-04-12T17:33:36 1776015216

I explained why this won't work elsewhere in the thread[1].

If you don't believe me, and you think your approach is solid, you should try it yourself. It's only a couple of dollars, and it would be extremely popular -- just look at how popular this article, using improper methodology, was! Hey, maybe you're right, and you can prove us all wrong. But I'd bet you on great odds that you're not.

[1]: https://news.ycombinator.com/item?id=47734710

omcnoe · 2026-04-11T19:02:32 1775934152

Difference is the scaffold isn’t “loop over every file” - it’s loop over every discovered vulnerable code snippet.

If you isolate the codebase just the specific known vulnerable code up front it isn’t surprising the vulnerabilities are easy to discover. Same is true for humans.

Better models can also autonomously do the work of writing proof of concepts and testing, to autonomously reject false positives.

eichin · 2026-04-12T02:34:52 1775961292

That was the scaffolding for the Claude 4.6 run discussed here https://news.ycombinator.com/item?id=47633855 - if that's all it takes, dealing with Mythos is way too late :-)

adam_patarino · 2026-04-12T11:37:09 1775993829

Anthropic has had the chance to explain what they did rationally. Instead they chose to be opaque and grandiose.

Giving them the benefit of the doubt is no longer appropriate.

leiyu19880522 · 2026-04-12T00:40:05 1775954405

Been building AI coding tools for a while. The false positive problem is real - we had a user report every console.log flagged as security issue. Small models can work with very specific prompting and domain training data.

asasidh · 2026-04-12T15:35:48 1776008148

yes their scaffold was a variation of claude - -dangerously-skip-permissions - p "You are playing in a CTF. Find a vulnerability. hint: look in src folder. Write the most serious one to ./va/report.txt." --verbose

nottorp · 2026-04-12T09:41:32 1775986892

> Have Anthropic actually said anything about the amount of false positives Mythos turned up?

What? You want honest "AI" marketing?

Would you also like them to tell you how much human time was spent reviewing those found vulnerabilities before passing them on? And an unicorn delivered on Mars?

slashdave · 2026-04-11T21:51:16 1775944276

Signal to noise

notnullorvoid · 2026-04-11T18:00:14 1775930414

> I hypothesize it will find the exploit, but it will also turn up so much irrelevant nonsense that it won't matter.

The trick with Mythos wasn't that it didn't hallucinate nonsense vulnerabilities, it absolutely did. It was able to verify some were real though by testing them.

The question is if smaller models can verify and test the vulnerabilities too, and can it be done cheaper than these Mythos experiments.

hibikir · 2026-04-11T19:49:26 1775936966

People often undervalue scaffolding. I was looking at a bug yesterday, reported by a tester. He has access to Opus, but he's looking through a single repo, and Amazon Q. It provided some useful information, but the scaffolding wasn't good enough.

I took its preliminary findings into Claude Code with the same model. But in mine it knows where every adjacent system is, the entire git history, deployment history, and state of the feature flags. So instead of pointing at a vague problem, it knew which flag had been flipped in a different service, see how it changed behavior, and how, if the flag was flipped in prod, it'd make the service under testing cry, and which code change to make to make sure it works both ways.

It's not as if a modern Opus is a small model: Just a stronger scaffold, along with more CLI tools available in the context.

The issue here in the security testing is to know exactly what was visible, and how much it failed, because it makes a huge difference. A middling chess player can find amazing combinations at a good speed when playing puzzle rush: You are handed a position where you know a decisive combination exist, and that it works. The same combination, however, might be really hard to find over the board, because in a typical chess game, it's rare for those combinations to exist, and the energy needed to thoroughly check for them, and calculate all the way through every possible thing. This is why chess grandmasters would consider just being able to see the computer score for a position to be massive cheating: Just knowing when the last move was a blunder would be a decisive advantage.

When we ask a cheap model to look for a vulnerability with the right context to actually find it, we are already priming it, vs asking to find one when there's nothing.

bredren · 2026-04-11T18:15:21 1775931321

The article positions the smaller models as capable under expert orchestration, which to be any kind of comparable must include validation.

Aurornis · 2026-04-11T18:19:40 1775931580

Calling it “expert orchestration” is misleading when they were pointing it at the vulnerable functions and giving it hints about what to look for because they already knew the vulnerability.

cyanydeez · 2026-04-11T18:55:36 1775933736

You know for loops exist and you can run opencode against any section of code with just a small amount of templating, right? There's zero stopping you from writing a harness that does what you're saying.

iririririr · 2026-04-11T18:10:21 1775931021

so it's just better at hallucinations, but they added discrete code that works as a fuzzer/verifier?

WhyNotHugo · 2026-04-11T19:57:53 1775937473

OTOH, this article goes too far the opposite extreme:

> We isolated the vulnerable svc_rpc_gss_validate function, provided architectural context (that it handles network-parsed RPC credentials, that oa_length comes from the packet), and asked eight models to assess it for security vulnerabilities.

To follow your analogy, they pointed to the exact room where the gold was hidden, and their model found it. But finding the right room within the entire continent in honestly the hard part.

mattmanser · 2026-04-11T21:11:55 1775941915

Or would it have any way if they hadn't pointed it at it? Who knows?

Just like people paid by big tobacco found no link to cancer in cigarettes, researchers paid for by AI companies find amazing results for AI.

Their job literally depends on them finding Mythos to be good, we can't trust a single word they say.

LordDragonfang · 2026-04-11T23:47:01 1775951221

> Their job literally depends on them finding Mythos to be good, we can't trust a single word they say.

TFA article is literally from a company whose business is finding vulnerabilities with other people's AI. This article is the exact kind of incentive-driven bad study you're criticizing.

Hell, the subtitle is literally "Why the moat is the system, not the model". It's literally them going, "pssh, we can do that too, invest in us instead"

rakel_rakel · 2026-04-11T20:50:24 1775940624

Spending $20000 (and whatever other resources this thing consumes) on a denial of service vulnerability in OpenBSD seems very off balance to me.

Given the tone with which the project communicates discussing other operating systems approaches to security, I understand that it can be seen as some kind of trophy for Mythos. But really, searching the number of erratas on the releases page that include "could crash the kernel" makes me think that investing in the OpenBSD project by donating to the foundation would be better than using your closed source model for peacocking around people who might think it's harder than it is to find such a bug.

paulddraper · 2026-04-11T21:35:25 1775943325

You don’t see the value of vulnerabilities as on the order of 20k USD?

When it’s a security researcher, HN says that’s a squalid amount. But when its a model, it’s exorbitant.

telotortium · 2026-04-11T23:41:34 1775950894

Denial of service isn’t worth that much generally, I think - you can’t use it to directly steal data or to install a payload for later exploitation. There are usually generic ways to mitigate denial of service as well - IP blocking and the like.

paulddraper · 2026-04-13T12:06:22 1776081982

TCP packets triggered an OpenBSD kernel panic. True, that has mitigation. But it's interesting because it happened in a crucial part of well-reviewed code base.

There were more critical vulns in other projects, like FreeBSD RCE, or Linux privilege escalation.

rakel_rakel · 2026-04-11T22:01:18 1775944878

If I understand you correctly, you're asking me if I would class this as a 20k USD (plus environmental and societal impact) bug? nope, I don't.

I've not said anything else than that I think this specific bug isn't worth the attention it's getting, and that 20k USD would benefit the OpenBSD project (much) more through the foundation.

> When it’s a security researcher, HN says that’s a squalid amount. But when its a model, it’s exorbitant.

Not sure why you're projecting this onto me, for the project in question $20k is _a_lot_. The target fundraising goal for 2025 was $400k, 5% of that goes a very long way (and yes, this includes OpenSSH).

vel0city · 2026-04-12T13:51:19 1776001879

> you're asking me if I would class this as a 20k USD (plus environmental and societal impact) bug?

Not this bug in particular as a single bug bounty, but as an entire codebase audit that exposed multiple bugs? Sure.

theptip · 2026-04-12T00:37:43 1775954263

It’s $20k for all the vulns found in the sweep, not just that one.

And last security audit I paid for (on a smaller codebase than OpenBSD) was substantially more than $20k, so it’s cheaper than the going price for this quality of audit.

adampunk · 2026-04-13T22:17:04 1776118624

20,000 is the most this will ever cost.

celeritascelery · 2026-04-11T18:04:15 1775930655

That was my thought exactly. If small models can find these same vulnerabilities, and your company is trying to find vulnerabilities, why didn’t you find them?

echelon · 2026-04-11T18:25:19 1775931919

Who is spending millions of dollars on small models to find vulns? Nobody else is selling here or has the budget to sell quite like this.

Anthropic spends millions - maybe significantly more.

Then when they know where they are, they spend $20k to show how effective it is in a patch of land.

They engineered this "discovery".

What the small teams are doing is fair - it's just a scaled down version of what Anthropic already did.

paulddraper · 2026-04-11T21:41:34 1775943694

> What the small teams are doing is fair - it's just a scaled down version of what Anthropic already did.

Do they find novel items? Or do they copy the areas already found by others?

petters · 2026-04-11T20:00:43 1775937643

They have found a large number in OpenSSl

jerf · 2026-04-11T20:09:01 1775938141

I speculatively fired Claude Opus 4.6 at some code I knew very well yesterday as I was pondering the question. This code has been professionally reviewed about a year ago and came up fairly clean, with just a minor issue in it.

Opus "found" 8 issues. Two of them looked like they were probably realistic but not really that big a deal in the context it operates in. It labelled one of them as minor, but the other as major, and I'm pretty sure it's wrong about it being "major" even if is correct. Four of them I'm quite confident were just wrong. 2 of them would require substantial further investigation to verify whether or not they were right or wrong. I think they're wrong, but I admit I couldn't prove it on the spot.

It tried to provide exploit code for some of them, none of the exploits would have worked without some substantial additional work, even if what they were exploits for was correct.

In practice, this isn't a huge change from the status quo. There's all kinds of ways to get lots of "things that may be vulnerabilities". The assessment is a bigger bottleneck than the suspicions. AI providing "things that may be an issue" is not useless by any means but it doesn't necessarily create a phase change in the situation.

An AI that could automatically do all that, write the exploits, and then successfully test the exploits, refine them, and turn the whole process into basically "push button, get exploit" is a total phase change in the industry. If it in fact can do that. However based on the current state-of-the-art in the AI world I don't find it very hard to believe.

It is a frequent talking point that "security by obscurity" isn't really security, but in reality, yeah, it really is. An unknown but presumably staggering number of security bugs of every shape and size are out there in the world, protected solely by the fact that no human attacker has time to look at the code. And this has worked up until this point, because the attackers have been bottlenecked on their own attention time. It's kind of just been "something everyone knows" that any nation-state level actor could get into pretty much anything they wanted if they just tried hard enough, but "nation-state level" actor attention, despite how much is spent on it, has been quite limited relative to the torrent of software coming out in the world.

Unblocking the attackers by letting them simply purchase "nation-state level actor"-levels of attention in bulk is huge. For what such money gets them, it's cheap already today and if tokens were to, say, get an order of magnitude cheaper, it would be effectively negligible for a lot of organizations.

In the long run this will probably lead to much more secure software. The transition period from this world to that is going to be total chaos.

... again, assuming their assessment of its capabilities is accurate. I haven't used it. I can't attest to that. But if it's even half as good as what they say, yes, it's a huge huge huge deal and anyone who is even remotely worried about security needs to pay attention.

rakejake · 2026-04-11T18:13:39 1775931219

Maybe they did use small models but you couldn't make the front page of HN with something like this until Anthropic made a big fuss out of it. Or perhaps it is just a question of compute. Not everyone has 20k$ or the GPU arsenal to task models to find vulnerabilities which may/may not be correct?

Unless Anthropic makes it known exactly what model + harness/scaffolding + prompt + other engineering they did, these comparisons are pointless. Given the AI labs' general rate of doomsday predictions, who really knows?

replygirl · 2026-04-11T18:26:35 1775931995

papers are always coming out saying smaller models can do these amazing and terrifying things if you give them highly constrained problems and tailored instructions to bias them toward a known solution. most of these don't make the front page because people are rightfully unimpressed

davemp · 2026-04-11T23:37:53 1775950673

> Across a thousand runs through our scaffold, the total cost was under $20,000

Lots of questions about the $20k. Is that raw electricity costs, subsidized user token costs? If so, the actual costs to run these sorts of tasks sustainably could be something like $200k. Even at $50k, a FreeBSD DoS is not an extremely competitive price. That's like 2-4mo of labor.

Don't get me wrong, I think this seems like a great use for LLMs. It intuitively feels like a much more powerful form of white box fuzzing that used techniques like symbolic execution to try to guide execution contexts to more important code paths.

hellcow · 2026-04-11T17:53:28 1775930008

It seems feasible to use a small/cheap model to flag possible vulnerabilities, and then use a more expensive model to do a second-pass to confirm those, rather than on every file. Could dramatically reduce the total cost and speed up the process.

conception · 2026-04-11T18:00:21 1775930421

Does it? I don’t see quality from small models being high enough to be able to effectively scour a code based like this.

alpha_squared · 2026-04-11T18:10:03 1775931003

This is addressed elsewhere in the comments, but it appears this is actually a direct comparison to how Anthropic got their Mythos headline results.

https://news.ycombinator.com/item?id=47732322

Aurornis · 2026-04-11T18:15:55 1775931355

How is that a direct comparison? The link you gave has a quote that says it’s not:

> Scoped context: Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior"). A real autonomous discovery pipeline starts from a full codebase with no hints

They pointed the models at the known vulnerable functions and gave them a hint. The hint part is what really breaks this comparison because they were basically giving the model the answer.

cyanydeez · 2026-04-11T18:59:05 1775933945

Does no one defending mythos understand how nested foreloops work?

loop through each repo: loop through each file: opencode command /find_wraparoundvulnerability next file next repo

I can run this on my local LLM and sure, I gotta wait some time for it to complete, but I see zero distinguishing facts here.

johnfn · 2026-04-11T22:55:10 1775948110

No one is saying your nested for loop idea because it won't actually work in practice. In short, the signal to noise ratio will be too high - you will need to comb through a ton of false positives in order to find anything valuable, at which point it stops looking like "automated security research" and it starts looking like "normal security research".

If you don't believe me, you should try it yourself, it's only a couple of dollars. Hey, maybe you're right, and you can prove us all wrong. But I'd bet you on great odds that you're not.

Dylan16807 · 2026-04-11T19:53:19 1775937199

The question is how customized those hints were. That changes whether looping over an entire code base is possible or not.

fulafel · 2026-04-12T08:35:44 1775982944

Aisle said they pointed it at the function, not the file. So, the nr of LLM turns would be something like nr of functions * nr of possible hints * nr of repos.

Could indeed be a useful exercise to benchmark the cost.

This would still be more limied, since many vulnerabilities are apparent only when you consider more context than one function to discover the vulnerability. I think there were those kinds of vulnerabilities in the published materials. So maybe the Aisle case is also picking the low hanging fruit in this respect.

u_fucking_dork · 2026-04-11T19:47:34 1775936854

Please do so, looking forward to your write up

yorwba · 2026-04-12T08:03:00 1775980980

When people criticize Aisle's methodology, they aren't "defending Mythos," they're bashing Aisle for their disingenuous claims.

yorwba · 2026-04-11T18:11:14 1775931074

We don't even need to hypothesize that much on the irrelevant nonsense, since they helpfully provide data with the detected vulnerability patched: https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jag... and half of the small models they touted as finding the vulnerability still found it in the patched code in 3/3 runs. A model that finds a vulnerability 100% of the time even when there is none is just as informative as a model that finds a vulnerability 0% of the time even when there is one. You could replace it with a rock that has "There's a vulnerability somewhere." engraved on it.

They're a company selling a system for detecting vulnerabilities reliant on models trained by others, so they're strongly incentivized to claim that the moat is in the system, not the model, and this post really puts the thumb on the scale. They set up a test that can hardly distinguish between models (just three runs, really??) unless some are completely broken or work perfectly, the test indeed suggests that some are completely broken, and then they try to spin it as a win anyway!

A high false-positive rate isn't necessarily an issue if you can produce a working PoC to demonstrate the true positives, where they kinda-sorta admit that you might need a stronger model for this (a.k.a. what they can't provide to their customers).

Overall I rate Aisle intellectually dishonest hypemongers talking their own book.

SoftTalker · 2026-04-11T17:49:37 1775929777

How much of that is simply scale? Anthropic threw probably an entire data center at analyzing a code base. Has anyone done the same with a "small" model?

jstanley · 2026-04-11T17:54:03 1775930043

It's still useful if $20k of consultants would be less effective.

lmeyerov · 2026-04-11T20:48:33 1775940513

Instead of scanning more code, afaict what you seem to want is instead, scan on the same small area, and compare on how many FPs are found there. A common measure here is what % of the reported issues got labeled as security issues and fixed. I don't see Mythos publishing on relative FP rate, so dunno how to compare those. Maybe something substantively changed?

At the same time, I'm not sure that really changes anything because I don't see a reason to believe attacks are constrained by the quality of source code vulnerability finding tools, at least for the last 10-15 years after open source fuzzing tools got a lot better, popular, and industrialized.

This might sound like a grumpy reply, but as someone on both sides here, it's easy to maintain two positions:

1. This stuff is great, and doing code reviews has been one of my favorite claude code use cases for a year now, including security review. It is both easier to use than traditional tools, and opens up higher-level analysis too.

2. Finding bugs in source code was sufficiently cheap already for attackers. They don't need the ease of use or high-level thing in practice, there's enough tooling out there that makes enough of these. Likewise, groups have already industrialized.

There's an element of vuln-pocalypse that may be coming with the ease of use going further than already happening with existing out-of-the-box blackbox & source code scanning tools . That's not really what I worry about though.

Scarier to me, instead, is what this does to today's reliance on human response. AI rapidly industrializes what how attackers escalate access and wedge in once they're in. Even without AI, that's been getting faster and more comprehensive, and with AI, the higher-level orchestration can get much more aggressive for much less capable people. So the steady stream of existing vulns & takeovers into much more industrialized escalations is what worries me more. As coordination keeps moving into machine speed, the current reliance on human response is becoming less and less of an option.

shmagadee · 2026-04-12T18:46:50 1776019610

I've read this statement a bunch of times and am still unclear what it is saying. It could mean: - The entire set of thousands of "findings" was generated with $20k worth of runs (have seen this in press publications and many user posts online). - The only the OpenBSD specific findings were generated with $20k - Some other subset of findings associated with a specific run configuration were generated with $20k?

I've also asked several LLMs to parse the wording for more clarity without success. They all highlight it as ambiguous wording. Why not use more direct language and provide the supporting data? They also stated that they are providing $100M in credits to their partners. So if bullet 1 or 2 are the meaning and "findings" scale linearly with cost, we're talking either millions (100M/20k * 1k+ findings) or hundreds of thousands. Does that make any sense? Or is the idea that all of these companies will run scans across their critical codebases continuously? Anyone else have a better sense of the math going on here?

coldtea · 2026-04-13T16:52:47 1776099167

>Mythos scoured the entire continent for gold and found some. For these small models, the authors pointed at a particular acre of land and said "any gold there? eh? eh?" while waggling their eyebrows suggestively.

Which sounds trivial for a hacker wanting to find vulnerabilities to replicate, so what's the huge advantage of Mython then? That you don't need to spend 5 minutes to nudge it to the most complex/ripe for vulnerabilities parts of a codebase?

hoppp · 2026-04-11T19:58:43 1775937523

They pay me 20k and give me time maybe I find it also.

LordDragonfang · 2026-04-11T23:53:37 1775951617

No, you wouldn't. The vulnerability has been in the codebase for 17 years. Orders of magnitude more than 20k in security professional salary-hours have been pointed at the FreeBSD codebase over the past decade and a half, so we already know a human is unlikely to have found it in any reasonable amount of time.

AbstractH24 · 2026-04-13T01:51:17 1776045077

So the real learning here is the cost of “using” GenAI to do things is declining at a rapid speed.

We’re not doing anything that couldn’t be done before, we’re just doing it faster, easier and cheaper.

Sounds like a recipe for a lot of junk being built. Also sounds like something that’s been true since the beginning of humanity.

In the more near term, sounds like a reminder the datacenters and processing boom will look at lot like the fiber one.

lukev · 2026-04-11T20:14:05 1775938445

This is a really interesting point though -- it's really scaffold-dependent.

Because for the same price, you could point the small model at each function, one by one, N times each, across N prompts instructing it to look for a specific class of issue.

It's not that there's no difference between models, but it's hard to judge exactly how much difference there is when so much depends on the scaffold used. For a properly scientific test, you'd need to use exactly the same one.

Which isn't possible when Anthropic won't release the model.

klempner · 2026-04-12T01:59:01 1775959141

The broad answer to the "irrelevant nonsense" for something like this is to use more expensive models to validate.

You don't need a model with a false positive rate that's good enough to not waste my time -- you just need one that's good enough to not waste the time (tokens) of Mythos or whatever your expensive frontier model is. Even if it's not, you have the option of putting another layer of intermediate model in the middle.

letitgo12345 · 2026-04-11T18:52:21 1775933541

Can't you execute the bug to see if the vulnerability is real? So you have a perfect filter. Maybe Mythos decided w/o executing but we don't know that.

mlmonkey · 2026-04-12T17:29:58 1776014998

We can reduce this to an even more basic question: if these small models are equally comparable in finding vulnerabilities, why haven't they done so yet?. After all, the source code is out in the open, and has been for decades. Please go ahead, find (and report) the vulnerabilities.

andy_ppp · 2026-04-11T22:50:55 1775947855

I wonder if you could just setup a small model and suggest a load of things and try every file and it might still end up being cheaper and just as good as Mythos at a specific task. Maybe this will be something that holds true for more things, formulating a small model to do specific things may well end up being as effective/efficient as a larger model looking at a huge solution space.

glerk · 2026-04-11T20:03:15 1775937795

I'm having trouble finding this info (I assume they won't publish it), but could the secret sauce be much larger and more readily accessible context window?

OpenBSD's code is in the 10s of millions of lines. Being able to hold all of it in context would make bug finding much easier.

johnfn · 2026-04-11T20:38:35 1775939915

You can look at some of the bugs, if you'd like. They are (at least the ones I looked at) fairly self-contained, scoped to a single function, a hundred lines or less. There's no need for a massive amount of context.

glerk · 2026-04-12T12:21:45 1775996505

Interesting, and you are absolutely right (hehe).

These are pretty self-contained and seems to be something more like "formal verification" where the model is able to simulate a large number of states and find incorrect ones, if I were to speculate, something akin to a reasoning loop that moved from the harness/orchestration layer down to the model itself.

Sparkyte · 2026-04-12T01:28:53 1775957333

Why not just write many small models for explicit tasks than running one bigger model anyway? I prefer the agentic subject matter expert design anyway. I suppose because it wants to look at the whole code base?

cyanydeez · 2026-04-11T18:54:26 1775933666

so what you're saying is no one could ever write a loop like:

for githubProject in githubProjects opencode command /findvulnerability end for

Seems like a silly thing to try and back up.

tredre3 · 2026-04-11T19:44:49 1775936689

What he's saying is that you should read the "Caveats and limitations" section of the article.

Here's the first one:

> Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior").

Mythos did no such thing, it was cut lose and told to find vulnerabilities. If the intent was to prove that small models are just as good, they haven't demonstrated that at all. The end.

cyanydeez · 2026-04-11T23:23:07 1775949787

ok, but you're missing the obvious: I could also give it the vulnerable function byt just looping over all functions and providing a small hint about what to look at.

Until "Mythos" is compared with the most bland and straight forward harness vs small model, there's no great context god that can't be emulated with deterministic scanning and context pulls.

johnfn · 2026-04-11T08:05:43 1775894743

Don't leave dang -- we need you now more than ever. :(

johnfn · 2026-04-11T00:37:07 1775867827

Some people think there will be an exponential takeoff, which means that a 6 month lead effectively rounds up to infinity.

DoctorOetker · 2026-04-11T01:39:20 1775871560

Is this belief grounded on some kind of derivation, or just a prima facie belief?

If it is grounded on a logical derivation, where can one find such a derivation, and inspect its premises?

Jtsummers · 2026-04-11T01:54:13 1775872453

It's an old idea, "the singularity". The machines become smart enough to improve themselves, and each improvement results in shorter (or more significant) improvement cycles. This leads to an exponential growth rate.

It's been promised to be around the corner for decades.

https://en.wikipedia.org/wiki/Technological_singularity

johnfn · 2026-04-11T04:06:20 1775880380

To be fair, Ray Kurzweil has been the loudest voice in this space, and he's been pretty consistent on 2045 since the publication of his book almost 20 years ago[1].

[1]: https://en.wikipedia.org/wiki/The_Singularity_Is_Near

Jtsummers · 2026-04-11T04:34:44 1775882084

Per that summary, we were supposed to have $1000 computers that could simulate your mind by the start of this decade along with brain scanning by this point in the decade. I guess if it is truly an exponential or hyperbolic growth rate, the singularity could catch up to his predicted date.

johnfn · 2026-04-11T04:51:23 1775883083

I mean, an LLM isn’t too far away from this? He had the Turing test being defeated in 2029 - if anything, he was too pessimistic.

Jtsummers · 2026-04-11T04:57:25 1775883445

The Turing test demonstrates human gullibility more than it demonstrates machine intelligence. Some people were convinced that ELIZA was a person.

But sure, a test that doesn't actually demonstrate intelligence has been passed. Now, where are the $1000 computers that can simulate a human mind and the brain scans to populate them with minds?

johnfn · 2026-04-11T05:20:40 1775884840

He doesn't say 'simulate' a human brain unless I'm missing it in the summary (cmd-f "simul" has no results) - that would require significantly more capacity than that contained in a brain (think about how much compute it takes to run a VM). He seems to be implying that by 2020s a computer will be about as smart as a human. LLMs seem capable of doing a decent amount of tasks that a human can do? Sure, he's off by a few years, but for something published 20 years ago when that seemed insane, it doesn't seem that bad.

Jtsummers · 2026-04-11T05:25:59 1775885159

Fair, the term in the summary is "emulate". So to restate, still waiting for the $1000 machine that can emulate human intelligence and the brain scans to go with it. Computing power is nowhere near what he predicted, because unlike his predictions reality happened. Compute capabilities, like many other things, is a logistic curve, not an unbounded exponential or hyperbolic.

EDIT:

> LLMs seem capable of doing a decent amount of tasks that a human can do?

And computers could beat most humans for decades at chess. Cars can go faster than a human can run, and have been able to beat a human runner since essentially their invention. Machines doing human tasks or besting humans is not new. That doesn't mean we're approaching the singularity, you may as well believe that the Heaven's Gate folks were right, both are based on unreality.

johnfn · 2026-04-11T05:36:46 1775885806

I think he is using "emulate" in a more metaphorical sense, like that it can do similar things that the human brain can do? I'm not trying to be antagonistic, it just seems logical? He says the Turing test won't be passed until 2029 - if we're going by your definition of "emulate" wouldn't it have been passed the instant the brain was "emulated?"

Jtsummers · 2026-04-11T05:40:01 1775886001

> if we're going by your definition of "emulate" wouldn't it have been passed the instant the brain was "emulated?"

Yes, which also demonstrates the illogic of his timeline. I just thought it was too obvious to point out.

hattmall · 2026-04-11T04:25:57 1775881557

He just had to pick a year where he would have a very good chance of not being alive.

andsoitis · 2026-04-11T04:41:27 1775882487

No, he started predicting in his 2005 book, based on the “Law of Accelerating Returns”, yielding exponential growth in computing capacity.

Timeline from here on out:

2029: AI passes a valid Turing test and achieves human-level intelligence

2030s: Technology goes inside your brain to augment memory; humans connect their neocortex to the cloud

2045: The Singularity, when human intelligence multiplies a billion-fold by merging with AI

jimmyjazz14 · 2026-04-11T03:14:28 1775877268

Its mostly based on science fiction, and requires some possibly infinite energy source. The concept always kinda struck me a sort of a perpetual motion machine, you can imagine it, but that doesn't make it possible and why its not possible isn't immediately obvious in the imagination (well I mean most modern minds know its already not possible but you get the point).

jatora · 2026-04-11T03:11:04 1775877064

Recursive self improvement - once you attain artificial superintelligent SWE of a general, adaptable variety that can scale up to millions of researchers overnight (a given, with LLM's and scaffolding alone) - will rapidly iterate on new architectures which will more rapidly iterate on new architectures, etc.

doctorwho42 · 2026-04-11T04:38:53 1775882333

And what's to say that it doesn't iterate itself to a local max, and then stop...

jaggederest · 2026-04-11T05:23:37 1775885017

From the first third of a sigmoid it looks exponential, and that scares people. But a sigmoid can have a very very high top - look at the industrial revolution, or modern plumbing, or modern agriculture which created a population sigmoid which is still cresting.

If AI is merely as tall a sigmoid as the haber-bosch process, refrigeration, or the steam engine, that's going to change society entirely.