Mythos was clear it was one agent per chunk. But this positive confirming result...

kennywinker · 2026-04-11T17:46:49 1775929609

In TFA they talk a fair bit about how different models perform wrt false positives:

“The results show something close to inverse scaling: small, cheap models outperform large frontier ones.”

mofeien · 2026-04-12T08:28:25 1775982505

These results were based on "a trivial snippet from the OWASP benchmark". In the section "caveats and limitations" they state that sonnet 4.6 and opus 4.6 now pass.

And they decided to base the false positive examination on a single snippet of a publicly known benchmark question (that small models are known to be heavily fine tuned for) instead of the real use case of finding actual vulnerabilities across an entire codebase by using a for loop and checking the false positive rate there.

This is disingenuous at best, or even misleading by omission if the second approach _was_ done but not mentioned because it just confirmed that the false positive rate of small models is enormous. Given how all seven small models identified the FreeBSD Bug when pointed to it, and how how 6/7 small models still identified the "bug" even after the patch was applied, that second outcome seems likely...