Mythos was clear it was one agent per chunk. But this positive confirming results do not actually disprove anytime with Mythos, because it is only one side of the discriminator challenge - you got positives, but we do not know your false positive rate and your false negative rate.
These results were based on "a trivial snippet from the OWASP benchmark". In the section "caveats and limitations" they state that sonnet 4.6 and opus 4.6 now pass.
And they decided to base the false positive examination on a single snippet of a publicly known benchmark question (that small models are known to be heavily fine tuned for) instead of the real use case of finding actual vulnerabilities across an entire codebase by using a for loop and checking the false positive rate there.
This is disingenuous at best, or even misleading by omission if the second approach _was_ done but not mentioned because it just confirmed that the false positive rate of small models is enormous. Given how all seven small models identified the FreeBSD Bug when pointed to it, and how how 6/7 small models still identified the "bug" even after the patch was applied, that second outcome seems likely...