Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What was the main focus when training this model? Besides the ELO score, it's looking like the models (31B / 26B-A4) are underperforming on some of the typical benchmarks by a wide margin. Do you believe there's an issue with the tests or the results are misleading (such as comparative models benchmaxxing)?

Thank you for the release.



Becnhmarks are a pox on LLMs.

You can use this model for about 5 seconds and realize its reasoning is in a league well above any Qwen model, but instead people assume benchmarks that are openly getting used for training are still relevant.


They really are. Benchmaxxing is real… but also the Qwen 3.5 series of models are still very impressive. I’m looking forward to trying out Gemma


Definitely have to use each model for your use case personally, many models can train to perform better on these tests but that might not transfer to your use case.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: