Yep, it's wild how little emphasis is there on control and replicability in these posts.
Already these models are useful for a myriad of use cases. It's really not that important if a model can 1-shot a particular problem or draw a cuter pelican on a bike. Past a degree of quality, process and reliability are so much more important for anything other than complete hands-off usage, which in business it's not something you're really going to do.
The fact that my tool may be gone tomorrow, and this actually has happened before, with no guarantees of a proper substitute... that's a lot more of a concern than a point extra in some benchmark.
what is the logic behind that anything that can be homegrown should be legal? it makes enforcement harder, but it doesn't make any potential damages any better or worse
it's easy to just look at the upside of something that doesn't hurt you and you just have an extra choice, but knowing that it can and does wreck the lives of many, I feel that it's a painful thing for me to vote for, or against
typically those dense models are too slow on Strix Halo to be practical, expect 5-7 tps
you can get an idea by looking at other dense benchmarks here: https://strixhalo.zurkowski.net/experiments - i'd expect this model to be tested here soon, i don't think i will personally bother
similar numbers here - slightly higher PP. slightly better peak speed and retention w/ q8_0 kv cache quants too. llama-bench results here, cba to format for hn:
https://pastebin.com/raw/zgJeqRbv
GTR 9 Pro, "performance" profile in BIOS, GTT instead of GART, Fedora 44
If I did a proper benchmark I think the numbers would be what you got. Minimax M2.7 is also surprisingly not that slow, and in some ways faster as it seems to get things right with less thinking output. (around 140 t/s prefill and 23 t/s generation).
The problem with M2.7 is that it's full GQA, meaning quadratic attention. It does start fast, but by 64k tokens deep, the version I'm running (Unsloth's UD IQ2_XXS) pp512 drops 95% from 261.3 t/s (0 context depth) to 13.1 t/s. q8_0 KV does help, still hitting 57.4 t/s at 64k depth vs 258.3 t/s at 0 depth. TG's retention rates are better, but still approaching single digit even with q8_0 KV cache by 64k depth.
That said, it was my favorite model when I valued output quality above all else, at least up until the new Qwen 3.6 27B, which I'm currently playing with.
I suspect I will like Qwen 3.6 122B A10B a LOT, maybe even better than M2.7.
I'm not sure what you're referring to by "that" but I think you're right that it's 10B to not purchase or 60B to purchase, so as an option posting $10B for an option with a $50 strike price.
reply