Just ran llama-bench at home with the similar priced AMD AI PRO R9700 32G. The phoronix numbers look extremely low? Probably I misunderstand their test bench. Anyway, here are some numbers. Maybe someone with access to a B70 can post a comparison.
"I've no idea why one would use gpt-oss-20b at Q8" - would you mind expanding on this comment?
In that particular model family, the choices are 20B and 120B, so 20B higher quant fits in VRAM, while you'd be settling for 120B at a lower quant. Is it that 20B MXFP4 is comparable in performance so no need for Q8?
Or is the insight simply that there are better models available now and the emphasis is on gpt-oss-20b, not Q8?
The parameters in the original gpt-oss-20B model are "post-trained with MXFP4 quantization", so there just isn't much to gain by quantizing to Q8. If you look inside the Q8 model, most of the parameters are MXFP4 anyway.
Though, looking inside my "gpt-oss 20B MXFP4 MoE" model, it looks to also be quantized the same way as the Q8, so that was probably an overstatement on my part.
Still, the Q8 is 12.1 GB and the FP16 is 13.8 GB. Not the ~1:2 ratio you might expect.
More RAM running slower is still true today. With AM5 you probably cannot enable EXPO with four ram slots filled vs two. The gap is not that extreme though.
In Tokyo many bicycle lanes are pretty useless for this reason. Cars are parking every 20m making them absolutely inaccessible. Then there is the bicycle lane between Asakusa and Ueno, which is separated from the street, but made like some sort of obstacle course. There are some good ones too though. Pretty random.
You don't need it, but as someone who has been there: For me making a 3D engine is a lot of fun! But then I never finish the actual game. So if you actually want to ship a game, I recommend using an engine. Personally I prefer Unreal.
For 2D, yeah, making the engine yourself is fast and easy. Can go without a big engine.
122b would be awesome. It is the largest size you can kinda run with a beefy consumer PC. I wondered about gemma stopping in the 30b category, it is already very strong. 122b might have been too close to being really useful.
Not OP, but I ran 122b successfully with normal RAM offloading. You dont need all that much VRAM, which is super expensive. I used 96gb ram + 16gb vram gpu. But it's not very fast in that setup, maybe 15 token per second. Still, you can give it a task and come back later and its done. (Disclaimer: I build that PC before stuff got expensive)
Any good gaming pc can run the 35b-a3 model. Llama cpp with ram offloading. A high end gaming PC can run it at higher speeds.
For your 122b, you need a lot of memory, which is expensive now. And it will be much slower as you need to use mostly system ram.
Seconding this. You can get A3B/A4B models to run with 10+ tok/sec on a modern 6/8GB GPU with 32k context if you optimize things well. The cheapest way to run this model at larger contexts is probably a 12gb RTX 3060.
Here in Japan they started to forward me to the app page when ordering. So you are forced to use the app with a mobile browser. Even though the website could do it perfectly fine in the past.
I do not go often, but if I do I prefer to sit, order in the page and they bring it to your seat. I dont like the Kiosk.
Tried to use the same model as the article:
llama-bench -m gpt-oss-20b-Q8_0.gguf -ngl 999 -p 2048 -n 128
AMD R9700 pp2048=3867 tg128=175
And a bigger model, because testing a tiny model with a 32GB card feels like a waste:
llama-bench -m Qwen3.6-27B-UD-Q6_K_XL.gguf -ngl 999 -p 2048 -n 128
AMD R9700 pp2048=917 tg128=22
reply