Hacker Newsnew | past | comments | ask | show | jobs | submit | canpan's commentslogin

Just ran llama-bench at home with the similar priced AMD AI PRO R9700 32G. The phoronix numbers look extremely low? Probably I misunderstand their test bench. Anyway, here are some numbers. Maybe someone with access to a B70 can post a comparison.

Tried to use the same model as the article:

llama-bench -m gpt-oss-20b-Q8_0.gguf -ngl 999 -p 2048 -n 128

AMD R9700 pp2048=3867 tg128=175

And a bigger model, because testing a tiny model with a 32GB card feels like a waste:

llama-bench -m Qwen3.6-27B-UD-Q6_K_XL.gguf -ngl 999 -p 2048 -n 128

AMD R9700 pp2048=917 tg128=22


As of b8966, it is still not great.

  | model                 |      size |  params | backend | ngl |   test |            t/s |
  | --------------------- | --------: | ------: | ------- | --: | -----: | -------------: |
  | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | SYCL    | 999 | pp2048 |  851.81 ± 6.50 |
  | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | SYCL    | 999 |  tg128 |   42.05 ± 1.99 |
  | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan  | 999 | pp2048 | 2022.28 ± 4.82 |
  | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan  | 999 |  tg128 |  114.15 ± 0.23 |
  | qwen35 27B Q6_K       | 23.87 GiB | 26.90 B | SYCL    | 999 | pp2048 |  299.93 ± 0.40 |
  | qwen35 27B Q6_K       | 23.87 GiB | 26.90 B | SYCL    | 999 |  tg128 |   14.58 ± 0.06 |
  | qwen35 27B Q6_K       | 23.87 GiB | 26.90 B | Vulkan  | 999 | pp2048 |  581.99 ± 0.86 |
  | qwen35 27B Q6_K       | 23.87 GiB | 26.90 B | Vulkan  | 999 |  tg128 |   10.64 ± 0.12 |
Edit: I've no idea why one would use gpt-oss-20b at Q8, but the result is basically the same:

  | model                 |      size |  params | backend | ngl |   test |            t/s |
  | --------------------- | --------: | ------: | ------- | --: | -----: | -------------: |
  | gpt-oss 20B Q8_0      | 11.27 GiB | 20.91 B | SYCL    | 999 | pp2048 |  854.16 ± 6.06 |
  | gpt-oss 20B Q8_0      | 11.27 GiB | 20.91 B | SYCL    | 999 |  tg128 |   44.02 ± 0.05 |
  | gpt-oss 20B Q8_0      | 11.27 GiB | 20.91 B | Vulkan  | 999 | pp2048 | 2022.24 ± 6.97 |
  | gpt-oss 20B Q8_0      | 11.27 GiB | 20.91 B | Vulkan  | 999 |  tg128 |  114.02 ± 0.13 |
Hopefully, support for the B70 will continue to improve. In retrospect, I probably should have bought a R9700 instead...

"I've no idea why one would use gpt-oss-20b at Q8" - would you mind expanding on this comment?

In that particular model family, the choices are 20B and 120B, so 20B higher quant fits in VRAM, while you'd be settling for 120B at a lower quant. Is it that 20B MXFP4 is comparable in performance so no need for Q8?

Or is the insight simply that there are better models available now and the emphasis is on gpt-oss-20b, not Q8?


The parameters in the original gpt-oss-20B model are "post-trained with MXFP4 quantization", so there just isn't much to gain by quantizing to Q8. If you look inside the Q8 model, most of the parameters are MXFP4 anyway.

Though, looking inside my "gpt-oss 20B MXFP4 MoE" model, it looks to also be quantized the same way as the Q8, so that was probably an overstatement on my part.

Still, the Q8 is 12.1 GB and the FP16 is 13.8 GB. Not the ~1:2 ratio you might expect.


At this speed, people end up paying more on electricity than api calls. (California electricity)

For reference in case it's interesting to someone, a 5090 on Windows 11 with CUDA 13.1

  | model                 |       size |   params | backend  | ngl |   test |              t/s |
  | --------------------- | ---------: |--------: | -------- | --: |------: |----------------: |
  | gpt-oss 20B MXFP4 MoE |  11.27 GiB |  20.91 B | CUDA     | 999 | pp2048 | 10179.12 ± 52.86 |
  | gpt-oss 20B MXFP4 MoE |  11.27 GiB |  20.91 B | CUDA     | 999 |  tg128 |    326.82 ± 7.82 |
  | qwen35 27B Q6_K       |  23.87 GiB |  26.90 B | CUDA     | 999 | pp2048 |   3129.92 ± 5.12 |
  | qwen35 27B Q6_K       |  23.87 GiB |  26.90 B | CUDA     | 999 |  tg128 |     53.45 ± 0.15 |
  
  build: 9d34231bb (8929)

  gpt-oss-20b-MXFP4.gguf
  Qwen3.6-27B-UD-Q6_K_XL.gguf
Using MXFP4 of GPT-OSS because it was trained quantization-aware for this quantization type, and it's native to the 50xx.

You can get 120TPS (144 peak) with Qwen3.6-27B on RTX PRO 6000 with autoround when MTP enabled. It runs faster than sonnet api calls.

5090 gets maybe 100TPS with MTP


the build they use is from February, over two months old: https://github.com/ggml-org/llama.cpp/releases/tag/b8121

Which might not sound like much, but 2months in llm time is a long time, especially regarding support for new hardware like the r9700.


More RAM running slower is still true today. With AM5 you probably cannot enable EXPO with four ram slots filled vs two. The gap is not that extreme though.

https://www.corsair.com/us/en/explorer/diy-builder/memory/2-...


In Tokyo many bicycle lanes are pretty useless for this reason. Cars are parking every 20m making them absolutely inaccessible. Then there is the bicycle lane between Asakusa and Ueno, which is separated from the street, but made like some sort of obstacle course. There are some good ones too though. Pretty random.

While that would be cool, something like the 7400 series, is already pretty close to scratching that itch. And a lot less dangerous.

I would love to buy one and know a bunch of others. Wonder if shipping to Japan is in the works?

Selected Japan on top right of the main page, it told me no-go.

Been waiting for years. I wonder what the hold up is.


Yeah, waiting for that too. They should expand the regions instead of new models. Availability is really limited.

You don't need it, but as someone who has been there: For me making a 3D engine is a lot of fun! But then I never finish the actual game. So if you actually want to ship a game, I recommend using an engine. Personally I prefer Unreal.

For 2D, yeah, making the engine yourself is fast and easy. Can go without a big engine.


122b would be awesome. It is the largest size you can kinda run with a beefy consumer PC. I wondered about gemma stopping in the 30b category, it is already very strong. 122b might have been too close to being really useful.

Not OP, but I ran 122b successfully with normal RAM offloading. You dont need all that much VRAM, which is super expensive. I used 96gb ram + 16gb vram gpu. But it's not very fast in that setup, maybe 15 token per second. Still, you can give it a task and come back later and its done. (Disclaimer: I build that PC before stuff got expensive)

Any good gaming pc can run the 35b-a3 model. Llama cpp with ram offloading. A high end gaming PC can run it at higher speeds. For your 122b, you need a lot of memory, which is expensive now. And it will be much slower as you need to use mostly system ram.

Seconding this. You can get A3B/A4B models to run with 10+ tok/sec on a modern 6/8GB GPU with 32k context if you optimize things well. The cheapest way to run this model at larger contexts is probably a 12gb RTX 3060.

Here in Japan they started to forward me to the app page when ordering. So you are forced to use the app with a mobile browser. Even though the website could do it perfectly fine in the past.

I do not go often, but if I do I prefer to sit, order in the page and they bring it to your seat. I dont like the Kiosk.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: