More

Readerium · 2026-05-23T04:18:54 1779509934

For MSFT: Just download DeepSeek locally and use it.

Or train your own power efficient stack.

Readerium · 2026-04-29T05:47:55 1777441675

AI workloads are all about memory size and bandwidth not compute

Readerium · 2026-04-29T05:46:22 1777441582

LLMs are memory bandwidth bound not compute bound.

AntiUSAbah · 2026-04-29T08:15:08 1777450508

LLMs are bound by both and depends on the hardware which factor is higher.

joshjob42 · 2026-04-29T18:05:45 1777485945

Technically true, but if we're talking about local models, overwhelmingly you're gonna be bandwidth bound. You need about 2 flops per active parameter per token. An M5 chip has what, 150-200GB of bandwidth? But it can easily do something like 16tflops of fp16, so you're talking like 100 flops per byte of bandwidth. Which is just to say that in a batch=1 scenario, ie one user, you're only gonna use a few % of the GPU while you're totally saturated your memory bandwidth. For all practical purposes at the consumer level, take your memory bandwidth, divide by the size of the model, and that gives you the max tok/s throughput you're gonna get.

Even a 5090 has something like 50-60 flops per byte of bandwidth, you just can't saturate the compute without running large batches. (At least at inference, prefill is obviously more compute bound).

ondra · 2026-04-29T06:38:55 1777444735

This is incorrect, prompt processing is compute bound.

icelancer · 2026-04-29T07:40:13 1777448413

This is only true for some parts of the time cost function.

Readerium · 2026-04-24T22:35:46 1777070146

Add an iPad mini esque screen on the trackpad of a MacBook.

Trackpad of a 16 inch MacBook Pro is humongous anyways.

Add a touchscreen display to the trackpad, and give it iPad OS

Readerium · 2026-04-17T01:26:16 1776389176

that is true. gguf does not support any Architecture.

for the most recent example, as of April 16, 2026 (today)

Turboquant isnt still added to GGUF

Readerium · 2026-04-17T01:10:27 1776388227

perhaps increasing repitition_penalty might be helpful

Readerium · 2026-04-14T21:48:03 1776203283

In coding they are worse.

Chinese models (GLM, MiniMax) are better.

nine_k · 2026-04-14T22:00:14 1776204014

Anyway, there are a few model that are freely distributable, and that can reasonably run on consumer-grade local hardware.

It changes a number of things. Not all tasks require very high intelligence, but a lot of data may be sensitive enough to avoid sharing it with a third party.

Readerium · 2026-03-28T03:17:48 1774667868

Can someone explain if the 3D Vcache are stacked on top of each other or side by side.

If they are stacked then why not 9800X3D2?

zdw · 2026-03-28T03:19:53 1774667993

The 99xx chips have two CPU dies, and one cache die is on each CPU die.

modeswitch · 2026-03-28T03:51:53 1774669913

The 3D V-Cache sits underneath only one of the CCDs. See https://en.wikipedia.org/wiki/Ryzen#Ryzen_9000.

anonymars · 2026-03-28T04:32:30 1774672350

That's what's different about this one. "Enter the Ryzen 9 9950X3D2 Dual Edition, a mouthful of a chip that includes 64MB of 3D V-Cache on both processor dies, without the hybrid arrangement that has defined the other chips up until now."

Tostino · 2026-03-28T04:09:49 1774670989

Did you forget which thread we are on?

modeswitch · 2026-03-29T17:52:46 1774806766

Oh heh, I thought they were asking about the X3D. My bad ><.

Readerium · 2026-03-13T20:45:17 1773434717

Qwen 3.5 4B is the goat then