Hacker Newsnew | past | comments | ask | show | jobs | submit | Readerium's commentslogin

Loll

For MSFT: Just download DeepSeek locally and use it.

Or train your own power efficient stack.


AI workloads are all about memory size and bandwidth not compute


LLMs are memory bandwidth bound not compute bound.


LLMs are bound by both and depends on the hardware which factor is higher.


Technically true, but if we're talking about local models, overwhelmingly you're gonna be bandwidth bound. You need about 2 flops per active parameter per token. An M5 chip has what, 150-200GB of bandwidth? But it can easily do something like 16tflops of fp16, so you're talking like 100 flops per byte of bandwidth. Which is just to say that in a batch=1 scenario, ie one user, you're only gonna use a few % of the GPU while you're totally saturated your memory bandwidth. For all practical purposes at the consumer level, take your memory bandwidth, divide by the size of the model, and that gives you the max tok/s throughput you're gonna get.

Even a 5090 has something like 50-60 flops per byte of bandwidth, you just can't saturate the compute without running large batches. (At least at inference, prefill is obviously more compute bound).


This is incorrect, prompt processing is compute bound.


This is only true for some parts of the time cost function.


Add an iPad mini esque screen on the trackpad of a MacBook.

Trackpad of a 16 inch MacBook Pro is humongous anyways.

Add a touchscreen display to the trackpad, and give it iPad OS


that is true. gguf does not support any Architecture.

for the most recent example, as of April 16, 2026 (today)

Turboquant isnt still added to GGUF


perhaps increasing repitition_penalty might be helpful


In coding they are worse.

Chinese models (GLM, MiniMax) are better.


Anyway, there are a few model that are freely distributable, and that can reasonably run on consumer-grade local hardware.

It changes a number of things. Not all tasks require very high intelligence, but a lot of data may be sensitive enough to avoid sharing it with a third party.


Can someone explain if the 3D Vcache are stacked on top of each other or side by side.

If they are stacked then why not 9800X3D2?


The 99xx chips have two CPU dies, and one cache die is on each CPU die.


The 3D V-Cache sits underneath only one of the CCDs. See https://en.wikipedia.org/wiki/Ryzen#Ryzen_9000.


That's what's different about this one. "Enter the Ryzen 9 9950X3D2 Dual Edition, a mouthful of a chip that includes 64MB of 3D V-Cache on both processor dies, without the hybrid arrangement that has defined the other chips up until now."


Did you forget which thread we are on?


Oh heh, I thought they were asking about the X3D. My bad ><.


Qwen 3.5 4B is the goat then


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: