I mean that mostly in the sense that there is huge variance in idiomatic code. So your optimized C/Rust code might be 100-1000x faster than two idiomatic versions of writing that code
I played around with local LLMs on my M4 Max 64GB this weekend and this is exactly what I found. I put Opus 4.7 "head to head" on the same task as Qwen 3.6 and a few other local models. The 35B did not perform well IME - it needed a lot of handholding and even then the final result did not work until a few more tweaks, while Claude one shot the task. The 27B was much better and also one shot the task, but took about ~55min as opposed to about ~15min for Claude. The 27B is probably something that I could happily run for many use cases if I had some faster hardware... the main problem there seems to be that at larger context sizes, prompt decoding can take several minutes.
This matches my experience too. The little a3b model is quite capable for its size class, as is the 27B model, but it’s still an order of magnitude less effective than Claude on the “effectiveness / time” curve
That may be true for OpenAI, less so for Antropic - which has much better margins. Both of these companies CEOs have come in public saying the same.
No doubt as of currently Google has a better business. But the same argument could have been said about Instagram or Whatsapp before Facebook (now Meta) acquired them.
For 1T Q4 - 1 token generated per every ~500GB memory read. So you'll need something like ~10TB/s memory for 20t/s. This is 8x5090 speed area and 16x5090 size area. HBM4 will bring us close to something really possible in home lab, but it will cost fortune for early adopters.
Speculative decoding/DFlash will help with it, but YMMV.
Edit:
Missed a part that this is A32B MoE, which means it drastically reduces amount of reads needed. Seems 20 t/s should be doable with 1TB/s memory (like 3090)
While they do make this argument, realistically anyone sending their prompt/data to an external server should assume there will be some level of retention.
And more so in particular, anyone using Darkbloom with commercial intents should only really send non-sensitive data (no tokens, customer data, ...) I'd say only classification tasks, imagine generation, etc.
My motivation was quite different, and i'd like to encourage more people to consider the same.
Often times narcissistic power grabbing (often technically incompetent) engineers become managers, like it was the case a previous team I've worked at and it was quite penalizing to the whole team.
I've realized that either i can be the one managing and try to do good, or be at the mercy of another manager; chose the first.
This is what taught me to sublimate my own ego. Overcoming the wickedness of others with patient, meditative calm can be an incredible experience. It just takes longer than a business day to play out. You've gotta think across much grander time scales. 3 steps ahead, at minimum, at all times. Burn these people out of your team. Take charge and stay focused on the customer. It often takes non technical people a little bit longer to lock onto complex problems and downstream consequences. It's taken me nearly 2 years to deal with one bad hire. All I can fantasize about is being in a position to never hire that kind of person again. The destruction some people can cause in a business is unthinkable to those who haven't seen it yet. I didn't believe these people existed until it was way too late.
I still prefer to solve technology problems, but I see a bigger and more important mission out there. Keeping the team happy and aligned on the customer is much more rewarding overall. I'd rather 5% dev time in paradise than 95% dev time in hell.
Absolutely disagree here, something that is considered good practice is very interesting to compare to!
reply