tbh ~1-3% PPL hit from Q4_K_M stopped being the bottleneck a while ago. the bottleneck is the 48 hours of guessing llama.cpp flags and chat template bugs before the ecosystem catches up. you are doing unpaid QA.
Just wait a week for model bugs to be worked out. This is well-known advice and a common practice within r/localllama. The flags are not hard at all if you're using llama.cpp regularly. If you're new to the ecosystem, that's closer to a one-time effort with irregular updates than it is to something you have to re-learn for every model.
Yeah, at the last job there was a single outdated external wiki server left sitting in DO for those kinds of reasons while everything updated and internal had moved already (if not twice). If it hadn't become such a security risk it would never have been moved.
The problem is a lot of this glue is proprietary by design at the various cloud services. I realize there are open source and alternative abstractions for a lot of of the same services, but there’s still quite a bit of glue if you’re on AWS, for example, and looking to move to bare metal.
But maybe I’m just thinking of the current capabilities of agents, and if we fast forward a couple years, even removing these abstractions or migrating will be very low friction.
But you can run most of the glue on your own dedicated instances.
I run k8s on a bunch of dedicated servers that are super cheap and I have all bells and whistles - just tell your coding agent to do it. You can literally design the thing you would never do yourself and it works brilliantly.
Postgres running on dedicated hardware replicated and with wal backups - easy just tell codebuff (my harness of choice) to do it. Then any number of firewalls, load balancers, bastion servers, etc. if you can imagine it , codebuff will implement it.
IMO it doesn't flatten design into one thing. it splits it. cheap obvious work at scale, and a way smaller premium tier for real authorship. the middle is what actually gets crushed.
caveman stops being a style tool and starts being self-defense. once prompt comes in up to 1.35x fatter, they've basically moved visibility and control entirely into their black box.
I keep saying even if there's not current malfeasance, the incentives being set up where the model ultimately determines the token use which determines the model provider's revenue will absolutely overcome any safeguards or good intentions given long enough.
This might be true, but right now everybody is like "please let me spend more by making you think longer." The datacenter incentives from Anthropic this month are "please don't melt our GPUs anymore" though.
separating codebase and leaving 'cal.diy' for hobbyists is pretty much the classic open-core path. the community phase is over and they need to protect their enterprise revenue.
blaming AI scanners is just really convenient PR cover for a normal license change.
yeah the desktop app forgets it's the desktop app. claude code feels local right up until the api starts coughing up 500s. same thing, just in a terminal instead of a window.
reply