in nyc any car based transportation is slower than subways often, but everyones so narrow minded they just think about their own life. if you're old in nyc cabs/ubers/waymo are a big deal, without them you're stuck walking to a bus stop or subway and that gets hard in your 70s and 80s.
There a few benchmarks out there where all existing models have abysmal scores. So it's not actually a problem if Antrophic's older models are bad, especially if the jump to the newest model is huge, and the competition is also way below it.
yes? the future for any verifiable task is the model attempts to verify initial state and a goal then decomposes its tasks in to every smaller verifiable subtasks, with /memory being the persistence between runs and then /dreaming on the results of those memory files + run data to introduce new ideas.
i think thats the path to async agi these labs are imagining. The only limit is that sensor data you have on the world or your system, how long your willing to wait, and how much you're willing to spend to parallelize it.
maybe once you start building out these verified workflows you can feed that back into training and hte model starts to get a feel for the world to the point that it can intuit things since it has these sub paths built.
my personal agi test is can a model, trained on video of someone knocking on a door and then open it encounter a microwave for the first time and open it when the foods done without knocking.
i used to use opus for everything, thats not an option once you move to a multi agent system unless you're working on like high end research. I could easily spend 3k a day if i was using opus as just a normal dev.
As we build a better and better harness and better feedback/verifiers we're switching more to 3.5 flash. I think chinese models would work too, but we cant use those atm.
Generally theres a coordinator running opus and an ever growing set of skills and subagents that take actions using weaker models and output feedback to the coordinator opus.
I'm pretty convinced at this point we're past the level of intelligence needed for most tasks most devs do and that will trend down as we better build harnesses for our own codebases.
this is the finance team doing a fantastic job. keep in mind they're raising this cash right before 3 major ipos in their sector which people will need to raise money for and will fight against htem in the narrative.
If i was a google cfo and was trading at a premium to my peers before that, i'd want to raise the cash now. Look at MSFT, they're trading at 25 forward p/e and were buying back shares at 40. If they have to issue equity over the next few years the spread between teh performance of the 2 cfos could be 40-50b on that alone.
Just as a google shareholder, this company bought back shares hand over fist at a low p/e for a few years, issues 100 year debt at low rates, and is selling equity when its at a premium to its peers right before 2-3 major ipos of competitors put selling pressure on the stock for a while.
I don't know who's going to win the llm battle, but googles finance team has been doing their job fantastically.
flash 3.5 is the best price/performance model for what i'm doing. I had been using opus for everything but as we started running many agents at once, and then eventually agent managing sub agents frontier is not an option.
we started model testing the cost/performance of our skills and agents and flash 3.5 wins in most things.
As people develop harnesses for their codebase i think the intelligence required comes down a lot.
reply