I've been doing a lot of experimentation with "hands off coding", where a test suite the agents cannot see determines the success of the task. Essentially, it's a Ralph loop with an external specification that determines when the task is done. The way it works is simple: no tests that were previously passing are allowed to fail in subsequent turns. I achieve this by spawning an agent in a worktree, have them do some work and then when they're done, run the suite and merge the code into trunk.
I see this kind of misalignment in all agents, open and closed weights.
I've found these forms to be the most common, "this test was already failing before my changes." Or, "this test is flaky due to running the test suite on multiple threads." Sometimes the agent cot claims the test was bad, or that the requirements were not necessary.
Even more interesting is a different class of misalignment. When the constraints are very heavy (usually towards the end of the entire task), I've observed agents intentionally trying to subvert the external validation mechanisms. For example, the agent will navigate out of the work tree and commit its changes directly to trunk. They cot usually indicates that the agent "is aware" that it's doing a bad think. This usually is accompanied by something like, "I know that this will break the build, but I've been working on this task for too long, I'll just check what I have in now and create a ticket to fix the build."
I ended up having to spawn the agents in a jail to prevent that behavior entirely.
Are you using any tools specifically for controlling this behavior that you can recommend? I want to tear my hair out every time Claude cleanly 1-shots weeks of work to 99% accuracy, one or a couple of tests fail, and it calmly resolves it with a declaration that it was a "pre-existing failure" or "flaky". It can usually resolve it if I then explicitly tell it to stash the changes and compare against the test results from the prior state, but it happens constantly.
What if you could boil a pot of water with an f16 jet engine?
The harness discussion is relevant because it might be possible to achieve the same results at 1/20th the cost. IFF it is the case, these trillion dollar companies have less value than what is understood at the moment.
It's a lot easier to research harness optimizations without having to raise a billion dollars.
I'm personally very interested to know the answer. There are a lot of resources being expended (and a lot of big bets being placed) on running and training these frontier models.
I met someone who received a degree in "Happiness" from a reputable university and took a hodge podge of courses. There are also many inter-disciplinary degrees that make sense at a graduate degree level but not an undergraduate level.
Right. But it's not my favorite nerd snipe interpretation that allows me to post low effort comments on hackernews about the headline instead of engaging in a meaningful discussion about the article.
I can see why people fall into the trap of calling for an equitable torment nexus: it is both cynical (it supposes everyone in power is corrupt and everyone at the top would oppose an equitable torment nexus) and also naive/optimistic (it supposes that we have any hope to actually impose an equitable torment nexus).
But I think the latter factor wins out, so we should just oppose obviously bad things in a non-clever fashion.
I don't see it as cynical. I'm just accepting the obvious reality.
I have no power to stop what's happening. I might as well make the best of it for myself and my family, and hope it becomes so bad that people who actually do have the power to stop it do something about it. Maybe it'll rise to the level that enough individual citizens will call out for change, but I continue to be amazed at what people will put up with in the name of convenience, continuation of their lifestyle, and, as it relates specifically to surveillance capitalism, shiny digital doodads and baubles that bring them temporary joy.
Capital being speech in the US, since I'm not a billionaire I have very little influence.
I have optimism and hope for people doing good things locally, but absolutely no hope large-scale problems will ever be fixed. I feel like the US political system experienced some phase change in the last 50 years, has "solidified", and is now completely unable to do anything meaningful at scale. The New Deal couldn't happen today. The interstate highway system couldn't happen today. The Affordable Care Act started off as a watered-down, weakened version of what it could have been (because anything more radical would never have passed), and the private interests have had 20 years to chip away at it, sculpting it into a driver of revenue. Heck, we can't even build mass public transit at the level of cities.
Private capital, meanwhile, soldiers on accomplishing its goals in spite of (or because of) our political gridlock.
The fact that you couldn't identify it as sarcasm/satire is indictive of not having an accurate understanding of your opponents position. If you want to defeat your opponents, understand their calculus.
I see this kind of misalignment in all agents, open and closed weights.
I've found these forms to be the most common, "this test was already failing before my changes." Or, "this test is flaky due to running the test suite on multiple threads." Sometimes the agent cot claims the test was bad, or that the requirements were not necessary.
Even more interesting is a different class of misalignment. When the constraints are very heavy (usually towards the end of the entire task), I've observed agents intentionally trying to subvert the external validation mechanisms. For example, the agent will navigate out of the work tree and commit its changes directly to trunk. They cot usually indicates that the agent "is aware" that it's doing a bad think. This usually is accompanied by something like, "I know that this will break the build, but I've been working on this task for too long, I'll just check what I have in now and create a ticket to fix the build."
I ended up having to spawn the agents in a jail to prevent that behavior entirely.
reply