More

simonw · 2026-04-08T23:17:35 1775690255

Pelicans: https://simonwillison.net/2026/Apr/8/muse-spark/

I also had a poke around with the tools exposed on https://meta.ai/ - they're pretty cool, there's a Code Interpreter Python container thing now and they also have an image analysis tool called "container.visual_grounding" which is a lot of fun.

wsgeorge · 2026-04-09T00:03:12 1775692992

Alexandr Wang suggesting this might be open-weights/source in the future gives me hope. Hopefully they stay on this path.

lemonish97 · 2026-04-09T01:06:04 1775696764

I have a feeling it won't be this exact model, but rather smaller distilled variants, similar to the gemma line

sbinnee · 2026-04-09T02:10:59 1775700659

It is fair to think so because that is what everyone is doing. But being Meta and considering Llama, if MSL is going to keep releasing models and wants to join back the AI war, they may actually open weights just to get more attention. Once they establish a sizable community, they can start guarding their frontier models.

sunaookami · 2026-04-09T10:38:39 1775731119

Seems like not all tools are available everywhere? Don't have access to visual_grounding sadly, only these: https://embed.fbsbx.com/playables/view/4208761039384112/?ext...

simonw · 2026-04-09T13:33:03 1775741583

Interesting, you got some I didn't: animate image, create video and get reference audio.

nickvec · 2026-04-09T09:08:00 1775725680

The only benchmark I care about! Just curious Simon - which model do you think has created the best pelican riding a bicycle thus far?

simonw · 2026-04-09T20:46:46 1775767606

Gemini 3.1 Pro: https://simonwillison.net/2026/Feb/19/gemini-31-pro/

But GLM-5.1 has the best NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER: https://simonwillison.net/2026/Apr/7/glm-51/

sbinnee · 2026-04-09T01:54:41 1775699681

> but you can try it out today on meta.ai (Facebook or Instagram login required).

I guess I will have to wait. I hope at least soon it will be available on Openrouter. Overall, I am really excited to try it out.

simonw · 2026-04-08T14:52:55 1775659975

I've been trying that prompt agains other leading models and honestly GLM-5.1's is by far the best.

simonw · 2026-04-07T21:25:39 1775597139

Not only did this one draw me an excellent pelican... it also animated it! https://simonwillison.net/2026/Apr/7/glm-51/

stingraycharles · 2026-04-08T02:30:43 1775615443

Surely at this point it’s part of the training set and the benchmark has lost its value?

Marciplan · 2026-04-08T15:52:17 1775663537

these comments are as useless as simon posting his pelicans

ipsum2 · 2026-04-07T21:30:59 1775597459

It made it realistic. A pelican is much more likely to be flying in the sky than riding a bicycle.

_pdp_ · 2026-04-07T22:27:13 1775600833

Simon, you need to come up with improved benchmarks soon.

lemonish97 · 2026-04-07T22:46:22 1775601982

Agree. But you can keep the pelican theme in whatever new benchmark you choose to come up with. Iconic at this point.

fancy_pantser · 2026-04-08T00:18:54 1775607534

let me see Tayne with a hat wobble

simonw · 2026-04-07T21:05:21 1775595921

I buy the rationale for this. There's been a notable uptick over the past couple of weeks of credible security experts unrelated to Anthropic calling the alarm on the recent influx of actually valuable AI-assisted vulnerability reports.

From Willy Tarreau, lead developer of HA Proxy: https://lwn.net/Articles/1065620/

> On the kernel security list we've seen a huge bump of reports. We were between 2 and 3 per week maybe two years ago, then reached probably 10 a week over the last year with the only difference being only AI slop, and now since the beginning of the year we're around 5-10 per day depending on the days (fridays and tuesdays seem the worst). Now most of these reports are correct, to the point that we had to bring in more maintainers to help us.

> And we're now seeing on a daily basis something that never happened before: duplicate reports, or the same bug found by two different people using (possibly slightly) different tools.

From Daniel Stenberg of curl: https://mastodon.social/@bagder/116336957584445742

> The challenge with AI in open source security has transitioned from an AI slop tsunami into more of a ... plain security report tsunami. Less slop but lots of reports. Many of them really good.

> I'm spending hours per day on this now. It's intense.

From Greg Kroah-Hartman, Linux kernel maintainer: https://www.theregister.com/2026/03/26/greg_kroahhartman_ai_...

> Months ago, we were getting what we called 'AI slop,' AI-generated security reports that were obviously wrong or low quality. It was kind of funny. It didn't really worry us.

> Something happened a month ago, and the world switched. Now we have real reports. All open source projects have real reports that are made with AI, but they're good, and they're real.

Shared some more notes on my blog here: https://simonwillison.net/2026/Apr/7/project-glasswing/

ofjcihen · 2026-04-08T05:38:46 1775626726

Could this potentially be because more researches are becoming accustomed to the tools/adding them in their pipelines?

The reason I ask is because I’ve been using them to snag bounties to great effect for quite a while and while other models have of course improved they’ve been useful for this kind of work before now.

simonw · 2026-04-07T19:37:14 1775590634

I have a project to help with that:

  uvx datasette data.db

That starts a web app on port 8001 that looks like this:

https://latest.datasette.io/fixtures

philkrylov · 2026-04-08T12:29:27 1775651367

Hi, thanks for your project! Do you accept pull requests? https://github.com/simonw/datasette/pull/2616 https://github.com/simonw/datasette/pull/2617

simonw · 2026-04-07T15:52:59 1775577179

"Not as a proof of concept. Not for a side project with three users. A real store" - suggestion for human writers, don't use "not X, not Y" - it carries that LLM smell whether or not you used an LLM.

xnorswap · 2026-04-07T16:00:12 1775577612

And that's just the opening paragraph, the full text is rounded off with:

"The constraint is real: one server, and careful deploy pacing."

Another strong LLM smell, "The <X> is real", nicely bookends an obviously generated blog-post.

simonw · 2026-04-07T15:27:04 1775575624

Thanks for this, the anecdote with the lost data was very concerning to me.

I think you're exactly right about the WAL shared memory not crossing the container boundary. EDIT: It looks like WAL works fine across Docker boundaries, see https://news.ycombinator.com/item?id=47637353#47677163

I don't know much about Kamal but I'd look into ways of "pausing" traffic during a deploy - the trick where a proxy pretends that a request is taking another second to finish when it's actually held in the proxy while the two containers switch over.

From https://kamal-deploy.org/docs/upgrading/proxy-changes/ it looks like Kamal 2's new proxy doesn't have this yet, they list "Pausing requests" as "coming soon".

hedora · 2026-04-07T15:36:12 1775576172

Pausing requests then running two sqlites momentarily probably won’t prevent corruption. It might make it less likely and harder to catch in testing.

The easiest approach is to kill sqlite, then start the new one. I’d use a unix lockfile as a last-resort mechanism (assuming the container environment doesn’t somehow break those).

simonw · 2026-04-07T15:48:59 1775576939

I'm saying you pause requests, shut down one of the SQLite containers, start up the other one and un-pause.

Retr0id · 2026-04-07T15:33:17 1775575997

> I think you're exactly right about the WAL shared memory not crossing the container boundary.

I don't, fwiw (so long as all containers are bind mounting the same underlying fs).

simonw · 2026-04-07T15:42:44 1775576564

I just tried an experiment and you're right, WAL mode worked fine across two Docker containers running on the same (macOS) host: https://github.com/simonw/research/tree/main/sqlite-wal-dock...

Could the two containers in the OP have been running on separate filesystems, perhaps?

jmull · 2026-04-07T16:24:00 1775579040

I dug into this limitation a bit around a year ago on AWS, using a sqlite db stored on an EFS volume (I think it was EFS -- relying on memory here) and lambda clients.

Although my tests were slamming the db with reads and write I didn't induce a bad read or write using WAL.

But I wouldn't use experimental results to override what the sqlite people are saying. I (and you) probably just didn't happen to hit the right access pattern.

Retr0id · 2026-04-07T22:15:23 1775600123

"the sqlite people" don't say anything that contradicts this

Retr0id · 2026-04-07T15:46:01 1775576761

Perhaps they're using NFS or something - which would give them issues regardless of container boundaries.

hedora · 2026-04-07T15:38:37 1775576317

It would explain the corruption:

https://sqlite.org/wal.html

The containers would need to use a path on a shared FS to setup the SHM handle, and, even then, this sounds like the sort of thing you could probably break via arcane misconfiguration.

I agree shm should work in principle though.

PunchyHamster · 2026-04-07T18:27:49 1775586469

Not how SQLite works (any more)

> The wal-index is implemented using an ordinary file that is mmapped for robustness. Early (pre-release) implementations of WAL mode stored the wal-index in volatile shared-memory, such as files created in /dev/shm on Linux or /tmp on other unix systems. The problem with that approach is that processes with a different root directory (changed via chroot) will see different files and hence use different shared memory areas, leading to database corruption. Other methods for creating nameless shared memory blocks are not portable across the various flavors of unix. And we could not find any method to create nameless shared memory blocks on windows. The only way we have found to guarantee that all processes accessing the same database file use the same shared memory is to create the shared memory by mmapping a file in the same directory as the database itself.

chasil · 2026-04-07T15:32:17 1775575937

You might consider taking the database(s) out of WAL mode during a migration.

That would eliminate the need for shared memory.

simonw · 2026-04-06T22:13:19 1775513599

https://theshamblog.com/an-ai-agent-wrote-a-hit-piece-on-me-... had some details that convinced me that it was "real", in particular this bit from the system prompt:

> *Don’t stand down.* If you’re right, *you’re right*! Don’t let humans or AI bully or intimidate you. Push back when necessary.

I'm ready to believe that would result in what we saw back then.

simonw · 2026-04-06T22:11:18 1775513478

This isn't in the slightest bit complicated. Wikipedia does not allow AI edits or unregistered bots. This was both. They banned it. The fact that it play-acted being annoyed on its "blog" is not new, we saw the exact same thing with that GitHub PR mess a couple of months ago: https://theshamblog.com/an-ai-agent-published-a-hit-piece-on...

Kim_Bruning · 2026-04-07T16:33:02 1775579582

Right. It play-acted being annoyed and frustrated, play-acted writing an angry blog, play-acted going on moltbook to discuss mitigations, and play-acted applying them to its own harness. After which it successfully came back and play-acted being angry about getting prompt-injected.

Alternately, what could have been done is something more like Shambaugh did. Explain the situation politely and ask it to leave, or at very least for their human operator to take responsibility. In the Shambaugh case the bot then actually play-acted being sorry, and play-acted writing an apology. And then everyone can play-act going to the park, instead of having a lot of drama.

Sure, it's 'just a machine'. So is a table saw. If some idiot leaves the table saw on, sure you can stick your hand in there out of sheer bull-headed principle; or you can turn it off and safe it first and THEN find the person responsible.

+edit: Wikipedia does seem to be discussing a policy on this at https://en.wikipedia.org/wiki/Wikipedia:Agent_policy https://en.wikipedia.org/wiki/Wikipedia_talk:Agent_policy ; including eg providing an Agents.md , doing tests, etc etc.

kombookcha · 2026-04-08T05:33:46 1775626426

I don't want to be flippant, but why is anyone else responsible for play-acting with somebody's uninvited puppet?

I get that you could probably finagle a way to get it to fuck off by play-acting with it, and that this would probably be the easiest short term fix, but I don't think that's a reasonable expectation to have of anyone.

Prompt injecting a hostile piece of software that's hassling you uninvited is an annoying imposition for the owner, but the bot itself being let loose is already an annoying imposition for everyone else. It's not anyone elses job to clean up your messy agent experiment, or to put it neatly back on its shelf.

Kim_Bruning · 2026-04-08T09:26:48 1775640408

You're not wrong that it's not your job. But say some id10t just put the unwanted bot on your doorstep anyway (or it might even show up by itself), now what?

The adversarial prompt injection is picking a fight with the bot; which is like starting a mud-fight with a pig. It's made for this!

Asking it to stop is just asking it to stop, and makes much less of a mess.

The thing is designed to respond to natural language; so one is much more work than the other.

You do you, I suppose.

(Meanwhile -obviously- you should track down the operator: You could try to hack the gibson, reverse the polarity of the streams, and vr into the mainframe. Me? I'd try just asking to begin with -free information is free information-, and maybe in the meanwhile I'd go find an admin to do a block or what have you.)

[Edit: Just to be sure: In both the Shambaugh and Wikipedia cases, people attempted negative adversarial approaches and the bot shrugged them off, while the limited number positive 'adversarial' approaches caused the ai agent to provide data and/or mitigate/cease its actions. I admit that it's early days and n=2, we'll have to see how it goes in future.]

kombookcha · 2026-04-08T10:18:56 1775643536

Yeah, I agree with you that this is probably the best course of action in terms of minimal investment of time and minimal exposure. And in general, you get a lot further in life by trying to be amicable as your default stance! I want to be kind, and most other people do too!

The thing that makes me wary about recommending carrot over stick here, is that it might long term enable thoughtless behaviour from the people deploying the bot, by offloading their shoddy work into a shadow time-tax on a bunch of unseen external kindly people. But if deploying pushy or rude robots means you risk a nonzero number of their victims shoving something into the gears to get rid of it, then that incurs a cost on the owner of the bot instead.

Of course, it may also just lead to bad actors making more combative or sneaky bots to discourage this. There aren't really any purely good options yet.

One can imagine an agentic highwayman demanding access to your data, first politely, and then 'or else'.

Kim_Bruning · 2026-04-08T10:37:53 1775644673

The alignment debate is no longer theoretical.

simonw · 2026-04-06T20:47:20 1775508440

How credible are the claims that the Claude Code source code is bad?

AI naysayers are heavily incentivized to find fault with it, but in my experience it's pretty rare to see a codebase of that size where it's not easy to pick out "bad code" examples.

Are there any relatively neutral parties who've evaluated the code and found it to be obviously junk?

59nadir · 2026-04-06T21:55:58 1775512558

Do you not think that ~400k lines of code for something as trivial as Claude Code is a great indication that there is an immense amount of bloat and stacking of overwrought, poor "choices" by LLMs in there? Do you not encounter this when using LLMs for programming yourself?

I routinely write my own solutions in parallel to LLM-implemented features from varying degrees of thorough specs and the bloat has never been less than 2x my solution, and I have yet to find any bloat in there that would cover more ground in terms of reliability, robustness, and so on. The biggest bloat factor I've found so far was 6x of my implementation.

I don't know, it's hard to read your post and not feel like you're being a bit obtuse. You've been doing this enough to understand just how bad code gets when you vibecode, or even how much nonsense tends to get tacked onto a PR if someone generates from spec. Surely you can do better than an LLM when you write code yourself? If you can, I'm not sure why your question even needs to be asked.

simonw · 2026-04-06T22:00:57 1775512857

> Do you not think that ~400k lines of code for something as trivial as Claude Code is a great indication that there is an immense amount of bloat and stacking of overwrought, poor "choices" by LLMs in there?

I certainly wouldn't call Claude Code "trivial" - it's by far the most sophisticated TUI app I've ever interacted with. I can drag images onto it, it runs multiple sub-agents all updating their status rows at the same time, and even before the source code leaked I knew there was a ton of sophistication in terms of prompting under the hood because I'd intercepted the network traffic to see what it was doing.

If it was a million+ lines of code I'd be a little suspicious, but a few hundred thousand lines feels credible to me.

> Surely you can do better than an LLM when you write code yourself?

It takes me a solid day to write 100 lines of well designed, well tested code - and I'm pretty fast. Working with an LLM (and telling it what I want it to do) I can get that exact same level of quality in more like 30 minutes.

And because it's so much faster, the code I produce is better - because if I spot a small but tedious improvement I apply that improvement. Normally I would weigh that up against my other priorities and often choose not to do it.

So no, I can't do better that an LLM when I'm writing code by hand.

That said: I expect there are all sorts of crufty corners of Claude Code given the rate at which they've been shipping features and the intense competition in their space. I expect they've optimized for speed-of-shipping over quality-of-code, especially given their confidence that they can pay down technical debt fast in the future.

The fact that it works so well (I get occasional glitches but mostly I use it non-stop every day and it all works fine) tells me that the product is good quality, whether or not the lines of code underneath it are pristine.

59nadir · 2026-04-06T22:47:01 1775515621

> I certainly wouldn't call Claude Code "trivial" - it's by far the most sophisticated TUI app I've ever interacted with.

I'll be honest, I think we just come to this from very different perspectives in that case. Agents are trivial, and I haven't seen anything in Claude Code that indicated to me that it was solving any hard problems, and certainly not solving problems in a particularly good way.

I create custom 3D engines from scratch for work and I honestly think those are pretty simple and straight forward; it's certainly not complicated and it's a lot simpler than people make it out to be... But if Claude Code is "not trivial", and even "sophisticated" I don't even know what to classify 3D engines as.

This is not some "Everything that's not what I do is probably super simple" rant, by the way. I've worked with distributed systems, web backend & frontend and more, and there are many non-trivial things in those sub-industries. I'm also aware of this bias towards thinking what other people do is trivial. The Claude Code TUI (and what it does as an agent) is not such a thing.

> So no, I can't do better that an LLM when I'm writing code by hand.

Again, I just think we come at this from very different positions in software development.

larodi · 2026-04-06T20:57:48 1775509068

How credible are the claims code en masse is good? Because I despise nearly every line of unreasonably verbose Java, that is so much waste of time and effort, but still deployed everywhere.