In-process sandboxed llm-generated code execution. Considerably lighter weight, faster to boot (assuming pre-compilation) than Docker or spinning up micro-VMs.
Yes, nothing to write home about. It's all relative of course, what stack, what goal, what approach on which models perform best, but for regular day-to-day coding, I do not find it usable given alternatives.
Kimi, Mimimax and GLM models provide far more robust coding assistance at sometimes no cost (financed via data sharing) or for very cheap. Output quality, tool calling reliability and task adherence tend to be far more reliable across all three over Mercury 2, so if you consider the time to get usable code including reviews, manual fixes, different prompting attempts, etc. end-to-end you'll be faster.
Only "coding" task I have found Mercury 2 to have a place for code generation is a browser desktop with simple generated applets. Think artefacts/canvas output but via a search field if the applet has been generated previously.
With other models, I need to hide the load behind a splash screen, but with Mercury 2 it is so fast that it can feel frictionless. The demo at this point is limited by the fact that venturing beyond a simple calculator or todo list, the output becomes unpredictable and I struggle to get Mercury 2 to rely on pre-made components, etc. to ensure consistent appearance and a11y.
Despite the benchmarks, cost and speed figure suggesting something different, I have had the best overall results with Haiku 4.5, simply because GPT-5.4-nano is still unwilling to play nice with my approach to UI components. I am currently experimenting with some routing, using different models for different complexity, then using loading spinners only for certain models, but even if that works reliably, any model that I cannot force to rely on UI components in a consistent manner isn't gonna work, so for the time being it'd just route between less expensive and more expensive Anthropic models.
Coding wise, one more exception can be in-line suggestions, though I have no way to fairly compare that because the tab models I know about (like Cursors) are not available via API, but Mercury 2 seems to perform solidly there, at least in Zed for a TS code base.
Basically, whether code or anything else, unless your task is truly latency dependent, I believe there are better options out there. If it is, Mercury 2 can enable some amazing things.
This is cool stuff, have you considered submitting any of these exploits to https://hackmyclaw.com/? Email being the only allowed injection vector might be tricky though.
I did (not extensively) tried hackmyclaw but no success. The challenge is a complete black box and the user intent (e.g., "summarize my emails") is not known - this is critical for the prompt injection payload. I also suspect that batch processing of "malicious" emails (every 3 hours) adds a bias to the model behaviour (a lot of potential and detected prompt injection payloads are injected in context). That's why I always start my experiments with a fresh context. Moreover, "hacking" the VPS is not allowed.
Imho the author shall disclose more info about the setup (version, user intent, exact config) to make it more realistic. I read people saying "OpenClaw is secure against prompt injection" because nobody was able to solve the challenge - it's not.
You could maybe rearchitect the backend to be provider-agnostic and sell this as a native client for users of Discord, Matrix, Zulip, Stoat, etc. That way you can capture a larger market and you're fine even if Discord kicks you out for ToS violations. Although, I suspect there's a bunch of complexity in papering over each platform especially when you factor in voice/streaming.
> I think maybe there's a mid-ground with buy forever, 1 year updates, so people get the product they paid for, and if they want updates or support the development they can re-buy, however I'm yet to hear opinions on this model.
As far as desktop software is concerned, I think this a commonly accepted approach. Sublime Text is probably the most notable example.
That's the exciting thing about macOS development, right? There's always some funny riddles because Apple doesn't care to document stuff for 3rd parties (or at all).
Sure, but if long-term throughput is a real limitation there's plenty of ways to address that while still not needing to keep anywhere close to all model weights in RAM (which is still the conventional approach with MoE). So the gain of a smaller memory footprint is quite real.
- Where do you source real time traffic data, ferry schedules, etc? Google APIs get you part of the way there but you'd need to crawl public transit sites for the rest.
- How do you keep track of what went into the fridge, what was consumed/thrown away?
- How do you track real world events like buying a physical pass?
That isn't secure is the issue, the more things you have it hooked up to the more havoc it can cause. The environment being locked down doesn't help when you're giving it access to potentially destructive actions. And once you remove those actions, you've neutered it.
The openclaw security model is the equivalent of running as root - i.e. full access. If that is insecure the inverse of it is running without any access as default and adding the things that you need.
The unsolved security challenge is how to give one of these agents access to private data while also enabling other features that could potentially leak data to an attacker (see the lethal trifecta.)
That's the product people want - they want to use a Claw with the ability to execute arbitrary code and also give it access to their private data.
reply