Hacker Newsnew | past | comments | ask | show | jobs | submit | AdamConwayIE's commentslogin

Had a very similar experience recently.

Built a basic authentication handler for this test just so it wouldn't be in the training data of either model. It had deliberately planted bugs. One was a hardcoded secret, another was a wrap-on-0xFFFFFFFF bug as a result of a malloc(length+1).

Qwen 3.6 found both, alongside two other issues I hadn't even considered, and the location of the magic value. GPT-5.4, though, missed the malloc issue (flagging memory exhaustion as the only risk), it missed a separate timing bug (it explicitly said the function was safe), and it hallucinated the location of the magic value. Qwen correctly identified the integer overflow. GPT-5.4 did not.

I then compared basic research between them using SearXNG for web search. For example, the current status of MTP in llama.cpp. Qwen 3.6 27B found the current PR, but flagged a related issue that shows the current implementation can be slower than just using a draft model right now. GPT-5.5 Thinking found the same PR, but didn't flag the downsides.

In a similar comparison, I asked both models how I should get started with ESPHome as a total beginner. ChatGPT suggested an ESP32-S3 and a BME280, which is... just not a good idea. It also talked about the ESP32-P4 not having Wi-Fi, and installing with HA or Docker. Meanwhile, Qwen3.6 27B said regular ESP32, DHT22, and mentioned HA, Docker, and pip as installation methods. While GPT was good, it was just throwing out jargon for a prompt that explicitly requested it for a beginner.

It kind of blew my mind that in all three of these, Qwen landed it better.


Hey, article author here!

I've been writing for nearly a decade, and I can assure you, all of this is human written. I've long been writing about the Linux kernel where it's been relevant to my coverage, and there are articles under my name talking about low-level technical aspects in drivers and kernels from as far back as 2017.

I get that it's hard to know what to trust out there given that Dead Internet Theory is beginning to feel like a reality, but comments like this can be quite upsetting after spending days researching and writing an article like this. I totally get criticism of the article itself, and I'm fine with that, but it feels as if people are too quick to jump on the "must be written by AI" bandwagon. I receive it, my colleagues receive it, and for the people who I know put in so much effort into their work, it can be upsetting to them as well.

As was mentioned in another thread, there were actually a couple of typos in this article when it went live. I cleaned those up once they were pointed out, but AI doesn't make typos. I get it to an extent; hostility and accusations of all kinds have been levied at writers for the years and years I've been in this industry writing long-form content and analysis. But with the proliferation of AI, that hostility has really ramped up over the last couple of years.


Apologies if my post hurt your feelings and I appreciate you taking the time to respond. The writing style in the piece I quoted looked very AI driven to me, that's why I said what I said.


There aren't really any of the typical benchmark suites targeting Codex 5.3 because it's still not in the API.

SWE bench for example creates a predictions file and evaluates the results in the harness. Without Codex 5.3 being in the API, it can't.


You don't have to wonder whether or not it returns value to the tax payer. The Irish government already monitored the pilot program for two years, publishing all of the details and findings. [1]

"The headline finding from this social CBA is that for every €1 of public money invested in the pilot, society received €1.39 in return"

This came about as a mixture of greater economic activity from participants, cultural impacts that saw public-facing artist activities increase, and improvements to wellbeing of participants that reduced their requirement for psychological interventions by the state. The state also predicts that the further roll-out of this program will benefit consumers with lower prices for artistic works, as there will be more supply overall.

The scheme has been quite popular here in Ireland. Given the history of Ireland when it comes to art (both in the sense of spoken and written word, and in other mediums), it makes sense to introduce a scheme like this to safeguard and uplift those who produce art.

[1] https://www.gov.ie/en/department-of-culture-communications-a...


Thanks for linking the CBA. I hadn't seen that before

> "The headline finding from this social CBA is that for every €1 of public money invested in the pilot, society received €1.39 in return"

Okay, so if you read the CBA, the net fiscal cost of the pilot was:

* Gross pilot cost (2021–2025): ~€114 MM

* Tax revenue: ~€36 MM

* Social protection savings: ~€6.5 MM

* Net fiscal cost: ~€72 MM

So for every €1 of public money invested in the pilot, society received 37¢ in fiscal return. So it's an unambiguous fiscal cost, a net loss.

Of the "Total monetised benefits", €80 MM of the benefit was in "wellbeing gains", as measured by the WELLBY test, which is calculated based on a single survey question:

> “Overall, how satisfied are you with your life nowadays, where 0 is "not at all satisfied" and 10 is "completely satisfied"?

The €80 MM in "wellbeing gains", which is the sole decided of whether this pilot was a net positive or a net negative to society, is because on average, the 2,000 pilot scheme participants had a very approximate 0.7–1.1 increase in score when asked the above question during the pilot as compared to before the pilot. Each 1 point is deemed to be worth €15,340.

That's it. There's no economic return - it's a proven economic cost. There's no proven social benefit. No demonstrated effect on art prices or availability.

The pilot was successful - if you consider it to have been - solely because the artists who received payments as part of the pilot had an improvement in Wellby satisfaction score when they were asked via survey. If you remove this factor, the pilot was an abject failure.


Nicely set out. I completely agree with you. I'm also pretty certain - and I say this both as a lover of the arts and as a taxpayer - that I will see no benefit whatsoever in my life, or to society in general, from the works produced under the aegis of this programme.

You know what would have been a worthwhile use of that €114 MM? Improving the pay and conditions of our naval personnel. That way, the nation might now be able to put more than one patrol boat out to sea at a time.


Every gun that is made, every warship launched, every rocket fired signifies, in the final sense, a theft from those who hunger and are not fed, those who are cold and are not clothed.


In this case isn't it more that: Every sculpture that is made, every picture drawn, every bed left unmade, in the final sense, a theft from those who hunger and are not fed, those who are cold and are not clothed.

From where I'm sitting, this is theft, its forced wealth redistribution, from people that are potentially already struggling,to people that choose to slum it as artists. Its not even means tested, this really will result in money transferring from those on the edge of poverty to rich art school kids.

There's currently 16,000 homeless / at risk people in Ireland, including 5000 children [0]. I can think of at least one better use for that money.

[0] https://www.irishtimes.com/ireland/social-affairs/2025/11/28...


Those who are cold won't find their situation improved if an undetected Russian submarine sabotages the country's natural gas interconnectors.


Can you imagine the net WELLBY increase if the DF were paid a living wage?


I think you two are using different definitions of society.

In this comment society seems to mean "the government, and its tax revenue profit/loss statement"

In the previous comment society seems to be construed more broadly and encompass both non-economic activity and economic activity outside the collection and disbursement of tax funds.


> In this comment society seems to mean "the government, and its tax revenue profit/loss statement"

No, that's not correct. I specifically separated the pure economic impact from the society impact, but the only societal impact used to quantify the success of the pilot scheme is that the people paid a basic income by the scheme had higher life satisfaction as measured by a single survey question.

That is the basis used by Government to claim that it's a social benefit.

Personally, I support the arts and I think that culture, health, housing accessibility, safety, fitness, happiness, and companionship are all better measures of a society than GDP or other fiscal metrics.

Right now, we have a health, housing, and social crises desperate for resources - resources that are allocated exclusively through Euro budgets. This pilot scheme has not demonstrated any cultural or social impact at all. Only the aforementioned increase in recipient satisfaction.

Meanwhile people in dire situations face multi-year waits for operations, or dying of a treatable stroke/MI due to a lack of ambulances, or death by suicide as the mental health services are overwhelmed.

Is the WELLBY score of these artists more important the WELLBY score of parents awaiting their kid's operation for the second or third spring? Or burying their children? Or raising them in hotel rooms?

Ireland is only economically successful. We are failing our citizenry abysmally outside of fiscal terms and basic income for artists should be allocated while hundreds of more pressing needs are left unmet.


Article author here: you'd be surprised! XDA these days has quite a bit of mainstream outreach, and this article has been getting shared on some socials. Even saw it getting passed around on LinkedIn.


Oh ok I didn't realise. I just know it as a forum where custom rom developers hang out.


There's even an interview with Steve Furber, who co-designed it, where he talks about it. https://www.youtube.com/watch?v=1jOJl8gRPyQ&t=508s


People always forget that back when OpenAI accused DeepSeek of distillation, o1's reasoning process was locked down, with only short sentences shared with the user as it "thought." There was a paper published in November 2024 from Shanghai Jiao Tong University that outlined how one would distill information from o1[1], and it even says that they used "tens of thousands" of o1 distilled chains. Given that the primary evidence given for distillation, according to Bloomberg[2], was that a lot of data was sent from OpenAI developer accounts in China in late 2024, it's not impossible that this (and other projects like it) could also have been the cause of that.

The thing is, given the other advances that were outlined in the DeepSeek R1 paper, it's not as if DeepSeek needed to coast on OpenAI's work. The use of GRPO RL, not to mention the training time and resources that were required, is still incredibly impressive, no matter the source of the data. There's a lot that DeepSeek R1 can be credited with in the LLM space today, and it really did signify a number of breakthroughs all at once. Even their identification of naturally emergent CoT through RL was incredibly impressive, and led to it becoming commonplace across LLMs these days.[3]

It's clear that there are many talented researchers on their team (their approach to MoE with its expert segmentation and expert isolation is quite interesting), so it would seem strange that with all of that talent, they'd resort to distillation for knowledge gathering. I'm not saying that it didn't happen, it absolutely could have, but a lot of the accusations that came from OpenAI/Microsoft at the time seemed more like panic given the stock market's reaction rather than genuine accusations with evidence behind them... especially given we've not heard anything since then.

https://github.com/GAIR-NLP/O1-Journey https://www.bloomberg.com/news/articles/2025-01-29/microsoft... https://github.com/hkust-nlp/simpleRL-reason


Does it? The Pixel 6a's primary sensor has really been showing its age for a while now, and often struggles to contrast really dark spots with really bright spots. We talked a lot about that in our comparison of the Pixel 5 to the iPhone 13 Pro. It's part of why Google needed to upgrade the Pixel 6 Pro camera, and I wouldn't really say the 6a is a "top-tier" anymore. It's still really good, but there are a ton of phones that do way better nowadays.


Hey there! Author here.

This isn't a criticism of the Pixel 6a, per se. This article is not meant to be a critique of the 6a, but rather, is using it as a tool to illustrate an overall greater point. The reason you don't understand the criticism of the 6a is because it's not supposed to be criticism.

The phone costs a lot, and a lot more than most other devices that are in a similar boat of "mid-range". It has a lot of bells and whistles that normal consumers won't necessarily care for, because it doesn't matter how great the camera is when a lot of people are just using their phones for the likes of Instagram, Twitter, and Facebook. A Nothing Phone, Nord 2T etc will get 80% of the way there in the camera department after the photo is on social media, and most people won't care for the difference.

However, the primary argument of the article is not to critique the Pixel 6a. Far from it. The problem is how the reason it's considered good value is because of the US carrier market. This article is primarily taking aim at the US carrier market and not the Pixel 6a. It's just a tool being used to illustrate a point.


I think one point you might have missed in your analysis is the form factor. I'm in Europe and I'm seriously considering getting a Pixel 6a because it's the smallest one amongst the phones in this price range. It's not small, but it's also not a comically large slab that doesn't fit any pocket as some of the other "good enough" phones out there. That, plus 5 years of security updates and a stock Android experience are enough to justify the price difference which is not even that big anyway - 469 Euros vs 400 Euros for Nord 2T, same price for Nothing Phone etc


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: