More

spuz · 2026-06-19T07:37:42 1781854662

The article describes their decision making process:

> As rescue divers searched for the boy's body, we deliberated whether to attempt resuscitation and likelihood of meaningful neurologic recovery of a child submerged for at least 90 minutes. We reviewed literature for guidance2-4,6 and drew from institutional experience with a 2-year-old submerged in ice water for 40 minutes who received 101 minutes of CPR.3 The toddler recovered with no sequelae. For our current patient, the decision was made to resuscitate and rewarm the boy because of his young age and protective effects of ice water submersion. We reasoned that if meaningful neurologic function were not observed after rewarming, end-organ preservation on ECMO may allow family goodbyes and organ harvest for transplantation to give other sick children the gift of life.9 This important point should be considered by providers faced with the difficult decision to attempt resuscitation of a patient with asystolic hypothermia >90 minutes.

spuz · 2026-06-06T15:17:09 1780759029

As well as measuring how many questions each model was able to answer correctly, I think it's equally important to measure how many questions each model answered incorrectly. After all, if you consider using them as a tool, you will need to have confidence that any answer they give is correct.

If you look at Table 3 you can see the difference in performance between for example GPT 5.5 and Opus 4.7 for each of the 20x 100 runs:

- GPT 5.5: 1389/2000 questions answered, of which 1043 were correct (75%)

- Opus: 1306/2000 questions answered, of which 294 were correct (22%)

So while you can claim that Opus solved 40% of the problems it still had a failure rate of 78%. That means if you chose this model to answer your homework question, there is a good chance you would fail.

Perhaps a more useful benchmark for future models is measuring how many of these types of questions they can answer in one shot. I.e. how confident can you be when using them for real world tasks.

christianstump · 2026-06-06T15:44:40 1780760680

You are 100% correct with your assessment of the situation. But I do not agree with either of your conclusions:

1. These questions cannot and must not be compared as being similar to homework questions. These are different leagues and possibly even different sports.

2. The "more useful benchmark" that you suggest is already present in the data as we ran every model exactly once in Stage 1.

spuz · 2026-06-06T20:40:32 1780778432

Ah you are right. I think I started reading the results of Stage 2 thinking it was Stage 1.

spuz · 2026-06-04T18:55:37 1780599337

I normally love Lego's interpretation of various real architectural works but I don't believe there is enough detail here to really capture the unique style of Gaudi's design.

genocidicbunny · 2026-06-04T19:34:11 1780601651

They're doing the best they can given the budget and size constraints. The set has to simultaneously be interesting and not tedious to build, cost a somewhat reasonable amount, not be too huge that no one can display or even reasonably build it at home, and able to closely enough replicate what they're trying to model.

Could they make a bigger version of this set that is more closely resembling the real thing? More than likely, yeah they can; look at the displays they have at Legoland. But would that more detailed version be accessible for even the well off AFOL? Most likely not. It'd be too big, too expensive, and too unwieldy, and will probably still fail to capture some of the details of the real thing.

pvillano · 2026-06-04T19:15:20 1780600520

12,060 pieces

spuz · 2026-05-29T08:50:24 1780044624

What does client assertion mean here? I don't see any mention in the GitHub issue.

fhars · 2026-05-29T08:55:56 1780044956

It means that the request to the API contains cryptographic proof that is was generated by a legitimate, reviewed app running on a unmodified and non-rooted mobile device controlled by Apple or Google.

Retr0id · 2026-05-29T09:27:25 1780046845

fwiw this is a correct definition of Remote Attestation, matching what is mentioned in the github thread, but Client Assertion is something mostly unrelated (an OAuth implementation detail)

spuz · 2026-05-20T10:06:00 1779271560

What are some of the millions of legitimate use cases that are harmed by having metadata added to generated images? It's funny you mention Photoshop because that software also adds metadata to jpeg images that it creates. Is the difference here that the SynthID is hidden and can't be removed?

spuz · 2026-05-20T09:57:03 1779271023

There's a world where all the big platforms automatically flag AI images thanks to SnythID and other techniques. The more ubiquitous it is the easier it will be for them to make that a reality.

spuz · 2026-05-14T06:21:04 1778739664

When I had exams in the 90s we'd have to hand phones in at the start. If the phone was seen during the exam, your test would be forfeited. If the rules were that strict then, I can't imagine how they could be less strict now given how much more powerful phones are today.

traderj0e · 2026-05-14T17:13:01 1778778781

Most colleges today will consider phone use during a test cheating, then it depends on the school but at least a 0 on the test is likely. I've had some professors say upfront that they'll go further and wreck cheaters as much as they possibly can.

They probably don't ask for phones upfront, but I don't see how that'd do anything. The only way to make this stricter would be to search students like they do for big standardized tests.

spuz · 2026-05-14T06:09:15 1778738955

What is this honour council I've heard in a few comments? I thought Princeton was unique in having and honour system as opposed to strict academic integrity rules.

porknubbins · 2026-05-15T05:38:06 1778823486

Many US universities and some private schools had honor councils made up of students and faculty that would hear cheating cases

spuz · 2026-05-03T23:06:39 1777849599

Yeah this is a problem even without technology. I believe the UK does not use the letter O in standard registration numbers so it cannot be confused with 0.

spuz · 2026-04-27T11:56:46 1777291006

The only problem with this idea I can forsee is that the application and therefore the screenshots can change but the documentation does not. For example, if the documentation says press "Options > Customize" but the application is updated so this becomes "Preferences > Advanced" then the screenshot will show the new text but the documentation will still show the old labels. This would be very confusing as it would be hard to correlate what is being shown on the screenshot with the text. If the user saw the old screenshot they could more easily identify that they were looking at an out of date documentation.

Having said that, have a process to automatically grab screenshots is going to make it significantly easier for a developer to update the docs so the motivation to keep the text up to date is going to be much higher.

zffr · 2026-04-27T16:14:10 1777306450

As a next step, it could be cool to write unit tests against these screenshots that look for words like you mentioned. That way if a screenshot is updated and a test breaks you will know what documentation to update