Hacker Newsnew | past | comments | ask | show | jobs | submit | spuz's commentslogin

As well as measuring how many questions each model was able to answer correctly, I think it's equally important to measure how many questions each model answered incorrectly. After all, if you consider using them as a tool, you will need to have confidence that any answer they give is correct.

If you look at Table 3 you can see the difference in performance between for example GPT 5.5 and Opus 4.7 for each of the 20x 100 runs:

- GPT 5.5: 1389/2000 questions answered, of which 1043 were correct (75%)

- Opus: 1306/2000 questions answered, of which 294 were correct (22%)

So while you can claim that Opus solved 40% of the problems it still had a failure rate of 78%. That means if you chose this model to answer your homework question, there is a good chance you would fail.

Perhaps a more useful benchmark for future models is measuring how many of these types of questions they can answer in one shot. I.e. how confident can you be when using them for real world tasks.


You are 100% correct with your assessment of the situation. But I do not agree with either of your conclusions:

1. These questions cannot and must not be compared as being similar to homework questions. These are different leagues and possibly even different sports.

2. The "more useful benchmark" that you suggest is already present in the data as we ran every model exactly once in Stage 1.


Ah you are right. I think I started reading the results of Stage 2 thinking it was Stage 1.

I normally love Lego's interpretation of various real architectural works but I don't believe there is enough detail here to really capture the unique style of Gaudi's design.

They're doing the best they can given the budget and size constraints. The set has to simultaneously be interesting and not tedious to build, cost a somewhat reasonable amount, not be too huge that no one can display or even reasonably build it at home, and able to closely enough replicate what they're trying to model.

Could they make a bigger version of this set that is more closely resembling the real thing? More than likely, yeah they can; look at the displays they have at Legoland. But would that more detailed version be accessible for even the well off AFOL? Most likely not. It'd be too big, too expensive, and too unwieldy, and will probably still fail to capture some of the details of the real thing.


12,060 pieces

What does client assertion mean here? I don't see any mention in the GitHub issue.


It means that the request to the API contains cryptographic proof that is was generated by a legitimate, reviewed app running on a unmodified and non-rooted mobile device controlled by Apple or Google.


fwiw this is a correct definition of Remote Attestation, matching what is mentioned in the github thread, but Client Assertion is something mostly unrelated (an OAuth implementation detail)


What are some of the millions of legitimate use cases that are harmed by having metadata added to generated images? It's funny you mention Photoshop because that software also adds metadata to jpeg images that it creates. Is the difference here that the SynthID is hidden and can't be removed?


There's a world where all the big platforms automatically flag AI images thanks to SnythID and other techniques. The more ubiquitous it is the easier it will be for them to make that a reality.


When I had exams in the 90s we'd have to hand phones in at the start. If the phone was seen during the exam, your test would be forfeited. If the rules were that strict then, I can't imagine how they could be less strict now given how much more powerful phones are today.


Most colleges today will consider phone use during a test cheating, then it depends on the school but at least a 0 on the test is likely. I've had some professors say upfront that they'll go further and wreck cheaters as much as they possibly can.

They probably don't ask for phones upfront, but I don't see how that'd do anything. The only way to make this stricter would be to search students like they do for big standardized tests.


What is this honour council I've heard in a few comments? I thought Princeton was unique in having and honour system as opposed to strict academic integrity rules.


Many US universities and some private schools had honor councils made up of students and faculty that would hear cheating cases


Yeah this is a problem even without technology. I believe the UK does not use the letter O in standard registration numbers so it cannot be confused with 0.


The only problem with this idea I can forsee is that the application and therefore the screenshots can change but the documentation does not. For example, if the documentation says press "Options > Customize" but the application is updated so this becomes "Preferences > Advanced" then the screenshot will show the new text but the documentation will still show the old labels. This would be very confusing as it would be hard to correlate what is being shown on the screenshot with the text. If the user saw the old screenshot they could more easily identify that they were looking at an out of date documentation.

Having said that, have a process to automatically grab screenshots is going to make it significantly easier for a developer to update the docs so the motivation to keep the text up to date is going to be much higher.


As a next step, it could be cool to write unit tests against these screenshots that look for words like you mentioned. That way if a screenshot is updated and a test breaks you will know what documentation to update


"F" usually means somebody did something wrong and you are paying respect to their memory. You don't say it as a form of congratulations.


> did something wrong

Nah it's for those who sacrificed their own life, those who succumbed to the call of duty (or to the imperium of perfection) and put their teammates first.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: