It might feel good but spiking your blood sugar isn't healthy for you, and the crashes afterwards will get worse over the years. Improving metabolic health might be a better long term solution; have you explored how endurance or high intensity exercise affects your focus?
The nature of the content is an important variable to control for in future work, but the primary negative impact appears to be via the devastating effect on human attention.
From the paper: "repeated exposure to highly stimulating, fast-paced content may contribute to habituation, in which users become desensitized to slower, more effortful cognitive tasks such as reading, problem solving, or deep learning. This process may gradually reduce cognitive endurance and weaken the brain’s ability to sustain attention on a single task... potentially reinforcing impulsive engagement patterns and encouraging habitual seeking of instant gratification".
Is all short form video "highly stimulating" and/or "fast-paced" though? I can see the argument for the format being inherently stimulating/fast-paced, but I think that it still comes down more to the content than the format.
The pace is the format. Even if you're just watching turtles for 30 seconds, the loop and the switch to next video are fast-paced context switching, which is stimulating. I suspect it has similar mental effects to constant interruptions, like a bad day at work where slack and email prevent you from getting into flow state/real work.
The format also encourages maximum aggressive video editing where the short video is further chopped up with cuts and zooms etc, techniques designed to tickle your brain and keep you engaged, more stimulation.
Look at what twitter et al. did to long form reading. Short video is the same.
> The pace is the format. Even if you're just watching turtles for 30 seconds, the loop and the switch to next video are fast-paced context switching, which is stimulating.
I've been over-indulging in context switching long before short-form videos ever showed up. The internet itself is all about context switching. But the UX around short-form videos definitely encourages doomscrolling, similar to how microtransaction games encourage neverending grinds.
We definitely need better habits as a collective, but I think a list of "do's" is just as important as a whack-a-mole list of "don'ts".
Yep, the internet as a whole and is the real culprit. We love instant gratification and short feedback loops and the internet provides.
I feel like things will likely get worse before it gets better, but I have long-term hopes that eventually we'll see some cultural change that promotes doing vs consuming.
> Even if you're just watching turtles for 30 seconds, the loop and the switch to next video are fast-paced context switching, which is stimulating.
I'll agree that it's stimulating... I guess the question is then: how stimulating it is, vs how stimulating the content itself is? As the initial comment said, we need more data on the specific types of content.
It's an interesting question. Personally, I feel like it's a combination of factors.
On the content side, I think the content editing can have more of an impact than the subject itself. For example, I can watch something like a fast-paced action movie with a reasonable amount of camera tricks for a couple hours without any noticeable strain, but 30 mins of a modern cooking show can be exhausting just because the average time between camera cuts and zooms is only a few seconds. The latter jams so much stimulation into a small window that baking a cake is on par with a car chase.
On the format side, regardless of content, the loop and video switch gives me similar vibes to the editing tricks, but ofc the short video probably also contains similar editing, so it's a double whammy, and likely spread across different subjects as you scroll every minute or so. Bonus points if the content itself is stimulating.
If the modern cooking show I described is cocaine, doom scrolling shorts is crack cocaine. Harder, faster, more addictive.
Aye, the fast-paced editing is extremely jarring. Another variable that makes these discussions so difficult to reach a conclusion! :( We need to consider stuff like this - content and subject matter, not just format - when it comes to figuring out what is harmful about this stuff, not just say "short form videos are bad!"
Very similar to social media. What is it about social media that's harmful? Is it the connecting with other humans - which seems to me to define social media? Or is it the algorithms? The infinite scrolling? Something else?
(I'm not denying we're facing very serious issues that are certainly being exacerbated if not entirely caused by popular uses of online platforms; I want to solve those issues. I just want to solve them in a productive and non-reductive manner. Taking correlations and running with them is not that, and will not only not solve the problems, but will lead to massive privacy and security issues (see: ID verification))
I don't have any specific knowledge about Waymo's stack, but I can confidently say Waymo's reaction time is likely poorer than an attentive human. By the time sensor data makes it through the perception stack, prediction/planning stack, and back to the controls stack, you're likely looking at >500ms. Waymos have the advantage of consistency though (they never text and drive).
> but I can confidently say [...] you're likely looking at >500ms
That sounds outrageous if true. Very strange to acknowledge you don't actually have any specific knowledge about this thing before doing a grand claim, not just "confidently", but also label it as such.
They've been publishing some stuff around latency (https://waymo.com/search?q=latency) but I'm not finding any concrete numbers, but I'd be very surprised if it was higher than the reaction time for a human, which seems to be around 400-600ms typically.
Human reaction time is very difficult to average meaningfully. It ranges anywhere from a few hundred milliseconds on the low end to multiple seconds. The low end of that range consists of snap reactions by alert drivers, and the high end is common with distracted driving.
400-500ms is a fairly normal baseline for AV systems in my experience.
> MIT researchers have found an answer in a new study that shows humans need about 390 to 600 milliseconds to detect and react to road hazards, given only a single glance at the road — with younger drivers detecting hazards nearly twice as fast as older drivers.
But it'll be highly variable not just between individuals but state of mind, attentiveness and a whole lot of other things.
Even if we assume this to be true, waymos have the advantage of more sensors and less blind spots.
Unlike humans they can also sense what's behind the car or other spots not directly visible to a human.
They can also measure distance very precisely due to lidars (and perhaps radars too?)
A human reacts to the red light when a car breaks, without that it will take you way more time due to stereo vision to realize that a car ahead was getting closer to you.
And I am pretty sure when the car detects certain obstacles fast approaching at certain distances, or if a car ahesd of you stopped suddenly or deer jumped or w/e it breaks directly it doesn't need neural networks processing those are probably low level failsafes that are very fast to compute and definitely faster than what a human could react to
Beyond the questions about human braking, this seems worse than the dedicated AEB systems many vehicles are using now. Do they really use the full stack for this case instead of a faster collision avoidance path? I remember some of their people talking about concurrency back in the DARPA Grand Challenge days and it seems like this would be a high priority for anyone working on a system like this.
Humans can provide a simple, pre-planned reaction to an expected event (e.g. "click when the reaction test shows a signal") within typically 250-300ms, but 500ms from vision to physically executed action for an unexpected event seems pretty optimistic for a human driver.
Waymo "sees" further - including behind cars - and has persistent 360-degree awareness, wheres humans have to settle for time-division of the fovea and are limited to line-of-sight from driver's seat. Humans only have an advantage if the event is visible from the cabin, and they were already looking at it (i.e. it's in front of them) for every other scenario, Waymo has better perception + reaction times. "They just came out of nowhere" happens less for Waymo vehicles with their current sensor suite.
It's actually a really interesting topic to think about. Depending on the situation, there might be some indecision in a human driver that slows the process down. Whereas the Waymo probably has a decisive answer to whatever problem is facing it.
I don't really know the answers for sure here, but there's probably a gray area where humans struggle more than the Waymo.
The von Neumann architecture is not ideal for all use cases; ML training and inference is hugely memory bound and a ton of energy is spent moving network weights around for just a few OPs. Our own squishy neural networks can be viewed as a form of in-memory computing: synapses both store network properties and execute the computation (there's no need to read out synapse weights for calculation elsewhere).
It's still very niche but could offer enormous power savings for ML inference.
sooner or later we get a NRAM - neural ram as extension which is basically this neuromorphic lattice that can be wired on the very low level, perhaps also photonic level, and then the whole AI thing trains/lives in it.
there is another CPU which was recently featured which has again a lattice which is sort of FPGA but very fast, where different modules are loaded with some tasks, and each marble pumps data to some other, where the orchestrator decides how and what goes in each of these.
I keep thinking of a dram with a row of MAC units and registers along the row outputs. A vector is then an entire dram row. Access takes longer then the math, so slower/smaller multi-cycle circuits could be used. This would probably require OS level allocation of vectors in dram, and management of the accumulator vector (it really should be a row, but we need a huge register to avoid extra reads and writes. The dram will also need some kind of command interface.
That is a very wild shape indeed. I wonder if there is some analytical way to derive these or if some type of search algorithm will remain the best way.
Given the elegance of the wave equation, I like to imagine there are solutions with some sort of symmetry and structure. We are unfortunately missing the tools and knowledge to find these solutions today!
Good list, I'm also keeping an eye on Tri Alpha Energy and First Light Fusion. TAE recently announced [1] initiating a field reversed configuration with no plasma injectors, only neutral beam injection, which is a pretty big deal in simplifying the design.
Thank you for the excellent writeup of some extremely interesting work! Do you have any opinions on whether binary networks and/or differentiable circuits will play a large role in the future of AI? I've long had this hunch that we'll look back on current dense vector representations as an inferior way of encoding information.
Well, I'm not an expert. I think that this research direction is very cool. I think that, at the limit, for some (but not all!) applications, we'll be training over the raw instructions available to the hardware, or perhaps even the hardware itself. Maybe something as in this short story[0]:
> A descendant of AutoML-Zero, “HQU” starts with raw GPU primitives like matrix multiplication, and it directly outputs binary blobs. These blobs are then executed in a wide family of simulated games, each randomized, and the HQU outer loop evolved to increase reward.
I also think that different applications will require different architectures and tools, much like how you don't write systems software in Lua, nor script games mods with Zsh. It's fun to speculate, but who knows.