The distributed nature of git lets me be independent of some central instance (you may decide that the master copy resides on Github, but with the advent of mesh VPNs like the ones Zerotier and Tailscale offer, you could also sidestep it and push/pull from your colleagues directly as well). It also lets me dictate who gets to access it.
What the article describes, though, is possibly the worst way a machine can access a git repository, which is using a web UI and scraping that, instead of cloning it and adding all the commits to its training set. I feel like they simply don't give a shit. They got such a huge capital injection that they feel they can afford to not give a shit about their own cost efficiency and that they go using the scorched earth tactics. After all, even their own LLMs can produce a naive scraper that wreaks havoc on the internet infrastructure, and they just let it loose. Got mine, fuck you all the way!
But then they will release some DeepSeek R(xyz), and yay, all the hackernews who were roasting them for such methods, will be applauding them for a new version of an "open source" stochastic parrot. Yay indeed.
What the article describes, though, is possibly the worst way a machine can access a git repository, which is using a web UI and scraping that, instead of cloning it and adding all the commits to its training set. I feel like they simply don't give a shit. They got such a huge capital injection that they feel they can afford to not give a shit about their own cost efficiency and that they go using the scorched earth tactics. After all, even their own LLMs can produce a naive scraper that wreaks havoc on the internet infrastructure, and they just let it loose. Got mine, fuck you all the way!
But then they will release some DeepSeek R(xyz), and yay, all the hackernews who were roasting them for such methods, will be applauding them for a new version of an "open source" stochastic parrot. Yay indeed.