Sounds like nothing particularly unusual or alarming. Researchers found a few thousand images that could be illegal that were referenced by it, told LAION about it, and LAION pulled the database down temporarily while checking and removing them. A few thousand images out of five billion is not significant.
There’s also the persistent misunderstanding of what the LAION database is, which is even perpetuated by the paper itself (making me suspicious of the researchers’ motivations since they surely know better). The paper says: “We find that having possession of a LAION‐5B dataset populated even in late 2023 implies the possession of thousands of illegal images—not including all of the intimate imagery published and gathered non‐consensually, the legality of which is more variable by jurisdiction,” When the LAION-5B dataset doesn’t actually have any pictures at all in. It’s purely a list of URLs pointing at images that are on the Internet, each with text describing them. Possessing the dataset doesn’t make you in possession of any of those images.
Edit: Yeah, down at the bottom of the article I see the researcher state that in his opinion LAION-5B shouldn’t even exist and use inaccurate emotionally-charged language about how AI training data is “stolen.” So there’s the motivation I was suspicious of.
deleted
“Taking” is doing a lot of work there, and fundamentally the issue at heart.
deleted
“Copyright violation” is probably the wording you’re looking for. Copyright violation is not taking or theft or stealing or any of those other words - it’s copyright violation.
Whether training an AI on a copyrighted work without permission of the copyright holder is a violation of copyright is something that is debatable. But it most definitely is not stealing or theft. Theft is covered by completely different laws.
deleted