Skip Navigation

InitialsDiceBearhttps://github.com/dicebear/dicebearhttps://creativecommons.org/publicdomain/zero/1.0/„Initials” (https://github.com/dicebear/dicebear) by „DiceBear”, licensed under „CC0 1.0” (https://creativecommons.org/publicdomain/zero/1.0/)F
Posts
0
Comments
169
Joined
3 yr. ago

  • No, this is mostly incorrect, sorry. The commercial aspect of the reproduction is not relevant to whether it is an infringement--it is simply a factor in damages and Fair Use defense (an affirmative defense that presupposes infringement).

    What you are getting at when it applies to this particular type of AI is effectively whether it would be a fair use, presupposing there is copying amounting to copyright infringement. And what I am saying is that, ignoring certain stupid behavior like torrenting a shit ton of text to keep a local store of training data, there is no copying happening as a matter of necessity. There may be copying as a matter of stupidity, but it isn't necessary to the way the technology works.

    Now, I know, you're raging and swearing right now because you think that downloading the data into cache constitutes an unlawful copying--but it presumably does not if it is accessed like any other content on the internet. Because intent is not a part of what makes that a lawful or unlawful copying and once a lawful distribution is made, principles of exhaustion begin to kick in and we start getting into really nuanced areas of IP law that I don't feel like delving into with my thumbs, but ultimate the point is that it isn't "basic copyright law." But if intent is determinitive of whether there is copying in the first place, how does that jive with an actor not making copies for themselves but rather accessing retained data in a third party's cache after they grab the data for noncommercial purposes? Also, how does that make sense if the model is being trained for purely research purposes? And then perhaps that model is leveraged commercially after development? Your analysis, assuming it's correct arguendo, leaves far too many outstanding substantive issues to be the ruling approach.

    EDIT: also, if you download images from deviantart with the purpose of using them to make shirts or other commercial endeavor, that has no bearing on whether the download was infringing. Presumably, you downloaded via the tools provided by DA. The infringement happens when you reproduce the images for the commercial (though any redistribute is actually infringing) purpose.

  • Yes, inadvertent copying is still copying, but it would be copying in the output and is not evidence of copying happening in the creation of the model. That was why I used the music example, because it is rather probative of where there could be grounds for copyright infringement related to these model architectures. This may not seem an important distinction, but it has significant consequences on who is ultimately liable and how.

  • I get that that's how it feels given how it's being reported, but the reality is that due to the way this sort of ML works, what internet archive does and what an arbitrary GPT does are completely different, with the former being an explicit and straightforward copy relying on Fair Use defense and the latter being the industrialized version of intensive note taking into a notebook full of such notes while reading a book. That the outputs of such models are totally devoid of IP protections actually makes a pretty big difference imo in their usefulness to the entities we're most concerned about, but that certainly doesn't address the economic dilemma of putting an entire sector of labor at risk in narrow areas.

  • You are misunderstanding what I'm getting at and unfortunately no this isn't just straightforwardly copyright law whatsoever. The training content does not need to be copied. It isn't saved in a database somewhere (as part of the training....downloading pirated texts is a whole other issue completely removed from the inherent processes of training a model), relationships are extracted from the material, however it is presented. So the copyright extends to the right of displaying the material in the first place. If your initial display/access to the training content is non-infringing, the mere extraction of relationships between components is not itself making a copy nor is it making a derivative work in any way we haven't historically considered it. Effectively, it's the difference between looking at material and making intensive notes of how different parts of the material relate to each other and looking at a material and reproducing as much of it as possible for your own records.

  • I have no personal interest in the matter, tbh. But I want people to actually understand what they're advocating for and what the downstream effects would inevitably be. Model training is not inherently infringing activity under current IP law. It just isn't. Neither the law, legislative or judicial, nor the actual engineering and operations of these current models support at all a finding of infringement. Effectively, this means that new legislation needs to be made to handle the issue. Most are effectively advocating for an entirely new IP right in the form of a "right to learn from" which further assetizes ideas and intangibles such that we get further shuffled into endstage capitalism, which most advocates are also presumably against.

  • Training data IS a massive industry already. You don't see it because you probably don't work in a field directly dealing with it. I work in medtech and millions and millions of dollars are spent acquiring training data every year. Should some new unique IP right be found on using otherwise legally rendered data to train AI, it is almost certainly going to be contracted away to hosting platforms via totally sound ToS and then further monetized such that only large and we'll funded corporate entities can utilize it.

  • No, it isn't storing that information in that sequence. What is happening is that it is overly encoding those particular sequential relationships along some arbitrary but tightly mapped semantic concepts represented by dimensions in a massive vector space. It is storing copies of the information on the way that inadvertent copying of music might be based on "memorized" music listened to by the infringing artist in the past.

  • ML techniques have been very useful in compression, yes, but it's sort of nuts to say that a data structure that encodes only (sometimes overly so for certain regions of its latent space/embedding space/semantics space/whatever you want to call it right now) relationships between values rather than value sequences themselves as storing contiguous copyright protected works is storing partiularized creative works in particularly identifiable manner.

  • On the other hand, it's hard to have a serious discussion with people who insist that building a LLM or diffusion model amounts to copying pieces of material into an obfuscated database. And then having to deal with the typical reply after explanation is attempted of "that isn't the point!" but without any elaboration strongly implies to me that some people just want to be pissy and don't want to hear how they may have been manipulated into taking a pro-corporate, hyper-capitalist position on something.

  • Fair use is a (difficult) defense to being sued for infringing copyright. The copyright holder doesn't get to say what is and isn't fair use-- that's a multifactor largely qualitative assessment preformed in the court room.

  • The creation, curation, and maintenence of training data is a big industry in and of itself that has been around for years. Likewise, feature engineering is an entire sub-discipline of data science and engineering unto itself. I think you might be making the mistake that chatgpt = AI.

  • The issue with bitching about "NATO expansionism" is that at the end of the day it's still an alliance that countries ask to be members of due to concerns about being invaded or attacked.

  • They'll support Israel because the way we do it is another of the many infinite money hacks our MIC has created and entrenched deep into our politics. I really dislike this American trend to racialize the conflict. In virtually every way, Palestinians and Israelis, and Jews in general for that matter, are pretty indistinguishable but for small differences in otherwise very similar ethnic and religious cultures. Jewish culture, even across the diaspora, has been pretty clearly a Levantine originating one forever (not to mention the far deeper similarities between Judaism as a religion and Islam than between either and Christianity).

  • I'm not sure how it could be besides the point, though it may not be entirely dispositive. I take ownership to be a question of who has a controlling and exclusionary right to something--in this case thats copyright. Copyright allows you to license these things and extract money for their use. If there is no copyright, there is no secure monetization (something companies using AI generated materials absolutely keep high in mind). The question was "who would own it" and I think it's pretty clear cut who would own it. No one.

  • The outputs would be considered no one's outputs as no copyright is afforded to AI general content.

  • This is absolutely wrong about how something like SD generates outputs. Relationships between atomic parts of an image are encoded into the model from across all training inputs. There is no copying and pasting. Now whether you think extracting these relationships from images you can otherwise access constitutes some sort of theft is one thing, but characterizing generative models as copying and pasting scraped image pieces is just utterly incorrect.

  • But they would end up running ads next to them more often. There are a lot of shitty industry groups. This is like the most banal, inoffensive one to get shitty about.

  • It's just woke. /s

  • Nah, man, you made an error in your parenting. It's not a big deal so long as your recognize it but at this point there is pretty substantial evidence that such discipline techniques are generally more harmful than not.

    And that's ok, because honestly parenting is fucking hard. I definitely get rougher and less patient with my kid when I'm stressed, but it's a behavior I recognize I need to change and actively work on because it is objectively, unquestionably, bad parenting. This is a long way of saying that while, yea, family dynamics vary, there are many ways of parenting that are just very clearly bad or good, and recognizing the bad, even in ourselves, is something that is necessary for being a complete parent.

  • I'll second this experience. Pricing aside (and even then, because of their new recycling policy, I was able to replace an old galaxy nearly the size of a tablet with a new flip-- that has VERY surprisingly become my favorite phone I've ever owned-- for like a hundred bucks), I've never had complaints about my Samsung phone and wearables that weren't general to all smartphones. And the easy integrations between my watch, phone, and earbuds, all Samsung, is really great.