ether+nick

@evan That’s not enough code for copyright enforcement. People have been finding identical code in the output - you just need something “rare”. It’s similar for subjects with little text in the corpus - I’ve been seeing listings that *can only have one source* (retro datasheets by AMD, in my case).

But maybe that's wrong; I don't know. Maybe if I wrote a Person.setName() method that was in the training set, and the LLM generated an identical Person.setName() code snippet for someone else, I could claim that the code is a copyright violation, even if there were thousands of other identical and independent Person.setName() methods in the training set.

@cwebber @richardfontana @bkuhn @ossguy

I think the worst case scenario is that the inserted code matches exactly one snippet in the training data.

So you could try to go for zero matches, by using such idiosyncratic and unrecommended coding conventions that nobody else has code like yours.

Or you could try to go for lots of matches, by using bog standard coding conventions and software patterns.

@cwebber @richardfontana @bkuhn @ossguy

@evan @richardfontana @bkuhn @ossguy Yeah! I actually already said elsewhere in the thread I don't think we need to worry about using these tools for such scenarios from a *licensing* perspective, only when the genAI is explicitly checked into the codebase