fc994665-f187-4786-bf0e-19102c6af474

If I were going to productize this, I'd do AF passes on a huge training dataset like The Stack and generate some kind of fingerprint for each program. (Estimated cost: billions!)

https://huggingface.co/datasets/bigcode/the-stack

Then, I'd have a tool to let you fingerprint your own code and C it against the big database -- maybe give you a list of high-similarity codebases.

And you could re-run the comparison each time you push to Git -- maybe only Cing what changed.

@bkuhn @richardfontana @cwebber @ossguy

⁂