Story Detail of id 47354045 | Liveview Hacker News

iepathos6 hours ago | on: Malus – Clean Room as a Service

This is essentially 'License Laundering as a Service.' The 'Firewall' they describe is an illusion because the contamination happens at the training phase, not the inference phase. You can't claim independent creation when your 'independent developer' (the commercial LLM) already has the original implementation's patterns and edge cases baked into its weights.

In order to really do this, they would need to train LLMs from scratch that had no exposure whatsoever to open source code which they may be asked to reproduce. Those models in turn would be terrible at coding given how much of the training corpus is open source code.

john_strinlai6 hours ago | parent | next

>The 'Firewall' they describe is an illusion because [...]

it is an illusion because this is a satire site.

loading story #47354764

loading story #47354782

gwern6 hours ago | parent | next

The solution here seems to be to impose some constraint or requirement which means that literal copying is impossible (remember, copyright governs copies, it doesn't govern ideas or algorithms - that would be 'patents', which essentially no open source software has) or where any 'copying' from vaguely remembered pretraining code is on such an abstract indirect level that it is 'transformative' and thus safe.

For example, the Anthropic Rust C compiler could hardly have copied GCC or any of the many C compilers it surely trained on, because then it wouldn't have spat out reasonably idiomatic and natural looking Rust in a differently organized codebase.

Good news for Rust and Lean, I guess, as it seems like everyone these days is looking for an excuse to rewrite everything into those for either speed or safety or both.

pron6 hours ago | root | parent

> copyright governs copies, it doesn't govern ideas or algorithms

The second part is true. The first is a little trickier. The copyright applies to some fixed media (text in this case) rather than the idea expressed, but the protections extend well beyond copies. For example, in fiction, the narrative arc and "arrangement" is also protected, as are adaptations and translations.

If you were to try and write The Catcher in the Rye in Italian completely from memory (however well you remember it) I believe that would be protected by copyright even if not a single sentence were copied verbatim.

loading story #47354715

neilv6 hours ago | parent | next

I think this site is either satire, or serious but with a certain kind of humor in which both they and the reader know they're lying (but it's in everyone's interest to play along).

They do say this:

> Is this legal? / our clean room process is based on well-established legal precedent. The robots performing reconstruction have provably never accessed the original source code. We maintain detailed audit logs that definitely exist and are available upon request to courts in select jurisdictions.

Unless they're rejecting almost all of open source packages submitted by the customer, due to those packages being in the training set of the foundation model that they use, this is really the opposite of cleanroom.

loading story #47354676

ActivePattern6 hours ago | parent

[flagged]

loading story #47354236

#visit	13,080,649
#session	74,665
#live-session	0