It's God's gift to them when it lets them bypass ads and dl copyrighted material. But it's Satan's curse on humanity when the Zuck does it to train his llm and dl copyrighted material.
Zuck serves ads/spywares to other users, he deserves to taste his own medicines, not me.
Where's the contradiction?
Web crawling for training is when you ingest content on a mass scale, usually indiscriminately, usually with a dumb crawler for scale's sake, for the purposes of training an LLM. You don't really care whether one particular website is in the dataset (unless it's the size of Reddit), you just want a large, diverse, high-quality data mix.
Web crawling for inference is when a user asks a targeted question, you do a web search, and fetch exactly those resources that are likely to be relevant to that search. Nothing ends up in the training data, it's just context enrichment.
People have a much larger issue with crawling for training than for inference (though I personally think both are equally ok).