i think we should distinguish between pretraining and polishing/alignment data. what you are describing is most likely the latter (and probably mixed into to pretraining). but if you can't get a mass of tokens from scraping, you're going to be screwed