Hacker News new | past | comments | ask | show | jobs | submit
Deduplication is not trivial. Each scrape is stored in a WARC archive, so you would have to unpack several large files, dedupe, and then pack them back up again. I believe they are at least compressed within each snapshot though.
loading story #41465106