Story Detail of id 47333935 | Liveview Hacker News

junaru9 hours ago | on: Cloudflare crawl endpoint

> Ask yourself, why would a scraper ddos?

Don't need to ask anything i can tell you exactly - because they have no regard for anything but their own profit.

Let me give you an example of this mom and pop shop known as anthropic.

You see they have this thing called claudebot and at least initially it scraped iterating through IP's.

Now you have these things called shared hosting servers, typically running 1000-10000 domains of actual low volume websites on 1-50 or so IPs.

Guess what happens when it is your networks time to bend over? Whole hosting company infrastructure going down as each server has hundreds of claudebots crawling hundreds of vhosts at the same time.

This happened for months. Its the reason they are banned in WAFs by half the hosting industry.

PeterStuer3 hours ago | parent

So how would you avoid this specific situation as a web-crawler that tries to be well behaved? You strictly adhere to robots.txt as specified by each domain. The problem is not with any of the sites but the density (1000-10000) by which the hoster packed them. If e.g. the crawler had a 1 sec between page governor even if robots.txt had no rate specified, which to be fair is very reasonable, this packing could still lead to high server load.

#visit	13,055,390
#session	74,665
#live-session	0