Story Detail of id 47330283 | Liveview Hacker News

theamk20 hours ago | on: Cloudflare crawl endpoint

no? it takes 10 seconds to check:

> The /crawl endpoint respects the directives of robots.txt files, including crawl-delay. All URLs that /crawl is directed not to crawl are listed in the response with "status": "disallowed".

You don't need any scraping countermeasures for crawlers like those.

Macha19 hours ago | parent | next

So what’s the user agent for their bot? They don’t seem to specify the default in the docs and it looks like it’s user configurable. So yet another opt out bot which you need your web server to match on special behaviour to block

flanksteak2014 hours ago | root | parent | next

Isn't this covered here? https://developers.cloudflare.com/browser-rendering/referenc...

Macha11 hours ago | root | parent

No, hence all their examples using User-Agent: *

gruez19 hours ago | root | parent

>So yet another opt out bot which you need your web server to match on special behaviour to block

Given that malicious bots are allegedly spoofing real user agents, "another user agent you have to add to your list" seems like the least of your problems.

Macha6 hours ago | root | parent | next

It is cloudflare who made the claim that they are well behaved unlike those other bots and that their behaviour can be controlled by robots.txt

If I need to treat cloudflare bots the same as malicious bots, that undermines their claim.

AdamN10 hours ago | root | parent

Not 'allegedly' - it's just a fact. Even if you're not malicious however it's still sometimes necessary because the server may have different sites for different browsers and check user agents for the experience they deliver. So then even for legitimate purposes you need to at least use the prefix of the user agent that the server expects.

PeterStuer12 hours ago | parent

Like they explain in the docs, their crawler will respect the robots.txt dissalowed user-agents, right after the section hat explains how to change your user-agent.

#visit	13,055,370
#session	74,665
#live-session	0