Hacker News new | past | comments | ask | show | jobs | submit
Doesn't work for pages protected by cloudflare in my experience. What a shame, they could've produced the problem and sold the solution.
That’s what they are doing. This is a textbook protection racket.

“Buy Cloudflare bot protection, otherwise it would be a shame if your site got scraped and ddos’d.”

Who is doing the scraping and ddosing? Cloudflare.

In this case, sure... that said, I've worked on a few sites where more than half the traffic was bots because the content was useful for other sites (classic car classifieds/sales site). The fact that just over half the page requests were actually search query results is what meant a lot of optimization steps in practice... Implementing a "search" database (mongodb and elastic were pretty new at the time), denormalizing a lot of the data structures on the "enterprise" SQL structures for search and display for not logged in users, etc. Heavier caching, donut caching, etc.

It was an interesting and sometimes fun part of my career. Working on a site/application that isn't necessarily a tech site, and that I have a personal interest in was pretty great... some of the pace for sales/commercial features less so, with sales making deals requiring deep integrations on impossible timelines. You learn a lot when a self-hosted site is being kicked while it's down... The cloud migration to get a better use of flexible resources, etc.

You can trivially block Cloudflare crawl via robots.txt. You don't need to buy Cloudflare's bot protection -- this is not a malicious bot.

https://x.com/CloudflareDev/status/2031745285517455615

(Disclosure: I work for Cloudflare but not on this product. I get pretty tired of the conspiracy theories TBH.)

{"deleted":true,"id":47337261,"parent":47336935,"time":1773244525,"type":"comment"}
loading story #47339217
That's too funny. If true, really looking forward to the Cloudflare response here. I'm unsure how you would spin that in a way that didn't seem self-serving.
It's very clearly disclosed in the linked docs already, it says that Cloudflare Bot Protection will block it same as all other bots, unless you choose to allow it as an exception. If they didn't do it that way, people would accuse them of either bypassing their own product (possibly anticompetitive) or just having a low quality one.
So it doesn't take any action to work around other bot protections? Feels like that would be on the list of features an AI company wanting to scrape would ask for.
loading story #47339255
I imagine that would cause a backlash from the website owners trusting cloudflare to keep their content 'safe'
Wait. What?

Is this just a way to strong-arm non-cloudflarians into adopting their platform if you don't want your site crawled? It does sound like they are selling the solution to avoid their own content crawler.

As long at it gets past Azure's bot protection ...
Came here to write this. I am getting much better results from Firecrawl (not affiliated with them, just a happy customer).
As someone who helps keep a site online with a lot of content, I have mixed feelings on Firecrawl.

On one hand, their bots seem much more well behaved than others.

However, running a crawler fleet which is deceptive and evasive in its identification and don't honor REP is no way to build a business.

I'd love for you to kick the tires on https://grubcrawler.dev
fuck firecrawl. they copied my idea by showing interest in my product and then copied it, used their YC money to give it all out for free. fuck nick in particular. I'm still salty over this
"they copied my idea by showing interest in my product and then copied it". What exactly is revolutionary about Firecrawl or your product? Scraping APIs have been around for over a decade.
I was the first to return markdown and use reader mode stuff to strip irrelevant stuff. Theres copying and there's talking to the founder sounding interested to have your team copy what I did in the background. One is fair game, the other is a dick head move.
Not sure about the first claim. But yes, talking to the founder, sharing details and having it stolen is not a good look. Sorry that happened to you.
I think that is a neat idea and it sucks this happened, but how long before somebody simply saw that feature and replicated it? I'm curious, had you considered a deeper moat than that?

This is especially relevant given AI is making this kind of thing easy at an industrial scale. I think we should all be looking for alternative moats.

loading story #47340001
Tell more. Crawling is not a new idea. How did they abuse you?
Please tells me you are joking