Story Detail of id 47330934 | Liveview Hacker News

Lasang18 hours ago | on: Cloudflare crawl endpoint

The idea of exposing a structured crawl endpoint feels like a natural evolution of robots.txt and sitemaps.

If more sites provided explicit machine-readable entry points for crawlers, indexing could become a lot less wasteful. Right now crawlers spend a lot of effort rediscovering the same structure over and over.

It also raises interesting questions about whether sites will eventually provide different views for humans vs. automated agents in a more formalized way.

_heimdall18 hours ago | parent | next

I expect that if we still used REST indexing would be even less wasteful.

I've found myself falling pretty hard on the side of making APIs work for humans and expecting LLM providers to optimize around that. I don't need an MCP for a CLI tool, for example, I just need a good man page or `--help` documentation.

berkes12 hours ago | parent | next

I know in practice it no longer is the case, if it ever was.

But semantic HTML is exactly that explicit machine-readable entrypoint. I am firmly entrenched in the opinion that HTML, and the DOM is only for machines to read, it just happens to be also somewhat understandable to some humans. Take an average webpage, have a look at all characters(bytes) in there: often two third won't ever be shown to humans.

Point being: we don't need to invent something new. We just need to realize we already have it and use it correctly. Other than this requiring better understanding of web tech, it has no downsides. The low hanging fruit being the frameworks out there that should really do a better job of leveraging semantics in their output.

PeterStuer12 hours ago | parent | next

The only ones benefitting from 'wastefull' crawling are the anti-bot solution vendors. Everyone else is incentivized to crawl as efficiently as possible.

Makes you think, right?

oybng5 hours ago | root | parent

I yearn for the days when a single kb get was enough. Now it's endless wastage spawning entire browsers larger than operating systems with mitigations, hacks and proxies. Requesting access directly from webmasters is only met with silence. All of my once simple, hobbyist programs are now bloated beyond belief and less reliable than ever

catlifeonmars18 hours ago | parent | next

> It also raises interesting questions about whether sites will eventually provide different views for humans vs. automated agents in a more formalized way.

This question raises an interesting question about if this would exacerbate supply chain injection attacks. Show the innocuous page to the human, another to the bot.

pocksuppet16 hours ago | parent | next

Apart from the obvious problem: presenting something different to crawlers and humans.

threwaway0357 hours ago | parent | next

Isn't it already covered by sitemaps and sitemap index files, which are machine readable XML?

rglover18 hours ago | parent | next

I just do a query param to toggle to markdown/text if ?llm=true on a route. Easy pattern that's opt-in.

pdntspa17 hours ago | parent

They already do...

A lot of known crawlers will get a crawler-optimized version of the page

rafram17 hours ago | root | parent

Do they? AFAIK Google forbids that, and they’ll occasionally test that you aren’t doing it.

pdntspa17 hours ago | root | parent | next

I haven't checked in a while but I know for a fact that Amazon does or did it

651016 hours ago | root | parent

With google covering only 3% I wonder how much people still care and if they should. Funny: I own and know sites that are by far the best resource on the topic but shouldn't have so many links google says. It's like I ask you for a page about cuban chains then you say you don't have it because they had to many links. Or your greengrocer suddenly doesn't have apples because his supplier now offers more than 5 different kinds so he will never buy there again.

#visit	13,055,494
#session	74,665
#live-session	0