I was wrong about robots.txt

KarlHeinzSchwuke@feddit.org · 3 months ago

I was wrong about robots.txt

General_Effort@lemmy.world · 3 months ago

What did he think a crawler is? Why was he surprised that not allowing companies to use his data lead to them not using his data? Looks like he has another surprise coming when he notices that search engines no longer index his blog.

INeedMana@piefed.zip · 3 months ago

Huh. So in this case, the file actually is respected. Refreshing

ell1e@leminal.space · edit-2 3 months ago

Often it is respected, but the resulting problem is platforms conflate things with the questionable AI scraping crawlers to blackmail websites into participating in feeding AI.

For example, Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI. I imagine LinkedinBot, given it’s microsoft, will feed some other AI of theirs as well on top of the previews.

Until regulation steps in to require AI bots to separately ask for crawling permission, or to actually get a proper license for reuse of the contents, this situation isn’t going to improve.

General_Effort@lemmy.world · 3 months ago

Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI.

False.

cecilkorik@lemmy.ca · 3 months ago

Absolutely true. They’ll buy the data they want from some shitty crawler running from some data broker in some far-flung and lawless part of the world, hallucinate the actual source, and pretend they had no idea their “data partner” wasn’t respecting robots.txt if they have to, which they won’t ever have to do because it’s literally impossible to detect and prove and realistically unenforceable.

This is a company that removed it’s company motto of “Don’t be evil” because it found it too “limiting”. Don’t be naive.

ell1e@leminal.space · edit-2 3 months ago

See here: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ If you have a source that says it’s false, I’d be curious.

thedruid@lemmy.world · 3 months ago

So. If I can add something here for everyone’s benefit

No search engine really obeys robots.txt

Their publicly acknowledged crawlers do, but they have other crawlers that aren’t know that ignore the file.

Google knows every inch of your site, allowed or not.

See, just because a search engine says it doesn’t know, doesn’t mean it hasn’t crawled. Just doesn’t display the results based on your settings.

ell1e@leminal.space · 3 months ago

And allowing the public crawler might also have it feed their AI: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/