The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,860 other subscribers

Is it possible to deter AI scraping by providing overly large robots.txt?

Posted by jpluimers on 2024/11/18

An idea: [Wayback/Archive] Jeroen Wiert Pluimers: “@ruurd @mcc … Maybe place useful content below 500 KiB and serve a file at least 1 GiB size?…” – Mastodon

@ruurd @mcc probably not, although Google Search limits them to 500 KiB.

developers.google.com/search/d

“Google currently enforces a robots.txt file size limit of 500 kibibytes (KiB). Content which is after the maximum file size is ignored. You can reduce the size of the robots.txt file by consolidating rules that would result in an oversized robots.txt file. For example, place excluded material in a separate directory.”

Maybe place useful content below 500 KiB and serve a file at least 1 GiB size?

It was in response to these earlier toots (with quotes of some very interesting links on when cookies are (dis)allowed –  TL;DR: it depends on local regulations):

  1. [Wayback/Archive] mcc: “Just learned about “ads.txt”… will Google penalize my site if I give it an ads.txt but put nothing inside except the EICAR antivirus test string, do you think …” – Mastodon
  2. [Wayback/Archive] mcc: “I am constantly seeking ways to fit the EICAR antivirus test string into inappropriate places. Fit it in a QR code, print and frame it and hang it behind me during Twitch streams. Embed it in an NFC chip in my wrist. Build an impromptu transmitter and beam it into space. Swallow a smart pill which CRISPRs me to insert it into my DNA at the moment of my death, to confound attempts to identify my corpse…” – Mastodon
  3. [Wayback/Archive] mcc: “See also this previous discussion, and the helpful but buzzkilly response from Royce Williams https://mastodon.social/@mcc/112418994…” – Mastodon
    1. [Wayback/Archive] mcc: “Pondering configuring a web server to set the EICAR antivirus test string as a cookie on all page loads and never bother reading it back” – Mastodon
    2. [Wayback/Archive] mcc: “This would serve no purpose except to increase the ambient chaos level of the universe” – Mastodon
    3. [Wayback/Archive] mcc: “I wonder if GDPR technically requires you to ask consent to set a cookie if you don’t use it for tracking or otherwise read it after setting it” – Mastodon

      GDPR – Wikipedia

  4. [Wayback/Archive] mcc: “A thing that I realized a bit after this is nothing is stopping you from putting ASCII art of robots dancing into your robots.txt…” – Mastodon

    A thing that I realized a bit after this is nothing is stopping you from putting ASCII art of robots dancing into your robots.txt

    As long as the first character of each line is a # it will even be parsed properly by scrapers

    [Wayback/Archive] e765f9311444337e.png (384×274)

Obligatory videos:

  1. [Wayback/Archive] Kraftwerk – The Robots HQ Audio – YouTube
  2. [Wayback/Archive] The Humans Are Dead – Full version – YouTube

Oh, and next to ads.txt and robots.txt, there is of course also security.txt – Wikipedia.

--jeroen


Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.