Is it possible to deter AI scraping by providing overly large robots.txt?
Posted by jpluimers on 2024/11/18
An idea: [Wayback/Archive] Jeroen Wiert Pluimers: “@ruurd @mcc … Maybe place useful content below 500 KiB and serve a file at least 1 GiB size?…” – Mastodon
@ruurd @mcc probably not, although Google Search limits them to 500 KiB.
https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#file-format
“Google currently enforces a robots.txt file size limit of 500 kibibytes (KiB). Content which is after the maximum file size is ignored. You can reduce the size of the robots.txt file by consolidating rules that would result in an oversized robots.txt file. For example, place excluded material in a separate directory.”
Maybe place useful content below 500 KiB and serve a file at least 1 GiB size?
It was in response to these earlier toots (with quotes of some very interesting links on when cookies are (dis)allowed – TL;DR: it depends on local regulations):
- [Wayback/Archive] mcc: “Just learned about “ads.txt”… will Google penalize my site if I give it an
ads.txtbut put nothing inside except theEICARantivirus test string, do you think …” – Mastodon
- [Wayback/Archive] mcc: “I am constantly seeking ways to fit the EICAR antivirus test string into inappropriate places. Fit it in a QR code, print and frame it and hang it behind me during Twitch streams. Embed it in an NFC chip in my wrist. Build an impromptu transmitter and beam it into space. Swallow a smart pill which CRISPRs me to insert it into my DNA at the moment of my death, to confound attempts to identify my corpse…” – Mastodon
- [Wayback/Archive] mcc: “See also this previous discussion, and the helpful but buzzkilly response from Royce Williams
https://mastodon.social/@mcc/112418994…” – Mastodon
- [Wayback/Archive] mcc: “Pondering configuring a web server to set the EICAR antivirus test string as a cookie on all page loads and never bother reading it back” – Mastodon
- [Wayback/Archive] mcc: “This would serve no purpose except to increase the ambient chaos level of the universe” – Mastodon
- [Wayback/Archive] mcc: “I wonder if GDPR technically requires you to ask consent to set a cookie if you don’t use it for tracking or otherwise read it after setting it” – Mastodon
- [Wayback/Archive] Hugo Mills: “@mcc @mcc Don’t think so. If there’s no PII or tracking information, then I think it’s outside the scope of GDPR.” – Mastodon 🐘
-
- [Wayback/Archive] Yuri Schimke: “@darkling @mcc yep, cookies fo…” – Android Dev Social
@darkling @mcc yep, cookies for providing the requested service, or for transmitting data are ok.
https://www.cookieyes.com/blog/cookie-law/
It’s basically against tracking.
- [Wayback/Archive] Yuri Schimke: “@darkling @mcc yep, cookies fo…” – Android Dev Social
-
- [Wayback/Archive] Mekki: “@mcc This was debated in Canada …” – Mastodon
@mcc This was debated in Canada waaaaay back when PIPEDA was enacted and the then Privacy Commissioner decided that cookies in this context WERE a problem, much to the disappointment of privacy advocates who wanted a focus on actual issues rather than generic “all cookies are bad” fears. I feel like it might have been redressed more recently, but less sure. Looks like latest guidance is from 2019: https://www.priv.gc.ca/en/privacy-topics/t
[Wayback/Archive] Cookies – Office of the Privacy Commissioner of Canada
- [Wayback/Archive] Note | ぷにすきー
@mcc@mastodon.social alright so the requirement to obtain prior consent for cookies is, as far as I know, not set forward by the GDPR, but by the ePrivacy EU policy, so that’s where you wannya look
https://en.wikipedia.org/wiki/EPrivacy_Directive
Read the text here https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:02002L0058-20091219
The relevant part is article 5(3) which states
the storing of information, or the gaining of access to information already stored, in the terminal equipment of a subscriber or user is only allowed on condition that the subscriber or user concerned has given his or her consentSo setting a cookie (yes that’s what they mean by “the storing of information”) requires consent, even if you don’t read it.
Now you could argue though that if you set the same cookie to every visitor, you aren’t storing information are you? What information is there in a cookie that is the same for every visitor?The same paragraph then sets forward exemptions for technically necessary cookies as follows.
This shall not prevent any technical storage or access for the sole purpose of carrying out the transmission of a communication over an electronic communications network, or as strictly necessary in order for the provider of an information society service explicitly requested by the subscriber or user to provide the service.
if your service exists for the sole purpose of setting a cookie, then isn’t setting the cookie required for your service(Probably not, this only makes sense if you know the service was requested by the user)
– Anita
- [Wayback/Archive] Note | ぷにすきー
@mcc@mastodon.social but now hear me out, so some anyalytics cookies are actually allowed without consent. You’re gonnya call me crazy, but it’s an unresolved issue. Some regulatory agencies have recognised that some amounts of anyalytics is required to provide services, and therefore, cookies for audience measurement could fall under that.
So if you used your cookie to, say, measure unique visitors vs. total requests, that might be fair game. (Because you’d assume a request without the cookie is a new visitor)
The CNIL, France’s online privacy regulatory agency, has come up with some guidelines which may serve as a basis. https://www.cnil.fr/en/sheet-ndeg16-use-analytics-your-websites-and-applications
Under their guidelines, you’d need to give the user a way to opt out though. Maybe a button to clear the cookie.
Some anyalytics software has even been certified by the CNIL to be used without a consent banner. They are listed here, but the page is in French. https://www.cnil.fr/fr/cookies-et-autres-traceurs/regles/cookies-solutions-pour-les-outils-de-mesure-daudience (free software solution Matomo is on the list by the way!!!)
Now apparently the Spanish data agency disagrees and believes that anyalytics cookies systematically require consent. So that’s an ongoing source of mystery regarding EU privacy law.
- [Wayback/Archive] Note | ぷにすきー
@mcc@mastodon.social to be clear, not ALL anyalytical cookies are allowed without consent, but maybe some bare minimal ones are
- [Wayback/Archive] Note | ぷにすきー
- [Wayback/Archive] Royce Williams: “@mcc Ha! Chaos cookies! (neu…” – Infosec Exchange
Ha! Chaos cookies!
(neurodivergent “answer joke as if serious” mode activated …)
Kidding aside, forcing the EICAR string into a detectable position in a cookie is trickier than it may seem.
First, per the EICAR spec, the ‘magic’ string has to be at the beginning of the file to be considered “detection-worthy”. So it would have to be the first ‘name’ in a cookie’s name-value pairs.
Second, EICAR contains characters not allowed in the ‘name’ field, so most browsers would probably reject it:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#cookie-namecookie-value
Third, even if you did manage to create a cookie with a ‘Name’ field containing EICAR … many browsers store cookies in a sqlite or other file, such that it would never be the beginning of that file on disk.
(I do think it’s worth trying, though – it could flush out some interesting coverage gaps!)
[Wayback/Archive] Set-Cookie:
<cookie-name>=<cookie-value>– HTTP | MDN
- [Wayback/Archive] Hugo Mills: “@mcc @mcc Don’t think so. If there’s no PII or tracking information, then I think it’s outside the scope of GDPR.” – Mastodon 🐘
- [Wayback/Archive] mcc: “A thing that I realized a bit after this is nothing is stopping you from putting ASCII art of robots dancing into your
robots.txt…” – Mastodon
A thing that I realized a bit after this is nothing is stopping you from putting ASCII art of robots dancing into your
robots.txtAs long as the first character of each line is a
#it will even be parsed properly by scrapers- [Wayback/Archive] Ari [APz] Sovijärvi: “@mcc I was bored once, so so here’s one of mine:…” – Mastodon Games

ASCII art of Crow and Tom Servo MST3K-characters. Crow says: “Say, Tom, you think this is a valid robots.txt” Tom Servo says: “It sure is. It has robots, right, Crow?” Below the robots it says: “What did you expect?”
- [Wayback/Archive] MM0KHR: “@mcc “HUMANS ARE DEAD” “BINARY SOLO” …” – Mastodon.Radio
@mcc “HUMANS ARE DEAD”
“BINARY SOLO!”
(my instant earworm must be different than some I guess)
[Wayback/Archive] The Humans Are Dead – Full version – YouTube
- [Wayback/Archive] Glenn Fleishman: “@mcc I redirect /wp-admin/ to the FBI cybercrime page. …” – TWiT.social
- [Wayback/Archive] Jernej Simončič �: “@glennf @mcc Mine redirects to an internal URL that serves a large HTML file at ~9 bytes/second (makes me wonder what’d happen if I posted the link here – I know that some IRC bots that tried to get the page title timed out of the servers they were on).…” – Infosec Exchange
- [Wayback/Archive] EndlessMason: “@jernej__s You could, hypothet…” – Hachyderm.io
@jernej__s
You could, hypothetically, send a very large gzipped document that contains only <. That takes very few bytes to send since it compresses real nice… Until the bot has to un-gzip it to parse the document and discovers 20Tb of <‘sYou could maybe try sneaking a billion laughs attack in there too, since maybe the bot will unzip 30Tb of xhtml and then do xml on it too
- [Wayback/Archive] ruurd@mastodon.social: “@mcc @wiert hmmmm would a 10MB robots.txt kill a scraper?…” – Mastodon
- [Wayback/Archive] Ari [APz] Sovijärvi: “@mcc I was bored once, so so here’s one of mine:…” – Mastodon Games
Obligatory videos:
- [Wayback/Archive] Kraftwerk – The Robots HQ Audio – YouTube
- [Wayback/Archive] The Humans Are Dead – Full version – YouTube
Oh, and next to ads.txt and robots.txt, there is of course also security.txt – Wikipedia.
--jeroen





[
Leave a comment