Do not block User-agent ia_archiver because you think it is Alexa, as it makes it harder for the Internet Archive WayBack Machine to stay up-to-date
Posted by jpluimers on 2020/12/18
I observed some sites block the User-agent: ia_archiver in their robots.txt, thinking the would just block Alexa Internet: [Archive.is1/Archive.is2] Crawlers – Alexa Support (which oddly refuses the details to be archived in the Internet Archive).
It will indeed block, but also makes it harder for the Internet Archive WayBack Machine to stay up to date
This sounds complicated, and it is. The Internet Archive originally wrote ia_archiver, but it was ran by Alexa Internet, and still feeds lots of the Internet Archive. By now, the Internet Archive also uses their own both that uses User-agent: archive.org_bot and is called Heritrix.
What makes it complicated, is that most (maybe by now that is down “a lot of”) content is still donated to the internet archive by Alexa Internet.
So if you block ia_archiver, then the Internet Archive still misses data.
Further reading:
- [WayBack] Heritrix – Home Page
- WayBack: Heritrix – Heritrix – IA Webteam Confluence
- [WayBack] Home · internetarchive/heritrix3 Wiki · GitHub and[WayBack] Heritrix 3 Documentation — Heritrix API Documentation
- Heritrix – Wikipedia
- [WayBack] Alexa’s Web and Site Audit Crawlers – Alexa Support
- [WayBack] ia_archiver user-agent / bot
- [WayBack] archive.org_bot user-agent / bot
- [WayBack] User Agent: archive.org_bot : Free Data : Free Download, Borrow and Streaming : Internet Archive
- [WayBack] Robots.txt meant for search engines don’t work well for web archives | Internet Archive Blogs
and:
- [WayBack] Robots.txt Disallow: 20 Years of Mistakes To Avoid | Hacker News (that thread is worth every minute reading)
- [WayBack] ROBOTS.TXT DISALLOW: 20 Years of Mistakes To Avoid | beu | blog
Example: [Archive.is] https://assarbad.net/robots.txt
–jeroen






Leave a comment