The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 2,481 other followers

Do not block User-agent ia_archiver because you think it is Alexa, as it makes it harder for the Internet Archive WayBack Machine to stay up-to-date

Posted by jpluimers on 2020/12/18

I observed some sites block the User-agent: ia_archiver in their robots.txt, thinking the would just block Alexa Internet: [Archive.is1/Archive.is2] Crawlers – Alexa Support (which oddly refuses the details to be archived in the Internet Archive).

It will indeed block, but also makes it harder for the Internet Archive WayBack Machine to stay up to date

This sounds complicated, and it is. The Internet Archive originally wrote ia_archiver, but it was ran by Alexa Internet, and still feeds lots of the Internet Archive. By now, the Internet Archive also uses their own both that uses User-agent: archive.org_bot and is called Heritrix.

What makes it complicated, is that most (maybe by now that is down “a lot of”) content is still donated to the internet archive by Alexa Internet.

So if you block ia_archiver, then the Internet Archive still misses data.

Further reading:


Example: [Archive.is


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: