The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 2,481 other followers

GitHub – pastpages/savepagenow: A simple Python wrapper for archive.org’s “Save Page Now” capturing service

Posted by jpluimers on 2021/03/11

This makes it way easier to save WayBack content:

[WayBack] GitHub – pastpages/savepagenow: A simple Python wrapper for archive.org’s “Save Page Now” capturing service

A poor-mans alternative is the below bash script from [WayBack] Saving of public Google+ content at the Internet Archive’s Wayback Machine by the Archive Team has begun : plexodus:

For Linux, MacOS / OSX, BSD, and other Unix-like operating systems (including Android with Termux, or Windows, with a Unix/Linux environment), the following script (I’ve saved this as archive-url) will archive the requested URL:

#!/bin/bash
# archive-url
# Archive selected URL at the Internet Archive

curl -s -I -H "Accept: application/json" "https://web.archive.org/save/${1}" |
grep '^x-cache-key:' | sed "s,https,&://,; s,\(${1}\).*$,\1,"

Save that to your execution path (I’ve chosen ~/bin, you might use /usr/local/bin or another location on your $PATH, and invoke as, say (again referring to the G+MM homepage):

$ archive-url https://plus.google.com/communities/112164273001338979772

If you have a list of URLs in a file (or pipelined from command output), you can request all of them to be archived in a simple bash loop. I’m using xargs here to run ten simultaneous requests from the file gplus-urllist:

cat gplus_urllist | while read url do xargs -I{} -P 10 archive-url {}; done

I’ve run this on over 10,000 URLs over a modest residential broadband connection in a hair over two hours.

Note that such requests trigger an archive by the Internet Archive from one of its archiving nodes, you’re not sending the page to the Archive yourself. In particular, archival from regions defaulting to another language may result in the Google+ site content (but not post or comments) being in a different language. I’ve frequently seen my pages turning up in Japanese, for instance.

–jeroen

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

 
%d bloggers like this: