April 2026
M	T	W	T	F	S	S
	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Archive for the ‘InternetArchive’ Category

Overview of Client Libraries · Internet Archive

Posted by jpluimers on 2021/09/14

Besides manual upload at [Archive.is] Upload to Internet Archive, there are also automated ways of uploading content.

One day I need this to archive pages or sites into the WayBack machine: [WayBack] Overview of Client Libraries · Internet Archive (most of which is Python based):

Read the rest of this entry »

Posted in Bookmarklet, Development, Internet, InternetArchive, Power User, Python, Scripting, Software Development, WayBack machine, Web Browsers | Leave a Comment »

GitHub – jjjake/internetarchive: A Python and Command-Line Interface to Archive.org

Posted by jpluimers on 2021/06/16

On my list of things to play with: [WayBack] GitHub – jjjake/internetarchive: A Python and Command-Line Interface to Archive.org.

Via:

[WayBack] The Internet Archive Python Library — Internet Archive item APIs 1.8.5 documentation
[WayBack] Command-Line Interface — Internet Archive item APIs 1.8.5 documentation
[WayBack] Quickstart — Internet Archive item APIs 1.8.5 documentation, including:
Configuring

Certain functionality of the internetarchive Python library requires your archive.org credentials. Your IA-S3 keys are required for uploading, searching, and modifying metadata, and your archive.org logged-in cookies are required for downloading access-restricted content and viewing your task history. To automatically create a config file with your archive.org credentials, you can use the ia command-line tool:
```
$ ia configure
Enter your archive.org credentials below to configure 'ia'.

Email address: user@example.com
Password:

Config saved to: /home/user/.config/ia.ini
```
Your config file will be saved to $HOME/.config/ia.ini, or $HOME/.ia if you do not have a .configdirectory in $HOME. Alternatively, you can specify your own path to save the config to via ia --config-file '~/.ia-custom-config' configure.

If you have a netc file with your archive.org credentials in it, you can simply run ia configure --netrc. Note that Python’s netrc library does not currently support passphrases, or passwords with spaces in them, and therefore not currently suported here.

–jeroen

Read the rest of this entry »

Posted in Development, Internet, InternetArchive, Power User, Python, Scripting, Software Development, WayBack machine | Leave a Comment »

Check if this still happens: some Twitter content in the WayBack machine gets a slash in the URL removed during rendering on Chrome

Posted by jpluimers on 2021/06/11

From my research list; check if this still happens: [WayBack] Saving Twitter content in the WayBack archive: the fully loaded page has a wrong trailing URL (missing the second slash before the authority) · GitHub

Visited https://twitter.com/MarkGraham

Saved it using https://web.archive.org/save/https://twitter.com/MarkGraham

Waited for the save to complete and the page to fully load and got https://web.archive.org/web/20190607081047/https:/twitter.com/MarkGraham

Observed the trailing part is not a valid URL any more https:/twitter.com/MarkGraham: it is missing the second slash before the authority (see https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Generic_syntax)

This might be a Twitter.com thing:

visit https://web.archive.org/web/*/https://twitter.com/MarkGraham

observe the latest saved URL is correct https://web.archive.org/web/20190607081047/https://twitter.com/MarkGraham

click it, then end up at the wrong URL: https://web.archive.org/web/20190607081047/https:/twitter.com/MarkGraham

Notes:

I have only tested this with my Chrome configurations on various machines (both regular and anonymous tabs) over at least a year; I need to figure out what happens when using different browsers.

It does not always happen.

Via: [WayBack] Jeroen Pluimers on Twitter: “I understand that the sites themselves pay a big role in this. That’s why I have the mangling of URLs that sometimes happens on my research list. I made this quick summary: …”

–jeroen

Read the rest of this entry »

Posted in Internet, InternetArchive, Power User, SocialMedia, Twitter, WayBack machine | Leave a Comment »

Contact for when WayBack internet archival fails to grab content

Posted by jpluimers on 2021/06/07

For my link archive, some tweets. [WayBack] Mark Graham is the person to contact in case archiving a link in the WayBack machine fails.

These are the steps for my link archival:

check if it saves and renders with the WayBack machine, if so, copy the saved URL and the original URL
check if it saves and renders with archive.is, if so, copy the saved URL and the original URL
if neither saved, then use the original URL and link text, but note it was unsavable; otherwise prepend the original URL and link text with [WayBack] or [Archive.is] containing the saved URL

Reporting history gist: https://gist.github.com/jpluimers/6115b3cd6dab568ebd1c10ebddfaf140

–jeroen

Read the rest of this entry »

Posted in Internet, InternetArchive, Power User, WayBack machine | Leave a Comment »

Running ArchiveTeam Warrior version 3.2 on ESXi

Posted by jpluimers on 2021/05/05

A while ago I wrote about Helping the WayBack ArchiveTeam team: running their Warrior virtual appliance on ESXi.

Since it was scheduled before my cancer treatment started and got posted when still recovering from it, I missed that version 3.2 of the [Wayback] ArchiveTeam Warrior appliance appeared in the [Wayback] Releases · ArchiveTeam/Ubuntu-Warrior at [Wayback] Release v3.2 · ArchiveTeam/Ubuntu-Warrior. You can download it form these places:

These two sites have not yet been updated, so they contain the older versions:

The source code now has been moved three times:

Read the rest of this entry »

Posted in *nix, *nix-tools, ArchiveTeamWarrior, Cloud, Containers, diff, Docker, ESXi5, ESXi5.1, ESXi5.5, ESXi6, ESXi6.5, ESXi6.7, ESXi7, Infrastructure, Internet, InternetArchive, Kubernetes (k8n), KVM Kernel-based Virtual Machine, patch, Power User, VirtualBox, Virtualization, VMware, VMware ESXi, VMware Workstation, WayBack machine | Leave a Comment »

Interactive tool to record a web-site, then archive it as WARC: GitHub – webrecorder/webrecorder-user-guide: webrecorder user guide and glossary

Posted by jpluimers on 2021/03/26

Learned about this in the fall of G+: [WayBack] GitHub – webrecorder/webrecorder-user-guide: webrecorder user guide and glossary

Back then, fully automated tools were easier.

So it is on my list of things to try one day for smaller projects.

[WayBack] Webrecorder | Homepage: Create high-fidelity, interactive web archives of any web site you browse.
[WayBack] GitHub – webrecorder/webrecorder: Web Archiving For All!

From the glossary:

Quick Start

Enter a URL in the box in the center of the screen labeled ‘URL to capture’.

Press the ‘start’ button (look down and to the right of the box where you entered the URL).

Interact with the web page that loads so Webrecorder can capture the content displayed on this page. To collect audio or video from a page be sure to press ‘play’ so the file will load into the browser.

Continue to visit and browse the pages you would like to capture. Each page you view will be included in your capture session. Note: you will be capturing the contents of each page you visit but will not automatically obtain pages that are linked to on the pages you collect (hyperlinks).

To end your capture session hover over the ‘Capture’ button in the upper left corner of the screen so it changes to read ‘Stop’ then click that button.

Your capture will then be browsable. Note: the capture will not ‘replay’ like a linear recording but instead be an interactive copy of the pages you have collected.

If you are a logged-in user, this session will be saved to your account automatically. If you are not logged in an account, you can sign up for an account or log in to your existing account to save the collection after you create it. If you do not log in you can still download your collection for a limited time (approximately 90 minutes from when you stop your recording session).

–jeroen

Posted in Internet, InternetArchive, Power User | Leave a Comment »

Helping the WayBack ArchiveTeam team: running their Warrior virtual appliance on ESXi

Posted by jpluimers on 2021/03/19

The [WayBack] Archiveteam helps the WayBack machine with feeding new content.

You can help that team by running one or more “warrior” virtual machine instances. The VM is distributed as a virtual appliance in an ova file according to the Open Virtualization Format.

That format sounds more generic than it actually is, so the (at the time of writing) archiveteam-warrior-v3-20171013.ova file at [WayBack] Index of /downloads/warrior3/ was created for VirtualBox.X

This meant running it on VMware ESXi or VMware vSphere takes a few steps for patching it, then uploading it to your VMware host.

Since I might want to run the appliance on multiple places or multiple instances, I wanted to have a ready-to-go solution, I created a git repository with both the patch instructions and the update at [WayBack] wiert.me / public / ova / archiveteam-warrior-v3-20171013.ESXi · GitLab.

Read the rest of this entry »

Posted in ArchiveTeamWarrior, Cloud, Containers, Docker, Infrastructure, Internet, InternetArchive, Kubernetes (k8n), Power User, WayBack machine | Leave a Comment »

Archiving Google Product Forums URLs

Posted by jpluimers on 2020/11/13

Archiving Google Product Forum URLs is a pain in the butt for a couple of reasons:

By default the URLs in your browser present themselves like productforums.google.com/forum/#!topic/chrome/rXamwvFGWXs, with a piece of JavaScript wizardry to show the actual content after the browser page initially displays
The search engine shows them like productforums.google.com/d/topic/chrome/rXamwvFGWXs, which The WayBack Machine on the Internet Archive cannot archive with web.archive.org/save/https://productforums.google.com/d/topic/chrome/rXamwvFGWXs because it mis-interprets [WayBack] productforums.google.com/robots.txt
Archive.is can save them with a URL like archive.is/?run=1&url=https://productforums.google.com/d/topic/chrome/rXamwvFGWXs

So the trick for saving is:

Get from the /forum/#!topic/ based URL to the /d/topic/ based one
Put it after the archive.is/?run=1&url=, then save

--jeroen

Posted in Conference Topics, Conferences, Event, Internet, InternetArchive, Power User, WayBack machine | Leave a Comment »

WayBack machine now rate limits your requests and blocks if you go over it

Posted by jpluimers on 2019/10/19

Got this a while ago while saving a bunch of links for my blog; unfortunately the email address did not respond for information

Too Many Requests

We are limiting the number of URLs you can submit to be Archived to the Wayback Machine, using the Save Page Now features, to no more than 15 per minute.

If you submit more than that we will block Save Page Now requests from your IP number for one day.

Please feel free to write to us at info@archive.org if you have questions about this. Please include your IP address and any URLs in the email so we can provide you with better service.

I wish there was a queue service that would make you wait longer, but does fulfill the request.

–jeroen

Posted in Internet, InternetArchive, Power User, WayBack machine | Leave a Comment »

When archiving in the WayBack machine returns error 400: clear your cookies

Posted by jpluimers on 2019/08/16

When archiving pages in the WayBack machine, despite Privacy Badger having set to “save no cookies”, it still managed to set truckloads of cookies.

So I used the Chrome settings in chrome://settings/content/cookies to disable cookies and now everything is fine.

–jeroen

Read the rest of this entry »

Posted in Chrome, Google, Internet, InternetArchive, Power User, Privacy, WayBack machine | Leave a Comment »

« Previous Entries

Next Entries »

	Jeroen Wiert Pluimer… on Pie Comic by John McNamee: Mov…
	Attila Kovacs on Crowbarring Windows 95 into Wi…
	Jeroen Wiert Pluimer… on Does Odido (the old T-Mobile N…
	Lars Fosdal on Security alarm provider Woonve…
	Thomas Mueller on Question got closed in May 202…

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

Archive for the ‘InternetArchive’ Category

Overview of Client Libraries · Internet Archive

GitHub – jjjake/internetarchive: A Python and Command-Line Interface to Archive.org

Configuring

Check if this still happens: some Twitter content in the WayBack machine gets a slash in the URL removed during rendering on Chrome

Contact for when WayBack internet archival fails to grab content

Running ArchiveTeam Warrior version 3.2 on ESXi

Interactive tool to record a web-site, then archive it as WARC: GitHub – webrecorder/webrecorder-user-guide: webrecorder user guide and glossary

Quick Start

Helping the WayBack ArchiveTeam team: running their Warrior virtual appliance on ESXi

Archiving Google Product Forums URLs

WayBack machine now rate limits your requests and blocks if you go over it

Too Many Requests

When archiving in the WayBack machine returns error 400: clear your cookies

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

My Flickr Stream

Pages

All categories

Email Subscription

Archive for the ‘InternetArchive’ Category

Rate this:

Share this:

Configuring

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Quick Start

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Too Many Requests

Rate this:

Share this:

Rate this:

Share this: