Helping the WayBack ArchiveTeam team: running their Warrior virtual appliance on ESXi

March 2021
M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Posted by jpluimers on 2021/03/19

The [WayBack] Archiveteam helps the WayBack machine with feeding new content.

You can help that team by running one or more “warrior” virtual machine instances. The VM is distributed as a virtual appliance in an ova file according to the Open Virtualization Format.

That format sounds more generic than it actually is, so the (at the time of writing) archiveteam-warrior-v3-20171013.ova file at [WayBack] Index of /downloads/warrior3/ was created for VirtualBox.X

This meant running it on VMware ESXi or VMware vSphere takes a few steps for patching it, then uploading it to your VMware host.

Since I might want to run the appliance on multiple places or multiple instances, I wanted to have a ready-to-go solution, I created a git repository with both the patch instructions and the update at [WayBack] wiert.me / public / ova / archiveteam-warrior-v3-20171013.ESXi · GitLab.

Both of them are here:

Thanks to these for both the patch and for getting me to write instructions:

I raised the memory of each instance to 2 gibibyte (from the default 400 mebibyte) and 3.6 GHz CPU speed (from the default 1 core). This was more than enough so they would saturate less than 60% of the maximum:

That way I could run each instance at the maximum allowed settings on the right:

6 concurrent items
3 rsync threads

Together they would fluctuate at about 50 mebibit/second combined down and upstream throughput: about half of one of my home fiber connections.

More on the warrior

The download above is a virtual appliance that deploys as an Alpine Linux instance with an embedded docker container.

You can scale it up either by running multiple virtual appliances (easiest way for just a few) or many orchestrated docker instances (if you already a docker infrastructure running: grab the [Archive.is] archiveteam/warrior-dockerfile – Docker Hub).

More on that, and how the warriors fit in the archiving scheme:

[WayBack] Deathwatch – Archiveteam: The Deathwatch or Watchlist is a central indicator of websites and networks that are shutting down and serves as an indicator of what happened to particular sites that shut down quickly.
[WayBack] Alive… OR ARE THEY – Archiveteam
[WayBack] ArchiveTeam Warrior – Archiveteam
- [WayBack] diggan / archiveteam-infra · GitLab
- [WayBack] gp-archiveteam-bs/watcher.py at master · general-programming/gp-archiveteam-bs · GitHub
[WayBack] Tracker – Archiveteam
- [WayBack] ArchiveTeam Warrior Tracker
[WayBack] Dev – Archiveteam
- [WayBack] Dev/Infrastructure – Archiveteam
- [WayBack] Dev/Source Code – Archiveteam
- [WayBack] Dev/Warrior – Archiveteam
  The virtual machine is self-updating. It does the following:
  1. 1. Start the virtual machine
    2. Linux boots
    3. boot.sh downloads and launches /root/startup.sh
    4. startup.sh prepares and runs a docker container with the warrior runner
    5. Point your web browser to http://localhost:8001 and go.
- [WayBack] Dev/New Project – Archiveteam
- [WayBack] Dev/Seesaw – Archiveteam
- [WayBack] Dev/Tracker – Archiveteam
- [WayBack] Dev/Staging – Archiveteam (Setting up Rsync and Megawarc Factory)
- [WayBack] Dev/Project Management – Archiveteam

Docker

If you have Docker installed, the following command will help preserve…: docker run --publish 127.0.0.1:8001:8001 --restart always archiveteam/warrior-dockerfile https://twitter.com/SteveMcLaugh/status/1112433277359529984

– Edward Morbius – Google+

Via:

–jeroen

PS

PS: I wrote this when Google+ (G+) was a few days from being shut down, but there are always sites that go down in need for archiving.

The Wayback Archive Team Warriors have a site listing projects, but since they are better in archiving and tool development than writing WiKi content, the easiest is to have your warrior instance run the “ArchiveTeam’s Choice” project: it will work on the most pressing downloads, and continue even after one project is finished.

On the Google+ and the G+ archiving project

[WayBack] Archive Team: Google Plus (or Minus) : Free Web : Free Download, Borrow and Streaming : Internet Archive
[WayBack] Saving of public Google+ content at the Internet Archive’s Wayback Machine by the Archive Team has begun : plexodus
There are tools to assist with rebuilding websites based on Wayback Machine archives. Whether or not these will support Google+ user, Page, Collection, or Community accounts is not presently clear, though we’ll try to provide information as it becomes available.

Tools:
- via: [WayBack] Saving of public Google+ content by the Archive Team has begun | Hacker News
[WayBack] Google+ tracker – #googleminus – Donate at https://archive.org/donate/ for hosting the archives Dashboard
[WayBack] Google+ – Archiveteam
[WayBack] PLExodus: The Beginning is Near – Google+
[WayBack] PlexodusWiki
[WayBack] Google+ Mass Migration community on G+ — Helping coordinate the exodus : googleplus
Google+ – Wikipedia
[WayBack] Where to get Google+ community, profile or page URL?
[WayBack] “Google just started mass banning/limiting Archive Team downloads” | Hacker News
- via [WayBack] “Google just started mass banning/limiting Archive Team downloads” …
[WayBack] Archive Team Tracker Charts – Grafana: project googleplus
[WayBack] Thread by @jpluimers: “seeing all kinds of 302 in the @at_warrior from my @xs4all IP address like “Item users:smap/44717/oc: Step 4 of 8 10=302 plus.google.com/112 […]”

Some stats:

The more interesting project tracker, showing updates in realtime, is: http://tracker.archiveteam.org/googleplus/

Note that this shows only 1/50th of the total project at a time. “Items” are sitemap subsets of 100 profiles, and 50 batches of 1,000 sitemaps at a time, each with about 680 or so items, will be processed over the course of this archival. The tracker shows the status only of the current batch. Total profiles archived are 50 batches * 1,000 sitemaps/batch * 680 items/sitemap * 100 profiles/item = 3.4 billion profiles, or the total number of Google+ profiles (as of March, 2017). There will be 34 million items, total, in the overall process.

Some Twitter tid-bits on the last day of archival:

[WayBack] Jason Scott on Twitter: “Google just started mass banning/limiting Archive Team downloads. If your thing doesn’t end up on the Internet Archive, talk to Google.”
[WayBack] JRWR – Forrest F. auf Twitter: “@textfiles We found some work arounds to keep saving google plus, Warriors are coming back online!… “
Around 2019-04-01T12:00Z (thats noon in UTC on April frist for all the AM/PM lovers), the party seemed over:
But then half an hour later, it recovered, but 7 minuted later it died again on my IP-address. Seems I got blacklisted:
A bit later, there was a small and even shorter resurrection:
Later that day and on 20190402, every now and then, my IP address was unblocked:
The ArchiveTeam then announced they were getting very close to grabbing everything: [WayBack] Archive Team status: 97%, estimated completion 8.5 hours. (Via IRC.) Context: https://www.archiveteam.org/index.php?title=Google – Edward Morbius – Google+

Some of my old G+ starting places

On VMware and uploading OVA/OVF files

Quoted in part because the VMware documentation site hates to be archived

[empty WayBack/empty Archive.is] OVF and OVA Limitations for the VMware Host Client

To deploy a large OVA file, VMware recommends to first extract the OVA on your system by running the command tar -xvf <file.ova>. Then you can provide the deployment wizard with the OVF and VMDKs as separate files.

This entry was posted on 2021/03/19 at 12:00 and is filed under ArchiveTeamWarrior, Cloud, Containers, Docker, Infrastructure, Internet, InternetArchive, Kubernetes (k8n), Power User, WayBack machine. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

	xyzzy, Relay Confere… on Sad and Useless about Competit…
	jpluimers on Windows warned me of disk full…
	jpluimers on Started making people walk me…
	jpluimers on Stack Overflow’s forum is dead…
	jpluimers on Some links on getting the most…

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription