Helping the WayBack ArchiveTeam team: running their Warrior virtual appliance on ESXi
Posted by jpluimers on 2021/03/19
The [WayBack] Archiveteam helps the WayBack machine with feeding new content.
You can help that team by running one or more “warrior” virtual machine instances. The VM is distributed as a virtual appliance in an ova file according to the Open Virtualization Format.
That format sounds more generic than it actually is, so the (at the time of writing) archiveteam-warrior-v3-20171013.ova
file at [WayBack] Index of /downloads/warrior3/ was created for VirtualBox.X
This meant running it on VMware ESXi or VMware vSphere takes a few steps for patching it, then uploading it to your VMware host.
Since I might want to run the appliance on multiple places or multiple instances, I wanted to have a ready-to-go solution, I created a git repository with both the patch instructions and the update at [WayBack] wiert.me / public / ova / archiveteam-warrior-v3-20171013.ESXi · GitLab.
Both of them are here:
- [WayBack] esxi-patch-steps.md · master · wiert.me / public / ova / archiveteam-warrior-v3-20171013.ESXi · GitLab
- [WayBack] archiveteam-warrior-v3-20171013.ESXi · master · wiert.me / public / ova / archiveteam-warrior-v3-20171013.ESXi · GitLab
Thanks to these for both the patch and for getting me to write instructions:
- [WayBack] www.archiveteam.org/index.php?title=Deploy_OVA_to_VMware_ESXi
- [WayBack] edvoncken.net/2014/08/archiveteam-warrior-on-esxi
- [WayBack] gist.github.com/Kipari/c32192de52c2a86d56ee0d67472d4dc6
I raised the memory of each instance to 2 gibibyte (from the default 400 mebibyte) and 3.6 GHz CPU speed (from the default 1 core). This was more than enough so they would saturate less than 60% of the maximum:
That way I could run each instance at the maximum allowed settings on the right:
- 6 concurrent items
- 3 rsync threads
Together they would fluctuate at about 50 mebibit/second combined down and upstream throughput: about half of one of my home fiber connections.
More on the warrior
The download above is a virtual appliance that deploys as an Alpine Linux instance with an embedded docker container.
You can scale it up either by running multiple virtual appliances (easiest way for just a few) or many orchestrated docker instances (if you already a docker infrastructure running: grab the [Archive.is] archiveteam/warrior-dockerfile – Docker Hub).
More on that, and how the warriors fit in the archiving scheme:
- [WayBack] Deathwatch – Archiveteam: The Deathwatch or Watchlist is a central indicator of websites and networks that are shutting down and serves as an indicator of what happened to particular sites that shut down quickly.
- [WayBack] Alive… OR ARE THEY – Archiveteam
- [WayBack] ArchiveTeam Warrior – Archiveteam
- [WayBack] Tracker – Archiveteam
[WayBack] Dev – Archiveteam
- [WayBack] Dev/Infrastructure – Archiveteam
- [WayBack] Dev/Source Code – Archiveteam
- [WayBack] Dev/Warrior – Archiveteam
The virtual machine is self-updating. It does the following:
-
- Start the virtual machine
- Linux boots
boot.sh
downloads and launches/root/startup.sh
startup.sh
prepares and runs a docker container with the warrior runner- Point your web browser to http://localhost:8001 and go.
-
- [WayBack] Dev/New Project – Archiveteam
- [WayBack] Dev/Seesaw – Archiveteam
- [WayBack] Dev/Tracker – Archiveteam
- [WayBack] Dev/Staging – Archiveteam (Setting up Rsync and Megawarc Factory)
- [WayBack] Dev/Project Management – Archiveteam
Docker
If you have Docker installed, the following command will help preserve…:
docker run --publish 127.0.0.1:8001:8001 --restart always archiveteam/warrior-dockerfile
https://twitter.com/SteveMcLaugh/status/1112433277359529984– Edward Morbius – Google+
Via:
- [WayBack] If you have Docker installed, the following command will help preserve…
- [WayBack] Steve McLaughlin🌾💾 on Twitter : “If you have Docker installed, the following command will help preserve Google Plus:
docker run --publish 127.0.0.1:8001:8001 --restart always archiveteam/warrior-dockerfile
“, mentioning:- [WayBack] Jason Scott on Twitter: “The @archiveteave mirroring of Google Plus has now surpassed 1.2 PETABYTES (1,200 Terabytes). We need to finish it off before April 2 when they close everything down. Please watch the fun at https://t.co/LSAIEkMEbv and PLEASE RUN YOUR OWN WARRIOR FOR US. We need it badly.… “
–jeroen
PS
PS: I wrote this when Google+ (G+) was a few days from being shut down, but there are always sites that go down in need for archiving.
The Wayback Archive Team Warriors have a site listing projects, but since they are better in archiving and tool development than writing WiKi content, the easiest is to have your warrior instance run the “ArchiveTeam’s Choice” project: it will work on the most pressing downloads, and continue even after one project is finished.
On the Google+ and the G+ archiving project
- [WayBack] Archive Team: Google Plus (or Minus) : Free Web : Free Download, Borrow and Streaming : Internet Archive
- [WayBack] Saving of public Google+ content at the Internet Archive’s Wayback Machine by the Archive Team has begun : plexodus
There are tools to assist with rebuilding websites based on Wayback Machine archives. Whether or not these will support Google+ user, Page, Collection, or Community accounts is not presently clear, though we’ll try to provide information as it becomes available.
Tools:
- [WayBack] Google+ tracker – #googleminus – Donate at https://archive.org/donate/ for hosting the archives Dashboard
- [WayBack] Google+ – Archiveteam
- [WayBack] PLExodus: The Beginning is Near – Google+
- [WayBack] PlexodusWiki
- [WayBack] Google+ Mass Migration community on G+ — Helping coordinate the exodus : googleplus
- Google+ – Wikipedia
- [WayBack] Where to get Google+ community, profile or page URL?
- [WayBack] “Google just started mass banning/limiting Archive Team downloads” | Hacker News
- [WayBack] Archive Team Tracker Charts – Grafana: project googleplus
- [WayBack] Thread by @jpluimers: “seeing all kinds of 302 in the @at_warrior from my @xs4all IP address like “Item users:smap/44717/oc: Step 4 of 8 10=302 plus.google.com/112 […]”
Some stats:
The more interesting project tracker, showing updates in realtime, is: http://tracker.archiveteam.org/googleplus/
Note that this shows only 1/50th of the total project at a time. “Items” are sitemap subsets of 100 profiles, and 50 batches of 1,000 sitemaps at a time, each with about 680 or so items, will be processed over the course of this archival. The tracker shows the status only of the current batch. Total profiles archived are 50 batches * 1,000 sitemaps/batch * 680 items/sitemap * 100 profiles/item = 3.4 billion profiles, or the total number of Google+ profiles (as of March, 2017). There will be 34 million items, total, in the overall process.
Some Twitter tid-bits on the last day of archival:
- [WayBack] Jason Scott on Twitter: “Google just started mass banning/limiting Archive Team downloads. If your thing doesn’t end up on the Internet Archive, talk to Google.”
- [WayBack] JRWR – Forrest F. auf Twitter: “@textfiles We found some work arounds to keep saving google plus, Warriors are coming back online!… “
- Around 2019-04-01T12:00Z (thats noon in UTC on April frist for all the AM/PM lovers), the party seemed over:
- [WayBack] … I guess the party is over … – Jeroen Wiert Pluimers – Google+
- [WayBack] Jeroen Pluimers on Twitter: “seeing all kinds of 302 in the @at_warrior from my @xs4all IP address like “Item users:smap/44717/oc: Step 4 of 8 10=302 https://plus.google.com/112752132495892643511 …” So I guess the party is over? CC @jrwr @textfiles @SteveMcLaugh @archiveteam…”
- [WayBack] Jeroen Pluimers on Twitter: “Grafana confirms. Bye-bye G+. So long and thanks for all the fish. “ (larger grafana screenshot below)
- [WayBack] Jeroen Pluimers on Twitter: “Grafana confirms. Bye-bye G+. So long and thanks for all the fish. “ (larger grafana screenshot below)
- [WayBack] Jeroen Pluimers on Twitter: “Found the grafana dashboard. Party seems indeed over slightly before 2019-04-01T12:00Z. Bye-bye G+. It was fun while it lasted.”
- But then half an hour later, it recovered, but 7 minuted later it died again on my IP-address. Seems I got blacklisted:
- A bit later, there was a small and even shorter resurrection:
- Later that day and on 20190402, every now and then, my IP address was unblocked:
- The ArchiveTeam then announced they were getting very close to grabbing everything: [WayBack] Archive Team status: 97%, estimated completion 8.5 hours. (Via IRC.) Context: https://www.archiveteam.org/index.php?title=Google – Edward Morbius – Google+
Some of my old G+ starting places
- [WayBack] Jeroen Wiert Pluimers – Google+
- [WayBack] Blub/humor – Google+
- [WayBack] 1984 and (IT) (in)security – Google+
- [WayBack] Opinion – Google+
On VMware and uploading OVA/OVF files
Quoted in part because the VMware documentation site hates to be archived
[empty WayBack/empty Archive.is] OVF and OVA Limitations for the VMware Host Client
To deploy a large OVA file, VMware recommends to first extract the OVA on your system by running the command tar -xvf <file.ova>. Then you can provide the deployment wizard with the OVF and VMDKs as separate files.
Leave a Reply