The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,862 other subscribers

Archive for the ‘InternetArchive’ Category

Interactive tool to record a web-site, then archive it as WARC: GitHub – webrecorder/webrecorder-user-guide: webrecorder user guide and glossary

Posted by jpluimers on 2021/03/26

Learned about this in the fall of G+: [WayBack] GitHub – webrecorder/webrecorder-user-guide: webrecorder user guide and glossary

Back then, fully automated tools were easier.

So it is on my list of things to try one day for smaller projects.

From the glossary:

Quick Start

  1. Enter a URL in the box in the center of the screen labeled ‘URL to capture’.
  2. Press the ‘start’ button (look down and to the right of the box where you entered the URL).
  3. Interact with the web page that loads so Webrecorder can capture the content displayed on this page. To collect audio or video from a page be sure to press ‘play’ so the file will load into the browser.
  4. Continue to visit and browse the pages you would like to capture. Each page you view will be included in your capture session. Note: you will be capturing the contents of each page you visit but will not automatically obtain pages that are linked to on the pages you collect (hyperlinks).
  5. To end your capture session hover over the ‘Capture’ button in the upper left corner of the screen so it changes to read ‘Stop’ then click that button.
  6. Your capture will then be browsable. Note: the capture will not ‘replay’ like a linear recording but instead be an interactive copy of the pages you have collected.
  7. If you are a logged-in user, this session will be saved to your account automatically. If you are not logged in an account, you can sign up for an account or log in to your existing account to save the collection after you create it. If you do not log in you can still download your collection for a limited time (approximately 90 minutes from when you stop your recording session).

–jeroen

Posted in Internet, InternetArchive, Power User | Leave a Comment »

Helping the WayBack ArchiveTeam team: running their Warrior virtual appliance on ESXi

Posted by jpluimers on 2021/03/19

The [WayBack] Archiveteam helps the WayBack machine with feeding new content.

You can help that team by running one or more “warrior” virtual machine instances. The VM is distributed as a virtual appliance in an ova file according to the Open Virtualization Format.

That format sounds more generic than it actually is, so the (at the time of writing) archiveteam-warrior-v3-20171013.ova file at [WayBack] Index of /downloads/warrior3/ was created for VirtualBox.X

This meant running it on VMware ESXi or VMware vSphere takes a few steps for patching it, then uploading it to your VMware host.

Since I might want to run the appliance on multiple places or multiple instances, I wanted to have a ready-to-go solution, I created a git repository with both the patch instructions and the update at [WayBack] wiert.me / public / ova / archiveteam-warrior-v3-20171013.ESXi · GitLab.

Read the rest of this entry »

Posted in ArchiveTeamWarrior, Cloud, Containers, Docker, Infrastructure, Internet, InternetArchive, Kubernetes (k8n), Power User, WayBack machine | Leave a Comment »

Archiving Google Product Forums URLs

Posted by jpluimers on 2020/11/13

Archiving Google Product Forum URLs is a pain in the butt for a couple of reasons:

So the trick for saving is:

  1. Get from the /forum/#!topic/ based URL to the /d/topic/ based one
  2. Put it after the archive.is/?run=1&url=, then save

--jeroen

 

Posted in Conference Topics, Conferences, Event, Internet, InternetArchive, Power User, WayBack machine | Leave a Comment »

WayBack machine now rate limits your requests and blocks if you go over it

Posted by jpluimers on 2019/10/19

Got this a while ago while saving a bunch of links for my blog; unfortunately the email address did not respond for information

Too Many Requests

We are limiting the number of URLs you can submit to be Archived to the Wayback Machine, using the Save Page Now features, to no more than 15 per minute.

If you submit more than that we will block Save Page Now requests from your IP number for one day.

Please feel free to write to us at info@archive.org if you have questions about this. Please include your IP address and any URLs in the email so we can provide you with better service.

I wish there was a queue service that would make you wait longer, but does fulfill the request.

–jeroen

Posted in Internet, InternetArchive, Power User, WayBack machine | Leave a Comment »

When archiving in the WayBack machine returns error 400: clear your cookies

Posted by jpluimers on 2019/08/16

When archiving pages in the WayBack machine, despite Privacy Badger having set to “save no cookies”, it still managed to set truckloads of cookies.

So I used the Chrome settings in chrome://settings/content/cookies to disable cookies and now everything is fine.

–jeroen

Read the rest of this entry »

Posted in Chrome, Google, Internet, InternetArchive, Power User, Privacy, WayBack machine | Leave a Comment »

When saving on the WayBack machine at web.archive.org/save terminates the connection

Posted by jpluimers on 2019/05/27

When you get the response “web.archive.org unexpectedly closed the connection” without even returning an HTTP code, but:

  • it works in anonymous mode
  • it works with all extensions turned off

then likely there are too many cookies for archive.org or/and web.archive.org: in my case, I had 90 cookies.

Cleaning these cookies out resolved the problem (I used [WayBackAwesome Cookie Manager for this).

Edit 20231230: Awesome Cookie Manager source repository at [Wayback/Archive] Phatsuo/awesome-cookie-manager: Awesome Cookie Manager.

--jeroen

Posted in Chrome, Google, Internet, InternetArchive, Power User, WayBack machine | Leave a Comment »

When +Google Nederland maps only fills one or part of the map tiles…

Posted by jpluimers on 2019/04/18

I still have to do this every few weeks on all my desktop machines: [WayBack] When +Google Nederland maps only fills none or part of the map tiles… – Jeroen Wiert Pluimers – Google+

When +Google Nederland maps only fills none or part of the map tiles at https://maps.google.nl, but https://maps.google.com works fine, then remove any gsScrollPos cookies from www.google.nl.

I need to do this every couple of days to keep maps.google.nl working.

Later I also found it can happen for YouTube, then did more digging for gsScrollPos and found a better workaround: [WayBackAwesome Cookie Manager where you can just delete the gsScrollPos cookies from all sites in one go.

Even later I found out that this can be one of the causes for the WayBack machine giving an error 400 when archiving. A more common reason however is that many archived web-pages try to create cookies in the web.archive.com subdomain resulting in the same problem.

The cause seems to be the Great Suspender plugin which should be fixed by now, but might not automatically update to the latest version. See:

Pending a new Great Suspender release, below is a quick way to manually remove them if you are into SQL scripting for sqlite. It basically comes down to executing the below statement when Chrome is closed:

delete from cookies where name like 'gsScrollPos-%'

Edit 20231230: Awesome Cookie Manager source repository at [Wayback/Archive] Phatsuo/awesome-cookie-manager: Awesome Cookie Manager.

--jeroen

Posted in Chrome, Google, GoogleMaps, Internet, InternetArchive, Power User, WayBack machine | Leave a Comment »

GitHub – ArchiveTeam/googleplus-grab: Archiving Google+.

Posted by jpluimers on 2019/03/18

Soon this is a thing of the past, but for just a few more days, you can help: Archiving Google+.

Either run this project: [WayBack] GitHub – ArchiveTeam/googleplus-grab: Archiving Google+.

Or even better: run the appliance, and help the WayBack machine with any archiving projects setup by the virtual appliance: the [WayBack] ArchiveTeam Warrior – Archiveteam.

See some of their other pages for more background information:

You can donate both to the archive team, and the internet archive:

How is G+ archiving doing?

The tracker is well under way: [WayBack] Googleplus tracker Dashboard. History: archive.is 1; archive.is 2

Read the rest of this entry »

Posted in ArchiveTeamWarrior, Development, G+: GooglePlus, Google, Internet, InternetArchive, Power User, Python, Scripting, SocialMedia, Software Development, WayBack machine | Leave a Comment »

The [WayBack] and [Archive.is] links in my blog and G+ stream

Posted by jpluimers on 2018/12/13

Answering a good question on [WayBack] Jeroen Wiert Pluimers – Google+:

Read the rest of this entry »

Posted in Blogging, Bookmarklet, InternetArchive, Power User, SocialMedia, WayBack machine, Web Browsers | Leave a Comment »

Aside from the Wayback Machine, what are other options for getting screenshots of websites from the past? – Quora

Posted by jpluimers on 2018/10/15

I’ve used these myself:

There are many more listed in for instance these links:

IIPC OpenWayBack:

–jeroen

Posted in Internet, InternetArchive, Power User, WayBack machine | Leave a Comment »