The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,860 other subscribers

Bookmarklets for Archive.is and the WayBack Machine to go to the original page

Posted by jpluimers on 2023/06/07

Quite often, when browsing an archived page on Archive.is or the WayBack Machine, I want to check the current status of the original page.

So I wrote a few Bookmarklets.

Archive.is

Default field

Any Archive.is page has a Saved from field which is an input html element having a name attribute with value q and a value property containing the URL, which is navigated to by assigning the location in the above code.

So my goto Bookmarklet is this one:

javascript:open(document.getElementsByName("q")[0]?.value)

It uses [0]? as there is no getElementsByName, but there is [Wayback/Archive] Document.getElementsByName() – Web APIs | MDN as name values need not to be unique but id values have to be.

Other Archive.is fields

The above works on all types of Archive.is page types:

  1. search pages like https://archive.is/https://example.org
  2. actual archived pages like https://archive.is/LkpeZ and https://archive.ph/2022.01.22-165646/https://example.org/
    • these only have a Saved from field.
  3. redirected archived pages like https://archive.ph/UEQeg and https://archive.ph/2013.01.03-111457/http://www.iana.org/domains/example/
    • these both have Saved from and Redirected from fields.
  4. complex pages like https://archive.ph/5iVVH and https://archive.ph/2015.11.14-044109/http://www.example.org/
    • those have even more fields: in addition to fields Saved from and Redirected from, the fields Via and Original also are added.

To get the additional fields from the other fields, we need to figure out a way to access them.

All these additional fields lack name or id attributes, but we might have a selection criterion as they are always readonly:

First option for alternate fields: CSS selectors

The additional fields also are always in the same order, so we can use see [Wayback/Archive] Element.querySelector() – Web APIs | MDN or [Wayback/Archive] for the first or from first to last Document.querySelectorAll() – Web APIs | MDN (both use [Wayback/Archive] CSS selectors – CSS: Cascading Style Sheets | MDN) to access them:

  1. The Redirected from field:
    javascript:open(document.querySelector('input[readonly]')?.value)
    

    or

    javascript:open(document.querySelectorAll('input[readonly]')[0]?.value)
  2. The Via field:
    javascript:open(document.querySelectorAll('input[readonly]')[1]?.value)
  3. The Original field
    javascript:open(document.querySelectorAll('input[readonly]')[2]?.value)

Note I tried document.querySelector('input[readonly]:nth-of-type(2)') to access the Via field (and :nth-of-type(3) for the Original field), but both failed (because each input has a div as parent). The failure is explained by [Wayback/Archive] James Donnelly answering [Wayback/Archive] html – Matching the first/nth element of a certain type in the entire document – Stack Overflow (thanks [Wayback/Archive] user3289092 for asking):

With CSS alone this unfortunately isn’t possible. The documentation for the :first-of-type pseudo-class [Wayback/Archive] states:

The :first-of-type pseudo-class represents an element that is the first sibling of its type in the list of children of its parent element.

This means that :first-of-type is applied to the first element of its type relative to its parent and not the document’s root (or the body element, in this case).

Combining CSS selectors with the Document Object Model (DOM) means you have to know both the HTML elements and how they map to the DOM equivalents. Links like these are of big help:

Notes:

  • The four bookmarklets for the additional fields depend on the order of elements in the page; the bookmarklet for the Saved from field depends on the naming of the input element.
  • All bookmarklets will open an undefined page (usually about:blank) if the underlying element is not found. This is the result of how the ?. operator works (see [Wayback/Archive] Optional chaining (?.) – JavaScript | MDN) and could be worked around with the ?? operator (see [Wayback/Archive] Nullish coalescing operator (??) – JavaScript | MDN).
  • The first bookmarklet fails for https://archive.is/example.org (it opens the same page in a new tabjust reloads the page). I will need to figure out a way to prepend a protocol to the URI when there is none (as document.querySelectorAll('input[readonly]')[0]?.value now returns example.org which has no protocol).
Second option for alternate fields: XPath queries

This is a cool solution, but given that XPath is totally different from CSS selectors, I am going to do this in a future blog post (even a few months from now as the queue has already filled up quite a bit since starting writing the current post).

Hopefully the page title will stay the same, but here is the link already: XPath based bookmarklets for Archive.is: more JavaScript fiddling!

WayBack Machine search page

javascript:optn(document.querySelectorAll('input[type=text]')[0]?.value)

This is for pages like https://web.archive.org/web/*/nu.nl or . It has an input html element without id attribute or name attribute, but with type of text, so it is queryable with document.querySelectorAll('input[type=text]').

WayBack Machine archived page

This was hard, as WayBack Machine archived pages like https://web.archive.org/web/20220122112002/https://example.org/ “hide” their input elements in a closed Shadow DOM to separate them outside of the regular DOM, likely to prevent side effects from these elements to the archived paged and vice versa.

Whereas an open Shadow DOM can be retrieved from a web-page, a closed Shadow DOM Cannot. So I did the second best, which is search/replace the current URL and remove the bolded parts: https://web.archive.org/web/20220122112002/https://example.org/. Since the WayBack Machine can use both http and https, this is the bookmarklet:

javascript:open(location.href.replace(/^http[s]?:\/\/web\.archive\.org\/web\/\d{14}\/http/,'http'))

That looks a bit similar to the solution of Source: JavaScript bookmarklet to replace part of the WayBack machine URL.

Note that like the Archive.is bookmarklets, the above one fails for https://web.archive.org/web/*/example.org/. Here to I need to think about how to prepend a missing protocol.

Fixing the protocol issue

Besides http:// and https://, I also need to consider ftp://, ftps:// and sftp:// as ftp://ftp.adobe.com/pub/adobe/acrobat/win/11.x/11.0.23/misc/ is archived in the WayBack Machine as seen by https://web.archive.org/web/*/ftp://ftp.adobe.com/pub/adobe/acrobat/win/11.x/11.0.23/misc/.

These links might help me fixing the protocol issue eventually:

Basically the original -> replaced string testcases would be these:

  • ftp://example.org -> ftp://example.org
  • ftps://example.org -> ftps://example.org
  • sftp://example.org -> sftp://example.org
  • http://example.org -> http://example.org
  • https://example.org -> https://example.org
  • example.org -> https://example.org

Helpful links

The below links tremendously helped me figure out writing the above blog post.

–jeroen

One Response to “Bookmarklets for Archive.is and the WayBack Machine to go to the original page”

  1. […] Bookmarklets for Archive.is and the WayBack Machine to go to the original page […]

Leave a reply to Side effect-free bookmarklets: wrap them in an IIFE (Immediately Invoked Function Expression) « The Wiert Corner – irregular stream of stuff Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.