Bookmarklets for Archive.is and the WayBack Machine to go to the original page

June 2023
M	T	W	T	F	S	S
	1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

Posted by jpluimers on 2023/06/07

Quite often, when browsing an archived page on Archive.is or the WayBack Machine, I want to check the current status of the original page.

So I wrote a few Bookmarklets.

Archive.is

Default field

Any Archive.is page has a Saved from field which is an input html element having a name attribute with value q and a value property containing the URL, which is navigated to by assigning the location in the above code.

So my goto Bookmarklet is this one:

javascript:open(document.getElementsByName("q")[0]?.value)

It uses [0]? as there is no getElementsByName, but there is [Wayback/Archive] Document.getElementsByName() – Web APIs | MDN as name values need not to be unique but id values have to be.

Other Archive.is fields

The above works on all types of Archive.is page types:

search pages like https://archive.is/https://example.org
actual archived pages like https://archive.is/LkpeZ and https://archive.ph/2022.01.22-165646/https://example.org/
- these only have a Saved from field.
redirected archived pages like https://archive.ph/UEQeg and https://archive.ph/2013.01.03-111457/http://www.iana.org/domains/example/
- these both have Saved from and Redirected from fields.
complex pages like https://archive.ph/5iVVH and https://archive.ph/2015.11.14-044109/http://www.example.org/
- those have even more fields: in addition to fields Saved from and Redirected from, the fields Via and Original also are added.

To get the additional fields from the other fields, we need to figure out a way to access them.

All these additional fields lack name or id attributes, but we might have a selection criterion as they are always readonly:

[Wayback/Archive] <input>: The Input (Form Input) element; readonly attribute – HTML: HyperText Markup Language | MDN
- [Wayback/Archive] HTML attribute: readonly – HTML: HyperText Markup Language | MDN

First option for alternate fields: `CSS` selectors

The additional fields also are always in the same order, so we can use see [Wayback/Archive] Element.querySelector() – Web APIs | MDN or [Wayback/Archive] for the first or from first to last Document.querySelectorAll() – Web APIs | MDN (both use [Wayback/Archive] CSS selectors – CSS: Cascading Style Sheets | MDN) to access them:

The Redirected from field:

javascript:open(document.querySelector('input[readonly]')?.value)

javascript:open(document.querySelectorAll('input[readonly]')[0]?.value)

The Via field:

javascript:open(document.querySelectorAll('input[readonly]')[1]?.value)

The Original field

javascript:open(document.querySelectorAll('input[readonly]')[2]?.value)

Note I tried document.querySelector('input[readonly]:nth-of-type(2)') to access the Via field (and :nth-of-type(3) for the Original field), but both failed (because each input has a div as parent). The failure is explained by [Wayback/Archive] James Donnelly answering [Wayback/Archive] html – Matching the first/nth element of a certain type in the entire document – Stack Overflow (thanks [Wayback/Archive] user3289092 for asking):

With CSS alone this unfortunately isn’t possible. The documentation for the :first-of-type pseudo-class [Wayback/Archive] states:

The :first-of-type pseudo-class represents an element that is the first sibling of its type in the list of children of its parent element.

This means that :first-of-type is applied to the first element of its type relative to its parent and not the document’s root (or the body element, in this case).

Combining CSS selectors with the Document Object Model (DOM) means you have to know both the HTML elements and how they map to the DOM equivalents. Links like these are of big help:

Notes:

The four bookmarklets for the additional fields depend on the order of elements in the page; the bookmarklet for the Saved from field depends on the naming of the input element.
All bookmarklets will open an undefined page (usually about:blank) if the underlying element is not found. This is the result of how the ?. operator works (see [Wayback/Archive] Optional chaining (?.) – JavaScript | MDN) and could be worked around with the ?? operator (see [Wayback/Archive] Nullish coalescing operator (??) – JavaScript | MDN).
The first bookmarklet fails for https://archive.is/example.org (it opens the same page in a new tabjust reloads the page). I will need to figure out a way to prepend a protocol to the URI when there is none (as document.querySelectorAll('input[readonly]')[0]?.value now returns example.org which has no protocol).

Second option for alternate fields: `XPath` queries

This is a cool solution, but given that XPath is totally different from CSS selectors, I am going to do this in a future blog post (even a few months from now as the queue has already filled up quite a bit since starting writing the current post).

Hopefully the page title will stay the same, but here is the link already: XPath based bookmarklets for Archive.is: more JavaScript fiddling!

WayBack Machine search page

javascript:optn(document.querySelectorAll('input[type=text]')[0]?.value)

This is for pages like https://web.archive.org/web/*/nu.nl or . It has an input html element without id attribute or name attribute, but with type of text, so it is queryable with document.querySelectorAll('input[type=text]').

WayBack Machine archived page

This was hard, as WayBack Machine archived pages like https://web.archive.org/web/20220122112002/https://example.org/ “hide” their input elements in a closed Shadow DOM to separate them outside of the regular DOM, likely to prevent side effects from these elements to the archived paged and vice versa.

Whereas an open Shadow DOM can be retrieved from a web-page, a closed Shadow DOM Cannot. So I did the second best, which is search/replace the current URL and remove the bolded parts: https://web.archive.org/web/20220122112002/https://example.org/. Since the WayBack Machine can use both http and https, this is the bookmarklet:

javascript:open(location.href.replace(/^http[s]?:\/\/web\.archive\.org\/web\/\d{14}\/http/,'http'))

That looks a bit similar to the solution of Source: JavaScript bookmarklet to replace part of the WayBack machine URL.

Note that like the Archive.is bookmarklets, the above one fails for https://web.archive.org/web/*/example.org/. Here to I need to think about how to prepend a missing protocol.

Fixing the protocol issue

Besides http:// and https://, I also need to consider ftp://, ftps:// and sftp:// as ftp://ftp.adobe.com/pub/adobe/acrobat/win/11.x/11.0.23/misc/ is archived in the WayBack Machine as seen by https://web.archive.org/web/*/ftp://ftp.adobe.com/pub/adobe/acrobat/win/11.x/11.0.23/misc/.

These links might help me fixing the protocol issue eventually:

[Wayback/Archive] String.prototype.replace() – JavaScript | MDN
Negative lookahead:
- [Wayback/Archive] regex – Javascript regular expression to add protocol to url string – Stack Overflow
- [Wayback/Archive] Regex Tutorial – Lookahead and Lookbehind Zero-Length Assertions
- [Wayback/Archive] “negative lookahead” add missing protocol to uri – Google Search
- [Wayback/Archive] php – Add http:// prefix to URL when missing – Stack Overflow (thanks [Wayback/Archive] DiegoP., [Wayback/Archive] Evan Kennedy and [Wayback/Archive] dhaupin)
  
  Your regex is smart that it preserves https:// however it doesn’t work with relative urls such as //www.example.com. It also appends http:// onto ftp:// or other protocols. Try this instead: preg_replace('/^\/\/|^(?!https?:)(?!ftps?:)/', 'http://', $src) (note: you can remove the check for ftp if you don’t need it)
- [Wayback/Archive] .net – How can I apply a negative lookahead to a whole capture group? – Stack Overflow
- [Wayback/Archive] Url Regex with mandatory Protocol – Regex Tester/Debugger
  - [Wayback/Archive] Top 15 Commonly Used Regex – Digital Fortress
- [Wayback/Archive] regex – Javascript regular expression to add protocol to url string – Stack Overflow

Basically the original -> replaced string testcases would be these:

ftp://example.org -> ftp://example.org
ftps://example.org -> ftps://example.org
sftp://example.org -> sftp://example.org
http://example.org -> http://example.org
https://example.org -> https://example.org
example.org -> https://example.org

Helpful links

The below links tremendously helped me figure out writing the above blog post.

–jeroen

This entry was posted on 2023/06/07 at 12:00 and is filed under Bookmarklet, Development, JavaScript/ECMAScript, Power User, Scripting, Software Development, Web Browsers. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

One Response to “Bookmarklets for Archive.is and the WayBack Machine to go to the original page”

Side effect-free bookmarklets: wrap them in an IIFE (Immediately Invoked Function Expression) « The Wiert Corner – irregular stream of stuff said

2023/06/08 at 12:02
[…] Bookmarklets for Archive.is and the WayBack Machine to go to the original page […]

Reply

	Lars Fosdal on Security alarm provider Woonve…
	Thomas Mueller on Question got closed in May 202…
	Thaddy de Koning on Formulier voor bewindvoerders…
	Thaddy de Koning on Formulier voor bewindvoerders…
	Thaddy de Koning on Formulier voor bewindvoerders…

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription