The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,860 other subscribers

XPath based bookmarklets for Archive.is: more JavaScript fiddling!

Posted by jpluimers on 2023/09/20

As I promised a few months back in Bookmarklets for Archive.is and the WayBack Machine to go to the original page, moar JavaScript fiddling, this time with XPath based bookmarklets to navigate from Archive.is pages to Saved From, Redirected from, Via and Original pages.

An alternative would be using XPath as the additional fields are always structured in a table like the html below (taking complex pages like https://archive.ph/5iVVH and https://archive.ph/2015.11.14-044109/http://www.example.org/ as an example).

I got triggered to using XPath from this answer from [Wayback/Archive] gdyrrahitis at [Wayback/Archive] Javascript .querySelector find by innerTEXT – Stack Overflow (thanks [Wayback/Archive] passwd for asking):

OP’s question is about plain JavaScript and not jQuery. Although there are plenty of answers and I like @Pawan Nogariya answer, please check this alternative out.

You can use XPATH in JavaScript. More info on the MDN article here.

The document.evaluate() method evaluates an XPATH query/expression. So you can pass XPATH expressions there, traverse into the HTML document and locate the desired element.

In XPATH you can select an element, by the text node like the following, whch gets the div that has the following text node.

//div[text()="Hello World"]

To get an element that contains some text use the following:

//div[contains(., 'Hello')]

The contains() method in XPATH takes a node as first parameter and the text to search for as second parameter.

Check this plunk here, this is an example use of XPATH in JavaScript

Here is a code snippet:

var headings = document.evaluate("//h1[contains(., 'Hello')]", document, null, XPathResult.ANY_TYPE, null );
var thisHeading = headings.iterateNext();

console.log(thisHeading); // Prints the html element in console
console.log(thisHeading.textContent); // prints the text content in console

thisHeading.innerHTML += "<br />Modified contents";

As you can see, I can grab the HTML element and modify it as I like.

The same question also has answers showing you can do the search in a mix of CSS selectors, pure (sometimes functional) JavaScript coding and some even regular expressions. These exactly show the reason why I opted for XPath: there the whole query is done in one language and is not spread over two.

More on non-XPath based solutions

Some examples (note that any thing selecting all div elements will usually be slow as the outer div will likely contain various inner div elements; see also my /html/body remark on XPath below):

  • Answer by [Wayback/Archive] Pawan Nogariya

    Since you have asked it in javascript so you can have something like this

    function contains(selector, text) {
      var elements = document.querySelectorAll(selector);
      return Array.prototype.filter.call(elements, function(element){
        return RegExp(text).test(element.textContent);
      });
    }
    

    And then call it like this

    contains('div', 'sometext'); // find "div" that contain "sometext"
    contains('div', /^sometext/); // find "div" that start with "sometext"
    contains('div', /sometext$/i); // find "div" that end with "sometext", case-insensitive

    Note that similar contains functions are present in various other answers, for instance [Wayback/Archive] Native javascript equivalent of jQuery :contains() selector – Stack Overflow (answered by [Wayback/Archive] elclanrs , asked by [Wayback/Archive] coulbourne) which has an interesting remark by [Wayback/Archive] avalanche1:

    This is incorrect because it also includes results for all child nodes. I.e. if child node of element will contain text – element will be included into contains result; which is wrong.

  • by anonymous:

    You could use this pretty simple solution:

    Array.from(document.querySelectorAll('div'))
      .find(el => el.textContent === 'SomeText, text continues.');
    
  • by [Wayback/Archive] Andrew Willems:

    This solution does the following:

    • Uses the ES6 spread operator to convert the NodeList of all divs to an array.
    • Provides output if the div contains the query string, not just if it exactly equals the query string (which happens for some of the other answers). e.g. It should provide output not just for ‘SomeText’ but also for ‘SomeText, text continues’.
    • Outputs the entire div contents, not just the query string. e.g. For ‘SomeText, text continues’ it should output that whole string, not just ‘SomeText’.
    • Allows for multiple divs to contain the string, not just a single div.
    [...document.querySelectorAll('div')]      // get all the divs in an array
      .map(div => div.innerHTML)               // get their contents
      .filter(txt => txt.includes('SomeText')) // keep only those containing the query
      .forEach(txt => console.log(txt));       // output the entire contents of those
    <div>SomeText, text continues.</div>
    <div>Not in this div.</div>
    <div>Here is more SomeText.</div>
  • and a similar one by [Wayback/ArchiveRedu:

    You best see if you have a parent element of the div you are querying. If so get the parent element and perform an element.querySelectorAll("div"). Once you get the nodeList apply a filter on it over the innerText property. Assume that a parent element of the div that we are querying has an id of container. You can normally access container directly from the id but let’s do it the proper way.

    var conty = document.getElementById("container"),
         divs = conty.querySelectorAll("div"),
        myDiv = [...divs].filter(e => e.innerText == "SomeText");
    

The last one – despite not my preference – did teach me a few things I did not have much experience with:

I found another archived page having these 4 fields as well: [Wayback/Archive] Get Windows Terminal – Microsoft Store.

The common aspect with the previous page is that both come from WayBack Machine links:

[Wayback/Archive] Jeroen Wiert Pluimers on Twitter: “Cool, some @archiveis archived pages have not 1, not 2, but 4 fields!”

Image

Image

Back to XPath and the HTML I based the XPath expressions on:

<table border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td rowspan="6">
<div>archive.today</div>
<div>webpage capture</div>
</td>
<td>Saved from</td>
<td><form action="https://archive.fo/search/" method="get">
<table border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td><input name="q" type="text" value="http://web.archive.org/web/20151114044109/http://www.example.org/" /><input name="t" type="hidden" value="1447631011200" /><input name="id" type="hidden" value="5iVVH" /></td>
<td><input tabindex="-1" type="submit" value="search" /></td>
</tr>
</tbody>
</table>
<textarea name=""></textarea></form></td>
<td rowspan="2"><time datetime="2015-11-15T23:43:31Z">15 Nov 2015 23:43:31 UTC</time></td>
</tr>
<tr>
<td>Redirected from</td>
<td>
<div><input readonly="readonly" type="text" value="http://web.archive.org/web/2/example.org" /></div>
</td>
</tr>
<tr>
<td>Via</td>
<td>
<div><input readonly="readonly" type="text" value="http://www.example.org/" /></div>
</td>
<td><time datetime="2015-11-14T04:41:09Z">14 Nov 2015 04:41:09 UTC</time></td>
</tr>
<tr>
<td>Original</td>
<td>
<div><input readonly="readonly" type="text" value="http://example.org/" /></div>
</td>
<td><time datetime="2015-11-14T04:41:09Z">14 Nov 2015 04:41:09 UTC</time></td>
</tr>
</tbody>
</table>

On the one hand, XPath requires knowing yet another language. On the other hand: it is very well supported in web browsers, see:

  • The table in Comparison of web browsers: JavaScript support – Wikipedia, and the remarks right above it:

    Information about what JavaScript technologies the browsers support. Note that although XPath is used by XSLT, it is only considered here if it can be accessed using JavaScript. External links lead to information about support in future versions of the browsers or extensions that provide such functionality, e.g., Babel.

  • [Wayback/Archive] Introduction to using XPath in JavaScript – XPath | MDN
  • [Wayback/Archive] Document.evaluate() – Web APIs | MDN

    Returns an XPathResult based on an XPath expression and other given parameters.

    Note that the returned XPathResult will never be null.

  • [Wayback/Archive] Document.evaluate(): values of resultType – Web APIs | MDN
    These are supported values for the resultType parameter of the evaluate method:
    Result Type Value Description
    ANY_TYPE 0 Whatever type naturally results from the given expression.
    NUMBER_TYPE 1 A result set containing a single number. Useful, for example, in an XPath expression using the count() function.
    STRING_TYPE 2 A result set containing a single string.
    BOOLEAN_TYPE 3 A result set containing a single boolean value. Useful, for example, an XPath expression using the not() function.
    UNORDERED_NODE_ITERATOR_TYPE 4 A result set containing all the nodes matching the expression. The nodes in the result set are not necessarily in the same order they appear in the document.
    ORDERED_NODE_ITERATOR_TYPE 5 A result set containing all the nodes matching the expression. The nodes in the result set are in the same order they appear in the document.
    UNORDERED_NODE_SNAPSHOT_TYPE 6 A result set containing snapshots of all the nodes matching the expression. The nodes in the result set are not necessarily in the same order they appear in the document.
    ORDERED_NODE_SNAPSHOT_TYPE 7 A result set containing snapshots of all the nodes matching the expression. The nodes in the result set are in the same order they appear in the document.
    ANY_UNORDERED_NODE_TYPE 8 A result set containing any single node that matches the expression. The node is not necessarily the first node in the document that matches the expression.
    FIRST_ORDERED_NODE_TYPE 9 A result set containing the first node in the document that matches the expression.
    Results of NODE_ITERATOR types contain references to nodes in the document. Modifying a node will invalidate the iterator. After modifying a node, attempting to iterate through the results will result in an error.
    Results of NODE_SNAPSHOT types are snapshots, which are essentially lists of matched nodes. You can make changes to the document by altering snapshot nodes. Modifying the document doesn’t invalidate the snapshot; however, if the document is changed, the snapshot may not correspond to the current state of the document, since nodes may have moved, been changed, added, or removed.
  • [Wayback/Archive] XPathResult – Web APIs | MDN
    The XPathResult interface represents the results generated by evaluating an XPath expression within the context of a given node.
    Since XPath expressions can result in a variety of result types, this interface makes it possible to determine and handle the type and value of the result.
  • [Wayback/Archive] XPathResult.singleNodeValue – Web APIs | MDN
  • [Wayback/Archive] XPathResult.iterateNext() – Web APIs | MDN
  • [Wayback/Archive] XPathResult.snapshotLength – Web APIs | MDN
  • [Wayback/Archive] XPathResult.snapshotItem() – Web APIs | MDN

The drawback of XPath is similar to CSS selectors: yet another language to master. So for XPath, I wrote two helper functions separating the JavaScript framework bits and the XPath bits:

function getElementByXPath(xpath, parent) { // use when expecting zero or one result; default value for parent is document
  result = document.evaluate(xpath, parent || document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null);
  return result.singleNodeValue;
}

function getArchiveIsInputValue(tdText) {
  input = getElementByXPath(`/html/body//td[text()="${tdText}"]/following-sibling::td/div/input`);
  return input?.value;
}

First the getElementByXPath method. It calls document.evaluate to execute the xpath expression and get an XPathResult into the result variable. The parent parameter will be passed as document when null or undefined via the || trick. I have underlined the non-Boolean usage of the [Wayback/Archive] Logical OR (||) – JavaScript | MDN

The logical OR (||) operator (logical disjunction) for a set of operands is true if and only if one or more of its operands is true. It is typically used with boolean (logical) values. When it is, it returns a Boolean value. However, the || operator actually returns the value of one of the specified operands, so if this operator is used with non-Boolean values, it will return a non-Boolean value.

The following code shows examples of the || (logical OR) operator.
o1  = true  || true       // t || t returns true
o2  = false || true       // f || t returns true
o3  = true  || false      // t || f returns true
o4  = false || (3 == 4)   // f || f returns false
o5  = 'Cat' || 'Dog'      // t || t returns "Cat"
o6  = false || 'Cat'      // f || t returns "Cat"
o7  = 'Cat' || false      // t || f returns "Cat"
o8  = ''    || false      // f || f returns false
o9  = false || ''         // f || f returns ""
o10 = false || varObject // f || object returns varObject

The above use of || shows you can use it as a kind of default operator as shown in [Wayback/Archive] 3 Ways to Set Default Value in JavaScript | SamanthaMing.com:

My go-to has always been the ternary operator for assigning a value to a variable conditionally. But ever since I discovered that || can be used as a selector operator, I’ve been using that more. I find my code so much easier to read 👍

Yes, it takes some time to wrap your head around it. But once you grasp the concept, it’s super handy. Now I don’t think less code makes your code better. But in this instance, I prefer the || operator 🤩

let isHappyHour = '🍺';

// Logical Operator
isHappyHour = isHappyHour || '🍵'; // '🍺'

// Ternary
isHappyHour = isHappyHour ? isHappyHour : '🍵'; // '🍺'

// If/Else
if (isHappyHour) {
  isHappyHour = isHappyHour;
} else {
  isHappyHour = '🍵';
}

console.log(isHappyHour); // '🍺'

Passing XPathResult.FIRST_ORDERED_NODE_TYPE ensures that the result will the first matching node in the order of the matches in the browser DOM, which is the order you see in the HTML source code. Since XPathResult.FIRST_ORDERED_NODE_TYPE is passed, result will have a valid singleNodeValue field which is returned. It contains the actual HTML element we are after or null if the HTML element cannot be found.

The to the second bit: the XPath expression in the getArchiveIsInputValue method. It passes a string template `/html/body//td[text()="${tdText}"]/following-sibling::td/div/input` which has the /html/body bit to very much speedup XPath query processing (otherwise the whole document tree needs to be walked for td nodes, now only the ones in the <body> of the <html>). It also depends on the value of the tdText parameter. These are valid values for the tdText parameter and the resulting XPath expression that is being assembled:

  • Redirected from field:
    /html/body//td[text()="Redirected from"]/following-sibling::td/div/input
  • Via field:
    /html/body//td[text()="Via"]/following-sibling::td/div/input
  • Original field:
    /html/body//td[text()="Original"]/following-sibling::td/div/input

This will first match <td> elements inside the /html/body path of the document that have the text() function (see below) match the passed value (either "Redirected from", "Via" or "Original").

From that match (since we passed XPathResult.FIRST_ORDERED_NODE_TYPE), it looks for the /following-sibling::td to get the first sibbling <td> element and inside that, it follows /div/input to get the <input> element we want.

These helped me understanding following-sibing:::

You see that XPath is already farily complex. Luckily, this is XPath version 1.0 as virtually no web browser supports any higher XPath versions, see what [Wayback/Archive] Andriy Ivaneyko answered in [Wayback/Archive] xml – What browsers support Xpath 2.0? – Stack Overflow (question by [Wayback/Archive] User Default – Stack Overflow):

Majority of the browsers do not support XPATH 2.0, please see Comparison of layout engines to get more information.

On the text() function:

  • [Wayback/Archive] XML Path Language (XPath): location paths also has a few bits on following-sibling:::
    • following-sibling::chapter[position()=1] selects the next chapter sibling of the context node
    • the following-sibling axis contains all the following siblings of the context node; if the context node is an attribute node or namespace node, the following-sibling axis is empty

You see that the XPath selection language is big. It might still be bigger than CSS selectors. Even on MDN there is much information:

I have the luck of having prior knowledge of doing XPath related work with XML in the past. Many are no so lucky, so hence the documentation references I made here.

MDN is not even the standard, which is at [Wayback/Archive] XML Path Language (XPath). That is incomplete too, hence a truckload of books, videos and web-sites cover it. The MDN links above reference a few good ones. Enjoy them!

Be prepared for oddities like explained by [Wayback/Archive] Mathias Müller in [Wayback/Archive] selenium – XPath: difference between dot and text() – Stack Overflow (question by [Wayback/Archive] Andersson):

the meaning of the two predicates (everything between [ and ]) is different. [text()="Ask Question"] actually means: return true if any of the text nodes of an element contains exactly the text “Ask Question”. On the other hand, [.="Ask Question"] means: return true if the string value of an element is identical to “Ask Question”.
In the XPath model, text inside XML elements can be partitioned into a number of text nodes if other elements interfere with the text,

In the same question, [Wayback/Archive] ggorlen answers [Wayback/Archive] this:

Although many browsers have $x(xPath) as a console built-in, here’s an aggregation of the useful-but-hardcoded snippets from [Wayback/Archive] Introduction to using XPath in JavaScript ready for use in scripts:

Snapshot

This gives a one-off snapshot of the xpath result set. Data may be stale after DOM mutations.

const $x = xp => {
  const snapshot = document.evaluate(
    xp, document, null, 
    XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null
  );
  return [...Array(snapshot.snapshotLength)]
    .map((_, i) => snapshot.snapshotItem(i))
  ;
};

console.log($x('//h2[contains(., "foo")]'));
<h2>foo</h2>
<h2>foobar</h2>
<h2>bar</h2>

First ordered node

 

const $xOne = xp => 
  document.evaluate(
    xp, document, null,
    XPathResult.FIRST_ORDERED_NODE_TYPE, null
  ).singleNodeValue
;

console.log($xOne('//h2[contains(., "foo")]'));
<h2>foo</h2>
<h2>foobar</h2>
<h2>bar</h2>

Iterator

Note however, that if the document is mutated (the document tree is modified) between iterations that will invalidate the iteration and the invalidIteratorState property of XPathResult is set to true, and a NS_ERROR_DOM_INVALID_STATE_ERR exception is thrown.

 

function *$xIter(xp) {
  const iter = document.evaluate(
    xp, document, null, 
    XPathResult.ORDERED_NODE_ITERATOR_TYPE, null
  );

  for (;;) {
    const node = iter.iterateNext();
    
    if (!node) {
      break;
    }
    
    yield node;
  }
}

// dump to array
console.log([...$xIter('//h2[contains(., "foo")]')]);

// return next item from generator
const xpGen = $xIter('//h2[text()="foo"]');
console.log(xpGen.next().value);
<h2>foo</h2>
<h2>foobar</h2>
<h2>bar</h2>

Back to the code (:

Based on the above two helper methods, I made these three bookmarklets:

  • Redirected from field:
    javascript:{
    function getElementByXPath(xpath, parent) { // use when expecting zero or one result; default value for parent is document
      result = document.evaluate(xpath, parent || document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null);
      return result.singleNodeValue;
    }
    
    function getArchiveIsInputValue(tdText) {
      input = getElementByXPath(`/html/body//td[text()="${tdText}"]/following-sibling::td/div/input`);
      return input?.value;
    }
    
    open(getArchiveIsInputValue("Redirected from"))
    }
  • Via field:
    javascript:{
    function getElementByXPath(xpath, parent) { // use when expecting zero or one result; default value for parent is document
      result = document.evaluate(xpath, parent || document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null);
      return result.singleNodeValue;
    }
    
    function getArchiveIsInputValue(tdText) {
      input = getElementByXPath(`/html/body//td[text()="${tdText}"]/following-sibling::td/div/input`);
      return input?.value;
    }
    
    open(getArchiveIsInputValue("Via"))
    }
  • Original field:
    javascript:{
    function getElementByXPath(xpath, parent) { // use when expecting zero or one result; default value for parent is document
      result = document.evaluate(xpath, parent || document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null);
      return result.singleNodeValue;
    }
    
    function getArchiveIsInputValue(tdText) {
      input = getElementByXPath(`/html/body//td[text()="${tdText}"]/following-sibling::td/div/input`);
      return input?.value;
    }
    
    open(getArchiveIsInputValue("Original"))
    }

This was a lot more work than the CSS selector based code!

Here too, the mapping between HTML elements and the DOM interface is crucial. A few more links here:

JSFiddle and gist!

The project to fiddle around with the base functions: [Wayback/Archive] https://jsfiddle.net/h0495krm/

I also saved the base functions in [Wayback/Archive] Browser JavaScript getElement(s)ByXPath functions.

Moar references

If your HTML is XHTML (i.e. XML that is at least well-formed), then [Wayback/Archive] XPath online real-time tester, evaluator and generator for XML & HTML is cool (contrary to the title: if your HTML is not well-formed XML, then it only works on the bits that are well-formed).

Just a few questions are tagged as [Wayback/Archive] document.evaluate‘ Questions – Stack Overflow.

These are in the gist but not in the above post:

–jeroen



// based on:
// – https://codereview.stackexchange.com/questions/167571/evaluating-an-xpath-with-document-evaluate-to-get-an-array-of-nodes
// – https://stackoverflow.com/questions/10596417/is-there-a-way-to-get-element-by-xpath-using-javascript-in-selenium-webdriver
// – https://developer.mozilla.org/en-US/docs/Web/XPath/Introduction_to_using_XPath_in_JavaScript
// – https://developer.mozilla.org/en-US/docs/Web/API/Document/evaluate#result_types
// – https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Logical_OR
function getElementsByXPath(xpath, parent) { // use when expecting zero or more results; default value for parent is document
result = document.evaluate(xpath, parent || document, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE, null);
nodes = []
while (node = result.iterateNext())
nodes.push(node);
return nodes;
}
function getElementByXPath(xpath, parent) { // use when expecting zero or one result; default value for parent is document
result = document.evaluate(xpath, parent || document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null);
return result.singleNodeValue;
}

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.