The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 4,173 other subscribers

aha (Ansi HTML Adapter) with clickable URIs

Posted by jpluimers on 2018/10/02

aha is great to generate HTML from ANSI text (i.e. the coloured output on a Linux console).

But it doesn’t generate clickable URIs (it can’t yet by itself as it only looks one character in the future).

The thread at https://github.com/theZiz/aha/issues/20 suggested a case-insensitive regex through sed but the exact suggestion failed for a few reasons I will explain below.

First the bash alias (requires both aha and perl):


#!/usr/bin/env bash
# based on https://github.com/theZiz/aha/issues/20#event-797466520
aha-with-expanded-http-https-urls()
{
aha | perl -C -Mutf8 -pe 's,([^"])((https?|s?ftp|ftps?|file)://.*?)([\s]|\&quot;\s),$1<a href="$2">$2</a>$4,gi'
}

The above script is a gist as WordPress regularly fucks up text that remotely resembles html.

The drawbacks of the original solution (sed replacement before running aha):

  1. aha would replace the generate < and > characters in the anchor element with &lt; and&gt; so the regular expression would not work
  2. after moving aha in front of sed I found out that on Mac OS X, the I option is not supported: you will get a bad flag in substitute command: 'I' when executing sed 's,\(https\?://[^ ]*\),<a href="\1">\1</a>,gI'
  3. after an initial port of the regular expression replacement to perl I found out it replaced too much (as it now operated on aha generated html) which made even perl -C -Mutf8 -pe 's,([^"])((https?|s?ftp|ftps?|file)://[^\s]*),$1<a href="$2">$2</a>,gi' fail

To cut a long story short, here is a bash function that works and you can pipe Ansi output through:

aha-with-expanded-http-https-urls()
{
  aha | perl -C -Mutf8 -pe 's,([^"])((https?|s?ftp|ftps?|file)://.*?)([\s]|\&quot;\s),$1<a 
}

It doesn’t take into account RFC URI checking by regex as that’s way too convoluted. If anyone wants that, adapt it according to the answers athttp://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url

The biggest problem was to ensure it would skip the &quot; terminating an URI at the end of the line. This can be in the testssl.sh output upon a 302-redirect. So the solution is somewhat tailored to testssl.sh output piped through aha.

A lot of digging finally resulted in this expression at https://regex101.com/r/zF3zQ2/2 Note that site forgets about the , as search separators, but that’s OK: you can use the drop-down to choose another one or paste this full expression and it will happily use the , separator:

s,([^"])((https?|s?ftp|ftps?|file)://.*?)([\s]|\&quot;\s),$1<a href="$2">$2</a>$4,gi

Getting there, one of the things I tried was negative lookahead but that failed. I tried following the example at for instance http://stackoverflow.com/questions/11028336/regex-to-match-a-pattern-and-exclude-list-of-string

So in the above solution, I went for a non-greedy .*? expression followed by matching either whitespace or the &quot; followed by whitespace.

These are the separator, search and modifier part of the above expression:

,([^"])((https?|s?ftp|ftps?|file)://.*?)([\s]|\&quot;\s),gi

Note the 2nd capturing group cannot do without the 3rd in order to match multiple protocols.

This is how it’s assembled:

  • 1st Capturing group ([^"])
    • [^"] match a single character not present in the list below
      • " a single character in the list " literally (case insensitive)
  • 2nd Capturing group ((https?|s?ftp|ftps?|file)://.*?)
    • 3rd Capturing group (https?|s?ftp|ftps?|file)
      • 1st Alternative: https?
        • http matches the characters http literally (case insensitive)
      • s? matches the character s literally (case insensitive)
        • Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
      • 2nd Alternative: s?ftp
        • s? matches the character s literally (case insensitive)
          • Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
      • ftp matches the characters ftp literally (case insensitive)
      • 3rd Alternative: ftps?
        • ftp matches the characters ftp literally (case insensitive)
      • s? matches the character s literally (case insensitive)
        • Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
      • 4th Alternative: file
        • file matches the characters file literally (case insensitive)
    • :// matches the characters :// literally
    • .*? matches any character (except newline)
      • Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
  • 4th Capturing group ([\s]|\&quot;)
    • 1st Alternative: [\s]
      • [\s] match a single character present in the list below
        • \s match any white space character [\r\n\t\f ]
    • 2nd Alternative: \&quot;\s
      • \& matches the character& literally
      • quot; matches the characters quot; literally (case insensitive)
      • \s match any white space character [\r\n\t\f ]
  • g modifier: global. All matches (don’t return on first match)
  • i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])

For replacement it’s important to ensure all unique capturing groups end up in the output. Which means you can skip $3 (as it’s part of $2) but have to include the others.

Which gets me to the replacement part of the expression:

$1<a href="$2">$2</a>$4

Test input:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!-- This file was created with the aha Ansi HTML Adapter. http://ziz.delphigl.com/tool_aha.php -->
<html xmlns="http://www.w3.org/1999/xhtml">
    testssl.sh       2.7dev from https://testssl.sh/dev/
<span style="font-weight:bold;"> OCSP URI                     </span>http://clients1.google.com/ocsp
<span style="font-weight:bold;"> HTTP Status Code           </span>  302 Found, redirecting to &quot;https://www.google.nl/?gfe_rd=cr&amp;ei=ZWjmV86hE5LH8AeFmaP4Bg&quot;

Test output:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!-- This file was created with the aha Ansi HTML Adapter. <a href="http://ziz.delphigl.com/tool_aha.php">http://ziz.delphigl.com/tool_aha.php</a> -->
<html xmlns="http://www.w3.org/1999/xhtml">
    testssl.sh       2.7dev from <a href="https://testssl.sh/dev/">https://testssl.sh/dev/</a>
<span style="font-weight:bold;"> OCSP URI                     </span><a href="http://clients1.google.com/ocsp">http://clients1.google.com/ocsp</a>
<span style="font-weight:bold;"> HTTP Status Code           </span>  302 Found, redirecting to &quot;<a href="https://www.google.nl/?gfe_rd=cr&amp;ei=ZWjmV86hE5LH8AeFmaP4Bg">https://www.google.nl/?gfe_rd=cr&amp;ei=ZWjmV86hE5LH8AeFmaP4Bg</a>&quot;

Test matches:

MATCH 1
1.  [168-169]   ` `
2.  [169-205]   `http://ziz.delphigl.com/tool_aha.php`
3.  [169-173]   `http`
4.  [205-206]   ` `
MATCH 2
1.  [286-287]   ` `
2.  [287-310]   `https://testssl.sh/dev/`
3.  [287-292]   `https`
4.  [310-311]   `
`
MATCH 3
1.  [379-380]   `>`
2.  [380-411]   `http://clients1.google.com/ocsp`
3.  [380-384]   `http`
4.  [411-412]   `
`
MATCH 4
1.  [512-513]   `;`
2.  [513-575]   `https://www.google.nl/?gfe_rd=cr&amp;ei=ZWjmV86hE5LH8AeFmaP4Bg`
3.  [513-518]   `https`
4.  [575-582]   `&quot;
`

Enjoy!

–jeroen

via:

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

 
%d bloggers like this: