The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 2,975 other subscribers

Archive for the ‘RegEx’ Category

Kristian Köhntopp on Twitter: “Basically, show me a Python regex with \d and without ASCII flag, and I can show you a bug, often exploitable.… “

Posted by jpluimers on 2022/12/14

An interesting thought: [Archive] Kristian Köhntopp on Twitter: “Basically, show me a Python regex with \d and without ASCII flag, and I can show you a bug, often exploitable.… “

Basically, input parsing is still very much underrated by most systems and a constant source of peculiarities and therefore bugs, or phrased differently: [Archive] Kristian Köhntopp on Twitter: “In many cases an uncaught exception, and hence a component crash.… “

Kris also states [Archive] Kristian Köhntopp on Twitter: “Again, Python is not alone in this. Perl, when “use utf8;” is active (which it should) also does this, so every single fucking Regex needs a ‘/a‘ at the end. Nobody ever asked \d to match tengwar or klingon numeric symbols.… “.

The point is in the last few words as Arabic numerals are so white spread over the world that the ten digits 0, 1, 2, 3, 4, 5, 6, 7, 8 , 9 they represent should be the de facto \d pattern, but aren’t in Python as per [Wayback/Archive] re — Regular expression operations — Python 3.10.0 documentation: /d (emphasis mine):

Read the rest of this entry »

Posted in Software Development, Development, RegEx, Scripting, Perl, Python | Leave a Comment »

RegEx character classes in “Searching | Notepad++ User Manual”

Posted by jpluimers on 2022/02/03

I needed to search for IBAN numbers in documents and used this regular expression: [a-zA-Z]{2}[0-9]{2} ?[a-zA-Z0-9]{4} ?[0-9]{4} ?[0-9]{4} ?[0-9]{2} which supports the usual optional whitespace like in NL12 INGB 0345 6789 01.

It is based on a nice list with table of Notepad++ RegEx character classes supported at [Wayback] Searching | Notepad++ User Manual:

Character Classes
  • [set] ⇒ This indicates a set of characters, for example, [abc] means any of the literal characters ab or c. You can also use ranges by doing a hyphen between characters, for example [a-z] for any character from a to z. You can use a collating sequence in character ranges, like in [[.ch.]-[.ll.]] (these are collating sequence in Spanish).
  • [^set] ⇒ The complement of the characters in the set. For example, [^A-Za-z] means any character except an alphabetic character. Care should be taken with a complement list, as regular expressions are always multi-line, and hence [^ABC]* will match until the first AB or C (or ab or c if match case is off), including any newline characters. To confine the search to a single line, include the newline characters in the exception list, e.g. [^ABC\r\n].

Please note that the complement of a character set is often many more characters than you expect: (?-s)[^x]+ will match 1 or more instances of any non-x character, including newlines: the (?-s) search modifier turns off “dot matches newlines”, but the [^x] is not a dot ., so that class is still allowed to match newlines.

  • [[:name:]] or [[:☒:]] ⇒ The whole character class named name. For many, there is also a single-letter “short” class name, ☒. Please note: the [:name:] and [:☒:] must be inside a character class [...] to have their special meaning.
    short full name description equivalent character class
    alnum letters and digits
    alpha letters
    h blank spacing which is not a line terminator [\t\x20\xA0]
    cntrl control characters [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]
    d digit digits
    graph graphical character, so essentially any character except for control chars, \0x7F\x80
    l lower lowercase letters
    print printable characters [\s[:graph:]]
    punct punctuation characters [!"#$%&'()*+,\-./:;<=>?@\[\\\]^_{
    s space whitespace (word or line separator) [\t\n\x0B\f\r\x20\x85\xA0\x{2028}\x{2029}]
    u upper uppercase letters
    unicode any character with code point above 255 [\x{0100}-\x{FFFF}]
    w word word characters [_\d\l\u]
    xdigit hexadecimal digits [0-9A-Fa-f]

    Note that letters include any unicode letters (ASCII letters, accented letters, and letters from a variety of other writing systems); digits include ASCII numeric digits, and anything else in Unicode that’s classified as a digit (like superscript numbers ¹²³…).

    Note that those character class names may be written in upper or lower case without changing the results. So [[:alnum:]] is the same as [[:ALNUM:]] or the mixed-case [[:AlNuM:]].

    As stated earlier, the [:name:] and [:☒:] (note the single brackets) must be a part of a surrounding character class. However, you may combine them inside one character class, such as [_[:d:]x[:upper:]=], which is a character class that would match any digit, any uppercase, the lowercase x, and the literal _ and = characters. These named classes won’t always appear with the double brackets, but they will always be inside of a character class.

    If the [:name:] or [:☒:] are accidentally not contained inside a surrounding character class, they will lose their special meaning. For example, [:upper:] is the character class matching :upe, and r; whereas [[:upper:]] is similar to [A-Z] (plus other unicode uppercase letters)

  • [^[:name:]] or [^[:☒:]] ⇒ The complement of character class named name or ☒ (matching anything not in that named class). This uses the same long names, short names, and rules as mentioned in the previous description.

–jeroen

Posted in Development, Notepad++, Power User, RegEx, Software Development, Text Editors | Leave a Comment »

windows – Is there any sed like utility for cmd.exe? – Stack Overflow

Posted by jpluimers on 2021/07/19

[WayBack] windows – Is there any sed like utility for cmd.exe? – Stack Overflow

TL;DR: many people suggest to use PowerShell, but there is GNU sed in Chocolatey

The chocolatey part:

The PowerShell part: read the other answers from the above question.

–jeroen

Posted in *nix, *nix-tools, CommandLine, Power User, PowerShell, RegEx, sed, Windows | Leave a Comment »

CloudFlare knows how to do public postmortems on outages

Posted by jpluimers on 2021/07/16

Everyone can learn from an outage. CloudFlare shows how to do it right, for instance on the RegEx-going-wild downtime 2 years ago.

So it’s time to link to that one again: [WayBack] Details of the Cloudflare outage on July 2, 2019

More like these at [WayBack] Post Mortem – The Cloudflare Blog.

More on evaluating regular expressions in linear time:

Via [WayBack] Details of the Cloudflare outage on July 2, 2019 | Hacker News

–jeroen

Posted in Algorithms, Development, Power User, RegEx, Software Development | Leave a Comment »

Regex for a file name without an extension – Stack Overflow

Posted by jpluimers on 2021/06/30

For me this unaccepted answer from [WayBack] Regex for a file name without an extension – Stack Overflow by [WayBack] Bohemian worked best:

Assuming the extensions are up to 4 chars in length (so filenames like mr.smith aren’t considered as having an extension, but mr.smith.doc and mr.smith.html are considered as having extensions):

^.*[^.]{5}$

No need to capture a group, as the whole expression is what you want – ie group 0.

Depending on the extension length, increase 5 to like 7 for 6 character extensions (it’s always N+1 when you want to match extensions of N characters).

–jeroen

Posted in Development, RegEx, Software Development | Leave a Comment »

 
%d bloggers like this: