Posted by jpluimers on 2022/12/14
An interesting thought: [Archive] Kristian Köhntopp on Twitter: “Basically, show me a Python regex with \d
and without ASCII flag, and I can show you a bug, often exploitable.… “
Basically, input parsing is still very much underrated by most systems and a constant source of peculiarities and therefore bugs, or phrased differently: [Archive] Kristian Köhntopp on Twitter: “In many cases an uncaught exception, and hence a component crash.… “
Kris also states [Archive] Kristian Köhntopp on Twitter: “Again, Python is not alone in this. Perl, when “use utf8;
” is active (which it should) also does this, so every single fucking Regex needs a ‘/a
‘ at the end. Nobody ever asked \d
to match tengwar or klingon numeric symbols.… “.
The point is in the last few words as Arabic numerals are so white spread over the world that the ten digits 0
, 1
, 2
, 3
, 4
, 5
, 6
, 7
, 8
, 9
they represent should be the de facto \d
pattern, but aren’t in Python as per [Wayback/Archive] re — Regular expression operations — Python 3.10.0 documentation: /d
(emphasis mine):
Read the rest of this entry »
Like this:
Like Loading...
Posted in Software Development, Development, RegEx, Scripting, Perl, Python | Leave a Comment »
Posted by jpluimers on 2022/02/03
I needed to search for IBAN numbers in documents and used this regular expression: [a-zA-Z]{2}[0-9]{2} ?[a-zA-Z0-9]{4} ?[0-9]{4} ?[0-9]{4} ?[0-9]{2}
which supports the usual optional whitespace like in NL12 INGB 0345 6789 01
.
It is based on a nice list with table of Notepad++ RegEx character classes supported at [Wayback] Searching | Notepad++ User Manual:
Character Classes
[
set]
⇒ This indicates a set of characters, for example, [abc]
means any of the literal characters a
, b
or c
. You can also use ranges by doing a hyphen between characters, for example [a-z]
for any character from a
to z
. You can use a collating sequence in character ranges, like in [[.ch.]-[.ll.]]
(these are collating sequence in Spanish).
[^
set]
⇒ The complement of the characters in the set. For example, [^A-Za-z]
means any character except an alphabetic character. Care should be taken with a complement list, as regular expressions are always multi-line, and hence [^ABC]*
will match until the first A
, B
or C
(or a
, b
or c
if match case is off), including any newline characters. To confine the search to a single line, include the newline characters in the exception list, e.g. [^ABC\r\n]
.
Please note that the complement of a character set is often many more characters than you expect: (?-s)[^x]+
will match 1 or more instances of any non-x
character, including newlines: the (?-s)
search modifier turns off “dot matches newlines”, but the [^x]
is not a dot .
, so that class is still allowed to match newlines.
[[:
name:]]
or [[:☒:]]
⇒ The whole character class named name. For many, there is also a single-letter “short” class name, ☒. Please note: the [:
name:]
and [:☒:]
must be inside a character class [...]
to have their special meaning.
short |
full name |
description |
equivalent character class |
|
alnum |
letters and digits |
|
|
alpha |
letters |
|
h |
blank |
spacing which is not a line terminator |
[\t\x20\xA0] |
|
cntrl |
control characters |
[\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D] |
d |
digit |
digits |
|
|
graph |
graphical character, so essentially any character except for control chars, \0x7F , \x80 |
|
l |
lower |
lowercase letters |
|
|
print |
printable characters |
[\s[:graph:]] |
|
punct |
punctuation characters |
[!"#$%&'()*+,\-./:;<=>?@\[\\\]^_ { |
s |
space |
whitespace (word or line separator) |
[\t\n\x0B\f\r\x20\x85\xA0\x{2028}\x{2029}] |
u |
upper |
uppercase letters |
|
|
unicode |
any character with code point above 255 |
[\x{0100}-\x{FFFF}] |
w |
word |
word characters |
[_\d\l\u] |
|
xdigit |
hexadecimal digits |
[0-9A-Fa-f] |
Note that letters include any unicode letters (ASCII letters, accented letters, and letters from a variety of other writing systems); digits include ASCII numeric digits, and anything else in Unicode that’s classified as a digit (like superscript numbers ¹²³…).
Note that those character class names may be written in upper or lower case without changing the results. So [[:alnum:]]
is the same as [[:ALNUM:]]
or the mixed-case [[:AlNuM:]]
.
As stated earlier, the [:
name:]
and [:☒:]
(note the single brackets) must be a part of a surrounding character class. However, you may combine them inside one character class, such as [_[:d:]x[:upper:]=]
, which is a character class that would match any digit, any uppercase, the lowercase x
, and the literal _
and =
characters. These named classes won’t always appear with the double brackets, but they will always be inside of a character class.
If the [:
name:]
or [:☒:]
are accidentally not contained inside a surrounding character class, they will lose their special meaning. For example, [:upper:]
is the character class matching :
, u
, p
, e
, and r
; whereas [[:upper:]]
is similar to [A-Z]
(plus other unicode uppercase letters)
[^[:
name:]]
or [^[:☒:]]
⇒ The complement of character class named name or ☒ (matching anything not in that named class). This uses the same long names, short names, and rules as mentioned in the previous description.
–jeroen
Like this:
Like Loading...
Posted in Development, Notepad++, Power User, RegEx, Software Development, Text Editors | Leave a Comment »
Posted by jpluimers on 2021/07/19
[WayBack] windows – Is there any sed like utility for cmd.exe? – Stack Overflow
TL;DR: many people suggest to use PowerShell, but there is GNU sed in Chocolatey
The chocolatey part:
The PowerShell part: read the other answers from the above question.
–jeroen
Like this:
Like Loading...
Posted in *nix, *nix-tools, CommandLine, Power User, PowerShell, RegEx, sed, Windows | Leave a Comment »
Posted by jpluimers on 2021/07/16
Everyone can learn from an outage. CloudFlare shows how to do it right, for instance on the RegEx-going-wild downtime 2 years ago.
So it’s time to link to that one again: [WayBack] Details of the Cloudflare outage on July 2, 2019
More like these at [WayBack] Post Mortem – The Cloudflare Blog.
More on evaluating regular expressions in linear time:
- [WayBack] Regular Expression Search Algorithm KEN THOMPSON Bell Telephone Laboratories, Inc., Murray Hill, New Jersey
- [WayBack] Programming Techniques: Regular expression search algorithm / [WayBack] Programming Techniques: Regular expression search algorithm
A method for locating specific character strings embedded in character text is described and an implementation of this method in the form of a compiler is discussed. The compiler accepts a regular expression as source language and produces an IBM 7094 program as object language. The object program then accepts the text to be searched as input and produces a signal every time an embedded string in the text matches the given regular expression. Examples, problems, and solutions are also presented.
Programming Techniques: Regular expression search algorithm
Full Text: |
PDF |
Author: |
Ken Thompson |
Bell Telphone Labs, Inc., Murray Hill |
Published in:
 |
|
· Magazine |
Communications of the ACM CACM Homepage archive |
Volume 11 Issue 6, June 1968
Pages 419-422
ACM New York, NY, USA
table of contents doi>10.1145/363347.363387 |
|
|
- Thompson’s construction – Wikipedia
is a method of transforming a regular expression into an equivalent nondeterministic finite automaton (NFA)
The algorithm works recursively by splitting an expression into its constituent subexpressions, from which the NFA will be constructed using a set of rules.[3] More precisely, from a regular expression E, the obtained automaton A with the transition function δ respects the following properties:
- A has exactly one initial state q0, which is not accessible from any other state. That is, for any state q and any letter a, {\displaystyle \delta (q,a)}
does not contain q0.
- A has exactly one final state qf, which is not co-accessible from any other state. That is, for any letter a, {\displaystyle \delta (q_{f},a)=\emptyset }
.
- Let c be the number of concatenation of the regular expression E and let s be the number of symbols apart from parentheses — that is, |, *, a and ε. Then, the number of states of A is 2s − c (linear in the size of E).
- The number of transitions leaving any state is at most two.
- Since an NFA of m states and at most e transitions from each state can match a string of length n in time O(emn), a Thompson NFA can do pattern matching in linear time, assuming a fixed-size alphabet.
- [WayBack] A Regular Expression Matcher Code by Rob Pike Exegesis by Brian Kernighan
Via [WayBack] Details of the Cloudflare outage on July 2, 2019 | Hacker News
–jeroen
Like this:
Like Loading...
Posted in Algorithms, Development, Power User, RegEx, Software Development | Leave a Comment »
Posted by jpluimers on 2021/06/30
For me this unaccepted answer from [WayBack] Regex for a file name without an extension – Stack Overflow by [WayBack] Bohemian worked best:
Assuming the extensions are up to 4 chars in length (so filenames like mr.smith
aren’t considered as having an extension, but mr.smith.doc
and mr.smith.html
are considered as having extensions):
^.*[^.]{5}$
No need to capture a group, as the whole expression is what you want – ie group 0.
Depending on the extension length, increase 5
to like 7
for 6
character extensions (it’s always N+1
when you want to match extensions of N
characters).
–jeroen
Like this:
Like Loading...
Posted in Development, RegEx, Software Development | Leave a Comment »