The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 4,227 other subscribers

Kristian Köhntopp on Twitter: “Basically, show me a Python regex with \d and without ASCII flag, and I can show you a bug, often exploitable.… “

Posted by jpluimers on 2022/12/14

An interesting thought: [Archive] Kristian Köhntopp on Twitter: “Basically, show me a Python regex with \d and without ASCII flag, and I can show you a bug, often exploitable.… “

Basically, input parsing is still very much underrated by most systems and a constant source of peculiarities and therefore bugs, or phrased differently: [Archive] Kristian Köhntopp on Twitter: “In many cases an uncaught exception, and hence a component crash.… “

Kris also states [Archive] Kristian Köhntopp on Twitter: “Again, Python is not alone in this. Perl, when “use utf8;” is active (which it should) also does this, so every single fucking Regex needs a ‘/a‘ at the end. Nobody ever asked \d to match tengwar or klingon numeric symbols.… “.

The point is in the last few words as Arabic numerals are so white spread over the world that the ten digits 0, 1, 2, 3, 4, 5, 6, 7, 8 , 9 they represent should be the de facto \d pattern, but aren’t in Python as per [Wayback/Archive] re — Regular expression operations — Python 3.10.0 documentation: /d (emphasis mine):

\d

For Unicode (str) patterns:
Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]). This includes [0-9], and also many other digit characters. If the ASCII flag is used only [0-9] is matched.
For 8-bit (bytes) patterns:
Matches any decimal digit; this is equivalent to [0-9].

Indeed, [Nd] is A LOT more than 10 digits, 660 according to Unicode character property – Wikipedia: General Category; [Nd]

Value Category Major, minor Basic type[b] Character assigned[b] Count
(as of 14.0)
Remarks
N, Number
Nd Number, decimal digit Graphic Character 660 All these, and only these, have Numeric Type = De[d]

And the same count plus all the characters at [Wayback/Archive] Unicode Characters in the ‘Number, Decimal Digit’ Category.

Look at those 650 other than 0-9: aren’t they amazing and bound for some errors in your code?

That’s why Kristian Köhntopp tweeted this:

Obligatory perl documentation:

  • [Wayback/Archive] utf8 – Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code – Perldoc Browser
    The use utf8 pragma tells the Perl parser to allow UTF-8 in the program text in the current lexical scope. The no utf8 pragma tells Perl to switch back to treating the source text as literal bytes in the current lexical scope. (On EBCDIC platforms, technically it is allowing UTF-EBCDIC, and not UTF-8, but this distinction is academic, so in this document the term UTF-8 is used to mean both).
    Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.
  • [Wayback/Archive] perlre – Perl regular expressions – Perldoc Browser: /a-(and-/aa):

    /a (and /aa)

    This modifier stands for ASCII-restrict (or ASCII-safe). This modifier may be doubled-up to increase its effect.
    When it appears singly, it causes the sequences \d\s\w, and the Posix character classes to match only in the ASCII range. They thus revert to their pre-5.6, pre-Unicode meanings. Under /a\d always means precisely the digits "0" to "9"\s means the five characters [ \f\n\r\t], and starting in Perl v5.18, the vertical tab; \w means the 63 characters [A-Za-z0-9_]; and likewise, all the Posix classes such as [[:print:]] match only the appropriate ASCII-range characters.
    This modifier is useful for people who only incidentally use Unicode, and who do not wish to be burdened with its complexities and security concerns.
    With /a, one can write \d with confidence that it will only match ASCII characters, and should the need arise to match beyond ASCII, you can instead use \p{Digit} (or \p{Word} for \w). There are similar \p{...} constructs that can match beyond ASCII both white space (see “Whitespace” in perlrecharclass), and Posix classes (see “POSIX Character Classes” in perlrecharclass). Thus, this modifier doesn’t mean you can’t use Unicode, it means that to get Unicode matching you must explicitly use a construct (\p{}\P{}) that signals Unicode.
    As you would expect, this modifier causes, for example, \D to mean the same thing as [^0-9]; in fact, all non-ASCII characters match \D\S, and \W\b still means to match at the boundary between \w and \W, using the /a definitions of them (similarly for \B).
    Otherwise, /a behaves like the /u modifier, in that case-insensitive matching uses Unicode rules; for example, “k” will match the Unicode \N{KELVIN SIGN} under /i matching, and code points in the Latin1 range, above ASCII will have Unicode rules when it comes to case-insensitive matching.
    To forbid ASCII/non-ASCII matches (like “k” with \N{KELVIN SIGN}), specify the "a" twice, for example /aai or /aia. (The first occurrence of "a" restricts the \detc., and the second occurrence adds the /i restrictions.) But, note that code points outside the ASCII range will use Unicode rules for /i matching, so the modifier doesn’t really restrict things to just ASCII; it just forbids the intermixing of ASCII and non-ASCII.
    To summarize, this modifier provides protection for applications that don’t wish to be exposed to all of Unicode. Specifying it twice gives added protection.
    This modifier may be specified to be the default by use re '/a' or use re '/aa'. If you do so, you may actually have occasion to use the /u modifier explicitly if there are a few regular expressions where you do want full Unicode rules (but even here, it’s best if everything were under feature "unicode_strings", along with the use re '/aa').

–jeroen

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

 
%d bloggers like this: