Kristian Köhntopp on Twitter: “Basically, show me a Python regex with \d and without ASCII flag, and I can show you a bug, often exploitable.… “

December 2022
M	T	W	T	F	S	S
	1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Posted by jpluimers on 2022/12/14

An interesting thought: [Archive] Kristian Köhntopp on Twitter: “Basically, show me a Python regex with \d and without ASCII flag, and I can show you a bug, often exploitable.… “

Basically, input parsing is still very much underrated by most systems and a constant source of peculiarities and therefore bugs, or phrased differently: [Archive] Kristian Köhntopp on Twitter: “In many cases an uncaught exception, and hence a component crash.… “

Kris also states [Archive] Kristian Köhntopp on Twitter: “Again, Python is not alone in this. Perl, when “use utf8;” is active (which it should) also does this, so every single fucking Regex needs a ‘/a‘ at the end. Nobody ever asked \d to match tengwar or klingon numeric symbols.… “.

The point is in the last few words as Arabic numerals are so white spread over the world that the ten digits 0, 1, 2, 3, 4, 5, 6, 7, 8 , 9 they represent should be the de facto \d pattern, but aren’t in Python as per [Wayback/Archive] re — Regular expression operations — Python 3.10.0 documentation: /d (emphasis mine):

\d

For Unicode (str) patterns:

Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]). This includes [0-9], and also many other digit characters. If the ASCII flag is used only [0-9] is matched.

For 8-bit (bytes) patterns:

Matches any decimal digit; this is equivalent to [0-9].

Indeed, [Nd] is A LOT more than 10 digits, 660 according to Unicode character property – Wikipedia: General Category; [Nd]

Value Category Major, minor Basic type^[b] Character assigned^[b] Count
(as of 14.0) Remarks

N, Number

Nd Number, decimal digit Graphic Character 660 All these, and only these, have Numeric Type = De^[d]

Value	Category Major, minor	Basic type^[b]	Character assigned^[b]	Count (as of 14.0)	Remarks
N, Number
Nd	Number, decimal digit	Graphic	Character	660	All these, and only these, have Numeric Type = De^[d]

And the same count plus all the characters at [Wayback/Archive] Unicode Characters in the ‘Number, Decimal Digit’ Category.

Look at those 650 other than 0-9: aren’t they amazing and bound for some errors in your code?

That’s why Kristian Köhntopp tweeted this:

[Archive] Kris on Twitter: “@koehntopp @ainmosni Next up: Discovering what \d really compares, then discovering the ASCII flag, then questioning why this is NOT the default.” / Twitter
[Archive] Kristian Köhntopp on Twitter: “… “ pointing to
- [Wayback/Archive] re — Regular expression operations — Python 3.10.0 documentation: re.A and re.ASCII
  
  re.A
  re.ASCII
  
  Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a).
  
  Note that for backward compatibility, the re.U flag still exists (as well as its synonym re.UNICODE and its embedded counterpart (?u)), but these are redundant in Python 3 since matches are Unicode by default for strings (and Unicode matching isn’t allowed for bytes).
[Archive] Kristian Köhntopp on Twitter: “Again, Python is not alone in this. Perl, when “use utf8;” is active (which it should) also does this, so every single fucking Regex needs a ‘/a‘ at the end. Nobody ever asked \d to match tengwar or klingon numeric symbols.… “
- [Archive] Daniël Franke 🏳️‍🌈 (@ainmosni@social.tchncs.de) on Twitter: “@isotopp @koehntopp It gets even more fun when you do need that behaviour for some character classes, but not others.” / Twitter
  - [Archive] darix on Twitter: “@ainmosni @isotopp @koehntopp also fun is \A\z vs ^$” / Twitter
[Archive] Kristian Köhntopp on Twitter: “Basically, show me a Python regex with \d and without ASCII flag, and I can show you a bug, often exploitable.… “
- [Archive] Daniël Franke 🏳️‍🌈 on Twitter: “Yeah… the sad part is that often \w with it is commonly also a bug, at least if you deal with non English names/text. That said, matching too much is usually worse than matching too little. Maybe someone already did this, but it would be nice to have separate classes for utf.… “
- aaa
  - [Archive] Kristian Köhntopp on Twitter: “In many cases an uncaught exception, and hence a component crash.… “

Obligatory perl documentation:

[Wayback/Archive] utf8 – Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code – Perldoc Browser

The use utf8 pragma tells the Perl parser to allow UTF-8 in the program text in the current lexical scope. The no utf8 pragma tells Perl to switch back to treating the source text as literal bytes in the current lexical scope. (On EBCDIC platforms, technically it is allowing UTF-EBCDIC, and not UTF-8, but this distinction is academic, so in this document the term UTF-8 is used to mean both).

Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.
[Wayback/Archive] perlre – Perl regular expressions – Perldoc Browser: /a-(and-/aa):

/a (and /aa)

This modifier stands for ASCII-restrict (or ASCII-safe). This modifier may be doubled-up to increase its effect.

When it appears singly, it causes the sequences \d, \s, \w, and the Posix character classes to match only in the ASCII range. They thus revert to their pre-5.6, pre-Unicode meanings. Under /a, \d always means precisely the digits "0" to "9"; \s means the five characters [ \f\n\r\t], and starting in Perl v5.18, the vertical tab; \w means the 63 characters [A-Za-z0-9_]; and likewise, all the Posix classes such as [[:print:]] match only the appropriate ASCII-range characters.

This modifier is useful for people who only incidentally use Unicode, and who do not wish to be burdened with its complexities and security concerns.

With /a, one can write \d with confidence that it will only match ASCII characters, and should the need arise to match beyond ASCII, you can instead use \p{Digit} (or \p{Word} for \w). There are similar \p{...} constructs that can match beyond ASCII both white space (see “Whitespace” in perlrecharclass), and Posix classes (see “POSIX Character Classes” in perlrecharclass). Thus, this modifier doesn’t mean you can’t use Unicode, it means that to get Unicode matching you must explicitly use a construct (\p{}, \P{}) that signals Unicode.

As you would expect, this modifier causes, for example, \D to mean the same thing as [^0-9]; in fact, all non-ASCII characters match \D, \S, and \W. \b still means to match at the boundary between \w and \W, using the /a definitions of them (similarly for \B).

Otherwise, /a behaves like the /u modifier, in that case-insensitive matching uses Unicode rules; for example, “k” will match the Unicode \N{KELVIN SIGN} under /i matching, and code points in the Latin1 range, above ASCII will have Unicode rules when it comes to case-insensitive matching.

To forbid ASCII/non-ASCII matches (like “k” with \N{KELVIN SIGN}), specify the "a" twice, for example /aai or /aia. (The first occurrence of "a" restricts the \d, etc., and the second occurrence adds the /i restrictions.) But, note that code points outside the ASCII range will use Unicode rules for /i matching, so the modifier doesn’t really restrict things to just ASCII; it just forbids the intermixing of ASCII and non-ASCII.

To summarize, this modifier provides protection for applications that don’t wish to be exposed to all of Unicode. Specifying it twice gives added protection.

This modifier may be specified to be the default by use re '/a' or use re '/aa'. If you do so, you may actually have occasion to use the /u modifier explicitly if there are a few regular expressions where you do want full Unicode rules (but even here, it’s best if everything were under feature "unicode_strings", along with the use re '/aa').

–jeroen

This entry was posted on 2022/12/14 at 12:00 and is filed under Development, Perl, Python, RegEx, Scripting, Software Development. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

	jpluimers on Ookla speedtest CLI for Window…
	Mateusz on Now that XE8 is out, some Turb…
	jpluimers on Some links that might help use…
	jpluimers on Hidden Features in Delphi rela…
	jpluimers on Watching “Why is C# Evol…

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription