RegEx character classes in “Searching | Notepad++ User Manual”
Posted by jpluimers on 2022/02/03
I needed to search for IBAN numbers in documents and used this regular expression: [a-zA-Z]{2}[0-9]{2} ?[a-zA-Z0-9]{4} ?[0-9]{4} ?[0-9]{4} ?[0-9]{2} which supports the usual optional whitespace like in NL12 INGB 0345 6789 01.
It is based on a nice list with table of Notepad++ RegEx character classes supported at [Wayback] Searching | Notepad++ User Manual:
Character Classes
[set]⇒ This indicates a set of characters, for example,[abc]means any of the literal charactersa,borc. You can also use ranges by doing a hyphen between characters, for example[a-z]for any character fromatoz. You can use a collating sequence in character ranges, like in[[.ch.]-[.ll.]](these are collating sequence in Spanish).[^set]⇒ The complement of the characters in the set. For example,[^A-Za-z]means any character except an alphabetic character. Care should be taken with a complement list, as regular expressions are always multi-line, and hence[^ABC]*will match until the firstA,BorC(ora,borcif match case is off), including any newline characters. To confine the search to a single line, include the newline characters in the exception list, e.g.[^ABC\r\n].Please note that the complement of a character set is often many more characters than you expect:
(?-s)[^x]+will match 1 or more instances of any non-xcharacter, including newlines: the(?-s)search modifier turns off “dot matches newlines”, but the[^x]is not a dot., so that class is still allowed to match newlines.
[[:name:]]or[[:☒:]]⇒ The whole character class named name. For many, there is also a single-letter “short” class name, ☒. Please note: the[:name:]and[:☒:]must be inside a character class[...]to have their special meaning.
short full name description equivalent character class alnumletters and digits alphaletters hblankspacing which is not a line terminator [\t\x20\xA0]cntrlcontrol characters [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]ddigitdigits graphgraphical character, so essentially any character except for control chars, \0x7F,\x80llowerlowercase letters printable characters [\s[:graph:]]punctpunctuation characters [!"#$%&'()*+,\-./:;<=>?@\[\\\]^_{sspacewhitespace (word or line separator) [\t\n\x0B\f\r\x20\x85\xA0\x{2028}\x{2029}]uupperuppercase letters unicodeany character with code point above 255 [\x{0100}-\x{FFFF}]wwordword characters [_\d\l\u]xdigithexadecimal digits [0-9A-Fa-f]Note that letters include any unicode letters (ASCII letters, accented letters, and letters from a variety of other writing systems); digits include ASCII numeric digits, and anything else in Unicode that’s classified as a digit (like superscript numbers ¹²³…).
Note that those character class names may be written in upper or lower case without changing the results. So
[[:alnum:]]is the same as[[:ALNUM:]]or the mixed-case[[:AlNuM:]].As stated earlier, the
[:name:]and[:☒:](note the single brackets) must be a part of a surrounding character class. However, you may combine them inside one character class, such as[_[:d:]x[:upper:]=], which is a character class that would match any digit, any uppercase, the lowercasex, and the literal_and=characters. These named classes won’t always appear with the double brackets, but they will always be inside of a character class.If the
[:name:]or[:☒:]are accidentally not contained inside a surrounding character class, they will lose their special meaning. For example,[:upper:]is the character class matching:,u,p,e, andr; whereas[[:upper:]]is similar to[A-Z](plus other unicode uppercase letters)[^[:name:]]or[^[:☒:]]⇒ The complement of character class named name or ☒ (matching anything not in that named class). This uses the same long names, short names, and rules as mentioned in the previous description.
–jeroen






Leave a comment