RegEx character classes in “Searching | Notepad++ User Manual”
Posted by jpluimers on 2022/02/03
I needed to search for IBAN numbers in documents and used this regular expression: [a-zA-Z]{2}[0-9]{2} ?[a-zA-Z0-9]{4} ?[0-9]{4} ?[0-9]{4} ?[0-9]{2}
which supports the usual optional whitespace like in NL12 INGB 0345 6789 01
.
It is based on a nice list with table of Notepad++ RegEx character classes supported at [Wayback] Searching | Notepad++ User Manual:
Character Classes
[
set]
⇒ This indicates a set of characters, for example,[abc]
means any of the literal charactersa
,b
orc
. You can also use ranges by doing a hyphen between characters, for example[a-z]
for any character froma
toz
. You can use a collating sequence in character ranges, like in[[.ch.]-[.ll.]]
(these are collating sequence in Spanish).[^
set]
⇒ The complement of the characters in the set. For example,[^A-Za-z]
means any character except an alphabetic character. Care should be taken with a complement list, as regular expressions are always multi-line, and hence[^ABC]*
will match until the firstA
,B
orC
(ora
,b
orc
if match case is off), including any newline characters. To confine the search to a single line, include the newline characters in the exception list, e.g.[^ABC\r\n]
.Please note that the complement of a character set is often many more characters than you expect:
(?-s)[^x]+
will match 1 or more instances of any non-x
character, including newlines: the(?-s)
search modifier turns off “dot matches newlines”, but the[^x]
is not a dot.
, so that class is still allowed to match newlines.
[[:
name:]]
or[[:☒:]]
⇒ The whole character class named name. For many, there is also a single-letter “short” class name, ☒. Please note: the[:
name:]
and[:☒:]
must be inside a character class[...]
to have their special meaning.
short full name description equivalent character class alnum
letters and digits alpha
letters h
blank
spacing which is not a line terminator [\t\x20\xA0]
cntrl
control characters [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]
d
digit
digits graph
graphical character, so essentially any character except for control chars, \0x7F
,\x80
l
lower
lowercase letters printable characters [\s[:graph:]]
punct
punctuation characters [!"#$%&'()*+,\-./:;<=>?@\[\\\]^_
{s
space
whitespace (word or line separator) [\t\n\x0B\f\r\x20\x85\xA0\x{2028}\x{2029}]
u
upper
uppercase letters unicode
any character with code point above 255 [\x{0100}-\x{FFFF}]
w
word
word characters [_\d\l\u]
xdigit
hexadecimal digits [0-9A-Fa-f]
Note that letters include any unicode letters (ASCII letters, accented letters, and letters from a variety of other writing systems); digits include ASCII numeric digits, and anything else in Unicode that’s classified as a digit (like superscript numbers ¹²³…).
Note that those character class names may be written in upper or lower case without changing the results. So
[[:alnum:]]
is the same as[[:ALNUM:]]
or the mixed-case[[:AlNuM:]]
.As stated earlier, the
[:
name:]
and[:☒:]
(note the single brackets) must be a part of a surrounding character class. However, you may combine them inside one character class, such as[_[:d:]x[:upper:]=]
, which is a character class that would match any digit, any uppercase, the lowercasex
, and the literal_
and=
characters. These named classes won’t always appear with the double brackets, but they will always be inside of a character class.If the
[:
name:]
or[:☒:]
are accidentally not contained inside a surrounding character class, they will lose their special meaning. For example,[:upper:]
is the character class matching:
,u
,p
,e
, andr
; whereas[[:upper:]]
is similar to[A-Z]
(plus other unicode uppercase letters)[^[:
name:]]
or[^[:☒:]]
⇒ The complement of character class named name or ☒ (matching anything not in that named class). This uses the same long names, short names, and rules as mentioned in the previous description.
–jeroen
Leave a Reply