The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 2,731 other followers

RegEx character classes in “Searching | Notepad++ User Manual”

Posted by jpluimers on 2022/02/03

I needed to search for IBAN numbers in documents and used this regular expression: [a-zA-Z]{2}[0-9]{2} ?[a-zA-Z0-9]{4} ?[0-9]{4} ?[0-9]{4} ?[0-9]{2} which supports the usual optional whitespace like in NL12 INGB 0345 6789 01.

It is based on a nice list with table of Notepad++ RegEx character classes supported at [Wayback] Searching | Notepad++ User Manual:

Character Classes
  • [set]¬†‚áí This indicates a¬†set¬†of characters, for example,¬†[abc]¬†means any of the literal characters¬†a,¬†b¬†or¬†c. You can also use ranges by doing a hyphen between characters, for example¬†[a-z]¬†for any character from¬†a¬†to¬†z. You can use a collating sequence in character ranges, like in¬†[[.ch.]-[.ll.]]¬†(these are collating sequence in Spanish).
  • [^set]¬†‚áí The complement of the characters in the¬†set. For example,¬†[^A-Za-z]¬†means any character except an alphabetic character. Care should be taken with a complement list, as regular expressions are always multi-line, and hence¬†[^ABC]*¬†will match until the first¬†A,¬†B¬†or¬†C¬†(or¬†a,¬†b¬†or¬†c¬†if match case is off), including any newline characters. To confine the search to a single line, include the newline characters in the exception list, e.g.¬†[^ABC\r\n].

Please note that the complement of a character set is often many more characters than you expect:¬†(?-s)[^x]+¬†will match 1 or more instances of any non-x¬†character, including newlines: the¬†(?-s)¬†search modifier¬†turns off ‚Äúdot matches newlines‚ÄĚ, but the¬†[^x]¬†is¬†not¬†a dot¬†., so that class is still allowed to match newlines.

  • [[:name:]]¬†or¬†[[:‚ėí:]]¬†‚áí The whole character class named¬†name. For many, there is also a single-letter ‚Äúshort‚ÄĚ class name, ‚ėí. Please note: the¬†[:name:]¬†and¬†[:‚ėí:]¬†must be inside a character class¬†[...]¬†to have their special meaning.
    short full name description equivalent character class
    alnum letters and digits
    alpha letters
    h blank spacing which is not a line terminator [\t\x20\xA0]
    cntrl control characters [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]
    d digit digits
    graph graphical character, so essentially any character except for control chars, \0x7F, \x80
    l lower lowercase letters
    print printable characters [\s[:graph:]]
    punct punctuation characters [!"#$%&'()*+,\-./:;<=>?@\[\\\]^_{
    s space whitespace (word or line separator) [\t\n\x0B\f\r\x20\x85\xA0\x{2028}\x{2029}]
    u upper uppercase letters
    unicode any character with code point above 255 [\x{0100}-\x{FFFF}]
    w word word characters [_\d\l\u]
    xdigit hexadecimal digits [0-9A-Fa-f]

    Note that letters include any unicode letters (ASCII letters, accented letters, and letters from a variety of other writing systems); digits include ASCII numeric digits, and anything else in Unicode that‚Äôs classified as a digit (like superscript numbers ¬Ļ¬≤¬≥‚Ķ).

    Note that those character class names may be written in upper or lower case without changing the results. So [[:alnum:]] is the same as [[:ALNUM:]] or the mixed-case [[:AlNuM:]].

    As stated earlier, the¬†[:name:]¬†and¬†[:‚ėí:]¬†(note the single brackets) must be a part of a surrounding character class. However, you¬†may¬†combine them inside one character class, such as¬†[_[:d:]x[:upper:]=], which is a character class that would match any digit, any uppercase, the lowercase¬†x, and the literal¬†_¬†and¬†=¬†characters. These named classes won‚Äôt always appear with the double brackets, but they will always be inside of a character class.

    If the¬†[:name:]¬†or¬†[:‚ėí:]¬†are accidentally¬†not¬†contained inside a surrounding character class, they will lose their special meaning. For example,¬†[:upper:]¬†is the character class matching¬†:,¬†u,¬†p,¬†e, and¬†r; whereas¬†[[:upper:]]¬†is similar to¬†[A-Z]¬†(plus other unicode uppercase letters)

  • [^[:name:]]¬†or¬†[^[:‚ėí:]]¬†‚áí The complement of character class named¬†name¬†or ‚ėí (matching anything¬†not¬†in that named class). This uses the same long names, short names, and rules as mentioned in the previous description.

–jeroen

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

 
%d bloggers like this: