The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,854 other subscribers

The regexp for an emoticon ?

Posted by jpluimers on 2024/08/08

I responded to [Wayback/Archive] jilles.com on Twitter: “@0xD4ni @Twitter What is the regexp for an emoticon ?” with [Wayback/Archive] Jeroen Wiert Pluimers on Twitter: “@jilles_com @0xD4ni @Twitter \p{So}+ See …”.

I got the answer from [Wayback/Archive] java – What is the regex to extract all the emojis from a string? – Stack Overflow (thanks [Wayback/Archive] vishalaksh, and [Wayback/Archive] Desgard_Duan) which refers to the quoted section below.

Note that correctly matching highly depends on the versions of the libraries you use: there have been lots of releases of Unicode versions over the last years (since 2014 roughly every 12 months) each usually adding more Emoji.

In addition, many Emoji are not single Unicode codepoints: often they are code points (with or without any of the variation selectors) stacked on top of each other with zero-width joiners like I described in Kris on Twitter: “Company chat: »Right, we need more languages with Emoji as variable type indicators and pointer symbols.«….

I tried fiddling on [Wayback/Archive] regex101: build, test, and debug regex and could not always getting it to work as I hoped for, but also could not figure out how recent their libraries are.

[Wayback/Archive] Regex Tutorial – Unicode Characters and Properties

In addition to complications, Unicode also brings new possibilities. One is that each Unicode character belongs to a certain category. You can match a single character belonging to the “letter” category with \p{L}. You can match a single character not belonging to that category with \P{L}.

Again, “character” really means “Unicode code point”. \p{L} matches a single code point in the category “letter”. If your input string is à encoded as U+0061 U+0300, it matches a without the accent. If the input is à encoded as U+00E0, it matches à with the accent. The reason is that both the code points U+0061 (a) and U+00E0 (à) are in the category “letter”, while U+0300 is in the category “mark”.

You should now understand why \P{M}\p{M}*+ is the equivalent of \X\P{M} matches a code point that is not a combining mark, while \p{M}*+ matches zero or more code points that are combining marks. To match a letter including any diacritics, use \p{L}\p{M}*+. This last regex will always match à, regardless of how it is encoded. The possessive quantifier makes sure that backtracking doesn’t cause \P{M}\p{M}*+ to match a non-mark without the combining marks that follow it, which \X would never do.

PCRE, PHP, and .NET are case sensitive when it checks the part between curly braces of a \p token. \p{Zs} will match any kind of space character, while \p{zs} will throw an error. All other regex engines described in this tutorial will match the space in both cases, ignoring the case of the category between the curly braces. Still, I recommend you make a habit of using the same uppercase and lowercase combination as I did in the list of properties below. This will make your regular expressions work with all Unicode regex engines.

In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties. \pLl is not the equivalent of \p{Ll}. It is the equivalent of \p{L}l which matches Al or àl or any Unicode letter followed by a literal l.

Perl, XRegExp, and the JGsoft engine also support the longhand \p{Letter}. You can find a complete list of all Unicode properties below. You may omit the underscores or use hyphens or spaces instead.
  • \p{L} or \p{Letter}: any kind of letter from any language.
    • \p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
    • \p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
    • \p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
    • \p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
    • \p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
    • \p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
  • \p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
    • \p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
    • \p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
    • \p{Me} or \p{Enclosing_Mark}: a character that encloses the character it is combined with (circle, square, keycap, etc.).
  • \p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
    • \p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
    • \p{Zl} or \p{Line_Separator}: line separator character U+2028.
    • \p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
  • \p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
    • \p{Sm} or \p{Math_Symbol}: any mathematical symbol.
    • \p{Sc} or \p{Currency_Symbol}: any currency sign.
    • \p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
    • \p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
  • \p{N} or \p{Number}: any kind of numeric character in any script.
    • \p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
    • \p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
    • \p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
  • \p{P} or \p{Punctuation}: any kind of punctuation character.
    • \p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
    • \p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
    • \p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
    • \p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
    • \p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
    • \p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
    • \p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
  • \p{C} or \p{Other}: invisible control characters and unused code points.
    • \p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
    • \p{Cf} or \p{Format}: invisible formatting indicator.
    • \p{Co} or \p{Private_Use}: any code point reserved for private use.
    • \p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
    • \p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.

–jeroen

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.