Note that correctly matching highly depends on the versions of the libraries you use: there have been lots of releases of Unicode versions over the last years (since 2014 roughly every 12 months) each usually adding more Emoji.
Every once in a while, b0rk (Julia Evans, of [Wayback/Archive] wizard zines fame) asks interesting questions like below that results in lot of cool links.
I needed to search for IBAN numbers in documents and used this regular expression: [a-zA-Z]{2}[0-9]{2} ?[a-zA-Z0-9]{4} ?[0-9]{4} ?[0-9]{4} ?[0-9]{2} which supports the usual optional whitespace like in NL12 INGB 0345 6789 01.
[set] ⇒ This indicates a set of characters, for example, [abc] means any of the literal characters a, b or c. You can also use ranges by doing a hyphen between characters, for example [a-z] for any character from a to z. You can use a collating sequence in character ranges, like in [[.ch.]-[.ll.]] (these are collating sequence in Spanish).
[^set] ⇒ The complement of the characters in the set. For example, [^A-Za-z] means any character except an alphabetic character. Care should be taken with a complement list, as regular expressions are always multi-line, and hence [^ABC]* will match until the first A, B or C (or a, b or c if match case is off), including any newline characters. To confine the search to a single line, include the newline characters in the exception list, e.g. [^ABC\r\n].
Please note that the complement of a character set is often many more characters than you expect: (?-s)[^x]+ will match 1 or more instances of any non-x character, including newlines: the (?-s)search modifier turns off “dot matches newlines”, but the [^x] is not a dot ., so that class is still allowed to match newlines.
[[:name:]] or [[:☒:]] ⇒ The whole character class named name. For many, there is also a single-letter “short” class name, ☒. Please note: the [:name:] and [:☒:] must be inside a character class [...] to have their special meaning.
short
full name
description
equivalent character class
alnum
letters and digits
alpha
letters
h
blank
spacing which is not a line terminator
[\t\x20\xA0]
cntrl
control characters
[\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]
d
digit
digits
graph
graphical character, so essentially any character except for control chars, \0x7F, \x80
l
lower
lowercase letters
print
printable characters
[\s[:graph:]]
punct
punctuation characters
[!"#$%&'()*+,\-./:;<=>?@\[\\\]^_{
s
space
whitespace (word or line separator)
[\t\n\x0B\f\r\x20\x85\xA0\x{2028}\x{2029}]
u
upper
uppercase letters
unicode
any character with code point above 255
[\x{0100}-\x{FFFF}]
w
word
word characters
[_\d\l\u]
xdigit
hexadecimal digits
[0-9A-Fa-f]
Note that letters include any unicode letters (ASCII letters, accented letters, and letters from a variety of other writing systems); digits include ASCII numeric digits, and anything else in Unicode that’s classified as a digit (like superscript numbers ¹²³…).
Note that those character class names may be written in upper or lower case without changing the results. So [[:alnum:]] is the same as [[:ALNUM:]] or the mixed-case [[:AlNuM:]].
As stated earlier, the [:name:] and [:☒:] (note the single brackets) must be a part of a surrounding character class. However, you may combine them inside one character class, such as [_[:d:]x[:upper:]=], which is a character class that would match any digit, any uppercase, the lowercase x, and the literal _ and = characters. These named classes won’t always appear with the double brackets, but they will always be inside of a character class.
If the [:name:] or [:☒:] are accidentally not contained inside a surrounding character class, they will lose their special meaning. For example, [:upper:] is the character class matching :, u, p, e, and r; whereas [[:upper:]] is similar to [A-Z] (plus other unicode uppercase letters)
[^[:name:]] or [^[:☒:]] ⇒ The complement of character class named name or ☒ (matching anything not in that named class). This uses the same long names, short names, and rules as mentioned in the previous description.
A method for locating specific character strings embedded in character text is described and an implementation of this method in the form of a compiler is discussed. The compiler accepts a regular expression as source language and produces an IBM 7094 program as object language. The object program then accepts the text to be searched as input and produces a signal every time an embedded string in the text matches the given regular expression. Examples, problems, and solutions are also presented.
The algorithm works recursively by splitting an expression into its constituent subexpressions, from which the NFA will be constructed using a set of rules.[3] More precisely, from a regular expression E, the obtained automaton A with the transition function δ respects the following properties:
A has exactly one initial state q0, which is not accessible from any other state. That is, for any state q and any letter a, {\displaystyle \delta (q,a)} does not contain q0.
A has exactly one final state qf, which is not co-accessible from any other state. That is, for any letter a, {\displaystyle \delta (q_{f},a)=\emptyset }.
Let c be the number of concatenation of the regular expression E and let s be the number of symbols apart from parentheses — that is, |, *, a and ε. Then, the number of states of A is 2s − c (linear in the size of E).
The number of transitions leaving any state is at most two.
Since an NFA of m states and at most e transitions from each state can match a string of length n in time O(emn), a Thompson NFA can do pattern matching in linear time, assuming a fixed-size alphabet.
Assuming the extensions are up to 4 chars in length (so filenames like mr.smith aren’t considered as having an extension, but mr.smith.doc and mr.smith.html are considered as having extensions):
^.*[^.]{5}$
No need to capture a group, as the whole expression is what you want – ie group 0.
Depending on the extension length, increase 5 to like 7 for 6 character extensions (it’s always N+1 when you want to match extensions of N characters).
I think the easiest way to list all VMs is the vim-cmd vmsvc/getallvms command, but it has a big downside: the output is a mess.
The reason is that the output:
has a lot of columns (Vmid, Name, Datastore, File, Guest OS, Version, Annotation),
more than 500 characters per line (eat that 1080p monitor!),
and potentially more than one line per VM as the Annotation is a free-text field that can have newlines.
Example output on one of my machines:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
5 PPB Local_Virtual Machine_v4.0 [EVO860_500GB] VM/PPB-Local_Virtual-Machine_v4.0/PPB Local_Virtual Machine_v4.0.vmx centos64Guest vmx-11 PowerPanel Business software(Local) provides the service which communicates
with the UPS through USB or Serial cable and relays the UPS state to each Remote on other computers
via a network.
It also monitors and logs the UPS status. The computer which has been installed the Local provides
graceful,
unattended shutdown in the event of the power outage to protect the hosted computer.
As an alternative, you could use esxcli vm process list, but that gives IDs that are way harder to remember:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Version looks like vmx-# where # is an unsigned integer
Annotation is multi-line free-form so potentially can have lines starting like being Vmid, but the chance that a line looks exactly like a non-annotated one is very low
So let’s find a grep or sed filter to get just the lines without annotation continuations. Though in general I try to avoid regular expressions as they are hard to both write and read, but with Busybox there is no much choice.
I choose sed, just in case I wanted to do some manipulation in addition to matching.
This means far less escaping than basic regular expressions, capture groups are supported as well as character classes (so [[:digit:]] is more readable than [0-9]), and the + is supported to match once or more (so [0-9]+ means one or more digits, as does [[:digit:]]+, but [d]+ or \d+ don’t ). Unfortunately named capture groups are not supported (so documenting parts of the regular expression like (?<Vmid>^[[:digit:]]+) is not possible, it will give you an error [Wayback] Invalid preceding regular expression).
But first a few of the sed commandline options and their order:
vim-cmd vmsvc/getallvms | sed -n -E -e '/(^[[:digit:]]+)/p'
-n outputs only matching lines that have a p print command.
-E allows extended regular expressions (you can also use -r for that)
-e adds a (in this case extended) regular expression
'/(^[[:digit:]]+)/p' is the extended regular expression embedded in quotes
/ at the start indicates that sed should match the regular expression on each line it parses
/p at the end indicates the matching line should be printed
Parentheses ( and ) surround a capture group
^[[:digit:]]+ matches 1 or more digits at the start of the line
The grep command is indeed much shorter, but does not allow post-editing:
I came up with the below sed regular expression to filter out lines:
starting with a Vmid unsigned integer
having a [Datastore] before the File
have a Guest OS identifier after File
have a Version matching vmx-# after File where # is an unsigned integer
optionally has an Annotation after Version
vim-cmd vmsvc/getallvms | sed -n -E -e "/^([[:digit:]]+)(\s+)((\S.+\S)?)(\s+)(\[\S+\])(\s+)(.+\.vmx)(\s+)(\S+)(\s+)(vmx-[[:digit:]]
+)(\s*?)((\S.+)?)$/p"
A longer expression that I used to fiddle around with is at regex101.com/r/A7MfKu and contains named capture groups. I had to nest a few groups and use the ? non-greedy (or lazy) operator a few times to ensure the fields would not include the spaces between the columns.
Output from “vim-cmd vmsvc/getallvms” is really challenging to process. Our normal approaches such as awk column indexes, character index, and regular expression are all error prone here. The character index of each column varies depending on maximum field length of, for example, VM name. And the presence of spaces in VM names throws off processing as awk columns. And VM name could contain almost any character, foiling regex’s.
Printing capture groups
The cool thing is that it is straightforward to modify the expression to print any of the capture groups in the order you wish: you convert the match expression (/match/p) into a replacement expression (s/match/replace/p) and print the required capture groups in the replace part. A short example is at [Wayback] regex – How to output only captured groups with sed? – Stack Overflow.
There is one gotcha though: Busybox sed only allows single-digit capture group numbers, and we have far more than 9 capture groups. This fails and prints 0 after the output of capture group 1 instead of printing capture group 10, similar for 2 after group 1 instead of printing group 12:
I really dislike using regular expressions, mainly because every time I bump into code using them either:
I cannot decipher them any more
It is used for things not suites for (like parsing JSON or XML: please don’t!)
For more background on when NOT to use regular expressions, remember they describe a regular grammar, and can only me implemented by a finite state machine (a state machine that can be exactly one state out of a set of finite states).
As soon as you need to parse something that needs multiple states at once, or the number of states becomes infinite,
-E,--extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below).(-E is specified by POSIX.)MatchingControl-e PATTERN,--regexp=PATTERN
Use PATTERN as the pattern.This can be used to specify multiple search patterns, or to protect a pattern
beginning with a hyphen (-).(-e is specified by POSIX.)
(…)
grep understands two different versions of regular expression syntax:“basic” and “extended.”In GNU grep, there
is no difference in available functionality using either syntax.In other implementations, basic regular
expressions are less powerful.The following description applies to extended regular expressions; differences for
basic regular expressions are summarized afterwards.
In the beginning I didn’t read further, so I didn’t recognize the subtle differences:
Basic vs ExtendedRegularExpressionsIn basic regular expressions the meta-characters ?,+,{,|,(, and ) lose their special meaning; instead use the
backslashed versions \?, \+, \{, \|, \(, and \).
I always used egrep and needlessly parens, because I learned from examples. Now I learned something new. :)