The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 2,815 other followers

Archive for the ‘RegEx’ Category

RegEx character classes in “Searching | Notepad++ User Manual”

Posted by jpluimers on 2022/02/03

I needed to search for IBAN numbers in documents and used this regular expression: [a-zA-Z]{2}[0-9]{2} ?[a-zA-Z0-9]{4} ?[0-9]{4} ?[0-9]{4} ?[0-9]{2} which supports the usual optional whitespace like in NL12 INGB 0345 6789 01.

It is based on a nice list with table of Notepad++ RegEx character classes supported at [Wayback] Searching | Notepad++ User Manual:

Character Classes
  • [set] ⇒ This indicates a set of characters, for example, [abc] means any of the literal characters ab or c. You can also use ranges by doing a hyphen between characters, for example [a-z] for any character from a to z. You can use a collating sequence in character ranges, like in [[.ch.]-[.ll.]] (these are collating sequence in Spanish).
  • [^set] ⇒ The complement of the characters in the set. For example, [^A-Za-z] means any character except an alphabetic character. Care should be taken with a complement list, as regular expressions are always multi-line, and hence [^ABC]* will match until the first AB or C (or ab or c if match case is off), including any newline characters. To confine the search to a single line, include the newline characters in the exception list, e.g. [^ABC\r\n].

Please note that the complement of a character set is often many more characters than you expect: (?-s)[^x]+ will match 1 or more instances of any non-x character, including newlines: the (?-s) search modifier turns off “dot matches newlines”, but the [^x] is not a dot ., so that class is still allowed to match newlines.

  • [[:name:]] or [[:☒:]] ⇒ The whole character class named name. For many, there is also a single-letter “short” class name, ☒. Please note: the [:name:] and [:☒:] must be inside a character class [...] to have their special meaning.
    short full name description equivalent character class
    alnum letters and digits
    alpha letters
    h blank spacing which is not a line terminator [\t\x20\xA0]
    cntrl control characters [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]
    d digit digits
    graph graphical character, so essentially any character except for control chars, \0x7F\x80
    l lower lowercase letters
    print printable characters [\s[:graph:]]
    punct punctuation characters [!"#$%&'()*+,\-./:;<=>?@\[\\\]^_{
    s space whitespace (word or line separator) [\t\n\x0B\f\r\x20\x85\xA0\x{2028}\x{2029}]
    u upper uppercase letters
    unicode any character with code point above 255 [\x{0100}-\x{FFFF}]
    w word word characters [_\d\l\u]
    xdigit hexadecimal digits [0-9A-Fa-f]

    Note that letters include any unicode letters (ASCII letters, accented letters, and letters from a variety of other writing systems); digits include ASCII numeric digits, and anything else in Unicode that’s classified as a digit (like superscript numbers ¹²³…).

    Note that those character class names may be written in upper or lower case without changing the results. So [[:alnum:]] is the same as [[:ALNUM:]] or the mixed-case [[:AlNuM:]].

    As stated earlier, the [:name:] and [:☒:] (note the single brackets) must be a part of a surrounding character class. However, you may combine them inside one character class, such as [_[:d:]x[:upper:]=], which is a character class that would match any digit, any uppercase, the lowercase x, and the literal _ and = characters. These named classes won’t always appear with the double brackets, but they will always be inside of a character class.

    If the [:name:] or [:☒:] are accidentally not contained inside a surrounding character class, they will lose their special meaning. For example, [:upper:] is the character class matching :upe, and r; whereas [[:upper:]] is similar to [A-Z] (plus other unicode uppercase letters)

  • [^[:name:]] or [^[:☒:]] ⇒ The complement of character class named name or ☒ (matching anything not in that named class). This uses the same long names, short names, and rules as mentioned in the previous description.


Posted in Development, Notepad++, Power User, RegEx, Software Development, Text Editors | Leave a Comment »

windows – Is there any sed like utility for cmd.exe? – Stack Overflow

Posted by jpluimers on 2021/07/19

[WayBack] windows – Is there any sed like utility for cmd.exe? – Stack Overflow

TL;DR: many people suggest to use PowerShell, but there is GNU sed in Chocolatey

The chocolatey part:

The PowerShell part: read the other answers from the above question.


Posted in *nix, *nix-tools, CommandLine, Power User, PowerShell, RegEx, sed, Windows | Leave a Comment »

CloudFlare knows how to do public postmortems on outages

Posted by jpluimers on 2021/07/16

Everyone can learn from an outage. CloudFlare shows how to do it right, for instance on the RegEx-going-wild downtime 2 years ago.

So it’s time to link to that one again: [WayBack] Details of the Cloudflare outage on July 2, 2019

More like these at [WayBack] Post Mortem – The Cloudflare Blog.

More on evaluating regular expressions in linear time:

Via [WayBack] Details of the Cloudflare outage on July 2, 2019 | Hacker News


Posted in Algorithms, Development, Power User, RegEx, Software Development | Leave a Comment »

Regex for a file name without an extension – Stack Overflow

Posted by jpluimers on 2021/06/30

For me this unaccepted answer from [WayBack] Regex for a file name without an extension – Stack Overflow by [WayBack] Bohemian worked best:

Assuming the extensions are up to 4 chars in length (so filenames like mr.smith aren’t considered as having an extension, but mr.smith.doc and mr.smith.html are considered as having extensions):


No need to capture a group, as the whole expression is what you want – ie group 0.

Depending on the extension length, increase 5 to like 7 for 6 character extensions (it’s always N+1 when you want to match extensions of N characters).


Posted in Development, RegEx, Software Development | Leave a Comment »

VMware ESXi console: viewing all VMs, suspending and waking them up: part 1

Posted by jpluimers on 2021/04/22

I think the easiest way to list all VMs is the vim-cmd vmsvc/getallvms command, but it has a big downside: the output is a mess.

The reason is that the output:

  • has a lot of columns (Vmid, Name, Datastore, File, Guest OS, Version, Annotation),
  • more than 500 characters per line (eat that 1080p monitor!),
  • and potentially more than one line per VM as the Annotation is a free-text field that can have newlines.

Example output on one of my machines:

Vmid Name File Guest OS Version Annotation
10 X9SRI-3F-W10P-EN-MEDIA [EVO860_500GB] VM/X9SRI-3F-W10P-EN-MEDIA/X9SRI-3F-W10P-EN-MEDIA.vmx windows9_64Guest vmx-14
5 PPB Local_Virtual Machine_v4.0 [EVO860_500GB] VM/PPB-Local_Virtual-Machine_v4.0/PPB Local_Virtual Machine_v4.0.vmx centos64Guest vmx-11 PowerPanel Business software(Local) provides the service which communicates
with the UPS through USB or Serial cable and relays the UPS state to each Remote on other computers
via a network.
It also monitors and logs the UPS status. The computer which has been installed the Local provides
unattended shutdown in the event of the power outage to protect the hosted computer.

As an alternative, you could use esxcli vm process list, but that gives IDs that are way harder to remember:

PPB Local_Virtual Machine_v4.0
World ID: 2099719
Process ID: 0
VMX Cartel ID: 2099713
UUID: 56 4d 74 f8 c8 22 41 27-a3 88 49 df 8b dc d6 63
Display Name: PPB Local_Virtual Machine_v4.0
Config File: /vmfs/volumes/5d35e7d8-e8df636f-46b9-0025907d9d5c/VM/PPB-Local_Virtual-Machine_v4.0/PPB Local_Virtual Machine_v4.0.vmx
World ID: 2099728
Process ID: 0
VMX Cartel ID: 2099717
UUID: 56 4d 51 ac f6 cf e4 0b-b6 86 2f 53 a2 8a 4b ea
Display Name: X9SRI-3F-W10P-EN-MEDIA
Config File: /vmfs/volumes/5d35e7d8-e8df636f-46b9-0025907d9d5c/VM/X9SRI-3F-W10P-EN-MEDIA/X9SRI-3F-W10P-EN-MEDIA.vmx

I got both of the above commands from [Wayback] VMware Knowledge Base: Performing common virtual machine-related tasks with command-line utilities (2012964).

Back to the columns that vim-cmd vmsvc/getallvms returns:

  • Vmid is an unsigned integer
  • Name can have spaces
  • Datastore has square brackets [ and ] around it
  • File can contain spaces
  • Guest OS is an identifier without spaces (it is a value from [Wayback] the vSphere API VcVirtualMachineGuestOsIdentifier
  • Version looks like vmx-# where # is an unsigned integer
  • Annotation is multi-line free-form so potentially can have lines starting like being Vmid, but the chance that a line looks exactly like a non-annotated one is very low

So let’s find a grep or  sed filter to get just the lines without annotation continuations. Though in general I try to avoid regular expressions as they are hard to both write and read, but with Busybox there is no much choice.

I choose sed, just in case I wanted to do some manipulation in addition to matching.

Busybox sed

Though the source code [Wayback] sed.c\editors – busybox – BusyBox: The Swiss Army Knife of Embedded Linux indicates sed.c - very minimalist version of sed, the implementation actually is reasonably feature rich, just not feature complete. That’s OK given the aim of Busybox to be small.

Luckily, deep in the busybox sed code, it indicates that extended regular expressions are supported (support is in [Wayback] /uClibc/plain/libc/misc/regex/regcomp.c (look for regcomp, do not get confused by xregcomp on call sites as that is [Wayback] just a tiny wrapper to call regcomp).

The support has become better over time, like [Wayback] gnu – sed Command on BusyBox expects different syntax? – Super User shows.

This means far less escaping than basic regular expressions, capture groups are supported as well as character classes (so [[:digit:]] is more readable than [0-9]), and the + is supported to match once or more (so [0-9]+ means one or more digits, as does [[:digit:]]+, but [d]+ or \d+ don’t ). Unfortunately named capture groups are not supported (so documenting parts of the regular expression like (?<Vmid>^[[:digit:]]+) is not possible, it will give you an error [Wayback] Invalid preceding regular expression).

But first a few of the sed commandline options and their order:

vim-cmd vmsvc/getallvms | sed -n -E -e '/(^[[:digit:]]+)/p'
  1. -n outputs only matching lines that have a p print command.
  2. -E allows extended regular expressions (you can also use -r for that)
  3. -e adds a (in this case extended) regular expression
  4. '/(^[[:digit:]]+)/p' is the extended regular expression embedded in quotes
    1. / at the start indicates that sed should match the regular expression on each line it parses
    2. /p at the end indicates the matching line should be printed
    3. Parentheses ( and ) surround a capture group
    4. ^[[:digit:]]+ matches 1 or more digits at the start of the line

The grep command is indeed much shorter, but does not allow post-editing:

vim-cmd vmsvc/getallvms | grep -E '(^[[:digit:]]+)'

Building a sed filter

I came up with the below sed regular expression to filter out lines:

  1. starting with a Vmid unsigned integer
  2. having a [Datastore] before the File
  3. have a Guest OS identifier after File
  4. have a Version matching vmx-# after File where # is an unsigned integer
  5. optionally has an Annotation after Version
vim-cmd vmsvc/getallvms | sed -n -E -e  "/^([[:digit:]]+)(\s+)((\S.+\S)?)(\s+)(\[\S+\])(\s+)(.+\.vmx)(\s+)(\S+)(\s+)(vmx-[[:digit:]]

A longer expression that I used to fiddle around with is at and contains named capture groups. I had to nest a few groups and use the ? non-greedy (or lazy) operator a few times to ensure the fields would not include the spaces between the columns.

Others use different expressions as for instance explained in [Wayback] Get all VMs with “vmware-vim-cmd vmsvc/getallvms” – VMware Technology Network VMTN:

Output from “vim-cmd vmsvc/getallvms” is really challenging to process. Our normal approaches such as awk column indexes, character index, and regular expression are all error prone here. The character index of each column varies depending on maximum field length of, for example, VM name. And the presence of spaces in VM names throws off processing as awk columns. And VM name could contain almost any character, foiling regex’s.

Printing capture groups

The cool thing is that it is straightforward to modify the expression to print any of the capture groups in the order you wish: you convert the match expression (/match/p) into a replacement expression (s/match/replace/p) and print the required capture groups in the replace part. A short example is at [Wayback] regex – How to output only captured groups with sed? – Stack Overflow.

There is one gotcha though: Busybox sed only allows single-digit capture group numbers, and we have far more than 9 capture groups. This fails and prints 0 after the output of capture group 1 instead of printing capture group 10, similar for 2 after group 1 instead of printing group 12:

vim-cmd vmsvc/getallvms | sed -n -E -e  "s/^([[:digit:]]+)(\s+)((\S.+\S)?)(\s+)(\[\S+\])(\s+)(.+\.vmx)(\s+)(\S+)(\s+)(vmx-[[:digit:]]+)(\s*?)((\S.+)?)$/Vmid:\1 Guest:\10 Version:\12 Name:\3 Datastore:\7 File:\8/p"

So we need to cut down on capture groups first by removing all capture groups around the \s white-space matching:

vim-cmd vmsvc/getallvms | sed -n -E -e  "/^([[:digit:]]+)\s+((\S.+\S)?)\s+(\[\S+\])\s+(.+\.vmx)\s+(\S+)\s+(vmx-[[:digit:]]+)\s*?((\S.+)?)$/p"

Then we get this to print some of the capture groups:

vim-cmd vmsvc/getallvms | sed -n -E -e "s/^([[:digit:]]+)\s+((\S.+\S)?)\s+(\[\S+\])\s+(.+\.vmx)\s+(\S+)\s+(vmx-[[:digit:]]+)\s*?((\S.+)?)$/Vmid:\1 Guest:\6 Version:\7 Name:\3 Datastore:\4 File:\5 Annotation:\8/p"

With this output:

Vmid:10 Guest:windows9_64Guest Version:vmx-14 Name:X9SRI-3F-W10P-EN-MEDIA Datastore:[EVO860_500GB] File:VM/X9SRI-3F-W10P-EN-MEDIA/X9SRI-3F-W10P-EN-MEDIA.vmx Annotation:
Vmid:5 Guest:centos64Guest Version:vmx-11 Name:PPB Local_Virtual Machine_v4.0 Datastore:[EVO860_500GB] File:VM/PPB-Local_Virtual-Machine_v4.0/PPB Local_Virtual Machine_v4.0.vmx Annotation:PowerPanel Business software(Local) provides the service which communicates

Figuring out power state for each VM

This will be in the next installment, as by now this already has become a big blog-post (:


Posted in *nix, *nix-tools, ash/dash, ash/dash development, Development, ESXi6, ESXi6.5, ESXi6.7, ESXi7, Power User, RegEx, Scripting, Software Development, Virtualization, VMware, VMware ESXi | Leave a Comment »

%d bloggers like this: