Michael Kaplan Obituary – Berkowitz-Kumin-Bookatz | Cleveland Heights OH (and a whole bunch of info in zero width Unicode stuff)
Posted by jpluimers on 2018/01/02
I totally missed the passing of Michael Scott Kaplan some 2 years ago, so a belated R.I.P. is in place.
Obituary for Michael Kaplan, Michael Scott Kaplan, 45, passed away Wednesday, October 21, 2015, in Redmond, WA, after a brave battle with MS for 25 years. He was a lead software developer for Microsoft.
Source: [WayBack] Michael Kaplan Obituary – Berkowitz-Kumin-Bookatz | Cleveland Heights OH
Michael was the leading source on i18n, L10N, Unicode, sorting, normalisation and other things having to do with languages, representations and writing.
Besides that he was a really nice guy of which I enjoyed his MSDN materials.
Other people enjoy that too, so I’m glad his writings have been archived: [first archive.is, second archive.is, WayBack] Sorting it All Out: Archives
Here are some additional links:
- https://web.archive.org/web/*/https://blogs.msdn.microsoft.com/michkap//*
- [WayBack] Sorting out Internationalization with Michael Kaplan on the Hanselminutes Technology Podcast: Fresh Air for Developers: Michael Kaplan is a Developer in the Windows International group and the author of the popular ‘Sorting It Out’ blog that is dedicated it all things ‘-ization.’ That means Globalization, Internationalization, and Localization. This show is is brought to you by the CYRILLIC CAPITAL LETTER A.
- Facebook: Michael S. Kaplan
- [WayBack] Michael S. Kaplan (@michkap). Blood type AB+ geek who loves i18n etc. I worked for MS & have MS & I had an iBot. Got game tho ain’t playin much anymore… Eto Akta Gamat!. Seattle, WA, USA
- [WayBack] Sorting the rest all Out
- [WayBack] RIP Michael J Kaplan (of Sorting It All Out blog) | The VSubhash.com Blog
- [WayBack] Excellent blog about Windows and Unicode – The Old New Thing
- [WayBack] Michael Kaplan leaves Microsoft
More on miloush.net:
- [WayBack] miloush.net Feed Services
- [Archive.is 1, Archive.is 2] http://miloush.net/
- [WayBack] Keyboard Layout Info – Keyboard Layout Info
- [Archive.is] Emoji List
[Archive.is] Skype Emoticons List List of public and hidden emoticons in Skype.
I got there while researching U+200C and U+200D:
- [WayBack] Every character has a story #19: U+200c and U+200d (ZERO WIDTH [NON] JOINER)
- [WayBack] Break it up, you two!: The zero width non-joiner – The Old New Thing
- [WayBack] Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!)
The relevant Unicode code points in that research:
- [WayBack] Unicode Character ‘SPACE’ (U+0020)
- [WayBack] Unicode Character ‘NO-BREAK SPACE’ (U+00A0)
- U+0300..U+036F: Combining Diacritical Marks – Wikipedia
- [WayBack] Unicode Character ‘OGHAM SPACE MARK’ (U+1680)
- U+1ABx: Combining Diacritical Marks Extended – Wikipedia
- U+1D00..U+1DFx: Combining Diacritical Marks Supplement – Wikipedia
- U+2000..U+206x: [WayBack] Unicode Characters in the General Punctuation Block / General Punctuation – Wikipedia
- [WayBack] Unicode Character ‘EN QUAD’ (U+2000)
- [WayBack] Unicode Character ‘EM QUAD’ (U+2001)
- [WayBack] Unicode Character ‘EN SPACE’ (U+2002)
- [WayBack] Unicode Character ‘EM SPACE’ (U+2003)
- [WayBack] Unicode Character ‘THREE-PER-EM SPACE’ (U+2004)
- [WayBack] Unicode Character ‘FOUR-PER-EM SPACE’ (U+2005)
- [WayBack] Unicode Character ‘SIX-PER-EM SPACE’ (U+2006)
- [WayBack] Unicode Character ‘FIGURE SPACE’ (U+2007)
- [WayBack] Unicode Character ‘PUNCTUATION SPACE’ (U+2008)
- [WayBack] Unicode Character ‘THIN SPACE’ (U+2009)
- [WayBack] Unicode Character ‘HAIR SPACE’ (U+200A)
- [WayBack] Unicode Character ‘ZERO WIDTH SPACE’ (U+200B)
- [WayBack] Unicode Character ‘ZERO WIDTH NON-JOINER’ (U+200C)
- [WayBack] Unicode Character ‘ZERO WIDTH JOINER’ (U+200D)
- [WayBack] Unicode Character ‘LEFT-TO-RIGHT MARK’ (U+200E)
- [WayBack] Unicode Character ‘RIGHT-TO-LEFT MARK’ (U+200F)
- [WayBack] Unicode Character ‘NARROW NO-BREAK SPACE’ (U+202F)
- [WayBack] Unicode Character ‘MEDIUM MATHEMATICAL SPACE’ (U+205F)
- [WayBack] Unicode Character ‘WORD JOINER’ (U+2060)
- U+20D0..U+20Fx: Combining Diacritical Marks for Symbols – Wikipedia
- U+3000..U+33Fx: [WayBack] Unicode Characters in the CJK Symbols and Punctuation Block
- U+FE0x: [WayBack] Unicode Characters in the Variation Selectors Block / Variation Selectors (Unicode block) – Wikipedia
- [WayBack] Unicode Character ‘VARIATION SELECTOR-1’ (U+FE00)
- [WayBack] Unicode Character ‘VARIATION SELECTOR-2’ (U+FE01)
- [WayBack] Unicode Character ‘VARIATION SELECTOR-3’ (U+FE02)
- [WayBack] Unicode Character ‘VARIATION SELECTOR-4’ (U+FE03)
- [WayBack] Unicode Character ‘VARIATION SELECTOR-5’ (U+FE04)
- [WayBack] Unicode Character ‘VARIATION SELECTOR-6’ (U+FE05)
- [WayBack] Unicode Character ‘VARIATION SELECTOR-7’ (U+FE06)
- [WayBack] Unicode Character ‘VARIATION SELECTOR-8’ (U+FE07)
- [WayBack] Unicode Character ‘VARIATION SELECTOR-9’ (U+FE08)
- [WayBack] Unicode Character ‘VARIATION SELECTOR-10’ (U+FE09)
- [WayBack] Unicode Character ‘VARIATION SELECTOR-11’ (U+FE0A)
- [WayBack] Unicode Character ‘VARIATION SELECTOR-12’ (U+FE0B)
- [WayBack] Unicode Character ‘VARIATION SELECTOR-13’ (U+FE0C)
- [WayBack] Unicode Character ‘VARIATION SELECTOR-14’ (U+FE0D)
- [WayBack] Unicode Character ‘VARIATION SELECTOR-15’ (U+FE0E)
- [WayBack] Unicode Character ‘VARIATION SELECTOR-16’ (U+FE0F)
- U+FE2x: Combining Half Marks – Wikipedia
- U+FE70..U+FEFx: [WayBack] Unicode Characters in the Arabic Presentation Forms-B Block
- U+E0100..U+E01Ex: [WayBack] Unicode Characters in the Variation Selectors Supplement Block / Variation Selectors Supplement – Wikipedia
- [WayBack] Unicode Characters in the ‘Separator, Space’ Category / Whitespace character – Wikipedia
- [WayBack] Unicode Characters in the ‘Mark, Nonspacing’ Category
- [WayBack] Unicode Characters in the ‘Other, Format’ Category
- [WayBack] Network.IDN.blacklist chars – MozillaZine Knowledge Base
- [WayBack] Unicode Blocks / Unicode block – Wikipedia
Related:
- [WayBack] javascript – Remove non-ascii character in string – Stack Overflow
- [WayBack] Unicode and JavaScript
- [WayBack] Fingerprinting with Zero-Width Characters… Kristian Köhntopp – Google+
- Fingerprinting with Zero-Width Characters (does not archive)
- Text Fingerprinting Update (does not archive)
From the G+ thread, a few nice comments:
- Quork Q’Tar:
Das heißt, Copy and Paste in Notepad++ und den Text in mehreren Zeichenkodierungen ansehen (bzw., wenn keine Sonderzeichen erforderlich sind (also fast immer), direkt in ASCII konvertieren und dann erst ins Zieldokument copyandpasten) dürfte bis auf die Wortsubstitution (die ja alles andere als neu ist als Methode) eigentlich alles in der Richtung aufdecken? - Jürgen Christoffel:
+Quork Q’Tar nein, nicht copy/paste, sondern ausdrucken und mit OCR wieder einscannen. Ein Bitmap-Scan reicht nicht, der könnte weiterhin erkennbare Glyphen (das kyrillische “a” o.ä.) enthalten. - Tobias Migge:
Beispiel-Text nach Notepad++ kopiert, Erweiterungen->MIME Tools->Quoted Printable Encode:- We’re=E2=80=8B not the=E2=80=8B same text, even though we look the same.
- We’re not the same=E2=80=8B text, even though we look the same.
- Steve S:
+Quork Q’Tar You can paste it into regular Notepad and save as ANSI instead of UTF-8. That strips it out: I tested it just now. - Jeroen Wiert Pluimers:
+Steve S though that kills many other useful characters which depends on your particular ANSI encoding. - Jeroen Wiert Pluimers
It should not be too hard to write a JavaScript web page that – without a round trip – strips a lot of this. Can be even ran from localhost. - Steve S:
+Jeroen Wiert Pluimers Yes, that’s true. Really, the right answer is to feed it through a program to canonicize the text. This includes fixing “typos”, making all of the words either American or British, and so on. Not a trivial task.
(A few years ago, I had to write a small subset of this as part of a program that de-duped email threads, so I’m a bit familiar with the issues.) - Jeroen Wiert Pluimers:
+Steve S that sounds like an interesting project to base such a thing on. Any change to publicise that source? If so: what language? - Jürgen Christoffel:
+Jeroen Wiert Pluimers once upon a time, there was some thing called the “writer’s workbench” for BSD 4.x (or was it AT&T’s?) This might be / have been a good place to start. Don’t remember if it ever wad available in source, though. - Quork Q’Tar
In other words, “few years” doesn’t mean two or three here =D - Steve S:
+Jeroen Wiert Pluimers It was done for hire, so I don’t have any of the code, and wouldn’t own it if I somehow had it. But the basic idea is very simple: for my purposes, only alphanumerics mattered. For “weird” characters, what matters is filtering out the gratuitous punctuation and canonicalizing representations.
–jeroen
via:
- [WayBack] Fingerprinting with Zero-Width Characters… Kristian Köhntopp – Google+
- [WayBack] Zach Aysan on Twitter: I just came across Variation Selectors in unicode. Just wow. I have to investigate how various platforms handle them, but if they’re as effective as I’m guessing, it puts my Zero-Width Character trick to shame.Thanks @FakeUnicode
- [WayBack] Zach Aysan on Twitter: Thanks for all the great feedback on my article on Zero-Width character fingerprinting. I have a quick update with some interesting quotes including @JSwiftTWS from The @weeklystandard.https://t.co/m9dMXeySRt
- [WayBack] Zach Aysan on Twitter: Journalists watch out—you may be unintentionally revealing sources. https://t.co/eKUnVfsEYc
Leave a Reply