The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 4,262 other subscribers

Michael Kaplan Obituary – Berkowitz-Kumin-Bookatz | Cleveland Heights OH (and a whole bunch of info in zero width Unicode stuff)

Posted by jpluimers on 2018/01/02

I totally missed the passing of Michael Scott Kaplan some 2 years ago, so a belated R.I.P. is in place.

Obituary for Michael Kaplan, Michael Scott Kaplan, 45, passed away Wednesday, October 21, 2015, in Redmond, WA, after a brave battle with MS for 25 years. He was a lead software developer for Microsoft.

Source: [WayBackMichael Kaplan Obituary – Berkowitz-Kumin-Bookatz | Cleveland Heights OH

Michael was the leading source on i18n, L10N, Unicode, sorting, normalisation and other things having to do with languages, representations and writing.

Besides that he was a really nice guy of which I enjoyed his MSDN materials.

Other people enjoy that too, so I’m glad his writings have been archived: [first archive.is, second archive.is, WayBackSorting it All Out: Archives

Here are some additional links:

More on miloush.net:

I got there while researching U+200C and U+200D:

The relevant Unicode code points in that research:

Related:

From the G+ thread, a few nice comments:

  • Quork Q’Tar:
    Das heißt, Copy and Paste in Notepad++ und den Text in mehreren Zeichenkodierungen ansehen (bzw., wenn keine Sonderzeichen erforderlich sind (also fast immer), direkt in ASCII konvertieren und dann erst ins Zieldokument copyandpasten) dürfte bis auf die Wortsubstitution (die ja alles andere als neu ist als Methode) eigentlich alles in der Richtung aufdecken?
  • Jürgen Christoffel:
    +Quork Q’Tar nein, nicht copy/paste, sondern ausdrucken und mit OCR wieder einscannen. Ein Bitmap-Scan reicht nicht, der könnte weiterhin erkennbare Glyphen (das kyrillische “a” o.ä.) enthalten.
  • Tobias Migge:
    Beispiel-Text nach Notepad++ kopiert, Erweiterungen->MIME Tools->Quoted Printable Encode:

    • We’re=E2=80=8B not the=E2=80=8B same text, even though we look the same.
    • We’re not the same=E2=80=8B text, even though we look the same.
  • Steve S:
    +Quork Q’Tar You can paste it into regular Notepad and save as ANSI instead of UTF-8. That strips it out: I tested it just now.
  • Jeroen Wiert Pluimers:
    +Steve S though that kills many other useful characters which depends on your particular ANSI encoding.
  • Jeroen Wiert Pluimers
    It should not be too hard to write a JavaScript web page that – without a round trip – strips a lot of this. Can be even ran from localhost.
  • Steve S:
    +Jeroen Wiert Pluimers Yes, that’s true. Really, the right answer is to feed it through a program to canonicize the text. This includes fixing “typos”, making all of the words either American or British, and so on. Not a trivial task.
    (A few years ago, I had to write a small subset of this as part of a program that de-duped email threads, so I’m a bit familiar with the issues.)
  • Jeroen Wiert Pluimers:
    +Steve S that sounds like an interesting project to base such a thing on. Any change to publicise that source? If so: what language?
  • Jürgen Christoffel:
    +Jeroen Wiert Pluimers once upon a time, there was some thing called the “writer’s workbench” for BSD 4.x (or was it AT&T’s?) This might be / have been a good place to start. Don’t remember if it ever wad available in source, though.
  • Quork Q’Tar
    In other words, “few years” doesn’t mean two or three here =D
  • Steve S:
    +Jeroen Wiert Pluimers It was done for hire, so I don’t have any of the code, and wouldn’t own it if I somehow had it. But the basic idea is very simple: for my purposes, only alphanumerics mattered. For “weird” characters, what matters is filtering out the gratuitous punctuation and canonicalizing representations.

–jeroen

via:

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.