Posted by jpluimers on 2016/10/04
A while ago (in fact more than a year), I posted Encoding is hard… go G+ with the below picture.
ftfy (fixes text for you) fixes it, but:
How did the single quote become “â€™”?
Actually, because of a a common “beautification” of many Office suites (Microsoft and Open alike), the single quote was a special one: a Unicode Character ‘RIGHT SINGLE QUOTATION MARK’ (U+2019) which in UTF-8 is encoded as 0xE2 0x80 0x99.
Read the rest of this entry »
Posted in Development, Encoding, ISO-8859, ISO8859, Software Development, Unicode, UTF-8, UTF8, Windows-1252 | Leave a Comment »
Posted by jpluimers on 2016/09/06
Simple if you know it:
pip install ftfy
That installs it as a command which is a lot easier than using it from Github at https://github.com/LuminosoInsight/python-ftfy
It knows how to solve the encoding issues in ÃƒÆ’Ã‚Æ’ÃƒÂ¢Ã‚â‚¬Ã‚Å¡ÃƒÆ’Ã‚â€šÃƒâ€šÃ‚Â the future of publishing at W3C.
It didn’t solve my non-Unicode encoding issue: “v3/43/4r” -> “v¾¾r” -> “vóór”.
That was caused by an infamous Western Latin character set confusion issue, in this case ISO-8859–/Windows- versus CP850/CP858 encoding issue (so: no Unicode involved at all, nor CP437 as it doesn’t have ¾).
So I put in a suggestion for ftfy to support finding the above.
Posted in Development, Encoding, Software Development, Unicode, UTF-8, UTF8 | 4 Comments »
Posted by jpluimers on 2016/08/17
After yesterdays post on Testing and static methods don’t go well together, I read around on Source (kunststube [WayBack]) a bit more and found these very nice articles on encoding,Unicode and text:
Related on those, some other nice readings:
Posted in Ansi, ASCII, CP437/OEM 437/PC-8, Development, EBCDIC, Encoding, ISO-8859, ISO8859, Shift JIS, Software Development, Unicode, UTF-16, UTF-8, UTF16, UTF8, Windows-1252 | Leave a Comment »
Posted by jpluimers on 2016/08/05
Unicode is about Glyphs that are used in writing. Have you ever seen the emoji on the right being written like this?
This has been bothering me a while and gets worse over time.
According to: Microsoft just changed its toy gun emoji to a real pistol:
Looks like Microsoft and Apple may not be on the same page about firearm emojis afterall. Right after Apple changed its gun emoji to a water pistol in iOS 10, Microsoft replaced its toy pistol emoji with an actual revolver.
While Apple and Microsoft have gone back to edit their symbols, Google continues to use a pistol in Android keyboards and doesn’t appear to have plans to change this. None of the companies in question have adjusted their knife, sword, bomb, poison and coffin emojis, so… ¯\_(ツ)_/¯
When vendors start prescribing how emojis must look like (influenced by all sorts of emotions) without the user allowing to choose (via a font – that’s what fonts are for!) how they look then it invalidates the whole Unicode principle:
Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems.
These emoji aren’t text and should be gone from the Unicode standard before they can do more harm.
Will the next step be that vendors define their own colours for certain characters in fonts? For Windows Times New Roman A becomes red, B green, C yellow, but in Courier New we’ll permute these colours and all Operating Systems and Versions will do different random colour choices.
Posted in Development, Encoding, Opinions, Software Development, Unicode | Leave a Comment »