Archive for the ‘Encoding’ Category
Posted by jpluimers on 2017/05/08
From ext3 – How to tell the language encoding of a filename on Linux? – Server Fault [WayBack] I learned a few things:
- filename encoding on Linux is undetermined – the file system just assumes a byte array of characters
- FTP and SFTP suffer from this as well (SFTP is based on SSH which now prefers UTF-8 [WayBack])
A good default is UTF-8, but it’s never guaranteed.
Two tools can help to determine the encoding of a filename:
- convmv [WayBack] converts filenames from one encoding to another
- chardet (Python) The Universal Character Encoding Detector
–jeroen
Posted in *nix, *nix-tools, Development, Encoding, Power User, Software Development, UTF-8, UTF8 | Leave a Comment »
Posted by jpluimers on 2017/04/20
You think you know Unicode? Think again, then read [Wayback] Dark corners of Unicode / fuzzy notepad.
On basics, sorting, comparison, decomposition, composition, width, whitespace, encoding, emoji, interesting code planes and dark corners. Lots of dark corners.
The examples are in Python, but hold for almost any programming language
–jeroen
via: Kristian Köhntopp
Posted in Conference Topics, Conferences, Development, Encoding, Event, Software Development, Unicode | Leave a Comment »
Posted by jpluimers on 2017/02/15
Posted in .NET, C#, Delphi, Delphi 10 Seattle, Delphi 10.1 Berlin (BigBen), Delphi 2010, Delphi XE, Delphi XE2, Delphi XE3, Delphi XE4, Delphi XE5, Delphi XE6, Delphi XE7, Delphi XE8, Development, Encoding, FreePascal, Pascal, Software Development | Leave a Comment »
Posted by jpluimers on 2016/11/22
A while ago, I needed to get the various date, time and week values from WMIC to environment variables with pre-padded zeros. I thought: easy job, just write a batch file.
Tough luck: I couldn’t get the values to expand properly. Which in the end was caused by WMIC emitting UTF-16 and the command-interpreter not expecting double-byte character sets which messed up my original batch file.
| What I wanted |
What I got |
wmic_Day=21
wmic_DayOfWeek=04
wmic_Hour=15
wmic_Milliseconds=00
wmic_Minute=02
wmic_Month=05
wmic_Quarter=02
wmic_Second=22
wmic_WeekInMonth=04
wmic_Year=2015
|
Day=21
wmic_DayOfWeek=4
wmic_Hour=15
wmic_Milliseconds=
wmic_Minute=4
wmic_Month=5
wmic_Quarter=2
wmic_Second=22
wmic_WeekInMonth=4
wmic_Year=2015
|
WMIC uses this encoding because the Wide versions of Windows API calls use UTF-16 (sometimes called UCS-2 as that is where UTF-16 evolved from).
As Windows uses little-endian encoding by default, the high byte (which is zero) of a UTF-16 code point with ASCII characters comes first. That messes up the command interpreter.
Lucikly rojo was of great help solving this.
His solution is centered around set /A, which:
- handles integer numbers and calls them “numeric” (hinting floating point, but those are truncated to integer; one of the tricks rojo uses)
- and (be careful with this as 08 and 09 are not octal numbers) uses these prefixes:
- 0 for Octal
- 0x for hexadecimal
Enjoy and shiver with the online help extract:
Read the rest of this entry »
Posted in Algorithms, Batch-Files, Development, Encoding, Floating point handling, Scripting, Software Development, UCS-2, UTF-16, UTF16 | Leave a Comment »
Posted by jpluimers on 2016/10/04
A while ago (in fact more than a year), I posted Encoding is hard… go G+ with the below picture.
[Wayback] ftfy (“fixes text for you”, a parody on “fixed that for you”) [Wayback] fixes it, but:
How did the single quote become “’“?
Actually, because of a a common “beautification” of many Office suites (Microsoft and Open alike), the single quote was a special one: a Unicode Character ‘RIGHT SINGLE QUOTATION MARK’ (U+2019) which in UTF-8 is encoded as 0xE2 0x80 0x99.
Read the rest of this entry »
Posted in Development, Encoding, ftfy, ISO-8859, ISO8859, Mojibake, Software Development, Unicode, UTF-8, UTF8, Windows-1252 | Leave a Comment »
Posted by jpluimers on 2016/09/06
Simple if you know it:
pip install ftfy
That installs it as a command which is a lot easier than using it from Github at [Wayback] https://github.com/LuminosoInsight/python-ftfy
It knows how to solve the encoding issues in [Archive.is]  the future of publishing at W3C explaining about WTF-8 and Unicode history.
It didn’t solve my non-Unicode encoding issue: [Wayback] “v3/43/4r” -> “v¾¾r” -> “vóór”.
Read the rest of this entry »
Posted in Development, Encoding, ftfy, Mojibake, Software Development, Unicode, UTF-8, UTF8 | 4 Comments »
Posted by jpluimers on 2016/08/17
After yesterdays post on Testing and static methods don’t go well together, I read around on Source (kunststube [WayBack]) a bit more and found these very nice articles on encoding,Unicode and text:
Related on those, some other nice readings:
–jeroen
Posted in Ansi, ASCII, CP437/OEM 437/PC-8, Development, EBCDIC, Encoding, ISO-8859, ISO8859, Shift JIS, Software Development, Unicode, UTF-16, UTF-8, UTF16, UTF8, Windows-1252 | Leave a Comment »
Posted by jpluimers on 2016/08/05
Unicode is about Glyphs that are used in writing. Have you ever seen the emoji on the right being written like this?
This has been bothering me a while and gets worse over time.
According to: Microsoft just changed its toy gun emoji to a real pistol:
Looks like Microsoft and Apple may not be on the same page about firearm emojis afterall. Right after Apple changed its gun emoji to a water pistol in iOS 10, Microsoft replaced its toy pistol emoji with an actual revolver.
…
While Apple and Microsoft have gone back to edit their symbols, Google continues to use a pistol in Android keyboards and doesn’t appear to have plans to change this. None of the companies in question have adjusted their knife, sword, bomb, poison and coffin emojis, so… ¯\_(ツ)_/¯
When vendors start prescribing how emojis must look like (influenced by all sorts of emotions) without the user allowing to choose (via a font – that’s what fonts are for!) how they look then it invalidates the whole Unicode principle:
Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems.
These emoji aren’t text and should be gone from the Unicode standard before they can do more harm.
Will the next step be that vendors define their own colours for certain characters in fonts? For Windows Times New Roman A becomes red, B green, C yellow, but in Courier New we’ll permute these colours and all Operating Systems and Versions will do different random colour choices.
–jeroen
via:
Posted in Development, Encoding, Opinions, Software Development, Unicode | Leave a Comment »
Posted by jpluimers on 2015/10/13
So I won’t forget:
Even though this does not work on most USA T-Shirt sites, it works on this Dutch one: T-Shirt Ontwerpen – t-shirt zelf ontwerpen | Spreadshirt.
–jeroen
PS:
Read the rest of this entry »
Posted in ASCII, Development, Encoding, Software Development, Unicode | Leave a Comment »