The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 2,157 other followers

Delphi, decoding files to strings and finding line endings: some links, some history on Windows NT and UTF/UCS encodings

Posted by jpluimers on 2019/12/31

A while back there were a few G+ threads sprouted by David Heffernan on decoding big files into line-ending splitted strings:

Code comparison:

Python:

with open(filename, 'r', encoding='utf-16-le') as f:
  for line in f:
    pass

Delphi:

for Line in TLineReader.FromFile(filename, TEncoding.Unicode) do
  ;

This spurred some nice observations and unfounded statements on which encodings should be used, so I posted a bit of history that is included below.

Some tips and observations from the links:

  • Good old text files are not “good” with Unicode support, neither are TextFile Device Drivers; nobody has written a driver supporting a wide range of encodings as of yet.
  • Good old text files are slow as well, even with a changed SetTextBuffer
  • When using the TStreamReader, the decoding takes much more time than the actual reading, which means that [WayBack] Faster FileStream with TBufferedFileStream • DelphiABall does not help much
  • TStringList.LoadFromFile, though fast, is a memory allocation dork and has limits on string size
  • Delphi RTL code is not what it used to be: pre-Delphi Unicode RTL code is of far better quality than Delphi 2009 and up RTL code
  • Supporting various encodings is important
  • EBCDIC days: three kinds of spaces, two kinds of hyphens, multiple codepages
  • Strings are just that: strings. It’s about the encoding from/to the file that needs to be optimal.
  • When processing large files, caching only makes sense when the file fits in memory. Otherwise caching just adds overhead.
  • On Windows, if you read a big text file into memory, open the file in “sequential read” mode, to disable caching. Use the FILE_FLAG_SEQUENTIAL_SCAN flag under Windows, as stated at [WayBack] How do FILE_FLAG_SEQUENTIAL_SCAN and FILE_FLAG_RANDOM_ACCESS affect how the operating system treats my file? – The Old New Thing
  • Python string reading depends on the way you read files (ASCII or Unicode); see [WayBack] unicode – Python codecs line ending – Stack Overflow

Though TLineReader is not part of the RTL, I think it is from [WayBack] For-in Enumeration – ADUG.

Encodings in use

It doesn’t help that on the Windows Console, various encodings are used:

Good reading here is [WayBack] c++ – What unicode encoding (UTF-8, UTF-16, other) does Windows use for its Unicode data types? – Stack Overflow

Encoding history

+A. Bouchez I’m with +David Heffernan here:

At its release in 1993, Windows NT was very early in supporting Unicode. Development of Windows NT started in 1990 where they opted for UCS-2 having 2 bytes per character and had a non-required annex on UTF-1.

UTF-1 – that later evolved into UTF-8 – did not even exist at that time. Even UCS-2 was still young: it got designed in 1989. UTF-8 was outlined late 1992 and became a standard in 1993

Some references:

–jeroen

Delphi has three different function ContainsPreamble implementations in these files:
…\source\internet\Web.HTTPProd.pas
…\source\vcl\Vcl.ComCtrls.pas
…\source\rtl\sys\System.SysUtils.pas

view raw
files.txt
hosted with ❤ by GitHub

function ContainsPreamble(const Buffer, Signature: array of Byte): Boolean;
var
I: Integer;
begin
Result := True;
if Length(Buffer) >= Length(Signature) then
begin
for I := 1 to Length(Signature) do
if Buffer[I – 1] <> Signature [I – 1] then
begin
Result := False;
Break;
end;
end
else
Result := False;
end;

function ContainsPreamble(Stream: TStream; Signature: TBytes): Boolean;
var
Buffer: TBytes;
I, LBufLen, LSignatureLen, LPosition: Integer;
begin
Result := True;
LSignatureLen := Length(Signature);
LPosition := Stream.Position;
try
SetLength(Buffer, LSignatureLen);
{$IFDEF CLR}
LBufLen := Stream.Read(Buffer, LSignatureLen);
{$ELSE}
LBufLen := Stream.Read(Buffer[0], LSignatureLen);
{$ENDIF}
finally
Stream.Position := LPosition;
end;
if LBufLen = LSignatureLen then
begin
for I := 1 to LSignatureLen do
if Buffer[I – 1] <> Signature [I – 1] then
begin
Result := False;
Break;
end;
end
else
Result := False;
end;

function ContainsPreamble(AStream: TStream; Encoding: TEncoding; var ASignatureSize: Integer): Boolean;
var
I: Integer;
Signature: TBytes;
Bytes: TBytes;
begin
Result := False;
Signature := Encoding.GetPreamble;
if Signature <> nil then
begin
if AStream.Size >= Length(Signature) then
begin
SetLength(Bytes, Length(Signature));
AStream.Read(Bytes[0], Length(Bytes));
Result := True;
ASignatureSize := Length(Signature);
for I := 0 to Length(Signature)-1 do
if Bytes[I] <> Signature [I] then
begin
ASignatureSize := 0;
Result := False;
Break;
end;
end;
end
end;

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

 
%d bloggers like this: