Question about RichEditBox handling of Unicode text files

Source code related resources

Moderator: Rathinagiri

Post Reply
User avatar
kcarmody
Posts: 122
Joined: Tue Oct 07, 2014 11:13 am
Has thanked: 6 times
Been thanked: 16 times
Contact:

Question about RichEditBox handling of Unicode text files

Post by kcarmody » Fri Nov 07, 2014 11:14 am

RichEditBox has two methods that handle files, RtfLoadFile and RtfSaveFile. These methods call RichEditBox_StreamIn and RichEditBox_StreamOut in c_richeditbox.c.

These two functions handle RTF and ANSI text files OK, but have some problems with Unicode text files. They seem to ignore the byte order marks (BOM) that are usually necessary for software to recognize text files as Unicode text files.

RichEditBox_StreamIn removes the BOM from a UTF-8 text file (nDataFormat = 1), but it does not remove the BOM from a UTF-16 text file (nDataFormat = 3). This behavior is actually implicit in the Windows code that the function calls.

RichEditBox_StreamOut does not add any BOMs, either to UTF-8 (nDataFormat = 1 or 2), or to UTF-16 (nDataFomat = 3). Some software can recognize unmarked UTF-8, but no software I have ever seen recognizes unmarked UTF-16.

I think that Windows acts this way because the EM_STREAMIN and EM_STREAMOUT messages are designed for "data streams", which may be internal buffers as well as file contents. Windows seems to assume that the developer will take care of BOMs if the data stream is going to or from a file.

All software that handles Unicode text files recognizes marked text files, so there is never any harm in putting a BOM in, while plenty of harm can come from leaving it out.

Both of these functions include a case (nDataFormat = 5) for UTF-8 RTF, but this is useless, as RTF encodes all Unicode characters as plain text RTF commands. So you never see a UTF-8 RTF file, and if you did, nothing would open it.

I came across the BOM problem when I was enhancing the Rich Edit Demo, viewtopic.php?f=9&t=4030. It was important to me to be able to read and write text files, so I added some workarounds to the demo to fix the behavior of RichEditBox_StreamIn/Out. This was a quick fix using Memoread and Memowrite, but it would be better to use fread and fwrite, either in Harbour or in C.

I could add such fixes into h_controlmisc.prg (definition of RtfLoadFile and RtfSaveFile methods) or into c_richeditbox.c (definition of RichEditBox_StreamIn/Out), but that would change the behavior of these methods and functions.

The question is, should these methods and functions be changed so that they handle BOMs? It might break existing code if we do. But I suspect that no one is using this code now, as it does not handle BOMs properly.

Kevin

User avatar
bpd2000
Posts: 1017
Joined: Sat Sep 10, 2011 4:07 am
Location: India
Has thanked: 164 times
Been thanked: 72 times

Post by bpd2000 » Fri Nov 07, 2014 12:14 pm

Thank you Mr. Kavin for more info on Unicode text files
BPD
Convert Dream into Reality through HMG

User avatar
esgici
Posts: 4440
Joined: Wed Jul 30, 2008 9:17 pm
DBs Used: DBF
Location: iskenderun / Turkiye
Has thanked: 328 times
Been thanked: 99 times
Contact:

Post by esgici » Fri Nov 07, 2014 12:20 pm

bpd2000 wrote:Thank you Mr. Kavin for more info on Unicode text files
+1
Viva INTERNATIONAL HMG :D

Javier Tovar
Posts: 1275
Joined: Tue Sep 03, 2013 4:22 am
Location: Tecámac, México
Has thanked: 1 time
Been thanked: 2 times

Post by Javier Tovar » Fri Nov 07, 2014 4:48 pm

bpd2000 wrote:Thank you Mr. Kavin for more info on Unicode text files
+1

Creo que el café es bueno por allá! :)

Saludos

Post Reply