Harbour's big Unicode bug 😲

kcarmody · Post by **kcarmody** » Sat Nov 07, 2020 7:13 pm

══════ The problem ══════
I recently discovered that all of Harbour's Unicode functions do not handle Unicode's supplemental plane (SP) characters (code points 0x10000 and above), only Basic Multilingual Plane (BMP) characters (code points 0xFFFF and below). For example,
HB_UTF8CHR(0x0100) returns Ā = U+0100 LATIN CAPITAL LETTER A WITH MACRON, but HB_UTF8CHR(0x1F604) returns U+F604 (a private use character) instead of

= U+1F604 SMILING FACE WITH OPEN MOUTH AND SMILING EYES.

Harbour's "UTF-8" functions actually support a variation of UTF-8 called CESU-8. Very little other software supports CESU-8. Similarly, Harbour's "UTF-16" functions actually support an obsolete format called UCS-2.

Unicode planes - https://en.wikipedia.org/wiki/Plane_(Unicode)
Basic Multilingual Plane (BMP) - https://en.wikipedia.org/wiki/Plane_(Un ... gual_Plane
Supplementary Multilingual Plane (SMP) - https://en.wikipedia.org/wiki/Plane_(Un ... gual_Plane
UTF-8 - https://en.wikipedia.org/wiki/UTF-8
CESU-8 - https://en.wikipedia.org/wiki/CESU-8
UTF-16 - https://en.wikipedia.org/wiki/UTF-16
Surrogate pairs - https://en.wikipedia.org/wiki/UTF-16#Co ... o_U+10FFFF
UCS-2 - https://en.wikipedia.org/wiki/Universal ... racter_Set

══════ The effect of this bug on HMG ══════
HMG uses Harbour to translate between UTF-8, which HMG uses internally, and UTF-16, which Windows uses. This bug affects all string processing in HMG where the string is sent to or received from Windows, such as when we assign a value to a label or retrieve a value from a textbox. These translations actually convert between CESU-8 and UCS-2, not UTF-8 and UTF-16.

══════ Example SP characters ══════
Until a few years ago, SP characters were rarely used. This may explain why many programmers still think that Unicode is a 16-bit encoding, i.e. that it consists only of the BMP. But now, SP characters include many popular emojis and pictograms. Here are a few examples.

A few Unicode equivalents to the phpBB Smilies:

= = U+1F604 SMILING FACE WITH OPEN MOUTH AND SMILING EYES
= = U+1F61E DISAPPOINTED FACE
= = U+1F632 ASTONISHED FACE

A few Unicode pictograms:

= U+1F304 SUNRISE OVER MOUNTAINS
= U+1F3B5 MUSICAL NOTE
= U+1F3C6 TROPHY

══════ Reporting the bugs ══════
A month ago, I reported these bugs on the Harbour developers forum ( https://groups.google.com/g/harbour-devel/c/HWgaMNa7T-Y ). So far, no one has replied. From the lack of activity on this forum, it appears that Harbour is no longer in active development. So we should not anticipate that these bugs will be ever be fixed - they are now features! We must find a way to work with them.

══════ Proposed new functions ══════
To work around the SP bugs in Harbour, I have modified several HMG functions, and added several more, in a proposed set of changes to HMG ( http://hmgforum.com/viewtopic.php?f=8&t=6654 ). This proposal includes revisions to the HMG manual that describe these functions and the underlying issues in the HMG UNICODE section. Currently, this proposal is a pending pull request in Github ( https://github.com/HMG-Official/HMG/pull/6 ).

One new function is HMG_UTF8CHR(), which can be used in place of the buggy Harbour function HB_UTF8CHR(). For example, HMG_UTF8CHR(0x1F60A) returns U+1F60A in UTF-8 form, whereas HB_UTF8CHR(0x1F60A) returns U+F60A.

Two other new functions are HMG_CESU8() and HMG_UNCESU8(), which convert between CESU-8 and UTF-8. For instance, if you have a UTF-8 variable that contains SP characters and you want to assign it to a label, you can do this with

Code: Select all

Win1.Label1.VALUE := HMG_CESU8( cValue )

And if you think a textbox might contain SP characters and you want to retrieve it to a UTF-8 variable, you can do this with

Code: Select all

cValue := HMG_UNCESU8( Win1.Textbox1.VALUE )

Both of these functions are shown in a new demo in SAMPLES\UNICODE\CESU8\Demo.prg.

══════ Unsuccessful attempt to fix HMG to avoid using Harbour ══════
Using the code below, I attempted to fix HMG so that, when converting strings between HMG and Windows, it avoids Harbour and uses Windows functions. This would eliminate the need to use HMG_CESU8() and HMG_UNCESU8() when setting and retrieving values to and from Windows.

This code gave executables that worked fine on Windows 7, but on Windows 10, it gave incompatible executables. When you double click on such an executable in Windows 10, nothing happens - no error message and no other action. You have to right click on the executable and choose an option that runs it in a compatibility mode.

Clearly, this is too high a price to pay for the ability to transparently handle SP characters, so I abandoned this approach. But here is the code in case you want to experiment with it.

In SOURCE\c_UNICODE.c, add two functions:

Code: Select all

WCHAR * HMG_UTF8toWinStrU16( const CHAR * srcA )
{
   INT length;
   WCHAR *dstW;

   length = MultiByteToWideChar( CP_UTF8, 0, srcA, -1, NULL, 0 );
   dstW = ( WCHAR * ) hb_xgrab( length * sizeof( WCHAR ) );
   MultiByteToWideChar( CP_UTF8, 0, srcA, -1, dstW, length );

   return dstW;
}

CHAR * HMG_WinStrU16toUTF8( const WCHAR * srcW )
{
   INT length;
   CHAR *dstA;
   length = WideCharToMultiByte( CP_UTF8, 0, srcW, -1, NULL, 0, NULL, NULL );
   dstA = ( CHAR * ) hb_xgrab( length );
   WideCharToMultiByte( CP_UTF8, 0, srcW, -1, dstA, length, NULL, NULL );

   return dstA;
}

In INCLUDE\HMG_UNICODE.h, make the following changes:

Code: Select all

// #define HMG_CHAR_TO_WCHAR(c)     ((c != NULL) ? hb_osStrU16Encode(c) : NULL)  // return WCHAR
#define HMG_CHAR_TO_WCHAR(c)     ((c != NULL) ? HMG_UTF8toWinStrU16(c) : NULL)  // return WCHAR

// #define HMG_WCHAR_TO_CHAR(c)      hb_osStrU16Decode(c)                       // return CHAR
#define HMG_WCHAR_TO_CHAR(c)      HMG_WinStrU16toUTF8(c)                       // return CHAR

#include <windef.h>
WCHAR * HMG_UTF8toWinStrU16( const CHAR * srcA );
CHAR * HMG_WinStrU16toUTF8( const WCHAR * srcW );

Post by **serge_girard** » Sun Nov 08, 2020 12:56 pm

Thanks Kevin !

Serge

HMGforum.com

Harbour's big Unicode bug 😲

Harbour's big Unicode bug 😲

Re: Harbour's big Unicode bug 😲