UTF-8 constants

HMG Unicode versions 3.1.x related

Moderator: Rathinagiri

Post Reply
User avatar
kcarmody
Posts: 152
Joined: Tue Oct 07, 2014 11:13 am
Contact:

UTF-8 constants

Post by kcarmody »

For some time, I was looking for a way to represent a Unicode character in UTF-8 format as a constant in terms of its code point. What I mean is something like the way, in C and some other languages, you can represent, for example, U+0905 (अ, Devanagari letter A) as "\u0905". In C, this gives a wide char type which is stored internally as UTF-16 LE. In Harbour, UTF-16 LE of this character can be expressed as e"\x05\x09". But in UTF-8, this becomes e"\xE0\xA4\x85", and the connection to 0x0905 is almost impossible to see.

Harbour has at least two ways to represent a ANSI character constant in terms of its code point, e.g. CHR(0xA0) and E"\xA0" for no-break space. So I thought it should also have a way to do this for UTF-8. But after looking through the Harbour changelog and other documentation, I did not find any solution.

Of course, it is possible to simply put the character in quotes. This works well for some characters. But for others, like no-break space, this does not work well at all, since this character looks just like a regular space (U+0020). For Asian characters, another problem is that they can be difficult to read at point sizes usually used for Western characters.

Harbour does have a function HB_UTF8CHR() that converts a numeric code point to its UTF-8 representation. But this is executed only at runtime. So HB_UTF8CHR() of a constant integer is not considered a constant string from Harbour's point of view. It cannot be used in places where a constant is required, such as in an initialization expression for a STATIC variable, or in a CASE statement of a SWITCH block. It is also inefficient to do a conversion at runtime that could instead be done at compile time.

Fortunately, I discovered that a few functions are evaluated at compile time if they have constant arguments, and the result is therefore also considered a constant. For instance, CHR(99) is considered a constant, because it is evaluated at compile time, not runtime. I did some testing and discovered that the following functions in this category:

Code: Select all

+ // numeric and string
-  *  /  % 
^ // including negative and fractional exponents
0x   $  ==   !=   <  <=  >  >=  .T.  .Y.  .F.  .N.  !  .NOT.  .AND.  .OR.  {}  {=>}  {||}  E""
ASC()  AT()  CHR()  EMPTY()  HB_BITAND()  HB_BITNOT()  HB_BITOR()  HB_BITRESET()  HB_BITSET()  HB_BITSHIFT()  HB_BITTEST()  HB_BITXOR()  IF()  INT()  LEN()  LOWER()  MAX()  MIN()  UPPER()
The following are a few functions that are not evaluated at compile time:

Code: Select all

ABS()  ALLTRIM()  EVAL()  EXP()  HB_UTF8ASC()  HB_UTF8AT()  HB_UTF8CHR()  HB_UTF8LEFT()  HB_UTF8LEN()  HB_UTF8RAT()  HB_UTF8RIGHT()  HB_UTF8SUBSTR()  ISALPHA()  ISDIGIT()  ISLOWER()  ISUPPER()  LEFT()  LOG()  LTRIM()  MOD()  PADC()  PADL()  PADR()  RAT()  REPLICATE()  RIGHT()  ROUND()  RTRIM()  SPACE()  SQRT()  STR()  STRTRAN()  STRZERO()  STUFF()  SUBSTR()  TRANSFORM()  TYPE()  VAL()  VALTYPE()
So ultimately, the solution for me was to develop a way of converting a code point to a UTF-8 string in terms of the first group of functions, and use #translate to map this to a pseudofunction:

Code: Select all

#translate U(<c>) => ;
  IF(<c> \< 0x80   , CHR(    <c>                  ), ;
  IF(<c> \< 0x0800 , CHR(INT(<c> / 0x40)    + 0xC0) + CHR(    <c>           % 0x40 + 0x80), ;
  IF(<c> \< 0x10000, CHR(INT(<c> / 0x1000)  + 0xE0) + CHR(INT(<c> / 0x40)   % 0x40 + 0x80) + CHR(    <c>         % 0x40 + 0x80), ;
                     CHR(INT(<c> / 0x40000) + 0xF0) + CHR(INT(<c> / 0x1000) % 0x40 + 0x80) + CHR(INT(<c> / 0x40) % 0x40 + 0x80) + CHR(    <c>         % 0x40 + 0x80))))
I now use this in my programs to express U+hhhh in Harbour as U(0xhhhh), e.g. U+0905 as U(0x0905).

Kevin
User avatar
srvet_claudio
Posts: 2193
Joined: Thu Feb 25, 2010 8:43 pm
Location: Uruguay
Contact:

Re: UTF-8 constants

Post by srvet_claudio »

Hi Kevin,
in HMG you've got:
# Gets Unicode text value

- HB_UCODE ( cUnicodeCharacter ) --> Return nCode
- HB_UCHAR ( nCode ) --> Return cUnicodeCharacter

- HMG_GetUnicodeValue ( cUnicodeText ) --> Return array { nCode1, nCode2, ..., nCodeN }
- HMG_GetUnicodeCharacter ( { nCode1, nCode2, ..., nCodeN } ) --> Return cUnicodeText



# UTF8 functions

- HMG_IsUTF8 ( cString ) --> lBoolean
- HMG_IsUTF8WithBOM ( cString ) --> lBoolean
- HMG_UTF8RemoveBOM ( cString ) --> cString
- HMG_UTF8InsertBOM ( cString ) --> cString

- HMG_UNICODE_TO_ANSI ( cTextUNICODE ) --> cTextANSI
- HMG_ANSI_TO_UNICODE ( cTextANSI ) --> cTextUNICODE
Best regards.
Dr. Claudio Soto
(from Uruguay)
http://srvet.blogspot.com
User avatar
kcarmody
Posts: 152
Joined: Tue Oct 07, 2014 11:13 am
Contact:

Re: UTF-8 constants

Post by kcarmody »

srvet_claudio wrote:Hi Kevin,
in HMG you've got:
# Gets Unicode text value

- HB_UCODE ( cUnicodeCharacter ) --> Return nCode
- HB_UCHAR ( nCode ) --> Return cUnicodeCharacter

- HMG_GetUnicodeValue ( cUnicodeText ) --> Return array { nCode1, nCode2, ..., nCodeN }
- HMG_GetUnicodeCharacter ( { nCode1, nCode2, ..., nCodeN } ) --> Return cUnicodeText



# UTF8 functions

- HMG_IsUTF8 ( cString ) --> lBoolean
- HMG_IsUTF8WithBOM ( cString ) --> lBoolean
- HMG_UTF8RemoveBOM ( cString ) --> cString
- HMG_UTF8InsertBOM ( cString ) --> cString

- HMG_UNICODE_TO_ANSI ( cTextUNICODE ) --> cTextANSI
- HMG_ANSI_TO_UNICODE ( cTextANSI ) --> cTextUNICODE
Thank you, Claudio, but this is not what I need. None of these functions are evaluated by the compiler when they have constant arguments. So for example, Harbour considers HB_UCHAR(0x0905) to be an expression and not a constant, even though 0x0905 is a constant and HB_UCHAR(0x0905) always returns the same value. Since Harbour does not consider HB_UCHAR(0x0905) to be a constant, it will not allow it to be used in a SWITCH .. CASE statement. Try it and you'll see.

But U(0x0905), where U() is defined in my first message, is a constant and can be used in a SWITCH .. CASE statement.

Kevin
User avatar
srvet_claudio
Posts: 2193
Joined: Thu Feb 25, 2010 8:43 pm
Location: Uruguay
Contact:

Re: UTF-8 constants

Post by srvet_claudio »

I'm sorry Kevien, but I do not understand what you want to say:

Code: Select all

#translate U(<c>) => ;
  IF(<c> \< 0x80   , CHR(    <c>                  ), ;
  IF(<c> \< 0x0800 , CHR(INT(<c> / 0x40)    + 0xC0) + CHR(    <c>           % 0x40 + 0x80), ;
  IF(<c> \< 0x10000, CHR(INT(<c> / 0x1000)  + 0xE0) + CHR(INT(<c> / 0x40)   % 0x40 + 0x80) + CHR(    <c>         % 0x40 + 0x80), ;
                     CHR(INT(<c> / 0x40000) + 0xF0) + CHR(INT(<c> / 0x1000) % 0x40 + 0x80) + CHR(INT(<c> / 0x40) % 0x40 + 0x80) + CHR(    <c>         % 0x40 + 0x80))))

#define CONST 0x0905
#define UChar HB_UCHAR(CONST)

MsgDebug ( U(CONST),  HB_UCHAR (CONST), HMG_GetUnicodeCharacter ( { CONST } ) )
MsgDebug ( NIL,   HB_UCODE (UChar), HMG_GetUnicodeValue (UChar) )


DO CASE
   CASE U(CONST) == HB_UCHAR (CONST)
        MsgInfo ("OK")
ENDCASE
Best regards.
Dr. Claudio Soto
(from Uruguay)
http://srvet.blogspot.com
User avatar
kcarmody
Posts: 152
Joined: Tue Oct 07, 2014 11:13 am
Contact:

Re: UTF-8 constants

Post by kcarmody »

Sorry if I was not clear. What I mean is that

Code: Select all

DO CASE
CASE cUtfChar == HB_UCHAR(0x0905)
...
END
will compile and work OK, but

Code: Select all

SWITCH cUtfChar
CASE HB_UCHAR(0x0905)
...
END
will not compile. It will not compile because Harbour requires the value following CASE to be a constant. The value of a constant is known at compile time, whereas the value of an expression is not known until run time. Harbour recognizes "0x0905" as a constant, but it considers "HB_UCHAR(0x0905)" to be an expression, not a constant, because it does not compute the value of HB_UCHAR(0x0905) during compilation, but only at run time. The code must be something like this:

Code: Select all

SWITCH cUtfChar
CASE "अ"
...
END
or

Code: Select all

SWITCH cUtfChar
CASE U(0x0905)
...
END
I prefer the last one because it is more readable.
User avatar
mol
Posts: 3718
Joined: Thu Sep 11, 2008 5:31 am
Location: Myszków, Poland
Contact:

Re: UTF-8 constants

Post by mol »

Fine solution for more readable source code.
Thanks, Kevin!
User avatar
srvet_claudio
Posts: 2193
Joined: Thu Feb 25, 2010 8:43 pm
Location: Uruguay
Contact:

Re: UTF-8 constants

Post by srvet_claudio »

kcarmody wrote:Sorry if I was not clear. What I mean is that

Code: Select all

DO CASE
CASE cUtfChar == HB_UCHAR(0x0905)
...
END
will compile and work OK, but

Code: Select all

SWITCH cUtfChar
CASE HB_UCHAR(0x0905)
...
END
will not compile. It will not compile because Harbour requires the value following CASE to be a constant. The value of a constant is known at compile time, whereas the value of an expression is not known until run time. Harbour recognizes "0x0905" as a constant, but it considers "HB_UCHAR(0x0905)" to be an expression, not a constant, because it does not compute the value of HB_UCHAR(0x0905) during compilation, but only at run time. The code must be something like this:

Code: Select all

SWITCH cUtfChar
CASE "अ"
...
END
or

Code: Select all

SWITCH cUtfChar
CASE U(0x0905)
...
END
I prefer the last one because it is more readable.
OK Kevien, now I understand, thanks.
Best regards.
Dr. Claudio Soto
(from Uruguay)
http://srvet.blogspot.com
Post Reply