Firther modifications to HMG_IsUTF8

HMG Unicode versions 3.1.x related

Moderator: Rathinagiri

Post Reply
User avatar
kcarmody
Posts: 152
Joined: Tue Oct 07, 2014 11:13 am
Contact:

Firther modifications to HMG_IsUTF8

Post by kcarmody »

Last month, I posted a set of RichEditBox suggestions to this forum at http://hmgforum.com/viewtopic.php?f=43&t=4471&start=38. Since then, I have discovered some bugs in one of the functions I submitted changes for, HMG_IsUTF8() in /SOURCE/h_UNICODE_String.prg. I've now made the following changes to the September version of HMG_IsUTF8():
  • It now rejects overlong UTF-8 byte sequences, except for the frequently used two byte overlong sequence for the null character.
  • It now rejects surrogate characters in UTF-8 sequences.
  • In some cases, it was returning incorrect values for incomplete UTF-8 sequences (the cPartial argument). This has now been corrected.
The changes I proposed in September remain in this version. I added three optional arguments:
  • lAllowASCII allows the input string to be all ASCII
  • lAllowPartial allows the input string to end with an incomplete UTF-8 sequence
  • cPartial (passed by reference) is set to the incomplete UTF-8 sequence at the end of the input string, or the empty string
The last two arguments are useful when the string is an input file buffer.

http://kevincarmody.com/hmg/SOURCE/h_UNICODE_String.prg - lines 175-300

Code: Select all

// Following function modified by Kevin Carmody, October 2015

FUNCTION HMG_IsUTF8( cString, lAllowASCII, lAllowPartial, cPartial )

/* 
  Modeled after HB_STRISUTF8 in \src\rtl\strutf8.c in Harbour source and
     is_utf8() posted at http://stackoverflow.com/questions/1031645/how-to-detect-utf-8-in-plain-c
  HB_STRISUTF8 has several bugs:
     1. It does not accept a pure ASCII string.
     2. It does not accept the empty string.
     3. It accepts code points outside of Unicode range.
     4. It accepts overlong UTF-8 sequences.
     5. It accepts surrogate characters.
  This function returns .F. if cString contains any invalid UTF-8.  It also
     accepts the 2-byte overlong sequence for the null character.
  If the optional argument lAllowASCII is .T., cString may be all ASCII.
     Otherwise cString must contain one or more non-ASCII chars.
  If the optional argument lAllowPartial is .T., cString may end with an 
     unfinished UTF-8 byte sequence, which is passed back through cPartial, 
     which is otherwise set to the empty string. This is useful when cString 
     is a file buffer.
*/

LOCAL lASCII  := .T.
LOCAL lCheck  := .F.
LOCAL lUTF8   := .T.
LOCAL nCBytes := 0
LOCAL nRBytes := 0
LOCAL cChar, nChar, nLead

   IF lAllowASCII == NIL
      lAllowASCII := .F.
   ENDIF
   IF lAllowPartial == NIL
      lAllowPartial := .F.
   ENDIF

   BEGIN SEQUENCE

      FOR EACH cChar IN cString

         nChar := HB_BCODE( cChar )

         IF nCBytes > 0 // check continuation bytes

            IF nChar < 0x80 .OR. nChar > 0xBF // disallow invalid continuation byte
               BREAK
            ENDIF
            IF lCheck // check first continuation byte for partially valid lead byte
               SWITCH nLead
               CASE 0xC0 // disallow 2-byte overlongs except overlong null character
                  IF nChar != 0x80
                     BREAK
                  ENDIF
                  EXIT
               CASE 0xE0 // disallow 3-byte overlongs
                  IF nChar < 0xA0
                     BREAK
                  ENDIF
                  EXIT
               CASE 0xED // disallow surrogates
                  IF nChar > 0x9F
                     BREAK
                  ENDIF
                  EXIT
               CASE 0xF0 // disallow 4-byte overlongs
                  IF nChar < 0x90
                     BREAK
                  ENDIF
                  EXIT
               CASE 0xF4 // disallow 4-byte sequences beyond end of Unicode
                  IF nChar > 0x8F
                     BREAK
                  ENDIF
                  EXIT
               ENDSWITCH
               lCheck := .F.
            ENDIF
            nCBytes --
            nRBytes ++

         ELSEIF nChar >= 0x80 // check lead byte

            lASCII := .F.
            nLead := nChar
            IF nLead < 0xC0 .OR. nLead == 0xC1 .OR. nLead > 0xF4 // disallow invalid lead bytes
               BREAK
            ENDIF
            lCheck := ( nLead == 0xC0 .OR. nLead == 0xE0 .OR. nLead == 0xED .OR. ;
              nLead == 0xF0 .OR. nLead == 0xF4 ) // partially valid lead bytes

            DO CASE // compute number of continuation bytes
            CASE nLead <= 0xDF
              nCBytes := 1
            CASE nLead <= 0xEF
              nCBytes := 2
            OTHERWISE
              nCBytes := 3
            ENDCASE
            nRBytes := 1

         ENDIF

      NEXT

   RECOVER

      lUTF8 := .F.

   END SEQUENCE

   IF lUTF8 .AND. nCBytes > 0
      IF lAllowPartial
         cPartial := RIGHT( cString, nRBytes )
      ELSE
         lUTF8 := .F.
      ENDIF
   ELSE
      IF lAllowPartial
         cPartial := ''
      ENDIF
   ENDIF

   IF ! lAllowASCII .AND. lASCII
      lUTF8 := .F.
   ENDIF

RETURN lUTF8
I've also updated the zip of files in my proposed patch at http://kevincarmody.com/hmg/HmgChangeProposal.zip, and relinked the Rich Edit demo at http://kevincarmody.com/hmg/SAMPLES/Con ... x/demo.exe, since it uses a RichEditBox method that calls HMG_IsUTF8().
User avatar
bpd2000
Posts: 1207
Joined: Sat Sep 10, 2011 4:07 am
Location: India

Re: Firther modifications to HMG_IsUTF8

Post by bpd2000 »

Thank you for your contribution
BPD
Convert Dream into Reality through HMG
User avatar
serge_girard
Posts: 3158
Joined: Sun Nov 25, 2012 2:44 pm
DBs Used: 1 MySQL - MariaDB
2 DBF
Location: Belgium
Contact:

Re: Firther modifications to HMG_IsUTF8

Post by serge_girard »

Thanks Kevin!

Serge
There's nothing you can do that can't be done...
User avatar
kcarmody
Posts: 152
Joined: Tue Oct 07, 2014 11:13 am
Contact:

Re: Firther modifications to HMG_IsUTF8

Post by kcarmody »

Thank you bpd2000 and Serge for your kind remarks, but I made a big mistake by not checking for a new version before submitting a proposal. I just noticed that version 3.4.2 was released two weeks ago, but the change I propose in this thread is for 3.4.1 patch 6. Almost no changes that I proposed for 3.4.1 have been put into 3.4.2, so I will redevelop my proposal for 3.4.2.
User avatar
Steed
Posts: 427
Joined: Sat Dec 12, 2009 3:40 pm

Re: Firther modifications to HMG_IsUTF8

Post by Steed »

thks
User avatar
danielmaximiliano
Posts: 2607
Joined: Fri Apr 09, 2010 4:53 pm
Location: Argentina
Contact:

Re: Firther modifications to HMG_IsUTF8

Post by danielmaximiliano »

Thanks Kevin!
*´¨)
¸.·´¸.·*´¨) ¸.·*¨)
(¸.·´. (¸.·` *
.·`. Harbour/HMG : It's magic !
(¸.·``··*

Saludos / Regards
DaNiElMaXiMiLiAnO

Whatsapp. := +54901169026142
Telegram Name := DaNiElMaXiMiLiAnO
Post Reply