Page 1 of 3

Problem reading Unicode file

Posted: Sat Jul 19, 2014 7:00 pm
by Clip2Mania
Anyone can suggest how to read unicode file attached?
I can open it in Windows Notepad without any problems.
I'm using HMG 3.3.1, 32 bits, Unicode.
Tried using memoread(), HB_Memoread() and FOpen(), FReadStr() combination,
both in ANSI & UNICODE versions of HMG, and apparently cannot open it. :cry:
Suggestions?
Thx,
Erik

Re: Problem reading Unicode file

Posted: Sun Jul 20, 2014 4:14 am
by bpd2000
Refer attached demo
You have to save file using Encoding UTF-8
Also refer
viewtopic.php?f=7&t=3689&p=34140&hilit= ... F+8#p34140

Re: Problem reading Unicode file

Posted: Sun Jul 20, 2014 6:58 am
by Clip2Mania
You have to save file using Encoding UTF-8
Yes, I saw the demo & read post previously. That is exactly the issue. I cannot save the file in UTF-8, because it comes from an external program (EAC). I have a lot of these files and need to read them, so manually opening & saving each file is way too much work for my customer. Furthermore, I want to save him the complexity :!:

In the mean time, found a command-line conversion tool on the web (http://www.autohotkey.com/board/topic/9 ... icode-cmd/. which allows to do this. I use "execute file" command to convert each file first. It's not really beautiful, but it kinda works... :geek:

Re: Problem reading Unicode file

Posted: Tue Jul 22, 2014 1:57 am
by srvet_claudio
Clip2Mania wrote:Anyone can suggest how to read unicode file attached?
I can open it in Windows Notepad without any problems.
I'm using HMG 3.3.1, 32 bits, Unicode.
Tried using memoread(), HB_Memoread() and FOpen(), FReadStr() combination,
both in ANSI & UNICODE versions of HMG, and apparently cannot open it. :cry:
Suggestions?
Thx,
Erik
Hi Erik,
the problem is that you file is in Unicode UTF16LE (Unicode of Windows) and HMG work with UTF8,
see this code:

Code: Select all



#include "hmg.ch"

FUNCTION Main()

cText := HMG_UTF16LE_TO_UTF8 ("test_unicodeUTF16LE.txt")

MsgInfo (cText)

RETURN NIL



#pragma BEGINDUMP

#define UNICODE

#include "HMG_UNICODE.h"
#include <windows.h>
#include "hbapi.h"

HB_FUNC ( HMG_UTF16LE_TO_UTF8 )
{ 
   TCHAR *FileName = (TCHAR *) HMG_parc (1);
   
   HANDLE    hFile;
   DWORD     nFileSize;
   DWORD     nReadByte;

   hFile = CreateFile (FileName, GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
   if (hFile == INVALID_HANDLE_VALUE)
       return;
          
   nFileSize = GetFileSize (hFile, NULL);
   if (nFileSize == INVALID_FILE_SIZE)
   {   CloseHandle (hFile); 
       return;
   }

   TCHAR cBuffer [ nFileSize ];

   ReadFile (hFile, cBuffer, nFileSize, &nReadByte, NULL);
   
   CloseHandle (hFile);

   HMG_retc (cBuffer);
}

#pragma ENDDUMP


Re: Problem reading Unicode file

Posted: Tue Jul 22, 2014 2:37 am
by esgici
Simply another approach :

Code: Select all

/*
  Convert big-endian Unicode string to ANSI 
  CAUTION : Use only for big-endian Unicode string  !
*/

#include <hmg.ch>

PROCEDURE Main
   MsgBox( UniBE2UT8( HB_MEMOREAD( "test_unicode.txt" ) ) )
RETURN

FUNCTION UniBE2UT8( cBigEndianStr )          // Convert big-endian Unicode string to ANSI
RETURN ( SUBSTR( STRTRAN( cBigEndianStr, CHR(0), '' ), 3 ) )

Re: Problem reading Unicode file

Posted: Tue Jul 22, 2014 7:57 am
by Clip2Mania
Fantastic, thanks gentlemen for the effort! :)
There is a problem with both codes, however
Mr. esgici's code does not read to the end of the file but stops somewhere :(
Dr. Claudio's code reads too much :) (see the "garbage" characters at the end of the file)
Not beautiful in Msgbox, but I can filter out in my code. ;)

Re: Problem reading Unicode file

Posted: Tue Jul 22, 2014 10:48 am
by esgici
Clip2Mania wrote:...
Mr. esgici's code does not read to the end of the file but stops somewhere :(
...
There isn't such truncate problem in my method and upper extra characters in Claudio's method at my side :(
UpperExtraCharactersInClaudio'sMethod.PNG
UpperExtraCharactersInClaudio'sMethod
UpperExtraCharactersInClaudio'sMethod.PNG (109.12 KiB) Viewed 3991 times
And physically there isn't such extra (letter or not) characters into your file :?

If you made this test on another file, please send me it.

Regards

Re: Problem reading Unicode file

Posted: Tue Jul 22, 2014 11:11 am
by Clip2Mania
Mr esgici,
the trouble is in the accents/special characters (it always is :( )
I tried adding 'SET CODEPAGE TO UNICODE' at the beginning of the program, but that does not change anything.

Re: Problem reading Unicode file

Posted: Tue Jul 22, 2014 11:17 am
by Clip2Mania
It is true, I added the éèçàôù characters in the file, because they are very common. In the example above,
if you leave them out, you will see that they are not correctly translated further in the file.

Re: Problem reading Unicode file

Posted: Tue Jul 22, 2014 12:25 pm
by esgici
You are right, my conversion method not convenient to your needs :(