Page 1 of 2

Read text from pdf - any idea?

Posted: Mon Dec 11, 2023 2:37 pm
by mol
Do you have any idea, maybe sample, how to read data from tables placed in PDF file?

Re: Read text from pdf - any idea?

Posted: Tue Dec 12, 2023 4:20 am
by Rathinagiri
https://github.com/michaelrsweet/pdfio

This looks very promising. Any one who can create a Library or DLL?

Re: Read text from pdf - any idea?

Posted: Tue Dec 12, 2023 4:26 am
by Rathinagiri

Re: Read text from pdf - any idea?

Posted: Tue Dec 12, 2023 8:11 am
by mol
C is not my language :lol:
I really even don't know how to start to compile zlib library from this text.

Update:
I'm trying to compile pdf.cpp from these sites.
First, I've downloaded zlib library.
I successfully compiled it with MinGW
But, when I try do compile pdf.cpp I get an error:

Code: Select all

pdf.cpp:205:22: error: '_TCHAR' has not been declared
  205 | int _tmain(int argc, _TCHAR* argv[])
      |                      ^~~~~~
pdf.cpp: In function 'int _tmain(int, int**)':
pdf.cpp:233:74: warning: ISO C++ forbids converting a string constant to 'char*' [-Wwrite-strings]
  233 |                         size_t streamstart = FindStringInBuffer (buffer, "stream", filelen);
      |                                                                          ^~~~~~~~
pdf.cpp:234:74: warning: ISO C++ forbids converting a string constant to 'char*' [-Wwrite-strings]
  234 |                         size_t streamend   = FindStringInBuffer (buffer, "endstream", filelen);
      |                                                                          ^~~~~~~~~~~
I think it's the end of my knowledge :D

Re: Read text from pdf - any idea?

Posted: Tue Dec 12, 2023 8:27 am
by serge_girard
Very interesting!

I would love to have this...!

Re: Read text from pdf - any idea?

Posted: Tue Dec 12, 2023 9:28 am
by mol
Maybe someone can help to move this piece of code to harbour?

Code: Select all

//Now use zlib to inflate:
				z_stream zstrm; ZeroMemory(&zstrm, sizeof(zstrm));

				zstrm.avail_in = streamend - streamstart + 1;
				zstrm.avail_out = outsize;
				zstrm.next_in = (Bytef*)(buffer + streamstart);
				zstrm.next_out = (Bytef*)output;

				int rsti = inflateInit(&zstrm);
				if (rsti == Z_OK)
				{
					int rst2 = inflate (&zstrm, Z_FINISH);
					if (rst2 >= 0)
					{
						//Ok, got something, extract the text:
						size_t totout = zstrm.total_out;
						ProcessOutput(fileo, output, totout);
					}
				}
				delete[] output; output=0;
				buffer+= streamend + 7;
				filelen = filelen - (streamend+7);

Re: Read text from pdf - any idea?

Posted: Tue Dec 12, 2023 2:36 pm
by Rathinagiri
Wow! At least you can move to this level!

I think there are C Gurus like Grigory and edk are available. Let us ask them.

Re: Read text from pdf - any idea?

Posted: Tue Dec 12, 2023 9:29 pm
by mol
I compiled sample, but I'm getting some trashes instead of text from pdf:

Code: Select all

!"#$"%&'$%()*&+,-./01&23"456$)3$75"&89:;<&:8=88:&>5?4(@7A@3"0)BCD&E:;&FGG&HIJ&H8;<&/2KD&J:J8I8L::H2/'&M"%6&NBO46$<;F&GIFI&GG:8&GIII&IIJI&J8GF&:L;F

>5?4(@7A@3"

P$)Q47)&3R4("3$)%$"DI8CI8C8I8L

S"("&4T*5)U"VRDI8CI8C8I8L

S"("&3R4("3$)%$"D

WT*5)U"37"D

/"#R37"D!"#$"%&'$%()*&+,-./01>B)"%&A@X4)&S3@*"6&Y%%"/2KD&J:J8I8L::H

23"456$)3$75"&89:;:8=88:&>5?4(@7A@3"

W$6@*46$)Z@&G[&:8=LII&PR456\3

/2KD&FHH&GJL&IL&8I!"6(X*"&]Y0&&8;:98I8L&@*RZ$%"^%"&T@U4("3$)&5"_\3$)%$"&1`&8L898I8L&5&U%$"&GJCIGC8I8L

I have no idea how to continue this work...

I think it's possible to write it in pure harbour, but I don't know how to decompress text variable in memory, what is compression method etc...

Re: Read text from pdf - any idea?

Posted: Tue Dec 12, 2023 10:18 pm
by hansmarc
Hi Mol,

This tool, see link, can maybe help you.
It is a command line tool. i know, not the best solution.
I did some tests in 2022 with a lot of supplier invoice pdf files for a project to import
automatically invoices in our management software.
Results where not bad at all but saddly our project has not the highest priority and thus not finished.

https://www.xpdfreader.com/pdftotext-man.html

Regards
Hans

Re: Read text from pdf - any idea?

Posted: Wed Dec 13, 2023 3:08 am
by Rathinagiri
Please share your code Mol. Let others try.
mol wrote: Tue Dec 12, 2023 9:29 pm I compiled sample, but I'm getting some trashes instead of text from pdf:

Code: Select all

!"#$"%&'$%()*&+,-./01&23"456$)3$75"&89:;<&:8=88:&>5?4(@7A@3"0)BCD&E:;&FGG&HIJ&H8;<&/2KD&J:J8I8L::H2/'&M"%6&NBO46$<;F&GIFI&GG:8&GIII&IIJI&J8GF&:L;F

>5?4(@7A@3"

P$)Q47)&3R4("3$)%$"DI8CI8C8I8L

S"("&4T*5)U"VRDI8CI8C8I8L

S"("&3R4("3$)%$"D

WT*5)U"37"D

/"#R37"D!"#$"%&'$%()*&+,-./01>B)"%&A@X4)&S3@*"6&Y%%"/2KD&J:J8I8L::H

23"456$)3$75"&89:;:8=88:&>5?4(@7A@3"

W$6@*46$)Z@&G[&:8=LII&PR456\3

/2KD&FHH&GJL&IL&8I!"6(X*"&]Y0&&8;:98I8L&@*RZ$%"^%"&T@U4("3$)&5"_\3$)%$"&1`&8L898I8L&5&U%$"&GJCIGC8I8L

I have no idea how to continue this work...

I think it's possible to write it in pure harbour, but I don't know how to decompress text variable in memory, what is compression method etc...