Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Test xpdf with php
#21
Thanks Uwe,

Just installed RC5 and it went trough like a charm, no more error messages.

Also the error i reported at the beginning now disappeared and is not showing up anymore in the httpd log.

Do you have a way how to test pdftotext out of the box? I uploaded some documents but it still doesn't add the words. I tried to run the conversion from the console using the commands in the default settings and it went trough.

Also, i can just type pdftotext in the console and it works. Is there a debug option which shows me what exactly runs in the background? Do i need to add both tools explicit to the path?

Thanks
Daniel
Reply
#22
Hi Uwe,

Found a script to test:

PHP Code:
<?php
system
("pdftotext 1.pdf outfile.txt");
?>

I had to modify it on my system, now it looks like this and works:

PHP Code:
<?php
system
("/opt/bin/pdftotext 1.pdf outfile.txt");
?>

I tried to add the path to LetoDMS but still no change in behavior.

What is the %s in the command?

I tried it with the following in my test file, but i don't know how to set this properly in LetoDMS (this worked too and presented me the output on the screen):

PHP Code:
system ("/opt/bin/pdftotext -nopgbrk 1.pdf - | sed -e 's/ [a-zA-Z0-9.]\{1\} / /g' -e 's/[0-9.]//g'"); 

Regards
Daniel
Reply
#23
(11-28-2012, 12:57 AM)Daniel Wrote: What is the %s in the command?

I tried it with the following in my test file, but i don't know how to set this properly in LetoDMS (this worked too and presented me the output on the screen):

PHP Code:
system ("/opt/bin/pdftotext -nopgbrk 1.pdf - | sed -e 's/ [a-zA-Z0-9.]\{1\} / /g' -e 's/[0-9.]//g'"); 

The %s is replaced by letodms with the path the file to index. The command is supposed to write its output to stdout.
The most simple command would be

'cat %s'

which just outputs the content of the file to stdout.

Uwe
Reply
#24
Hi Uwe,

Sounds a little bit like rocket science to me :-)

I tried to do various changes, including just using your suggestion. The fulltext index goes trough, but the documents are not indexed. I tested them before in the command line to ensure that they can be read out and used for a fulltext index, they are OK.

Sometimes (but unfortunately not always) the httpd error logs shows me these lines:

sh: catdoc: not found
sh: pdftotext: not found

One for each document, depending on their types.

Is there a debug option in LetoDMS where i can see what the fulltext indexer does and what he accesses during the process? The logs on my system are very poor and messages like the above ones are not really helpful.

Also, as tested above, PHP can access both tools, but there may be just a small bit wrong and i want to find this out.

Daniel
Reply
#25
(11-29-2012, 12:49 AM)Daniel Wrote: Hi Uwe,

Sounds a little bit like rocket science to me :-)

I tried to do various changes, including just using your suggestion. The fulltext index goes trough, but the documents are not indexed. I tested them before in the command line to ensure that they can be read out and used for a fulltext index, they are OK.

Sometimes (but unfortunately not always) the httpd error logs shows me these lines:

sh: catdoc: not found
sh: pdftotext: not found

One for each document, depending on their types.

Is there a debug option in LetoDMS where i can see what the fulltext indexer does and what he accesses during the process? The logs on my system are very poor and messages like the above ones are not really helpful.

Also, as tested above, PHP can access both tools, but there may be just a small bit wrong and i want to find this out.

Indexing in general works by calling a programm which turns the content of the document into a plain text file (basically a list of words). Such a programm can be configured for each document mimetype. A '%s' in the command will be replaced by the documents filename. LetoDMS takes the output of that programm, removes stop words and indexes the remaining words. On Unix this is a quite common solution. The crutial part in your case seems to be the path to those programms like catdoc and pdftotext. They are called by the web server and therefore you will have to call them with their full path or make sure the path is in the $PATH variable.
This has nothing todo with php, it is the shell environment.

The only way to debug this, is to dig into code an place some echos
at the right place. But I doubt this is needed in your case.

Uwe
Reply
#26
Uwe,

You are just fantastic!

I looked into $PATH, the directory /opt/bin was in there.

I went trough the internet and searched for related Information and found that PHP may look into /usr/bin for the application, if it is not there it fails.

I checked my /usr/bin and found nothing there. So i did a Symlink:

PHP Code:
Stratocumulusln -/opt/bin/pdftotext /usr/bin/pdftotext 

and

PHP Code:
ln -/opt/bin/catdoc /usr/bin/catdoc 

I ran the indexer, now instead of 17 Terms i see 1747 Terms.

Thanks
Daniel
Reply
#27
(11-29-2012, 10:01 PM)Daniel Wrote: I looked into $PATH, the directory /opt/bin was in there.

I went trough the internet and searched for related Information and found that PHP may look into /usr/bin for the application, if it is not there it fails.

I checked my /usr/bin and found nothing there. So i did a Symlink:

Great. So I was wrong with my assumption that $PATH is relevant. Thanks for posting your solution.

Uwe
Reply


Forum Jump:


Users browsing this thread: