LetoDMS Community Forum
Indexing cyrillic docs - Printable Version

+- LetoDMS Community Forum (https://community.letodms.com)
+-- Forum: LetoDMS Support (https://community.letodms.com/forumdisplay.php?fid=4)
+--- Forum: Technical Support (https://community.letodms.com/forumdisplay.php?fid=10)
+--- Thread: Indexing cyrillic docs (/showthread.php?tid=547)



Indexing cyrillic docs - sarulezzz - 08-31-2012

Hi,

doesn't LetoDMS index words with cyrillic letters? I tried to index docs with english and russian words but I could see only english words in "Fulltext index info". I had to change Lucene/IndexedDocument.php file for fulltext cyrillic indexing. Here is my changes:
Code:
--- IndexedDocument.php.orig    2012-08-30 16:24:40.199586601 +0300
+++ IndexedDocument.php    2012-08-30 17:43:55.233553208 +0300
@@ -39,6 +39,7 @@
        if($convcmd) {
            $_convcmd = $convcmd;
        }
+        Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive() );
        $version = $document->getLatestContent();
        $this->addField(Zend_Search_Lucene_Field::Keyword('document_id', $document->getID()));
        $this->addField(Zend_Search_Lucene_Field::Keyword('mimetype', $version->getMimeType()));

I only inserted one command and then fulltext search works with russian words. I'm not sure about correctness of this change. So what do you think about this correction?


RE: Indexing cyrillic docs - steinm - 09-01-2012

(08-31-2012, 02:36 PM)sarulezzz Wrote: Hi,

doesn't LetoDMS index words with cyrillic letters? I tried to index docs with english and russian words but I could see only english words in "Fulltext index info". I had to change Lucene/IndexedDocument.php file for fulltext cyrillic indexing. Here is my changes:
Code:
--- IndexedDocument.php.orig    2012-08-30 16:24:40.199586601 +0300
+++ IndexedDocument.php    2012-08-30 17:43:55.233553208 +0300
@@ -39,6 +39,7 @@
        if($convcmd) {
            $_convcmd = $convcmd;
        }
+        Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive() );
        $version = $document->getLatestContent();
        $this->addField(Zend_Search_Lucene_Field::Keyword('document_id', $document->getID()));
        $this->addField(Zend_Search_Lucene_Field::Keyword('mimetype', $version->getMimeType()));

I only inserted one command and then fulltext search works with russian words. I'm not sure about correctness of this change. So what do you think about this correction?

Looks reasonable. Zend_Search_Lucene probably uses latin1 by default.

Uwe