Website Downloads Documentation Knowledgebase Wiki Issue tracker Commercial support

Full Text Indexer

Full text indexing in Daisy happens automatically when document variants are updated, so you do not need to worry about updating the index yourself. Technically, the full text indexer has a durable subscription on the JMS events generated by the repository, and it are these events which trigger the index updating.

Technology

Daisy uses Jakarta Lucene as full-text indexer.

Included content

Only document variants which have a live version are included in the full text index. Thus retired document variants or document variants having only draft versions are not included. It is the content of the live version which is indexed, thus full text search operations always search on the live content.

For each document variant, the included content consists of the document name, the value of string fields, and text extracted from the parts. For the parts, text extraction will be performed on the data if the mime type is one of the following:

Mime type

Comment

text/plain

text/xml

e.g. the "Daisy HTML" parts

application/xhtml+xml

XHTML documents

application/pdf

PDF files

application/vnd.sun.xml.writer

OpenOffice Writer files

application/msword
application/vnd.ms-word

Microsoft Word files

application/mspowerpoint
application/vnd.ms-powerpoint

Microsoft Powerpoint files

application/msexcel
application/vnd.ms-excel

Microsoft Excel files

Support for other formats can be added by implementing a simple interface. Ask on the Daisy Mailing List if you need more information about this.

Index management

Optimizing the index

If you have a lot of documents in the repository, you can speed up fulltext searches by optimizing the index.

This is done as follows.

First go to the JMX console as explained in the JMX console documentation.

Follow the link titled: Daisy:name=FullTextIndexer

Look for the operation named optimizeIndex and invoke it (by pressing the Invoke button).

Afterwards, choose "Return to MBean view". In the IndexerStatus field, you will see an indication that the optimizing of the index is in progress. If you have a very small index, the optimizing might go so fast that it is already finished by the time you get back to that page. On larger indexes, the optimize procedure can take quite a bit of time.

Rebuilding the fulltext index

Rebuilding the fulltext index can be useful in a variety of situations:

  • new or updated text extractors are available, and so you want to reindex the documents
  • something went seriously wrong or you restored a backup and want to be sure the index is complete
  • sometimes installing new Daisy versions requires rebuilding the index

The index can be rebuild for all documents or a selection of the documents.

If you want to completely rebuild the index, you might fist want to delete all the index files, which can be found in:

<daisydata dir>/indexstore

It is harmless to delete these, as the index can be rebuild at any time. Better don't delete them while the repository server is running though.

To trigger the rebuilding, go to the JMX console as explained in the JMX console documentation.

Follow the link titled: Daisy:name=FullTextIndexUpdater

In case you want to rebuild the complete index, invoke the operation named reIndexAllDocuments.

In case you want to rebuild the index only for some documents, you can use the operation reIndexDocuments. As parameter, you need to enter a query to select the documents to reindex. For example, to re-index all documents containing PDFs, you can use:

select id where HasPartWithMimeType('application/pdf')

What you put in the select-clause of the query doesn't matter.

After invoking the reindex operation, choose "Return to MBean view". Look at the attribute ReindexStatus. This will show the progress of the reindexing (refresh the page to see its value being updated). Or more correctly, of scheduling the reindexing. It is important that this ends completely before the repository server is stopped, otherwise the reindexing will not happen completely.

If you have a large repository, the ReindexStatus might show a long time the message "Querying the repository to retrieve the list of documents to re-index". This is because after just starting the repository, the documents still need to be loaded into the cache.

Note that the reindexing here only pushes reindex-jobs to the work queue of the fulltext indexer, the reindexing doesn't happen immediately.

To follow up the status of the actual indexing, go again to the start page of the JMX console, by choosing the "Server view" tab.

Over there, follow this link: org.apache.activemq:BrokerName=DaisyJMS,Type=Queue,Destination=fullTextIndexerJobs

Look for the attribute named QueueSize. This indicates the amount of jobs waiting for the fulltext indexer to process. Each time you refresh this page, you will see this number go lower (or higher of new jobs are being added faster than they are processed).

If you have a large index, it could be beneficial to optimize it after the reindexing finished, as explained above.

Comments (0)
Advertisement

Daisy hosting, installation, support. Workshops and turnkey Daisy CMS projects. Get Daisy from its creators.

outerthought.org

Downloads provided by

SourceForge.net Logo

Open source stats