Using external parsers ====================== Since version 2.1 indexer can use external parsers to index different file types (mime types). Parser is any executable program which converts one of the mime types to text/plain or text/html. For example, if you have postscript files, you can use ps2ascii parser (filter), which reads postscript file from stdin and produces ascii to stdout. We assume parser sends output to stdout. If it is not true, you have to write a little shell script to put results to stdout. Please feel free to contribute your scripts and parsers configuration to devel@search.udm.net. Many parsers could not operate on stdin and requires a file. In this case ndexer creates a temporary file in /tmp and will remove it when parser exits. Use $1 macro in parser command line to substitute file name. For example, command line for "catdoc" MS Word to ASCII converters may look like this: /usr/bin/catdoc -a $1 Some parsers could produce output in other charset than input one. Specify charset to make indexer convert parser's output to proper charset. Parser's command line might be optional. In this case you can change charset or mime type. For example, change mime text/tab-separated-values to text/plain: # Note - we do not use parser command line Mime text/tab-separated-values text/plain How to setup parsers ==================== 1. Configure web server ----------------------- Configure your web server to send appropriate "Content-Type" header. For apache, have a look at mime.types file, most mime types are already defined there. 2. Edit indexer.conf -------------------- Uncomment or add lines with parsers definitions. Lines have the following format: # Parser definition format Mime [;charset] ["command line [$1]"] \ \ \ \ `- temporary file name \ \ \ `- full UNIX command line \ \ `- parser's output character set \ `- output mime type. text/plain or text/html `- source mime type For example, the following line defines parser for man pages: # I use deroff for parsing man pages ( *.man ) Mime application/x-troff-man text/plain "deroff" One more example: # I like catdoc, but sometimes it produces garbage. Mime application/msword text/plain;cp1251 "catdoc -a $1"