Frequently Asked Questions

ht://Dig Copyright © 1995-2000 The ht://Dig Group
Please see the file COPYING for license information.

This FAQ is compiled by the ht://Dig developers and the most recent version is available at <http://www.htdig.org/FAQ.html>. Questions (and answers!) are greatly appreciated. Please send questions and/or answers to the ht://Dig user mailing list at: <htdig@htdig.org>.

Questions

Answers

1. General

1.1. Can I search the internet with ht://Dig?

No, ht://Dig is a system for indexing and searching a small set of sites or intranet. It is not meant to replace any of the many internet-wide search engines.

1.2. Can I index the internet with ht://Dig?

No, as above, ht://Dig is not meant as an internet-wide search engine. While there is theoretically nothing to stop you from indexing as much as you wish, practical considerations (e.g. time, disk space, memory, etc.) will limit this.

1.3. What's the difference between htdig and ht://Dig?

The complete ht://Dig package consists of several programs, one of which is called "htdig." This program performs the "digging" or indexing of the web pages. Of course an index doesn't do you much good without a program to sort it, search through it, etc.

1.4. I sent mail to Andrew or Geoff or Gilles, but I never got a response!

Andrew no longer does much work on ht://Dig. He has started a company, called Contigo Software and is quite busy with that. To contact any of the current developers, send mail to <htdig3-dev@htdig.org>

Geoff and Gilles are currently the maintainers of ht://Dig, but they are both volunteers. So while they do read all the e-mail they receive, they may not respond immediately. Questions about ht://Dig in general should be pointed to the <htdig@htdig.org> mailing list.

1.5. I sent a question to the mailing list but I never got a response!

Development of ht://Dig is done by volunteers. Since we all have other jobs, it make take a while before someone gets back to you.

1.6. I have a great idea/patch for ht://Dig!

Great! Development of ht://Dig continues through suggestions and improvements from users. If you have an idea (or even better, a patch), please send it to the ht://Dig mailing list so others can use it. For suggestions on how to submit patches, please check the Guidelines for Patch Submissions. If you'd like to make a feature request, you can do so through the ht://Dig bug database, either off of <www.htdig.org> or by sending mail to <bugs@htdig.org>

1.7. Is ht://Dig Y2K compliant?

ht://Dig should be y2k compliant since it never stores dates as two-digit years. Under ht://Dig's copyright (GPL), there is no warranty whatsoever as permitted by law. If you would like an iron-clad, legally-binding guarantee, feel free to check the source code itself. Versions prior to 3.1.2 did have a problem with the parsing of the Last-Modified header returned by the HTTP server, which will cause incorrect dates to be stored for documents modified after February 28, 2000 (yes, it didn't recognize 2000 as a leap year). Versions prior to 3.1.5 didn't correctly handle servers that return two digit years in the Last-Modified header, for years after 99. These problems are fixed in the current release. If you discover something else, please let us know!

1.8. I think I found a bug. What should I do?

Well, there are probably bugs out there. You have two options for bug-reporting. You can either mail the ht://Dig mailing list at <htdig@htdig.org> or better yet, report it to the bug database, which ensures it won't become lost amongst all of the other mail on the list. To do this, either follow the link from <www.htdig.org> or by sending mail to <bugs@htdig.org>. Please try to include as much information as possible, including the version of ht://Dig, the OS, and anything else that might be helpful. Often, running the programs with one "-v" or more (e.g. "-vvv") gives useful debugging information. If you are unsure whether the problem is a bug or a configuration problem, you should discuss the problem on htdig@htdig.org (after carefully reading the FAQ and searching the mail archives, of course) to sort out what it is. The mailing list has a wider audience, so you're more likely to get help with configuration problems there than by reporting them to the bug database.

1.9. Does ht://Dig support phrase or near matching?

Phrase searching has been added for the 3.2 release, which is currently in the beta phase (3.2.0b2 as of this writing).

1.10. What are the practical and/or theoretical limits of ht://Dig?

The code itself doesn't put any real limit on the number of pages. There are several sites in the hundreds of thousands of pages. As for practical limits, it depends a lot on how many pages you plan on indexing. Some operating systems limit files to 2 GB in size, which can become a problem with a large database. There are also slightly different limits to each of the programs. Right now htmerge performs a sort on the words indexed. Most sort programs use a fair amount of RAM and temporary disk space as they assemble the sorted list. The htdig program stores a fair amount of information about the URLs it visits, in part to only index a page once. This takes a fair amount of RAM. With cheap RAM, it never hurts to throw more memory at indexing larger sites. In a pinch, swap will work, but it obviously really slows things down.

1.11. Do any ISPs offer ht://Dig as part of their web hosting services?

Yes. A list of such ISPs is available at http://www.htdig.org/isp.html.

2. Getting ht://Dig

2.1. What's the latest version of ht://Dig?

The latest version is 3.1.5 as of this writing. A beta version of the 3.2 code, 3.2.0b2 is also available, for those who wish to test it. You can find out about the latest version by reading the release notes. Note that if you're running any version older than 3.1.5 on a public web site, you should upgrade immediately, as older versions have a rather serious security hole which is explained in detail in this advisory which was sent to the BugTraq mailing list.

2.2. Are there binary distributions of ht://Dig?

We're trying to get consistent binary distributions for popular platforms. Contributed binary releases will go in http://www.htdig.org/files/binaries/ and contributions may be placed in ftp://ftp.htdig.org/incoming/.
Anyone who would like to make consistent binary distributions of ht://Dig at least should signup to the htdig3-announce mailing list.

2.3. Are there mirror sites for ht://Dig?

Not at the moment. Currently, there is only the main server at <www.htdig.org>. If you'd be willing to mirror the site, please contact <htdig3-dev@htdig.org>

2.4. Is ht://Dig available by ftp?

Yes. You can find the current versions and several older versions at <ftp.htdig.org>.

2.5. Are patches around to upgrade between versions?

Most versions are also distributed as a patch to the previous version's source code. The most recent exception to this was version 3.1.0b1. Since this version switched from the GDBM database to DB2, the new database package needed to be shipped with the distribution. This made the potential patch almost as large as the regular distribution. Update patches resumed with version 3.1.0b2.

3. Compiling

3.1. When I compile ht://Dig I get an error about libht.a

This usually indicates that either libstdc++ is not installed or is installed incorrectly. To get libstdc++ or any other GNU too, check ftp://ftp.gnu.org/pub/gnu/

3.2. I get an error about -lg

This is due to a bug in the Makefile.config.in of version 3.1.0b1. Remove all flags "-ggdb" in Makefile.config.in. Then type "./config.status" to rebuild the Makefiles and recompile. This bug is fixed in version 3.1.0b2.

3.3. I'm compiling on Digital Unix and I get mesages about "unresolved" and "db_open."

Answer contributed by George Adams <learningapache@my-dejanews.com>

What you're seeing are problems related to the Berkeley DB library. htdig needs a fairly modern version of db, which is why it ships with one that works. (see that -L../db-2.4.14/dist line? That's where htdig's db library is).
The solution is to modify the c++ command so it explicity references the correct libdb.a . You can do this by replacing the "-ldb" directive in the c++ command with "../db-2.4.14/dist/libdb.a" This problem has been resolved as of version 3.1.0.

3.4. I'm compiling on FreeBSD and I get lots of messages about '___error' being unresolved.

Answer contributed by Laura Wingerd <laura@perforce.com>
I got a clean build of htdig-3.1.2 on FreeBSD 2.2.8 by taking -D_THREAD_SAFE out of CPPFLAGS, and setting LIBS to null, in db/dist/configure.

3.5. I'm compiling on HP/UX and I get a complaint about "Large Files not supported."

The db/ pacakge, included with ht://Dig seems to be unable to complete on HP/UX 10.20 in particular. After running the top-level configure script, cd into db/dist and type:

./configure --disable-bigfile

Then continue with the normal compilation.

3.6. I'm compiling on Solaris and when I run the programs I get complaints about not finding libstdc++.

Answer contributed by Adam Rice <adam@newsquest.co.uk>

The problem is that the Solaris loader can't find the library. The best thing to do is set the LD_RUN_PATH variable during compile to the directory where libstdc++.so.2.8.1.1 lives. This tells the linker to search that directory at runtime.

4. Configuration

4.1. How come I can't index my site?

There are a variety of reasons ht://Dig won't index a site. To get to the bottom of things, it's advisable to turn on some debugging output from the htdig program. When running from the command-line, try "-vvv" in addition to any other flags. This will add debugging output, including the responses from the server.

4.2. How can I change the output format of htsearch?

Answer contributed by: Malka Cymbalista <vumalki@ultra1.weizmann.ac.il>

You can change the output format of htsearch by creating different header, footer and result files that specify how you want the output to look. You then create a configuration file that specifies which files to use. In the html document that links to the search, you specify which configuration file to use.

So the configuration file would have the lines:

search_results_header: ${common_dir}/ccheader.html
search_results_footer: ${common_dir}/ccfooter.html
template_map:  Long long builtin-long \
               Short short builtin-short \
               Default default ${common_dir}/ccresult.html
template_name: Default

You would also put into the configuration file any other lines from the default configuration file that apply to htsearch.

The files ${common_dir}/ccheader.html and ${common_dir}/ccfooter.html and ${common_dir}/ccresult.html would be tailored to give the output in the desired format.

Assuming your configuration file is called cc.conf, the html file that links to the search has to set the config parameter equal to cc. The following line would do it:
<input type=hidden name=config value="cc">

4.3. How do I index pages that start with '~'?

ht://Dig should index pages starting with '~' as if it was another web browser. If you are having problems with this, check your server log files to see what file the server is attempting to return.

4.4. Can I use multiple databases?

Yes, though you may find it easier to have one larger database and use restrict or exclude fields on searches. To use multiple databases, you will need a config file for each database. Then each file will set the "database_base" option to change the name of the databases.

4.5. OK, I can use multiple databases. Can I merge them into one?

As of version 3.1.0, you can do this with the -m option to htmerge.

4.6. Wow, ht://Dig eats up a lot of disk space. How can I cut down?

There are several ways to cut down on disk space. One is not to use the "-a" option, which creates work copies of the databases. Naturally this essentially doubles the disk usage. If you don't need to index and search at the same time, you can ignore this flag. Changing configuration variables can also help cut down on disk usage. Decreasing max_head_length and max_meta_description_length will cut down on the size of the excerpts stored (in fact, if you don't have use_meta_description set, you can set max_meta_description_length to 0!). Other techniques include removing the db.wordlist file and adding more words to the bad_words file.

4.7. Can I use SSI or other CGIs in my htsearch results?

Not really. Apache will not parse CGI output for SSI statements (See the Apache FAQ). Thus,the htsearch CGI does not understand SSI markup and thus cannot include other CGIs. However, it is possible doing it the other way round: you can have the htsearch results included in your dynamic page.

The easiest approach is using SSI with the help of the script_name configuration file attribute. See the contrib/scriptname directory for a small example using SSI.

For CGI and PHP, you need a "wrapper" script to do that. For perl script examples, see the files in contrib/ewswrap. The PHP guide (see contributed guides) not only describes a wrapper script for PHP, but also offers a step by step tutorial to the basics of ht://dig and is well worth reading. For other alternatives, see question 4.11.

4.8. How do I index Word or PostScript documents?

This must be done with an external parser or converter. A sample of such a parser is the contrib/parse_doc.pl Perl script. It will parse Word, PostScript and PDF documents, when used with the appropriate document to text converters. It uses catdoc to parse Word documents, and ps2ascii to parse PostScript files. The comments in the Perl script indicate where you can obtain these converters.

As of htdig version 3.1.4, you can use an external converter, such as the contrib/conv_doc.pl Perl script, instead of an external parser. This script is simpler to write and maintain than a full external parser, as it just converts input documents to text/plain or text/html, and passes that back to htdig to be parsed. Parsing is more consistent across document types as a result.

The most recent versions of parse_doc.pl and conv_doc.pl are available on our web site.
See below for an example of parse_doc.pl, or see the comments in conv_doc.pl for an example of its usage.

4.9. How do I index PDF files without acroread?

This too can be done with an external parser or converter, in combination with the pdftotext program that is part of the xpdf 0.90 package. A sample of such a parser is the contrib/parse_doc.pl Perl script. It uses pdftotext to parse PDF documents, then processes the text into external parser records. The most recent version of parse_doc.pl is available on our web site.

For example, you could put this in your configuration file:

external_parsers: application/msword /usr/local/bin/parse_doc.pl \
                  application/postscript /usr/local/bin/parse_doc.pl \
                  application/pdf /usr/local/bin/parse_doc.pl

You would also need to configure the script to indicate where all of the document to text converters are installed.

As of htdig version 3.1.4, you can use an external converter, such as the contrib/conv_doc.pl Perl script, also available on our web site, instead of an external parser. This script is simpler, and offers more consistent parsing, because the final work is done by htdig's internal parsers. See the comments inside this script for an example of its usage.

Whether you use this external parser or converter, or acroread with the pdf_parser attribute, to successfully index PDF files be sure to set the max_doc_size attribute to a value larger than the size of your largest PDF file. PDF documents can not be parsed if they are truncated.

This also raises the questions of why two different methods of indexing PDFs are supported, and which method is preferred. The built-in PDF support, which uses acroread to convert the PDF to PostScript, was the first method which was provided. It had a few problems with it: acroread is not open source, it is not supported on all systems on which ht://Dig can run, and for some PDFs, the PostScript that acroread generated was very difficult to parse into indexable text. Also, the built-in PDF support expected PDF documents to use the same character encoding as is defined in your current locale, which isn't always the case. The external parser, which uses pdftotext, was developed to overcome these problems. xpdf 0.90 is open source, and its pdftotext utility works very well as an indexing tool. It also converts various PDF encodings to the Latin 1 set. It is the opinion of the developers that this is the preferred method. However, some users still prefer to stick with acroread, as it works well for them, and is a little easier to set up if you've already installed Acrobat.

Also, pdftotext still has some difficulty handling text in landscape orientation, even with its new -raw option in 0.90, so if you need to index such text in PDFs, you may still get better results with acroread.

5. Troubleshooting

5.1. I can't seem to index more than X documents in a directory.

This usually has to do with the default document size limit. If you set "max_doc_size" in your config file to something enough to read in the directory index (try 100000 for 100K) this should fix this problem. Of course this will require more memory to read the larger file.

5.2. I can't index PDF files.

As above, this usually has to do with the default document size. What happens is ht://Dig will read in part of a PDF file and try to index it. This usually fails. Try setting "max_doc_size" in your config file to a larger value than the size of your largest PDF file.

Another common problem is that htdig can't find the acroread program, which it uses to convert PDF files to PostScript. The solution is to obtain and install Adobe Acrobat Reader 3.0, if it's available for your system. You may also need to set the pdf_parser attribute to the correct location and options for acroread.

There is a bug in Adobe Acrobat Reader version 4, in its handling of the -pairs option, which causes a segmentation violation when using it with htdig 3.1.2 or earlier. There is a workaround for this as of version 3.1.3 - you must remove the -pairs option from your pdf_parser definition, if it's there. However, acroread version 4 is still very unstable (on Linux, anyway) so it is not recommended as a PDF parser. An alternative is to use an external parser with the xpdf 0.90 package installed on your system, as described in question 4.9 above.

5.3. When I run "rundig," I get a message about "DATABASE_DIR" not being found.

This is due to a bug in the Makefile.in file in version 3.1.0b1. The easiest fix is to edit the rundig file and change the line "TMPDIR=@DATABASE_DIR@" to set TMPDIR to a directory with a large amount of temporary disk space for htmerge. This bug is fixed in version 3.1.0b2.

5.4. When I run htmerge, it stops with an "out of diskspace" message.

This means that htmerge has run out of temporary disk space for sorting. Either in your "rundig" script (if you run htmerge through that) or before you run htmerge, set the variable TMPDIR to a temp directory with lots of space.

5.5. I have problems running rundig from cron under Linux.

This problem commonly occurs on Red Hat Linux 5.0 and 5.1, because of a bug in vixie-cron. It causes htmerge to fail with a "Word sort failed" error. It's fixed in Red Hat 5.2. You can install vixie-cron-3.0.1-26.{arch}.rpm from a 5.2 distribution to fix the problem on 5.0 or 5.1. A quick fix for the problem is to change the first line of rundig to "#!/bin/ash" which will run the script through the ash shell, but this doesn't solve the underlying problem.

5.6. When I run htmerge, it stops with an "Unexpected file type" message.

Often this is because the databases are corrupt. Try removing them and rebuilding. If this doesn't work, some have found that the solution for question 3.2 works for this as well. This should be fixed in version 3.1.0b2.

5.7. When I run htsearch, I get lots of Internal Server Errors (#500).

If you are running under Solaris, see 3.6.
See also question 5.13.

5.8. I'm having problems with indexing words with accented characters.

Most of the time, this is caused by either not setting or incorrectly setting the locale attribute. The default locale for most systems is the "portable" locale, which strips everything down to standard ASCII. Most systems expect something like locale: en_US or locale: fr_FR. Locale files are often found in /usr/share/locale or the $LANGUAGE environment variable. See also question 4.10.

5.9. When I run htmerge, it stops with a "Word sort failed" message.

There are three common causes of this. First of all, the sort program may be running out of temporary file space. Fix this by freeing up some space where sort puts its temporary files, or change the setting of the TMPDIR environment variable to a directory on a volume with more space. A second common problem is on systems with a BSD version of the sort program (such as FreeBSD or NetBSD). This program uses the -T option as a record separator rather than an alternate temporary directory. On these systems, you must remove the TMPDIR environment variable from rundig, or change the code in htmerge/words.cc not to use the -T option. A third cause is the cron program on Red Hat Linux 5.0 or 5.1. (See question 5.5 above.)

5.10. When htsearch has a lot of matches, it runs extremely slowly.

When you run htsearch with no customization, on a large database, and it gets a lot of hits, it tends to take a long time to process those hits. Some users with large databases have reported much higher performance, for searches that yield lots of hits, by setting the backlink_factor attribute in htdig.conf to 0, and sorting by score. The scores calculated this way aren't quite as good, but htsearch can process hits much faster when it doesn't need to look up the db.docdb record for each hit, just to get the backlink count, date or title, either for scoring or for sorting. This affects versions 3.1.0b3 and up. In version 3.2, currently under development, the databases will be structured differently, so it should perform searches more quickly.

5.11. When I run htsearch, it gives me a count of matches, but doesn't list the matching documents.

This most commonly happens when you run htsearch while the database is currently being rebuilt or updated by htdig. If htdig and htmerge have run to completion, and the problem still occurs, this is usually an indication of a corrupted database. If it's finding matches, it's because it found the matching words in db.words.db. However, it isn't finding the document records themselves in db.docdb, which would suggest that either db.docdb, or db.docs.index (which maps document IDs used in db.words.db to URLs used to look up records in db.docdb), is incomplete or messed up. You'll likely need to rebuild your database from scratch if it's corrupted. Older versions of ht://Dig were susceptible to database corruption of this sort. Versions 3.1.2 and later are much more stable.

5.12. I can't seem to index documents with names like left_index.html with htdig.

There is a bug in the implementation of the remove_default_doc attribute in htdig versions 3.1.0, 3.1.1 and 3.1.2, which causes it to match more than it should. The default value for this attribute is "index.html", so any URL in which the filename ends with this string (rather than matches it entirely) will have the filename stripped off. This is fixed in version 3.1.3.

5.13. I get Premature End of Script Headers errors when running htsearch.

This happens when htsearch dies before putting out a "Content-Type" header. If you are running Apache under Solaris, first try the solution described in question 5.7. If that doesn't work, or you're running on another system, try running "htsearch -vvv" directly from the command line to see where and why it's failing. It should prompt you for the search words, as well as the format.
See also questions 5.7 and 5.14.

5.14. I get Segmentation faults when running htdig, htsearch or htfuzzy.

Despite a great deal of debugging of these programs, we haven't been able to completely eliminate all such problems on all platforms. If you're running htsearch or htfuzzy on a BSDI system, a common cause of core dumps is due to a conflict between the GNU regex code bundled in htdig 3.1.2 and later, and the BSD C or C++ library. The solution is to use the BSD library's own regex code instead, as summarized by Joe Jah:

make clean
Remove references to regex.o from htlib/Makefile.
Remove htlib/regex.h.
Remove references to htlib/regex.h in htfuzzy/Makefile, which will be there if you have previously done a "make depend".
make

This solution may work on some other platforms as well (we haven't heard one way or the other), but will definitely not work on some platforms. For instance, on libc5-based Linux systems, the bundled regex code works fine by default, but using libc5's regex code causes core dumps.

Users of Cobalt Raq or Qube servers have complained of segmentation faults in htdig. Apparently this is due to problems in their C++ libraries, which are fixed in their experimental compiler and libraries. The following commands should install the packages you need:

rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/binutils-2.8.1-3C1.mips.rpm
rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-1.0.2-9.mips.rpm
rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-c++-1.0.2-9.mips.rpm
rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-g77-1.0.2-9.mips.rpm
rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-objc-1.0.2-9.mips.rpm
rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/libstdc++-2.8.0-9.mips.rpm
rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/libstdc++-devel-2.8.0-9.mips.rpm
rpm -Uvh --force ftp://ftp.cobaltnet.com/pub/products/current/RPMS/gcc-2.7.2-C2.mips.rpm

You may have to remove the libg++ package, if you have it installed before installing libstdc++, because of conflicts in these packages. Be sure to do a "make clean" before a "make", to remove any object files compiled with the old compiler and headers.

For other causes of segmentation faults, or in other programs, getting a stack backtrace after the fault can be useful in narrowing down the problem. E.g.: try "gdb /path/to/htsearch /path/to/core", then enter the command "bt". You can also try running the program directly under the debugger, rather than attempting a post-mortem analysis of the core dump. Options to the program can be given on gdb's "run" command, and after the program is suspended on fault, you can use the "bt" command. This may give you enough information to find and fix the problem yourself, or at least it may help others on the htdig mailing list to point out what to do next.

5.15. Why does htdig 3.1.3 mangle URL parameters that contain bare "&" characters?

This is a known bug in 3.1.3, and is fixed with this patch. You can apply the patch by entering into the main source directory for htdig-3.1.3, and using the command "patch -p1 < /path/to/htdig-3.1.3-urlparmbug.patch". This is also fixed as of version 3.1.4.

5.16. When I run htmerge, it stops with an "Unable to open word list file '.../db.wordlist'" message.

The most common cause of this error is that htdig did not manage to index any documents, and so it did not create a word list. You should repeat the htdig or rundig command with the -vvv option to see where and why it is failing. See question 4.1.

5.17. When using Netscape, htsearch always returns the "No match" page.

Check your search form. Chances are there is a hidden input field with no value defined. For example, one user had
<input type=hidden name=restrict> in his search form, instead of
<input type=hidden name=restrict value=""> The problem is that Netscape sets the missing value to a default of " " (two spaces), rather than an empty string. For the restrict parameter, this is a problem, because htsearch won't likely find any URLs with two spaces in them. Other input parameters may similarly pose a problem.

5.18. Why doesn't htdig follow links to other pages in JavaScript code?

There probably isn't any indexing tool in existance that follows JavaScript links, because they don't know how to initiate JavaScript events. Realistically, it would take a full JavaScript parser in order to be able to figure out all the possible URLs that the code could generate, something that's way beyond the means of any search engine. You have a few options:

Add "backup" links using plain HTML <a href=...> tags to all the pages that could be accessed through JavaScript,
Add <link> tags to point to all these pages (requires htdig 3.1.3 or greater, but then everyone should be running 3.1.5 anyway),
Compose a list of all the unreachable documents, or write a program to do so, and feed that list as part of htdig's start_url attribute.

5.19. When I run htsearch from the web server, it returns a bunch of binary data.

Your server is returning the contents of the htsearch binary. Common causes of this are:

no execute permission on the htsearch binary,
the binary won't run on this system (it may be compiled for the wrong system type), or
the web server doesn't recognize the file as a CGI (for Apache, you must have a ScriptAlias directive for the program or the directory in which it's installed, or define a cgi-script handler for some suffix, e.g. .cgi, and add that suffix to the program file name).

By default, Apache is usually configured with one cgi-bin directory as ScriptAlias, so all your CGI programs must go in there, or have a .cgi suffix on them. Your configuration may differ, however.

Last modified: $Date: 2000/04/11 03:17:59 $