ht://Dig Copyright © 1995-2000 The ht://Dig Group
Please see the file COPYING for
license information.
accents_db: | ${database_base}.uml.db |
add_anchors_to_excerpt: | no |
<SELECT NAME="search_algorithm"> |
allow_in_form: | search_algorithm search_results_header |
allow_numbers: | true |
allow_virtual_hosts: | false |
any_keywords: | yes |
authorization: | myusername:mypassword |
backlink_factor: | 501.1 |
bad_extensions: | .foo .bar .bad |
bad_querystr: | forum=private section=topsecret&passwd=required |
contrib/examples
directory.
bad_word_list: | ${common_dir}/badwords.txt |
The default value of this attribute is determined at compile time.
bin_dir: | /usr/local/bin |
build_select_lists: |
MATCH_LIST matchesperpage matches_per_page_list \ 1 1 1 matches_per_page "Previous Amount" \ RESTRICT_LIST restrict restrict_names 2 1 2 restrict "" \ FORMAT_LIST format template_map 3 2 1 template_name "" |
case_sensitive: | false |
collection_names: | htdig_docs htdig_bugs |
common_dir: | /tmp |
common_url_parts: |
http://www.htdig.org/ml/ \ .html \ http://dev.htdig.org/ \ http://www.htdig.org/ |
compression_level: | 6 |
The default value of this attribute is determined at compile time.
config_dir: | /var/htdig/conf |
sort -u
to get a unique list.
create_image_list: | yes |
sort -u
to get a unique list.
create_url_list: | yes |
database_base: | ${database_dir}/sales |
The default value of this attribute is determined at compile time.
database_dir: | /var/htdig |
date_factor: | 0.35 |
date_format: | %Y-%m-%d |
description_factor: | 350 |
doc_db: | ${database_base}documents.db |
doc_excerpt: | ${database_base}excerpts.db |
doc_index: | documents.index.db |
doc_list: | /tmp/documents.text |
end_ellipses: | ... |
end_highlight: | </font> |
endings_affix_file: | /var/htdig/affix_rules |
endings_dictionary: | /var/htdig/dictionary |
endings_root2word_db: | /var/htdig/r2w.db |
endings_word2root_db: | /var/htdig/w2r.bm |
excerpt_length: | 500 |
excerpt_show_top: | yes |
exclude_urls: | students.html cgi-bin |
The parser program takes four command-line
parameters, not counting any parameters already
given in the command string:
infile content-type URL configuration-file
Parameter | Description | Example |
---|---|---|
infile | A temporary file with the contents to be parsed. | /var/tmp/htdext.14242 |
content-type | The MIME-type of the contents. | text/html |
URL | The URL of the contents. | http://www.htdig.org/attrs.html |
configuration-file | The configuration-file in effect. | /etc/htdig/htdig.conf |
The external parser is to write information for
htdig on its standard output. Unless it is an
external converter, which will output a document
of a different content-type, then its output must
follow the format described here.
The output consists of records, each record terminated
with a newline. Each record is a series of (unless
expressively allowed to be empty) non-empty tab-separated
fields. The first field is a single character
that specifies the record type. The rest of the fields
are determined by the record type.
Record type | Fields | Description |
---|---|---|
w | word | A word that was found in the document. |
location | A number indicating the normalized location of the word within the document. The number has to fall in the range 0-1000 where 0 means the top of the document. | |
heading level |
A heading level that is used to compute the
weight of the word depending on its context in
the document itself. The level is in the range of
0-10 and are defined as follows:
|
|
u | document URL | A hyperlink to another document that is referenced by the current document. It must be complete and non-relative, using the URL parameter to resolve any relative references found in the document. |
hyperlink description | For HTML documents, this would be the text between the <a href...> and </a> tags. | |
t | title | The title of the document |
h | head | The top of the document itself. This is used to build the excerpt. This should only contain normal ASCII text |
a | anchor | The label that identifies an anchor that can be used as a target in an URL. This really only makes sense for HTML documents. |
i | image URL | An URL that points at an image that is part of the document. |
m | http-equiv | The HTTP-EQUIV attribute of a META tag. May be empty. |
name | The NAME attribute of this META tag. May be empty. | |
contents | The CONTENTS attribute of this META tag. May be empty. |
external_parsers: |
text/html /usr/local/bin/htmlparser \ application/pdf /usr/local/bin/parse_doc.pl \ application/msword->text/plain "/usr/local/bin/mswordtotxt -w" \ application/x-gunzip->user-defined /usr/local/bin/ungzipper |
Parameter | Description | Example |
---|---|---|
protocol | The URL scheme to be used. | https |
URL | The URL to be retrieved. | https://www.htdig.org:8008/attrs.html |
configuration-file | The configuration-file in effect. | /etc/htdig/htdig.conf |
The external protocol script is to write information for htdig on the standard output. The output must follow the form described here. The output consists of a header followed by a blank line, followed by the contents of the document. Each record in the header is terminated with a newline. Each record is a series of (unless expressively allowed to be empty) non-empty tab-separated fields. The first field is a single character that specifies the record type. The rest of the fields are determined by the record type.
Record type | Fields | Description |
---|---|---|
s | status code |
An HTTP-style status code, e.g. 200, 404. Typical codes include:
|
r | reason | A text string describing the status code, e.g "Redirect" or "Not Found." |
m | status code | The modification time of this document. While the code is fairly flexible about the time/date formats it accepts, it is recommended to use something standard, like RFC1123: Sun, 06 Nov 1994 08:49:37 GMT, or ISO-8601: 1994-11-06 08:49:37 GMT. |
t | content-type | A valid MIME type for the document, like text/html or text/plain. |
l | content-length | The length of the document on the server, which may not necessarily be the length of the buffer returned. |
u | url | The URL of the document, or in the case of a redirect, the URL that should be indexed as a result of the redirect. |
external_protocols: |
https /usr/local/bin/handler.pl \ ftp /usr/local/bin/ftp-handler.pl |
extra_word_characters: | _ |
head_before_get: | true |
heading_factor: | 20 |
htnotify_sender: | bigboss@yourcompany.com |
http_proxy: | http://proxy.bigbucks.com:3128 |
http_proxy_exclude: | http://intranet.foo.com/ |
sort -u
on the file to
eliminate duplicates from the file.
image_list: | allimages |
The default value of this attribute is determined at compile time.
image_url_prefix: | /images/htdig |
include: | ${config_dir}/htdig.conf |
iso_8601: | true |
keywords_factor: | 12 |
<META name="somename" content="somevalue">
keywords_meta_tag_names: | keywords description |
limit_normalized: | http://www.mydomain.com |
http://
.limit_urls_to: | .sdsu.edu kpbs [.*\.html] |
local_default_doc: | default.html default.htm index.html index.htm |
local_urls: | http://www.foo.com/=/usr/www/htdocs/ |
local_urls_only: | true |
local_user_urls: | http://www.my.org/=/home/,/www/ |
locale: | en_US |
logging: | true |
maintainer: | ben.dover@uptight.com |
match_method: | boolean |
matches_per_page: | 999 |
max_connection_requests: | 100 |
max_description_length: | 40 |
max_descriptions: | 15 |
max_doc_size: | 5000000 |
max_head_length: | 50000 |
max_hop_count: | 4 |
max_keywords: | 10 |
max_meta_description_length: | 1000 |
max_prefix_matches: | 100 |
max_retries: | 6 |
max_stars: | 6 |
maximum_pages: | 20 |
maximum_word_length: | 15 |
meta_description_factor: | 20 |
metaphone_db: | ${database_base}.mp.db |
method_names: | or Or and And |
mime_types: | /etc/mime.types |
minimum_prefix_length: | 2 |
minimum_speling_length: | 3 |
minimum_word_length: | 2 |
next_page_text: | <img src="/htdig/buttonr.gif"> |
no_excerpt_show_top: | yes |
no_excerpt_text: |
no_next_page_text: |
no_page_list_header: | <hr noshade size=2>All results on this page.<br> |
no_page_number_text: |
<strong>1</strong> <strong>2</strong> \ <strong>3</strong> <strong>4</strong> \ <strong>5</strong> <strong>6</strong> \ <strong>7</strong> <strong>8</strong> \ <strong>9</strong> <strong>10</strong> |
no_prev_page_text: |
no_title_text: | "No Title Found" |
noindex_end: | </SCRIPT> |
noindex_start: | <SCRIPT |
HTML
text to display when no matches were found.
The file should contain a complete HTML
document.nothing_found_file: | /www/searching/nothing.html |
nph: | true |
page_list_header: |
page_number_separator: | "</td> <td>" |
page_number_text: |
<em>1</em> <em>2</em> \ <em>3</em> <em>4</em> \ <em>5</em> <em>6</em> \ <em>7</em> <em>8</em> \ <em>9</em> <em>10</em> |
persistent_connections: | false |
plural_suffix: | en |
prefix_match_character: | ing |
prev_page_text: | <img src="/htdig/buttonl.gif"> |
regex_max_words: | 10 |
remove_bad_urls: | true |
remove_default_doc: | default.html default.htm index.html index.htm |
remove_unretrieved_urls: | true |
robotstxt_name: | myhtdig |
contrib/scriptname
directory for a small example. Note that this
attribute also affects the value of the CGI variable
used in htsearch templates.
script_name: | /search/results.shtml |
search_algorithm: | exact:1 soundex:0.3 |
search_results_footer: | /usr/local/etc/ht/end-stuff.html |
search_results_header: | /usr/local/etc/ht/start-stuff.html |
search_results_order: | /docs/|faq.html * /maillist/ /testresults/ |
search_results_wrapper: | ${common_dir}/wrapper.html |
server_aliases: |
foo.mydomain.com:80=www.mydomain.com:80 \ bar.mydomain.com:80=www.mydomain.com:80 |
server_max_docs: | 50 |
server_wait_time: | 20 |
|
|
sort: | revtime |
sort_names: |
score 'Best Match' time Newest title A-Z \ revscore 'Worst Match' revtime Oldest revtitle Z-A |
soundex_db: | ${database_base}.snd.db |
star_blank: | http://www.somewhere.org/icons/noelephant.gif |
star_image: | http://www.somewhere.org/icons/elephant.gif |
star_patterns: |
http://www.sdsu.edu /sdsu.gif \ http://www.ucsd.edu /ucsd.gif |
start_ellipses: | ... |
start_highlight: | <font color="#FF0000"> |
start_url: | http://www.somewhere.org/alldata/index.html |
substring_max_words: | 100 |
synonym_db: | ${database_base}.syn.db |
synonym_dictionary: | /usr/dict/synonyms |
syntax_error_file: | ${common_dir}/synerror.html |
tcp_max_retries: | 6 |
No example provided |
template_map: |
Short short ${common_dir}/short.html \ Normal normal builtin-long \ Detailed detail ${common_dir}/detail.html |
template_name: | long |
template_patterns: |
http://www.sdsu.edu ${common_dir}/sdsu.html \ http://www.ucsd.edu ${common_dir}/ucsd.html |
text_factor: | 0 |
timeout: | 42 |
title_factor: | 12 |
url_list: | /tmp/urls |
url_log: | /tmp/htdig.progress |
url_part_aliases: |
http://search.example.com/~htdig *site \ http://www.htdig.org/this/ *1 \ .html *2 |
url_part_aliases: |
http://www.htdig.org/ *site \ http://www.htdig.org/that/ *1 \ .htm *2 |
url_seed_score: |
/mailinglist/ *.5-1e6 /docs/|/news/ *1.5 /testresults/ "*.7 -200" /faq-area/ *2+10000 |
use_doc_date: | true |
use_meta_description: | true |
use_star_image: | no |
user_agent: | htdig-digger |
valid_extensions: | .html .htm .shtml |
Andrew's
the digger will see this as
Andrews
.valid_punctuation: | -' |
version: | 3.2.0 |
word_db: | ${database_base}.allwords.db |
word_dump: | /tmp/words.txt |
wordlist_compress: | true |
wordlist_page_size: | 8192 |
wordlist_cache_size: | 40000000 |
wordlist_compress_debug: | 2 |
No example provided |
No example provided |
No example provided |
wordlist_monitor: | true |
wordlist_monitor_period: | .1 |
wordlist_monitor_fields: | put/s nwalks/s |
wordlist_monitor_output: | file:/home/bosc/trash/wlmonout |