Paths and URLs Options

-auth

Syntax: -auth path_and_filename

Specifies an authorization file to support authentication for secure paths.


Note

There must be a corresponding "Authfile=" entry in the Information Server configuration file, inetsrch.ini, so that documents can be accessed for viewing. Both -auth and Authfile= must point to the same file.


-cgiok

Type: Web crawling only.

Allows indexing of URLs containing the ? symbol. This typically means the URL leads to a CGI or other such processing program.

The return document produced by the Web server is indexed and parsed for document links which are followed and in turn indexed and parsed. However, if the Web server does not return a page, perhaps because the URL is missing parameters which are required for processing in order to produce a page, then nothing happens. There is no page to index and parse.

Example

A URL without parameters is:

http://server.com/cgi-bin/program?

If you include parameters in the URL to be indexed, as specified with the -start option, then those parameters are processed and any resulting pages are indexed and parsed.

By default, URLs with ? symbols are skipped.

-domain

Syntax: -domain name_1 [name_n] ...

Type: Web crawling only.

Limits indexing to the specified domain(s). You must use only complete text strings for domains. You may not use wildcard expressions. URLs not in the specified domain(s) will not be downloaded or parsed.

You may list multiple domains by separating each one with a single space.


Note

You must have the appropriate Verity Spider licensing capability to use this option.


-followdup

Specifies that Verity Spider follows links within duplicate documents, although only the first instance of any duplicate documents will be indexed.

You may find this option useful if you use the same home page on multiple sites. By default, only the first instance of the document is indexed, while subsequent instances are skipped. If you have different secondary documents on the different sites, using -followdup will allow you to get to them for indexing, while still indexing the common home page only once.

-followsymlink

Type: File system only.

Specifies that Verity Spider follows symbolic links when indexing UNIX file systems.

-host

Syntax: -host name_1 [name_n] ...

Type: Web crawling only.

Limits indexing to the specified host or hosts. You must use only complete text strings for hosts. You may not use wildcard expressions.

You may list multiple hosts by separating each one with a single space. URLs not on the specified host(s) will not be downloaded or parsed.

-https

Type: Web crawling only.

Allows the indexing of SSL-enabled Web sites.


Note

You must have the Verity SSL Option Pack installed to use -https. The Verity SSL Option Pack is a Verity Spider add-on available separately from a Verity salesperson.


-jumps

Syntax: -jumps num_jumps

Type: Web crawling only.

Specifies the maximum number of levels deep an indexing job can go from the starting URL. Specify a number between 0 and 254.

The default value is unlimited. If you see extremely large numbers of documents in a collection where you do not expect them, you should consider experimenting with this option, in conjunction with the Content options, to pare down your collection.

-nodocrobo

Specifies ROBOT META tag directives are to be ignored.

In HTML 3.0 and earlier, robot directives could only be given as the file robots.txt under the root directory of a Web site. In HTML 4.0, every document can have robot directives embedded in the META field. Use this option to ignore them. This option should, of course, be used with discretion.

See Also -norobo and http://www.w3c.org/TR/REC-html40/html40.txt.

-nofollow

Syntax: -nofollow "exp"

Type: Web crawling only.

Specifies Verity Spider cannot follow any URLs which match the expression exp. If you do not specify a exp value for -nofollow, then Verity Spider assumes a value of "*" where no documents are followed.

You can use wildcard expressions, where the asterisk ( * ) is for text strings and the question mark ( ? ) is for single characters. You should always encapsulate the exp values in double quotes to ensure they are properly interpreted.

If you use backslashes, you must double them so they are properly escaped. For example:

C:\\test\\docs\\path

To use regular expressions, also specify the -regexp option.

Previous versions of the Verity Spider did not allow the use of an expression. This meant that for each starting point URL, only the first document would be indexed. With the addition of the expression functionality, you can now selectively skip URLs even within documents.

See also -regexp

-norobo

Type: Web crawling only.

Specifies that any robots.txt files encountered are ignored. The robots.txt file is used on many Web sites to specify what parts of the site indexers should avoid. The default is to honor any robots.txt files.

If you are re-indexing a site and robots.txt has changed, the Verity Spider will delete documents that have been newly disallowed by robots.txt.

This option should, of course, be used with discretion and extreme care, especially in conjunction with -cgiok.

See Also -nodocrobo and http://info.webcrawler.com/mak/projects/robots/norobots.html.

-pathlen

Syntax: -pathlen num_pathsegments

Limits indexing to the specified number of path segments in the URL or file system path. The path length is determined as follows:

The host name and drive letter are not included. For example, neither www.spider.com:80/ nor C:\ would be included in determining the path length.

All elements following the host name are included.

The actual filename, if present, is included. For example, /world.html would be included in determining the path length.

Any directory paths between the host and the actual filename are included.

Example

For the following URL, the path length would be 4:

http://www.spider:80/comics/fun/funny/world.html

<-1->    <2>  <-3->  <---4--->

For the following file system path, the path length would be 3:

C:\files\docs\datasheets

<-1->  <-2->  <---3--->

The default value is 100 path segments.

-refreshtime

Syntax: -refreshtime timeunits

Specifies that any documents which have been indexed since the timeunits value began are not to be refreshed.

The syntax for timeunits is:

n day n hour n min n sec

Where n is a positive integer. Note that there must be spaces, and since the first three letters of each time unit is parsed, you can use the singular or plural form.

If you specify:

-refreshtime 1 day 6 hours

Only those documents which were last indexed at least 30 hours and 1 second ago, will be refreshed.


Note

This option is valid only with the -refresh option. When you use vsdb -recreate, the last indexed date is cleared.


-reparse

Type: Web crawling only.

Forces parsing of all HTML documents already in the collection. You must specify a starting point with the -start option when you use -reparse.

You can use -reparse when you want to include paths and documents which were previously skipped due to exclusion or inclusion criteria. Remember to change the criteria, else there will be little for the Verity Spider to do. This can be easy to overlook when you are using -cmdfile.

-unlimited

Specifies no limits to be placed on Verity Spider if neither -host nor -domain is specified. The default is to limit based on the host of the first starting point listed.

-virtualhost

Syntax: -virtualhost name_1 [name_n] ...

Specifies that DNS lookups are avoided for the hosts listed. You must use only complete text strings for hosts. You may not use wildcard expressions. This allows you to index by alias, such as when multiple Web servers are running on the same host. You can use regular expressions.

Normally, when Verity Spider resolves host names, it uses DNS lookups to convert the names to canonical names, of which there can be only one per machine. This allows for the detection of duplicate documents, to prevent results from being diluted. In the case of multiple aliased hosts, however, duplication is not a barrier as documents can be referred to by more than one alias, and yet remain distinct because of the different alias names.

Example

You may have both marketing.verity.com and sales.verity.com running on the same host. Each alias has a different document root, although document names such as index.htm may occur for both. With -virtualhost, both server aliases can be indexed as distinct sites. Without -virtualhost, they would both be resolved to the same host name and only the first document encountered from any duplicate pair would be indexed.

Warning! If you are using Netscape Enterprise Server, and you have specified only the host name as a virtual host, then Verity Spider will not be able to index the virtual host site. This is because the Verity Spider always adds the domain name to the document key.



Banner.Novgorod.Ru