Processing Options

-abspath

Type: File system only

Generates absolute paths for files. Use this option when the document locations are not going to change, but the collection might be moved around.

When you index a Web server's contents through the file system, you should use -prefixmap with -abspath to map the absolute filepaths to URLs.

See also -prefixmap.

-detectdupfile

Type: File system only

Details Enables checksum-based detection of duplicates when indexing file systems.

By default, a document checksum is not computed on indexed files. By using -detectdupfile, a checksum is computed based on the CRC-32 algorithm. The checksum combined with the document size is used to determine if the document is a duplicate.

-indexers

Syntax: -indexers num_indexers

Specifies the maximum number of indexing threads to run on a collection.

The default value is 2. Note that increasing the value for -indexers requires additional CPU and memory resources.

See also -maxindmem.

-license

Syntax: -license path_and_filename

Specifies the license file to use. By default, ind.lic is used, from:

verity/prdname/platform/admin/

Where verity/prdname is the user-definable portion of the installation directory, and platform represents the platform directory.

-maxindmem

Syntax: -maxindmem kilobytes

Specifies the maximum amount of memory, in kilobytes, used by each indexing thread. The number of threads is specified with -indexers.

By default, each indexing thread uses as much memory as is available from the system.

-maxnumdoc

Syntax: -maxnumdoc num_docs

Specifies the maximum number of documents to be downloaded or submitted for indexing. The value for num_docs does not necessarily correspond exactly to the number of documents indexed. The following factors affect the actual number.

Whether or not the value of num_docs falls within a block of documents dictated by -submitsize. If it does, the entire block of documents must be processed.

Whether or not documents retrieved are actually indexed because they are invalid or corrupt.

-mimemap

Syntax: -mimemap path_and_filename

Specifies a control file (simple ASCII text) that maps file extensions to MIME-types. This allows you to make custom associations and override defaults.

The format for the control file is:

#file_ext_no_dot                      mime-type

abc                       application/word

-nocache

Type: Web crawling only

Used with -noindex or -nosubmit, this option disables the caching of files during Web site indexing. This has the effect of decreasing the demands on your disk space.

Normally, Verity Spider downloads URLs and then writes them to a bulk insert file and downloads the documents themselves. When indexing occurs, once -submitsize has been reached, the cached files are indexed and then deleted. If you use -noindex, the bulk insert file is submitted but not processed by Verity Spider, and so the documents are not deleted until indexing occurs takes over. This will usually be mkvdk or collsvc, or you can subsequently use Verity Spider again with the -processbif option.

By using -nocache in conjunction with -noindex or -nosubmit, you avoid storing files locally at all. Files are downloaded only when indexing actually occurs.

See also -noindex.

-nodupdetect

Type: Web crawling only.

Disables checksum-based detection of duplicates when indexing Web sites. URL-based duplicate detection is still performed.

By default, a document checksum is computed based on the CRC-32 algorithm. The checksum combined with the document size is used to determine if the document is a duplicate.

See also -followdup.

-noindex

Specifies that the Verity Spider gathers document locations without indexing them. The document locations are stored in a bulk insert file (BIF), which is then submitted to the collection. This option is typically used in conjunction with a separate indexing process, such as mkvdk or collection servicers (collsvc). The BIF will be processed by the next indexing process run for the collection, whether it is the Verity Spider, mkvdk or collection servicers (collsvc).

Do not try to start both the Verity Spider and another process at the same time. You must allow Verity Spider enough time to generate enough work for the secondary indexing process to act upon. If you are using mkvdk, you can run it in persistent mode to ensure it will act upon work generated by Verity Spider.


Note

When you execute an indexing job for a collection and you use -noindex, the persistent store for the collection is not updated.


See also -nocache and -nosubmit.

For more information on mkvdk, see Chapter 9, "Managing Verity Collections with the mkvdk Utility".

-nosubmit

Specifies that the Verity Spider gathers document locations without indexing them. The document locations are stored in a bulk insert file (BIF), which is not submitted to the collection. This option is typically used in conjunction with a separate indexing process, such as mkvdk or collection servicers (collsvc). You can also use Verity Spider again with the -processbif option. Note that with an indexing process other than Verity Spider, you must specify the name and path for the BIF because the collection has no record of it.

-persist

Syntax: -persist num_seconds

Enables the Verity Spider to run in persistent mode, checking for updates every num_seconds seconds until it is stopped.

While the Verity Spider is running in persistent mode, there is no optimization. Once the Verity Spider is taken out of persistent mode, you will need to perform optimization on the collection. For more information about using mkvdk Chapter 9, "Managing Verity Collections with the mkvdk Utility".


Note

You should not run more than one Verity Spider process in persistent mode. As the Verity Spider is a resource intensive process, you should only run it in persistent mode with an interval of less than one day. For time intervals greater than twelve hours, you should use some form of scheduling. Some examples are cron jobs for UNIX, and the AT command for Windows NT Server.


-preferred

Syntax: -preferred exp_1 [exp_n] ...

Type: Web crawling only

Specifies a list of hosts or domains which are to be preferred when retrieving documents for viewing. You can use wildcard expressions, where the asterisk ( * ) is for text strings and the question mark ( ? ) is for single characters. To use regular expressions, also specify the -regexp option. Use this option when you leave duplicate detection enabled and do not specify -nodupdetect.

When indexing, you may encounter a non-preferred host first. In that case, documents are parsed and followed and stored as candidates. When duplicates are encountered on another server, which is preferred, the duplicate documents from the non-preferred server are skipped. When documents are requested for viewing, they will be retrieved from the preferred server.

On Windows NT, you should include double quotes around the argument to protect the special characters such as (*). On UNIX, you should use single quotes. Note that this is only required when you run the indexing job from a command line. Quotes are not necessary within a command file (-cmdfile).

See Also -regexp

-prefixmap

Syntax: -prefixmap path_and_filename

Type: File system only

Specifies a control file (simple ASCII text) that maps file system paths to Web aliases.

In conjunction with -abspath, this option is typically used to create an URL field that is the Web equivalent of a file system path. File system indexing is faster than Web crawling over the network. If you use -prefixmap to replace the file system path with the Web URL, relative hyperlinks in the HTML pages are kept intact when viewed through Information Server.

The format for the control file is:

src_field src_prefix dest_field dest_prefix

If you use backslashes, you must double them so they are properly escaped. For example:

C:\\test\\docs\\path

For example, to map the filepath /usr/pub/docs to http://web/~verity, use the following:

vdkvgwkey /usr/pub URL http://web/~verity

See also -abspath.

-processbif

Syntax: -processbif 'command_string !*'

Due to the use of special characters, which represent the bulk insert file (BIF), you must run Verity Spider with a command file using the -cmdfile option.

Specifies a command string in which you can call a program or script which operates on BIFs generated by Verity Spider.

For example, if you want to use a script called fix_bif to add customized information to BIF files, use the following command:

vspider -cmdfile filename

Where filename is the text-only command file which contains the following (among any other necessary options):

-processbif 'fix_bif !*'

Note that your command file will include other options as well.

-regexp

Specifies the use of regular expressions rather than the default wildcard expressions for the following options: -exclude, -indexclude, -include, -indinclude, -skip, -indskip, -preferred, and -nofollow.

Wildcard expressions allow the use of the asterisk ( * ) for text strings, and the question mark ( ? ) for single characters.

This wildcard expression...
Will apply to these text strings...
a*t
although, attitude, audit
file?.htm
files.htm, file1.htm, filer.htm
name?.*
names.txt, name.doc, named.blank, names.ext

Regular expressions allow for more powerful and flexible means for matching alphanumeric strings. For example, to match "ab11" or "ab34" but not "abcd" or "ab11cd," you could use the following regular expression:

^ab[0-9][0-9]$

The full extent to which regular expressions can be employed is beyond the scope of this description. For more information on regular expressions, refer to a book devoted to the subject.

-submitsize

Syntax: -submitsize num_documents

Specifies the number of documents submitted for indexing at one time. The default value is 128. The upper limit is 64,000.


Note

Although larger values mean more efficient processing by the indexer, smaller values will allow more parallelism on multi-CPU systems. Furthermore, in the event of a halt during indexing, a smaller value means fewer documents will be lost.


If a halt occurs during indexing, the chunk of documents specified by -submitsize is lost because there is no transactional rollback for indexing and the documents are no longer in the queue for indexing. Remember that when you re-run the indexing task, Verity Spider can only continue with URLs and documents which are enqueued.

-temp

Syntax: -temp path

Specifies the directory for temporary files (disk cache). By default, the temp directory is contained within the job directory (optionally specified with the -jobpath option.

If you do not specify a value for this option, Verity Spider will create a /spider/temp directory within the collection. For multiple-collection tasks, the first collection specified will be used.


Note

Make sure the location you specify contains enough disk space to handle the documents which are downloaded and held before indexing. The documents are deleted from the harddisk after they are indexed.


See also -jobpath, for specifying the location of all indexing job directories and files, one of which is the temp directory.



Banner.Novgorod.Ru