Generates absolute paths for files. Use this option when the document locations are not going to change, but the collection might be moved around.
When you index a Web server's contents through the file system, you should use -prefixmap
with -abspath
to map the absolute filepaths to URLs.
See also -prefixmap.
Details Enables checksum-based detection of duplicates when indexing file systems.
By default, a document checksum is not computed on indexed files. By using -detectdupfile, a checksum is computed based on the CRC-32 algorithm. The checksum combined with the document size is used to determine if the document is a duplicate.
Syntax: -indexers num_indexers
Specifies the maximum number of indexing threads to run on a collection.
The default value is 2. Note that increasing the value for -indexers requires additional CPU and memory resources.
See also -maxindmem.
Syntax: -license path_and_filename
Specifies the license file to use. By default, ind.lic
is used, from:
verity/prdname/platform/admin/
Where verity/prdname
is the user-definable portion of the installation directory, and platform represents the platform directory.
Specifies the maximum amount of memory, in kilobytes, used by each indexing thread. The number of threads is specified with -indexers
.
By default, each indexing thread uses as much memory as is available from the system.
Specifies the maximum number of documents to be downloaded or submitted for indexing. The value for num_docs does not necessarily correspond exactly to the number of documents indexed. The following factors affect the actual number.
Whether or not the value of num_docs
falls within a block of documents dictated by -submitsize
. If it does, the entire block of documents must be processed.
Whether or not documents retrieved are actually indexed because they are invalid or corrupt.
Syntax: -mimemap path_and_filename
Specifies a control file (simple ASCII text) that maps file extensions to MIME-types. This allows you to make custom associations and override defaults.
The format for the control file is:
#file_ext_no_dot mime-type
abc application/word
Used with -noindex
or -nosubmit
, this option disables the caching of files during Web site indexing. This has the effect of decreasing the demands on your disk space.
Normally, Verity Spider downloads URLs and then writes them to a bulk insert file and downloads the documents themselves. When indexing occurs, once -submitsize
has been reached, the cached files are indexed and then deleted. If you use -noindex
, the bulk insert file is submitted but not processed by Verity Spider, and so the documents are not deleted until indexing occurs takes over. This will usually be mkvdk
or collsvc
, or you can subsequently use Verity Spider again with the -processbif option.
By using -nocache
in conjunction with -noindex
or -nosubmit
, you avoid storing files locally at all. Files are downloaded only when indexing actually occurs.
See also -noindex.
Disables checksum-based detection of duplicates when indexing Web sites. URL-based duplicate detection is still performed.
By default, a document checksum is computed based on the CRC-32 algorithm. The checksum combined with the document size is used to determine if the document is a duplicate.
See also -followdup.
Specifies that the Verity Spider gathers document locations without indexing them. The document locations are stored in a bulk insert file (BIF), which is then submitted to the collection. This option is typically used in conjunction with a separate indexing process, such as mkvdk
or collection servicers (collsvc
). The BIF will be processed by the next indexing process run for the collection, whether it is the Verity Spider, mkvdk
or collection servicers (collsvc
).
Do not try to start both the Verity Spider and another process at the same time. You must allow Verity Spider enough time to generate enough work for the secondary indexing process to act upon. If you are using mkvdk
, you can run it in persistent mode to ensure it will act upon work generated by Verity Spider.
Note When you execute an indexing job for a collection and you use |
See also -nocache and -nosubmit.
For more information on mkvdk
, see Chapter 9, "Managing Verity Collections with the mkvdk Utility".
Specifies that the Verity Spider gathers document locations without indexing them. The document locations are stored in a bulk insert file (BIF), which is not submitted to the collection. This option is typically used in conjunction with a separate indexing process, such as mkvdk
or collection servicers (collsvc
). You can also use Verity Spider again with the -processbif option. Note that with an indexing process other than Verity Spider, you must specify the name and path for the BIF because the collection has no record of it.
Enables the Verity Spider to run in persistent mode, checking for updates every num_seconds
seconds until it is stopped.
While the Verity Spider is running in persistent mode, there is no optimization. Once the Verity Spider is taken out of persistent mode, you will need to perform optimization on the collection. For more information about using mkvdk
Chapter 9, "Managing Verity Collections with the mkvdk Utility".
Note You should not run more than one Verity Spider process in persistent mode. As the Verity Spider is a resource intensive process, you should only run it in persistent mode with an interval of less than one day. For time intervals greater than twelve hours, you should use some form of scheduling. Some examples are cron jobs for UNIX, and the AT command for Windows NT Server. |
Syntax: -preferred exp_1 [exp_n] ...
Specifies a list of hosts or domains which are to be preferred when retrieving documents for viewing. You can use wildcard expressions, where the asterisk ( * ) is for text strings and the question mark ( ? ) is for single characters. To use regular expressions, also specify the -regexp option. Use this option when you leave duplicate detection enabled and do not specify -nodupdetect
.
When indexing, you may encounter a non-preferred host first. In that case, documents are parsed and followed and stored as candidates. When duplicates are encountered on another server, which is preferred, the duplicate documents from the non-preferred server are skipped. When documents are requested for viewing, they will be retrieved from the preferred server.
On Windows NT, you should include double quotes around the argument to protect the special characters such as (*). On UNIX, you should use single quotes. Note that this is only required when you run the indexing job from a command line. Quotes are not necessary within a command file (-cmdfile
).
Syntax: -prefixmap path_and_filename
Specifies a control file (simple ASCII text) that maps file system paths to Web aliases.
In conjunction with -abspath
, this option is typically used to create an URL field that is the Web equivalent of a file system path. File system indexing is faster than Web crawling over the network. If you use -prefixmap
to replace the file system path with the Web URL, relative hyperlinks in the HTML pages are kept intact when viewed through Information Server.
The format for the control file is:
src_field src_prefix dest_field dest_prefix
If you use backslashes, you must double them so they are properly escaped. For example:
C:\\test\\docs\\path
For example, to map the filepath /usr/pub/docs
to http://web/~verity
, use the following:
vdkvgwkey /usr/pub URL http://web/~verity
See also -abspath.
Syntax: -processbif 'command_string !*'
Due to the use of special characters, which represent the bulk insert file (BIF), you must run Verity Spider with a command file using the -cmdfile
option.
Specifies a command string in which you can call a program or script which operates on BIFs generated by Verity Spider.
For example, if you want to use a script called fix_bif
to add customized information to BIF files, use the following command:
vspider -cmdfile filename
Where filename
is the text-only command file which contains the following (among any other necessary options):
-processbif 'fix_bif !*'
Note that your command file will include other options as well.
Specifies the use of regular expressions rather than the default wildcard expressions for the following options: -exclude, -indexclude, -include, -indinclude, -skip, -indskip, -preferred, and -nofollow
.
Wildcard expressions allow the use of the asterisk ( * ) for text strings, and the question mark ( ? ) for single characters.
This wildcard expression... |
Will apply to these text strings... |
---|---|
a*t |
although, attitude, audit |
file?.htm |
files.htm, file1.htm, filer.htm |
name?.* |
names.txt, name.doc, named.blank, names.ext |
Regular expressions allow for more powerful and flexible means for matching alphanumeric strings. For example, to match "ab11" or "ab34" but not "abcd" or "ab11cd," you could use the following regular expression:
^ab[0-9][0-9]$
The full extent to which regular expressions can be employed is beyond the scope of this description. For more information on regular expressions, refer to a book devoted to the subject.
Syntax: -submitsize num_documents
Specifies the number of documents submitted for indexing at one time. The default value is 128. The upper limit is 64,000.
Note Although larger values mean more efficient processing by the indexer, smaller values will allow more parallelism on multi-CPU systems. Furthermore, in the event of a halt during indexing, a smaller value means fewer documents will be lost. |
If a halt occurs during indexing, the chunk of documents specified by -submitsize is lost because there is no transactional rollback for indexing and the documents are no longer in the queue for indexing. Remember that when you re-run the indexing task, Verity Spider can only continue with URLs and documents which are enqueued.
Specifies the directory for temporary files (disk cache). By default, the temp directory is contained within the job directory (optionally specified with the -jobpath
option.
If you do not specify a value for this option, Verity Spider will create a /spider/temp
directory within the collection. For multiple-collection tasks, the first collection specified will be used.
Note Make sure the location you specify contains enough disk space to handle the documents which are downloaded and held before indexing. The documents are deleted from the harddisk after they are indexed. |
See also -jobpath, for specifying the location of all indexing job directories and files, one of which is the temp directory.