Verity Spider Syntax

The following section shows the syntax for several basic types of Verity Spider indexing tasks.

Overview

Before you create an indexing task for a new collection, you should make copies of the relevant default style files to ensure that you have a set of template style files in a known, stable state.

Keep in mind that running multiple simultaneous Verity Spider jobs on the Information Server host may cause performance problems for searches. This does not mean you should never run indexing jobs when users may be searching, because your collections are available for searching even while indexing jobs are running. With an eye toward optimizing performance, you should try staggering your indexing jobs to avoid overloading your server.

The Verity Spider command

At its most basic level, a Verity Spider command consists of the following:

vspider -initialize -collection coll [options]

Where -initialize is one of -start or -refresh (when starting points have changed), and -collection is required to provide a target for the Verity Spider, and [options] can be a near limitless combination of the options described later in this chapter.

For example:

c:\cfusion\bin\vspider -common c:\cfusion\verity\common 

-collection c:\new -start http://localhost -indinclude *

Note that there are of course dependencies for other options, depending on the nature of the indexing task. Some examples are:

Note that if you do not run the Verity Spider executable from its default installation directory, you must include that directory in your path. This is because the Verity Spider executable depends on other files to run properly.

The default location for the Verity Spider executable is as follows:

verity/prdname/platform/admin

Where verity/prdname is the user-definable portion of the installation directory, and platform will vary depending on your operating system.

Using a command file

If you want simpler reuse and archiving of your indexing commands, you should take advantage of the abstraction offered by the -cmdfile option. By using an ASCII text file to store a task's options, you also avoid the pitfall of using special characters in an option's parameter value.

For example, the -processbif option requires the use of "!*" and therefore any task using that option must also use the -cmdfile option.

Command-line option reference

The following sections describe the Verity Spider V3.7 options. Note that option names are case-sensitive.

-start

A starting point for an indexing job. You can specify multiple instances, or use multiple values in a single instance.

When you execute an indexing job from a command-line and you do not use a command file (with -cmdfile), you must URL-escape any special characters in the starting point. To URL-escape a special character, use "%hex-ASCII-character-number" in place of the character. For example, you would use /time%26/ instead of /time&/. This allows the operating system to properly process the command string.

In the event an indexing task halts, you can re-run the task as-is. The persistent store for the specified collection is read and only those candidate URLs that are in the queue but not yet processed are parsed. Candidate URLs correspond to URLs of the following status as reported by vsdb:

cand, used, inse, upda, dele, fail

.
For this repository type...
The starting point is...
Web
The URL or URLs from which the Verity Spider is to begin indexing. Use other options such as -jumps to control how far from the starting point Verity Spider goes.
File
system The starting directory or directories in which the Verity Spider will start indexing. All subdirectories beneath the starting point will be indexed unless you use -pathlen, or any of the inclusion or exclusion criteria.


Note

By using -start with -refresh, you provide a starting point for Verity Spider and therefore do not need to use at least one of -host, -domain, -nofollow or -unlimited


-refresh

Used for updating a collection, specifies that Verity Spider process only those documents which qualify as follows:

When you re-run an existing indexing job, Verity Spider will automatically refresh the collection. If you add or remove any of the starting points, however, you must manually specify -refresh in order to refresh existing documents.


Note

You can also use -start to provide a starting point for Verity Spider. If you do not use -start, then you should use at least one of -host, -domain, or -nofollow. For further control, also see -refreshtime. If you do not use any constraint criteria, Verity Spider will operate without limits and will likely index far more than you intended.




Banner.Novgorod.Ru