Content Options

-casesen

Details Makes processing case-sensitive by specifying that the spider process separately keys that differ only in case. Use only for indexing UNIX servers.

-exclude

Syntax: -exclude exp_1 [exp_n] ...

Files, paths and URLs matching the specified expression(s) will not be followed. If you use backslashes, you must double them so they are properly escaped. For example:

C:\\test\\docs\\path

You can use wildcard expressions, where the asterisk ( * ) is for text strings and the question mark ( ? ) is for single characters. For example:

'/my_doc*/year199?'

On Windows NT, you should include double quotes around the argument to protect the special characters such as (*). On UNIX, you should use single quotes. Note that this is only required when you run the indexing job from a command line. Quotes are not necessary within a command file (-cmdfile).

To use regular expressions, also specify the -regexp option.

To specify a file, path or URL which you want followed but not indexed, use -indexclude. For document types, use -mimeexclude instead. For example, specify -mimeexclude application/pdf rather than -exclude *.pdf.


Note

When specifying an URL, you must use full, absolute paths using the same format as appears in the HTML hyperlink. If the link is relative, you must change it to absolute to use it with -exclude.


See also -regexp.

-include

Only those files, paths and URLs which match the specified expression or expressions will be followed. If you use backslashes, you must double them so they are properly escaped. For example:

C:\\test\\docs\\path

You can use wildcard expressions, where the asterisk ( * ) is for text strings and the question mark ( ? ) is for single characters. For example:

'/my_doc*/year199?'

On Windows NT, you should include double quotes around the argument to protect the special characters such as (*). On UNIX, you should use single quotes. Note that this is only required when you run the indexing job from a command line. Quotes are not necessary within a command file (-cmdfile).

To use regular expressions, also specify the -regexp option.

Keep in mind that if your starting points do not contain the specified -include expressions, nothing will be indexed. The -include option prevents Verity Spider from even following anything which does not match the specified expressions. You may want to use -indinclude instead. Where -include prevents Verity Spider from even following anything which does not match the specified expressions, -indinclude allows Verity Spider to follow what matches the specified expressions, while not indexing.

For document types, use -mimeinclude instead. For example, specify -mimeinclude text/html rather than -include *.htm.


Note

When specifying an URL, you must use full, absolute paths using the same format as appears in the HTML hyperlink. If the link is relative, you must change it to absolute to use it with -include.


See also -regexp.

-indexclude

Syntax: -indexclude exp_1 [exp_n] ...

Specifies that the files and paths in URLs which match the expressions are not indexed. They are, however, still followed. If you use backslashes, you must double them so they are properly escaped. For example:

C:\\test\\docs\\path

You can use wildcard expressions, where the asterisk ( * ) is for text strings and the question mark ( ? ) is for single characters. For example:

'/my_doc*/year199?'

On Windows NT, you should include double quotes around the argument to protect the special characters such as (*). On UNIX, you should use single quotes. Note that this is only required when you run the indexing job from a command line. Quotes are not necessary within a command file (-cmdfile).

To use regular expressions, also specify the -regexp option.

You would use this option to gather some documents, such as HTML tables of contents, to gain access to other documents for indexing.

Where the -exclude option prevents Verity Spider from even following anything which matches the specified expressions, -indexclude allows Verity Spider to follow anything while only skipping that which matches the specified expressions.

For document types, use -indmimeexclude instead.


Note

When specifying an URL, you must use full, absolute paths using the same format as appears in the HTML hyperlink. If the link is relative, you must change it to absolute to use it with -indexclude.


See Also -regexp.

-indinclude

Syntax: -indinclude exp_1 [exp_n] ...

Specifies that only those files and paths in URLs which match the expressions be followed and indexed. If you use backslashes, you must double them so they are properly escaped. For example:

C:\\test\\docs\\path

You can use wildcard expressions, where the asterisk ( * ) is for text strings and the question mark ( ? ) is for single characters. For example:

'/my_doc*/year199?'

On Windows NT, you should include double quotes around the argument to protect the special characters such as (*). On UNIX, you should use single quotes. Note that this is only required when you run the indexing job from a command line. Quotes are not necessary within a command file (-cmdfile).

To use regular expressions, also specify the -regexp option.

Where the -include option prevents Verity Spider from even following anything which does not match the specified expressions, -indinclude allows Verity Spider to follow anything while only indexing that which matches the specified expressions.

Example

If you want to index all documents that include "search" in the URL at http://web.verity.com, you cannot use:

vspider -collection collname -start http://web.verity.com 

  -include '*search*'

This is because the starting point does not match the -include criteria. Instead, use -indinclude to follow all documents (unless, of course, you have specified any of the exclude options) and index only those documents that match your criteria. Simply replace -include with -indinclude in the above example.


Note

When specifying an URL, you must use full, absolute paths using the same format as appears in the HTML hyperlink. If the link is relative, you must change it to absolute to use it with -indinclude.


See Also -regexp.

-indmimeexclude

Syntax: -indmimeexclude mime_1 [mime_n] ...

Specifies that only those MIME types which match the expressions be followed but not indexed.

On Windows NT, you should include double quotes around the argument to protect the special characters such as (*). On UNIX, you should use single quotes. Note that this is only required when you run the indexing job from a command line. Quotes are not necessary within a command file (-cmdfile).

Use this option to gather some documents, such as HTML tables of contents, to gain access to other documents for indexing. The -mimeexclude option, on the other hand, prevents specified documents from being followed at all. For the mime variable, you can include the asterisk ( * ) wildcard for text strings. For example:

'text/*'

You cannot use the question mark ( ? ) wildcard, and the -regexp option does not allow you to use regular expressions.

-indmimeinclude

Syntax: -indmimeinclude mime_1 [mime_n] ...

Specifies that only those MIME types which match the expressions be followed and indexed.

The -mimeinclude option would not allow you to index desired documents if the starting URL is not followed. For the mime variable, you can include the asterisk ( * ) wildcard for text strings. For example:

'text/*'

On Windows NT, you should include double quotes around the argument to protect the special character (*). On UNIX, you should use single quotes. Note that this is only required when you run the indexing job from a command line. Quotes are not necessary within a command file (-cmdfile).

You cannot use the question mark ( ? ) wildcard, and the -regexp option does not allow you to use regular expressions.

Example

If you want to index all Word documents at http://web.verity.com, you cannot use:

vspider -collection collname -style style_dir -start 

  http://web.verity.com -mimeinclude 'application/msword'

This is because the starting point does not match the -mimeinclude criteria. Now, you can use -indmimeinclude to follow all documents (unless, of course, you have specified any of the exclude options) and index only those documents that match your criteria. Simply replace -mimeinclude with -indmimeinclude in the above example.

-indskip

Syntax: -indskip HTML_tag "exp"

Type: Web crawling only.

Specifies Verity Spider is follow and parse links, but not index, any HTML document which contains the text of exp within the given HTML_tag. For multiple HTML_tag and exp combinations, use multiple instances of the -skip option.

You can use wildcard expressions, where the asterisk ( * ) is for text strings and the question mark ( ? ) is for single characters. For example:

'/my_doc*/year199?'

On Windows NT, you should include double quotes around the argument to protect the special characters such as (*). On UNIX, you should use single quotes. Note that this is only required when you run the indexing job from a command line. Quotes are not necessary within a command file (-cmdfile).

If you use backslashes, you must double them so they are properly escaped. For example:

C:\\test\\docs\\path

To use regular expressions, also specify the -regexp option.

Example

To skip all HTML documents which contain the word "personnel" in the Title element, while still parsing those documents for links to other documents, use the following:

-indskip title "personnel"

Example

To avoid indexing directory listing pages, while still parsing the document and path links except for link up to the parent directory, use one of the following depending on the Web server being indexed:

For Netscape Web servers, use the following:

-indskip title "*Index of*"

-nofollow "*parent directory*"

For Microsoft Internet Information Server, use the following:

-indskip a "*to parent directory*"

-nofollow "*parent directory*"

-maxdocsize

Syntax: -maxdocsize integer

Specifies the maximum size, in kilobytes, for documents to be indexed. Any documents larger than the value specified by maxdocsize will be ignored.

The default is to index documents of any sizes.

-metafile

Syntax: -metafile path_and_filename

Type: Web crawling only.

Allows you to use a text file to map custom meta tags to valid HTTP header fields. If you use backslashes, you must double them so they are properly escaped. For example: C:\\test\\docs\\path.

This means you are able to use your own meta tag, in the document, to replace what is returned by the Web server, or to insert it if nothing is returned. Currently, the only header fields of real value are "Last-Modified" and "Content-Length." Note, however, that future enhancements could allow for much greater variety.

The syntax for entries in the text file is:

name Last-Modified y|n

or

name Content-Length y|n

Where y|n is an override flag which can be either yes or no.

Example

A mapping file for -metafile might include:

Doc_Last_Touched Last-Modified n

Doc_Size Content-Length y

If you use the y override flag, the value for the custom meta tag overrides the value for the valid field, even if both values are present and differ. This can be useful when the valid field value is always sent, but you want to specify your own value with a custom meta tag.

If you use the n override flag, then the value for the custom meta tag will be used only if there is no value for the valid field returned by the server. If a value for the valid field exists, then that is given precedence.

Warning! If you have several entries mapping to the same valid field, only the last entry will take effect.

-mimeexclude

Syntax: -mimeexclude mime_1 [mime_n] ...

Specifies MIME types which are neither followed nor indexed.

On Windows NT, you should include double quotes around the argument to protect the special characters such as (*). On UNIX, you should use single quotes. Note that this is only required when you run the indexing job from a command line. Quotes are not necessary within a command file (-cmdfile).

The default is to include all MIME types. For the mime variable, you can include the asterisk ( * ) wildcard for text strings. For example:

'text/*'

You cannot use the question mark ( ? ) wildcard, and the -regexp option does not allow you to use regular expressions.

Use -indmimeexclude to allow the Verity Spider to follow documents, without indexing them, to gain access to other desirable document types.

-mimeinclude

Syntax: -mimeinclude mime_1 [mime_n] ...

Specifies MIME types to be included.

On Windows NT, you should include double quotes around the argument to protect the special characters such as (*). On UNIX, you should use single quotes. Note that this is only required when you run the indexing job from a command line. Quotes are not necessary within a command file (-cmdfile).

The default is to include all MIME types. For the mime variable, you can include the asterisk ( * ) wildcard for text strings. For example:

'text/*'

You cannot use the question mark ( ? ) wildcard, and the -regexp option does not allow you to use regular expressions.

-mindocsize

Syntax: -mindocsize integer

Specifies the minimum size, in kilobytes, for documents to be indexed. Any documents smaller than the value specified by mindocsize will be ignored.

The default is to index documents of any sizes.

-skip

Syntax: -skip HTML_tag "exp"

Type: Web crawling only

Specifies Verity Spider is to not index any HTML document which contains the text of exp within the given HTML_tag. For multiple HTML_tag and exp combinations, use multiple instances of the -skip option.

You can use wildcard expressions, where the asterisk ( * ) is for text strings and the question mark ( ? ) is for single characters. For example:

'/my_doc*/year199?'

On Windows NT, you should include double quotes around the argument to protect the special characters such as (*). On UNIX, you should use single quotes. Note that this is only required when you run the indexing job from a command line. Quotes are not necessary within a command file (-cmdfile).

If you use backslashes, you must double them so they are properly escaped. For example:

C:\\test\\docs\\path

To use regular expressions, also specify the -regexp option.

Example 1

To skip all HTML documents which contain the word "personnel" in the Title element, use the following:

-skip title "personnel"

Example 2

To skip all HTML documents which contain both the word "private" and the phrase "internal user" in any paragraph element, use the following:

-skip title "personnel"

-skip p "*internal use*"

See also -regexp.



Banner.Novgorod.Ru