Overview

The Verity Spider enables you to index Web-based and file system documents throughout the enterprise. Verity Spider works in conjunction with the Verity KeyView document filtering technology so that more than two hundred of the most popular application document formats can be indexed, including Office2000 and WordPerfect, ASCII text, HTML, SGML, XML and PDF (Adobe Acrobat) documents.

Supports Web standards

Verity Spider supports key Web standards used by Internet and intranet sites today. Standard HREF links and frames pointers are recognized so that navigation through them is supported. Redirected pages are followed so that the real underlying document is indexed. Verity Spider adheres to the robots exclusion standard specified in robots.txt files, so that administrators can maintain friendly visits to remote Web sites. HTTP Basic Authentication mechanism is supported so that password-protected sites can be indexed.

Unlike other Web crawlers, Verity Spider does not need to maintain complete local copies of remote documents. When documents are viewed through Verity Information Server, documents are read from their native location with optional highlights.

Restart capability

When an indexing job fails, or for some reason the Verity Spider cannot index a significant number or type of URLs, you can now restart the indexing job to update the collection. Only those URLs which were not successfully indexed previously will be processed.

State maintenance through a persistent store

Verity Spider V3.7 stores the state of gathered and indexed URLs in a persistent store, allowing it to track progress for the purposes of gracefully and efficiently restarting halted indexing jobs.

Previous versions of Verity Spider only held state information in memory, which meant that any stoppage of spidering resulted in lost work. This also meant that larger target sites required significantly more memory for spidering. The information in the persistent store can help report information such as the number of indexed pages, number of visited pages, number of rejected pages, and number of broken links.

Performance

With low memory requirements, flow control and the help of multithreading and efficient Domain Name System (DNS) lookups, spidering performance is greatly improved over previous versions.

Flow control

When indexing Web sites, Verity Spider distributes requests to Web servers in a round-robin manner. This means one URL is fetched from each Web server in turn. With flow control, it is possible that a faster Web site will finish before a slower one. Regardless, the Verity Spider optimizes indexing every Web server.

Verity Spider V3.7 adjusts the number of connections per server depending on the download bandwidth. When the download bandwidth from a Web server falls below a certain value, Verity Spider will automatically scale back the number of connections to that Web server. There will always be at least one connection to a Web server. When the download bandwidth increases to an acceptable level, Verity Spider reallocates connections (per the value of the -connections option, which is 4 by default). You can turn off flow control with the -noflowctrl option.

Multithreading

Since version 3.1, the Verity Spider has separated the gathering and indexing jobs into multiple threads for concurrence. Verity Spider V3.7 can create concurrent connections to Web servers for fetching documents, and have concurrent indexing threads for maximum utilization. This translates to an overall improvement in throughput. In previous releases, work was done in a round-robin manner, so that at any given time, only one job was running. Spider attends to the Web sites within an indexing job in a round-robin manner.

Efficient DNS lookups

Verity Spider V3.7 significantly reduces DNS lookups, which means great improvements to spidering throughput. If spidering is limited by domain or host, then no DNS lookups are made on hosts that fall outside of that range. Previously, DNS lookups were made on all candidate URLs.

Proxy handling efficiency

The use of the -noproxy option for reducing proxy checking for certain hosts, and the use of -proxyauth for authenticating on proxy servers allows for much greater flexibility when dealing with indexing jobs that involve proxy servers and firewalls. NOTE: Information Server V3.7does not support retrieving documents for viewing through secure proxy servers. Do not use -proxyauth for indexing documents which are to be viewed through Information Server V3.7.

Advanced ColdFusion Administration
Verity Spider