Perl Cookbook

Perl CookbookSearch this book
Previous: 19.14. Program: chemiserieChapter 20Next: 20.1. Fetching a URL from a Perl Script
 

20. Web Automation

Contents:
Introduction
Fetching a URL from a Perl Script
Automating Form Submission
Extracting URLs
Converting ASCII to HTML
Converting HTML to ASCII
Extracting or Removing HTML Tags
Finding Stale Links
Finding Fresh Links
Creating HTML Templates
Mirroring Web Pages
Creating a Robot
Parsing a Web Server Log File
Processing Server Logs
Program: htmlsub
Program: hrefsub

The web, then, or the pattern, a web at once sensuous and logical, an elegant and pregnant texture: that is style, that is the foundation of the art of literature.

- Robert Louis Stevenson, On some technical Elements of Style in Literature (1885)

20.0. Introduction

Chapter 19, CGI Programming, concentrated on responding to browser requests and producing documents using CGI. This one approaches the Web from the other side: instead of responding to a browser, you pretend to be one, generating requests and processing returned documents. We make extensive use of modules to simplify this process, because the intricate network protocols and document formats are tricky to get right. By letting existing modules handle the hard parts, you can concentrate on the interesting part - your own program.

The relevant modules can all be found under the following URL:

http://www.perl.com/CPAN/modules/by-category/15_World_Wide_Web_HTML_HTTP_CGI/

There are modules for computing credit card checksums, interacting with Netscape or Apache server APIs, processing image maps, validating HTML, and manipulating MIME. The largest and most important modules for this chapter, though, are found in the libwww-perl suite of modules, referred to collectively as LWP. Here are just a few of the modules included in LWP:

Module Name

Purpose

LWP::UserAgent

WWW user agent class

LWP::RobotUA

Develop robot applications

LWP::Protocol

Interface to various protocol schemes

LWP::Authen::Basic

Handle 401 and 407 responses

LWP::MediaTypes

MIME types configuration (text/html, etc.)

LWP::Debug

Debug logging module

LWP::Simple

Simple procedural interface for common functions

HTTP::Headers

MIME/RFC822 style headers

HTTP::Message

HTTP style message

HTTP::Request

HTTP request

HTTP::Response

HTTP response

HTTP::Daemon

A HTTP server class

HTTP::Status

HTTP status code (200 OK etc)

HTTP::Date

Date parsing module for HTTP date formats

HTTP::Negotiate

HTTP content negotiation calculation

WWW::RobotRules

Parse robots.txt files

File::Listing

Parse directory listings

The HTTP:: and LWP:: modules let you request documents from a server. The LWP::Simple module, in particular, offers a very basic way to fetch a document. LWP::Simple, however, lacks the ability to access individual components of the HTTP response. To access these, use HTTP::Request, HTTP::Response, and LWP::UserAgent. We show both sets of modules in Recipes Recipe 20.1, Recipe 20.2, and Recipe 20.10.

Closely allied with LWP, but not distributed in the LWP bundle, are the HTML:: modules. These let you parse HTML. They provide the basis for Recipes Recipe 20.5, Recipe 20.4, Recipe 20.6, Recipe 20.3, Recipe 20.7, and the programs htmlsub and hrefsub.

Recipe 20.12 gives a regular expression to decode the fields in your web server's log files and shows how to interpret the fields. We use this regular expression and the Logfile::Apache module in Recipe 20.13 to show two ways of summarizing the data in web server log files.


Previous: 19.14. Program: chemiseriePerl CookbookNext: 20.1. Fetching a URL from a Perl Script
19.14. Program: chemiserieBook Index20.1. Fetching a URL from a Perl Script



Banner.Novgorod.Ru