article

Web Robots

It was only a matter of time. Because computers collect information much better than people do, the next logical evolution involving the World Wide Web (WWW) would have to be the robots. A robot can quickly catalog and store information about a WWW link and move on to the next. This is mainly because the robots do not display the information, nor do they load the images. They simply retrieve pages, catalog their contents, and move on. This article will show you the history of the web robot along with some insight as to how to create one.

by Jerry Ablan

A web robot, at its core, is simply a non-displaying Web browser. By virtue of being essentially a browser, it knows how to speak HTTP to servers and how to dissect URLs. This is quite helpful on a jaunt through Cyberspace.

Web robots can be written in nearly any language, though most are written in lower-level languages (like C and C++). However, according to the "List of Known Robots" Web page (URL below), quite a few are written in Perl. Some other languages that have been used to write robots are Tcl/Tk and Python.

Generally, if a language has network connectivity and parsing abilities, it is a good candidate for robot creation. With programming ease, however, comes much overhead. As your robot becomes more complex, you may outgrow your chosen language.

How They Work

Web robots can be simple or complex--depending on the intentions of the author and what the robot is designed to do. Most robots are simple wanderers, surfing links only to discover resources in webspace. Other, more complex, robots construct elaborate keyword databases of links and present them to their users. WebCrawler is an excellent example of a complex robot. Robots can be used for other purposes such as link validation and site-mirroring.

To exemplify the workings of a Web robot, I describe in this article a fictitious robot called "WebTreader" (WT). WT is a simple robot whose only purpose in life is link discovery. It traverses web sites and keeps a database of the links pointed to by these pages. It is nothing more than a program that retrieves, parses, and stores URLs in a database.

WebTreader's Database

The database stores information from its encounters on the Web. Uses for a resource discovery database like this are many. It can be used as the basis of a search utility or to generate random links.

Currently, the database consists of a single table. The table holds information about the places the WT robot has been and where it needs to go. Every site it visits is recorded in this table, which can be kept in a flat text file or even in a relational database such as Oracle.

Each record in the table contains information about a single URL. The table should store the following information:

  1. The complete URL

  2. The type of document the URL points to

  3. The last date and time it was visited by the robot

Future enhancements to WT could include keyword storage. It would be nice if WT compiled a set of keywords that it can associate with a web site. These keywords could then be used to search the database.

Web Robot Operations

Our fake robot, WT, is quite simple. You pass it a URL on the command line and it discovers all possible links from there on out into the Web. The discovered URLs will form a tree of sorts. This may be hard to visualize, so an example is in order.

Imagine that WebTreader is making a database of the files stored on your hard disk. Think of the starting URL as the root directory on your hard drive. Each file and directory name it finds is stored in the database.

All directories found are marked or tagged as having not been visited. The program then searches the database for the first unvisited directory. This directory is then searched, and then the next unvisited directory is searched.

After all directories have been searched and the files within have been cataloged and all possible avenues in discovering the files on your hard drive have been exhausted; the program ends.

When WebTreader runs in webspace, instead of your hard disk directory, it generates a database of web pages. A web page that contains links is much like a directory on your hard drive. It is marked as unvisited and will be cataloged later.

The bulk of the robot program is in the URL processing routine. Here it retrieves the document pointed to by a URL, strips out all the links within the document, stores the original URL in the database as having been visited, and finally retrieves the next unvisited URL from the database.

As you can see, the life cycle is short but sweet. The robot does nothing but parse, retrieve, and store. After running for only a few minutes, your robot can catalog quite a bit of data!

Web Robot Identification

When most software clients, or "agents" connect to a Web server, they leave a digital signature that is passed to your web server via the HTTP request header field User-Agent. This field allows the client to pass additional information about itself and the request to the web server.

So if you think you've been visited by a robot, or just want to know, check your log files. If your web server supports the User-Agent logging facility (NCSA and Apache do), it should be easy! Below is a sample of the User-Agent log from an NCSA server:

[11/Dec/1995:11:00:28] SPRY_Mosaic/v7.36 \
(Windows 16-bit) SPRY_package/v4.00
[11/Dec/1995:11:00:28] Mozilla/1.1N (X11; I; SunOS 5.3 sun4c)
[11/Dec/1995:11:00:29] Mozilla/1.22 (Windows; I; 16bit)
[11/Dec/1995:11:00:30] Mozilla/1.0N (Windows)
[11/Dec/1995:11:00:31] NCSA Mosaic for the \
X Window System/2.4  libwww/2.12 modified
[11/Dec/1995:11:00:32] MacWeb/1.00ALPHA3  libwww/2.17

Another way to identify robots is in their request pattern. If you find many files being requested in a very small amount of time, it is a good bet they are requests from a robot. Lastly, check for repeated requests for a file called "/robots.txt". This file is the centerpiece of a robot exclusion standard proposed by Martijn Koster (m.koster@webcrawler.com) in 1994.

In addition to the User-Agent field, there is a From field. This field usually contains the e-mail address of the person controlling the client software. Most robots place the e-mail address of the creator in this space. This information can be stored in your web server log files.

Web Robot Repelling

If you want to keep robots away from your site, or from a part of your site, you can use the "/robots.txt" file, part of a standard for robot exclusion proposed by Martijn Koster. The format of this file is simple. The assumption is that all robots are allowed except for those that appear in this special file.

Here is a sample "/robots.txt" file:

# Example /robots.txt file
User-agent: Fish-Search-Robot
Disallow:
User-agent: MOMspider/1.00 libwww-perl/0.40
Disallow: /
User-agent: *
Disallow: /tmp
Disallow: /logs

The User-Agent line identifies the robots to act upon, and the Disallow field immediately following it specifies the directories from which the robot is restricted. Multiple Disallows may be specified.

The first example disallows nothing for the "Fish-Search-Robot". The second example disallows all access for the "MOMspider/1.00 libwww-perl/0.40" robot.

The last example is a special case. The "*" is not a file wildcard in this specification, but instead identifies all remaining User-Agents. So, the last example disallows access to the /tmp and /logs directories for all remaining User-Agents.

The "/robots.txt" file is just one way to keep out unwanted robots. However, the robot software must support the protocol. Some web servers, like Apache and NCSA, allow you to configure different actions based on the User-Agent field received from the client. Check your web server documentation for more information about this feature.

Web Robot Creation

If you are like me, all this talk of robots makes you want to write one. An incomplete skeleton of WT is included in Listing 1. A robot is fairly simple to code, and once the base is completed, it is simple to build upon. There are just a few things to keep in mind before and during the creation of your robot.

Web Robot Creation Guidelines

Martijn Koster of WebCrawler, maintains a list of guidelines for web robot authors. It was created by Martijn and contributed to by Jonathon Fletcher, Lee McLoughlin, and others. The following list is a summary of that information:

Identify Your Web Wanderer.
HTTP supports a User-Agent field to identify a WWW browser. As your robot is a kind of WWW browser, use this field to name your robot (e.g., "WebTreader/1.0"). This will allow server maintainers to set your robot apart from human users using interactive browsers. It is also recommended to run it from a machine registered in the DNS, which will make it easier to recognize and indicate to people where you are.

Identify Yourself.
HTTP supports a From field to identify the user who runs the WWW browser. Use this to advertise your e-mail address, e.g., "j.smith@somewhere.edu". This will allow server maintainers to contact you in case of problems so that you can start a dialogue on better terms than if you were hard to track down.

Announce It To The Public.
Post a message to comp.infosystems.www.providers before running your robots. If people know in advance, they can watch and prepare for your robot's visit.

Announce It To The Target.
If you are only targeting a single site, or a selection, contact the owners and inform them.

Be Informative.
Server maintainers often wonder why their server is hit. If you use the HTTP Referer field, you can tell them. This costs no effort on your part and may be informative.

Be There.
Don't set your Web Wanderer going and then go on holiday. If in your absence it does things that upset people, you--the only one who can fix it--won't be available. It is best to remain logged in to the machine that is running your robot so people can use "finger" and "talk" to contact you.

Suspend the robot when you're not there for a number of days--like over the weekend--run it only when you are present. Yes, it may be better for the performance of the machine if you run it overnight, but that implies you don't think about the performance overhead of other machines. Yes, it will take longer for the robot to run, but this is more an indication that robots are not the way to do things anyway than an argument for running it continually--after all, what's the rush?

Notify Your Authorities.
It is advisable to tell your system administrator or network provider what you are planning to do. You will be asking a lot of the services they offer, and if something may go wrong they like to hear it from you first, not from external people.

Test Locally.
Don't run repeated tests on remote servers. Run a number of servers locally and use them first to test your robot. When going off-site for the first time, stay close to home initially (e.g., start with a page from a local server). After doing a small run, analyze your performance and your results, and estimate how they will scale with thousands of documents. It may soon be obvious you don't have the resources to continue with your present method.

Robots consume a lot of resources. To minimize the impact, keep the following in mind:

Walk, Don't Run.
Make sure your robot runs slowly: although robots can handle hundreds of documents per minute, this puts a large strain on a server and is guaranteed to infuriate the server maintainer. Instead, put a sleep in, or if you're clever, rotate queries among different servers in a round-robin fashion. Retrieving one document per minute is much better than one per second. One every five minutes is better still.

Use HEAD Where Possible.
If your application can use the HTTP HEAD facility for its purposes, less overhead than full GETs will be created.

Ask For What You Want.
HTTP has an Accept field in which a browser (or your robot) can specify the kinds of data it can handle. Use it. If you only analyze text, specify so. This will allow clever servers to not bother sending you data your robot can't handle and will have to throw away anyway. Also, make use of URL suffixes if they're available. You can build in some logic yourself: if a link refers to a ".ps", ".zip", ".Z", ".gif" etc, and you only handle text, then don't ask for the file. Although file extensions are not the modern way to do things (Accept is), there is an enormous installed base out there that uses them to specify file type (especially FTP sites). Also look out for gateways (e.g., URLs starting with "finger"), news gateways, WAIS gateways, etc. And think about other protocols (news:, telnet:, etc.) Don't forget the sub-page references (<A HREF="#abstract">)--don't retrieve the same page more then once. It's imperative to make a list of places not to visit before you start.

Check URLs.
Don't assume the HTML documents you are going to get back are sensible. When scanning for URLs be wary of things like <A HREF="http://somehost.somedom/doc">. Many sites don't put the trailing "/" on URLs for directories--a naive strategy of concatenating the names of sub-URLs can result in bad names.

Check The Results.
Check what comes back. If a server refuses a number of documents in a row, check what it is saying. It may be the server refuses your retrieval of these things because you're using a robot.

Don't Loop or Repeat.
Remember all the places you have visited so you can check that you're not looping. Check to see if the different addresses you have are not in fact the same machine (e.g., "web.nexor.co.uk" is the same machine as "hercules.nexor.co.uk"--both names resolve to 128.243.219.1) so you don't visit the same site again. This is imperative.

Run At Opportune Times.
On some systems, there are preferred times of access when the machine is only lightly loaded. If you plan to do many automatic requests from one particular site, check with its administrator(s) regarding the preferred time of access.

Don't Run It Often.
People differ on the acceptable frequency of visits, but I'd say once every two months is probably too often. Also, when you re-run it, make use of your previous data: you know which URLs to avoid. Make a list of volatile links (like the "what's new" page and the meta-index). Use this to get pointers to other documents and concentrate on new links--you will get a high initial yield this way, and if you stop your robot for some reason at least it will have been time well spent.

Don't Try Queries.
Some WWW documents are searchable (ISINDEX) or contain forms. Don't follow these. The Fish Search does this, for example, and may result in a search for "cars" being sent to databases with computer science PhDs, people in the X.500 directory, or botanical data. Not sensible.

Stay With It.
It is vital you know what your robot is doing and that it remains under control.

Log.
Make sure it provides ample logging, and it wouldn't hurt to keep certain statistics, such as the number of successes/failures, the hosts accessed recently, the average size of recent files, etc., and keep an eye on them. This ties in with the "Don't Loop" cpmcept--you need to log where you have been to prevent looping. Again, estimate the required disk space; you may find you can't cope.

Be Interactive.
Arrange for methods to guide your robot. Commands that suspend or cancel the robot, or make it skip the current host can be very useful. Provide frequent checkpoints for your robot. This way you don't lose everything if it crashes.

Be Prepared.
Your robot will visit hundreds of sites. It will probably upset a number of people. Be prepared to respond quickly to their enquiries and tell them what you're doing.

Be Understanding.
If your robot upsets someone, instruct it not to visit their site, or only the home page. Don't lecture them about why your cause is worth the increased server load, because they probably aren't interested in the least. If you encounter barriers that people put up to stop your access, don't try to go around them.

Okay, so you are using the resources of a lot of people to do this. Give something back:

Keep Results.
This may sound obvious, but think about what you are going to do with the retrieved documents. Try to keep as much info as you can store. Often analysis will require information that doesn't seem useful at first.

Raw Result.
Make your raw results available via FTP, the Web or some other way. That way other people can use the information, and they won't need to run their own robots.

Polished Result.
You are running a robot for a reason--probably to create a database or gather statistics. If you make these results available on the Web, people are more likely to think it worthwhile. And you might get in touch with people whose interests are similar.

Report Errors.
Your robot might come across dangling links. You might as well publish them on the Web somewhere (after checking that they really are invalid URLs. If you are convinced they are in error (as opposed to restricted), notify the administrator of the server.


Jerry Ablan is president of * NetGeeks, Inc., a Chicago-area web site consulting and design firm. He is also the co-author of The Web Site Administrator's Survival Guide (published by Sams.net, ISBN#1-57521-018-5).