It was only a matter of time. Because computers collect information much better than people do, the next logical evolution involving the World Wide Web (WWW) would have to be the robots. A robot can quickly catalog and store information about a WWW link and move on to the next. This is mainly because the robots do not display the information, nor do they load the images. They simply retrieve pages, catalog their contents, and move on. This article will show you the history of the web robot along with some insight as to how to create one.
by Jerry Ablan
A web robot, at its core, is simply a non-displaying Web browser. By virtue of being essentially a browser, it knows how to speak HTTP to servers and how to dissect URLs. This is quite helpful on a jaunt through Cyberspace.
Web robots can be written in nearly any language, though most are written in lower-level languages (like C and C++). However, according to the "List of Known Robots" Web page (URL below), quite a few are written in Perl. Some other languages that have been used to write robots are Tcl/Tk and Python.
Generally, if a language has network connectivity and parsing abilities, it is a good candidate for robot creation. With programming ease, however, comes much overhead. As your robot becomes more complex, you may outgrow your chosen language.
Web robots can be simple or complex--depending on the intentions of the author and what the robot is designed to do. Most robots are simple wanderers, surfing links only to discover resources in webspace. Other, more complex, robots construct elaborate keyword databases of links and present them to their users. WebCrawler is an excellent example of a complex robot. Robots can be used for other purposes such as link validation and site-mirroring.
To exemplify the workings of a Web robot, I describe in this article a fictitious robot called "WebTreader" (WT). WT is a simple robot whose only purpose in life is link discovery. It traverses web sites and keeps a database of the links pointed to by these pages. It is nothing more than a program that retrieves, parses, and stores URLs in a database.
The database stores information from its encounters on the Web. Uses for a resource discovery database like this are many. It can be used as the basis of a search utility or to generate random links.
Currently, the database consists of a single table. The table holds information about the places the WT robot has been and where it needs to go. Every site it visits is recorded in this table, which can be kept in a flat text file or even in a relational database such as Oracle.
Each record in the table contains information about a single URL. The table should store the following information:
Future enhancements to WT could include keyword storage. It would be nice if WT compiled a set of keywords that it can associate with a web site. These keywords could then be used to search the database.
Our fake robot, WT, is quite simple. You pass it a URL on the command line and it discovers all possible links from there on out into the Web. The discovered URLs will form a tree of sorts. This may be hard to visualize, so an example is in order.
Imagine that WebTreader is making a database of the files stored on your hard disk. Think of the starting URL as the root directory on your hard drive. Each file and directory name it finds is stored in the database.
All directories found are marked or tagged as having not been visited. The program then searches the database for the first unvisited directory. This directory is then searched, and then the next unvisited directory is searched.
After all directories have been searched and the files within have been cataloged and all possible avenues in discovering the files on your hard drive have been exhausted; the program ends.
When WebTreader runs in webspace, instead of your hard disk directory, it generates a database of web pages. A web page that contains links is much like a directory on your hard drive. It is marked as unvisited and will be cataloged later.
The bulk of the robot program is in the URL processing routine. Here it retrieves the document pointed to by a URL, strips out all the links within the document, stores the original URL in the database as having been visited, and finally retrieves the next unvisited URL from the database.
As you can see, the life cycle is short but sweet. The robot does nothing but parse, retrieve, and store. After running for only a few minutes, your robot can catalog quite a bit of data!
When most software clients, or "agents" connect to a Web server, they leave a digital signature that is passed to your web server via the HTTP request header field User-Agent. This field allows the client to pass additional information about itself and the request to the web server.
So if you think you've been visited by a robot, or just want to know, check your log files. If your web server supports the User-Agent logging facility (NCSA and Apache do), it should be easy! Below is a sample of the User-Agent log from an NCSA server:
[11/Dec/1995:11:00:28] SPRY_Mosaic/v7.36 \ (Windows 16-bit) SPRY_package/v4.00 [11/Dec/1995:11:00:28] Mozilla/1.1N (X11; I; SunOS 5.3 sun4c) [11/Dec/1995:11:00:29] Mozilla/1.22 (Windows; I; 16bit) [11/Dec/1995:11:00:30] Mozilla/1.0N (Windows) [11/Dec/1995:11:00:31] NCSA Mosaic for the \ X Window System/2.4 libwww/2.12 modified [11/Dec/1995:11:00:32] MacWeb/1.00ALPHA3 libwww/2.17
Another way to identify robots is in their request pattern. If you find many files being requested in a very small amount of time, it is a good bet they are requests from a robot. Lastly, check for repeated requests for a file called "/robots.txt". This file is the centerpiece of a robot exclusion standard proposed by Martijn Koster (email@example.com) in 1994.
In addition to the User-Agent field, there is a From field. This field usually contains the e-mail address of the person controlling the client software. Most robots place the e-mail address of the creator in this space. This information can be stored in your web server log files.
If you want to keep robots away from your site, or from a part of your site, you can use the "/robots.txt" file, part of a standard for robot exclusion proposed by Martijn Koster. The format of this file is simple. The assumption is that all robots are allowed except for those that appear in this special file.
Here is a sample "/robots.txt" file:
# Example /robots.txt file User-agent: Fish-Search-Robot Disallow:
User-agent: MOMspider/1.00 libwww-perl/0.40 Disallow: /
User-agent: * Disallow: /tmp Disallow: /logs
The User-Agent line identifies the robots to act upon, and the Disallow field immediately following it specifies the directories from which the robot is restricted. Multiple Disallows may be specified.
The first example disallows nothing for the "Fish-Search-Robot". The second example disallows all access for the "MOMspider/1.00 libwww-perl/0.40" robot.
The last example is a special case. The "*" is not a file wildcard in this specification, but instead identifies all remaining User-Agents. So, the last example disallows access to the /tmp and /logs directories for all remaining User-Agents.
The "/robots.txt" file is just one way to keep out unwanted robots. However, the robot software must support the protocol. Some web servers, like Apache and NCSA, allow you to configure different actions based on the User-Agent field received from the client. Check your web server documentation for more information about this feature.
If you are like me, all this talk of robots makes you want to write one. An incomplete skeleton of WT is included in Listing 1. A robot is fairly simple to code, and once the base is completed, it is simple to build upon. There are just a few things to keep in mind before and during the creation of your robot.
Martijn Koster of WebCrawler, maintains a list of guidelines for web robot authors. It was created by Martijn and contributed to by Jonathon Fletcher, Lee McLoughlin, and others. The following list is a summary of that information:
Suspend the robot when you're not there for a number of days--like over the weekend--run it only when you are present. Yes, it may be better for the performance of the machine if you run it overnight, but that implies you don't think about the performance overhead of other machines. Yes, it will take longer for the robot to run, but this is more an indication that robots are not the way to do things anyway than an argument for running it continually--after all, what's the rush?
Robots consume a lot of resources. To minimize the impact, keep the following in mind:
Okay, so you are using the resources of a lot of people to do this. Give something back: