feature

checklinks - Maintaining Web Integrity



If keeping your HREFs straight is driving you crazy, consider this script to help keep things under control.

by Michael Alan Dorman


The Problem
T he World Wide Web (WWW) has been steadily increasing in popularity and size for the last two years. Web page design has become much more sophisticated, and people are taking much greater advantage of the features that the HyperText Markup Language (HTML) makes available.
Unfortunately, the feature that HTML is best known for--the ability to link documents in a non-hierarchical fashion--is the one that often causes the majority of headaches for WWW administrators and implementors.
If you make a typo while editing an HTML document and a link is incorrect, you suddenly have a mailbox full of messages asking, "Why didn't the link from X to Y work?"
It is especially problematic if your organization has given responsibility for different parts of your Web site to different parties. In this scenario, a document, or even whole hierarchies of documents, can be moved or renamed and the person doing so will inevitably forget to mention it--and once again, your mailbox begins to fill as the broken links are found by your users.

The Solution

Our solution for the University of Miami School of Medicine was to run a program to check the integrity of our links on a regular basis--at least once a day, if not more. Unfortunately, we couldn't find a program that quite met our requirements (though a few came close):
  • Able to check both on- and off-site links
  • Able to be run from a cron script
  • Fast and small

So I wrote one during my off-hours. The result is * http://lot49.med.miami.edu/~mdorman/checklinks.html. It is written for Perl 5, though it might work with Perl 4. It can check both on- and off-site links. Its output reflects the fact that it's running under cron--it only reports errors (it can be configured to be more verbose if desired).

Anatomy of checklinks

Basic Structure

Conceptually, checklinks is a ridiculously simple program.
You need only tell it the root of your document hierarchy, the corresponding URL and where in the document hierarchy to start. It then opens the document you specified, parses all the URLs from the document and does the same thing to each of those documents in turn--if the URL points off-site, it checks only whether the document is still accessible. If it tries to open a document and gets an error, it reports it.

Configuration

The program must configure itself before anything else can be done. This is typically done in the configuration section at the beginning of the program. In the listing below, the numbers on the left correspond to the line numbers in checklinks version 1.6. Different versions may be slightly offset:
# URL of first directory to check 
# (corresponds to $check_dir)
local ( $check_URL ) = \
"http://lot49.med.miami.edu/";
# Server's base URL (corresponds to 
# $server_dir)
local ( $server_URL ) = \
"http://lot49.med.miami.edu/";
# Host directory of $server_URL
local ( $server_dir ) = \
"/var/lib/httpd/documents/";
# List of possible directory indices 
# in preferential order
local ( @index_files ) = ( "index.html" );
# List of directories to skip
local ( @exclude_dirs ) = ( \
"/var/lib/httpd/documents/catalog/alpha", \
"/var/lib/httpd/documents/catalog/subject" );
# Set this to 0 if you don't want remote 
# processing by default
local ( $check_remote ) = 1;

The first variable, $check_URL, sets the URL from which you typically want checklinks to start checking links. It's usually the root of your server, but it could start several levels down.
$server_URL and $server_dir must point to the root URL of your server and the directory to which it corresponds, respectively. If you don't set these correctly, checklinks cannot work correctly--it uses these to match URLs to files so it can read them and check them.
@index_files is an array of names for documents that might satisfy a reference ending in a /. They are checked in the order you specify them. You want to make sure this lookup order is the same as that which your httpd server uses, or you might get unexpected results.
@exclude_dirs is an array that specifies directory trees that should be ignored. Nothing below a directory specified this way will be checked, so use with caution.
Finally, $check_remote determines whether you will check the presence of remote URLs by default.
All the values are overridable on the command-line. Just run checklinks with the --help switch to see how to set them.

The Main Routine

The main routine is called check_URL. It is initially called from near the start of the program, like so:
# This is the meat of the program
check_URL ( $server_URL, $check_URL, 1 );

Both $server_URL and $check_URL are configuration variables from above. This routine is meant to be called recursively. It takes the canonical URL for the current document, along with the HREF that's being checked. If you pass it a third parameter, it will also look at the contents of the result of the HREF and check the links it finds within.
After a little preparation, it copies $check_URL into the local variable $new_URL for manipulation purposes, and then gets to work.
The first step is for checklinks to turn a given URL into a "canonical" one, which is to say, it turns it into an URL that includes all necessary parts in a non-relative form.
As part of this process, it discards non-http: URLs by looking at the beginning of the URL:
# We don't do mailtos, gopher, ftp 
# or telnet
if ( $new_URL =~ m|mailto:|i || \
$new_URL =~ m|gopher:|i || \
$new_URL =~ m|ftp:|i || \
$new_URL =~ m|telnet:|i ) {
   $skipped_URLs{$new_URL} = $current_URL;
# Return non-error status return 0; }
It also doesn't check URLs that refer to links within the current document:
# We also don't do inter-document
# checks right now
elsif ( $new_URL =~ m|^#|i ) {
    $skipped_URLs{$new_URL} = $current_URL;
    # Return our non-error status
    return 0;
}

If it's an absolute reference (one that begins with a slash), all we need to do is add the server's base URL to the existing string:
# Look for a server based reference
elsif ( $new_URL =~ s|^/||i ) {
    # Just add the server's base URL
    $new_URL = $server_URL . $new_URL;
}

At this point, we have to be looking at a relative reference, so we have to figure out what the absolute reference would be by working with the URL of the current document and the URL in the link. After figuring out the directory of the current URL (rather than the filename), we can try to get rid of '..' references in the new URL:
# While there are backward relative
# reference marks (..)
while ( $new_URL =~ m|\.\./|i ) {
      # Get rid of the last section of the 
      # current URL (using .+ insures that 
      # _something_ will be removed.
      $working_URL =~ s|(.*/)(.+)$|$1|;
      # Get rid of the .. section 
      # of the new URL
      $new_URL =~ s|(^\.\./)(.*)$|$2|;
   }
   # Tack the two URLs together
   $new_URL = $working_URL . $new_URL;
}

We don't at present worry about intra-document references, so we just get rid of them:
# Get rid of any subsection 
# (#WHATEVER) refs
$new_URL =~ s/(.*)(#.*)/$1/g;

Then we have to try to move from the URL to a filename. We try to do this by substituting the server's root directory for the server's root URL in our URL string (which has been copied to $new_file by this time):
# Do a string substitution
$new_file =~ s/$server_URL/$server_dir/i;

If we still have an http:// reference at the beginning, it wasn't a local URL, so we have to use our remote URL checking code:
# Check to see if there's still 
# an http:// ref (means its not our server
if ( $new_file =~ m|http://|i ) {
    # If we're supposed to check 
    # off-server links
    if ( $check_remote ) {
       # Check the link
       return ( check_nonlocal ( \
       $new_URL, $current_URL ) );
    }
    # If we're not supposed to check
    else {
      $skipped_URLs{$new_URL} = $current_URL;
      return 0;
    }
  }

If we did successfully substitute a directory for the server part of the URL, we can then check the appropriate file. If the URL ended in /, we need to figure out what file to look at, using the @index_files configuration parameter.
     # If the filename ends with 
     # / (this must be last, because we must 
     # have all else resolved before we 
     # try substituting the various index
     # file names
     if ( $new_file =~ m|/$| ) {
       # Be prepared to try each possible
       # index file
       INDEXES: foreach ( @index_files ) {
         # Add each suffix in turn
         $test_file = $new_file . $_;
         # If the file exists
         if ( -e $test_file ) {
             # Assign appropriately
             $new_file = $test_file;
             # Exit
             last INDEXES;
          }
        }
      }

Next we have to make sure we're supposed to be checking stuff in this section of the tree. To do this, we see if any of the exclude directories in the @exclude_dirs configuration variable can be found at the beginning of the path.
# Look at each exclude dir
foreach $exclude ( @exclude_dirs ) {
   # If that file is under that directory
  if ( $new_file =~ m/^$exclude/ ) {
       # Remember we skipped this one
       $skipped_URLs = $current_URL;
       return 0;
       }
   }

Assuming the file whose name we so painstakingly constructed does in fact exist, and assuming we are supposed to check the links inside the file, we make sure that file isn't already being checked (for fear of an infinite recursion loop), then parse the file and call ourselves once again:
# As long as we're not already in the
#  midsts of checking that URL
if ( ! $being_checked{$new_file} ) {
     # List it as being checked
     $being_checked{$new_file}++;
     # Make sure it's an HTML file
        if ( $new_file =~ /.html$/ || \
        $new_file =~ /.htm$/ ) {
         # Get all the URLs from the page
         @hrefs = extract_links ( $new_file );
         # For each URL
         foreach ( @hrefs ) {
# Check the URL check_URL ( $new_URL, $_, 1 ); } } }
You'll notice that we've stored all sorts of stuff in a few associative arrays--that's how we remember what the status of all this stuff is.
As a result, once all the links in a document have been checked, we just return to our calling routine--we don't have to worry about returning specific values to indicate status because that status information has already been stored as it was found.

The URL Extractor

The URL extraction routine, extract_links, is fairly small and easy to understand--a testament to Perl's simple-but-powerful string-handling routines.
It first reads the entire file into a single string:
# Open the file
open ( INPUT, $filename );
# Grab the entire file
while ( <INPUT> ) {
   # Append
   $file .= $_;
}

It then removes all line-feeds (to make the extraction of URLs simpler):
# Change all \n to ws
$file =~ s/\n/ /gi;

And then extracts all the URLs in a single line of code:
# Return our array of hrefs
return $file =~ m/<\s*a\s+href\s*= \
\s*"(\S+)"\s*>/gis;

Let's face it--extract_links is at best an idiot routine; it doesn't know the first thing about HTML. For instance, it will happily extract a URL you have in a <PRE></PRE> block, even though it shouldn't.
However, the upside is simplicity--and I suspect it will work for 99% of all the cases out there.

Checking Remote URLs

Checking the remote sites is fairly easy, thanks to Perl's powerful (if somewhat cryptic) sockets access.
After separating the site, port and filename information, we do some socket initialization (like finding the IP address for a named host) and make the connection:
  # Try to connect to the host
  if ( connect( S, $hostpack ) ) {
     # Set our socket to be unbuffered
     select ( S );
     $| = 1;
     # Go back to our default output
     select ( STDOUT );

We then request the header information about the document, just by printing to our socket:
  # Just request the head of the document
      print ( S "HEAD $file \
      HTTP/1.0\r\n\r\n" );

Receiving the information from the socket is just as easy--you read it just as you would a file:
     # Get the first line of response
     $received_line = >S<;

We split our response (if we got one) into its constituent parts:
      # Parse it
      if ( defined ( $recieved_line ) && \
      $received_line =~ \
      /^HTTP/1.0\s*(\d+)\s*(.*)$/ ) {

We then look at the first part of the result code, to figure out whether our request was successful. We then record the appropriate status information:
        # If it's success
        if ( $leader eq 2 ) {
           $good_URLs{$URL} = $caller;
        }
        # If it's redirection
        elsif ( $leader eq 3 ) {
           $bad_URLs{$URL} = $caller;
        }

And that's it. There's a little more error-checking and housekeeping, and I run Perl with the -w switch, so there's the occasional bit of excess care in order to quiet the warnings, but other than that one extra item, you've just seen the whole program.

The To-Do List

Support for User Directories

We don't use user directories on our web server at this time (other than as the occasional testing grounds), so checklinks hasn't been programmed to cope with them. It shouldn't be hard to do. checklinks would have to be told what the user directories are--Perl can figure out the home directory for a given userid, so it would only have to know the subdirectory name, such as html or public_html. The URL-decoding would have to have some logic added to check for ~ and handle it appropriately.

A Report of Files That Aren't Referenced

Almost as useful as knowing about broken links is knowing about unmade links--that information you created and put in place but never made a link to. Using Perl 5's File::Find module, it shouldn't be hard to cross-reference all the files in the web server's directories with ones that are actually referred to somewhere or other.

A Total Re-Write of the Internal Mechanisms

From a design standpoint, the source code would probably get cleaned up immensely if we accessed all documents using HTTP, instead of requiring such knowledge of the particular web server's setup. (What can I say--when I first started this project, I didn't intend to check off-site links, so it seemed natural to just work at a file level.) As a result, we wouldn't have to special-case user directories or off-site URLs. Most of the URL-to- filename translation code would go away. Moving responsibility onto that code would also make it easier to add support for other types of links. Maybe when I re-write it in C++.

Mike Dorman is Head of Systems for the Louis Calder Memorial Library at the University of Miami School of Medicine (*http://www.med.miami.edu). As such, he gets paid for writing Perl and HTML, occasionally working on the Debian GNU/Linux Distribution and surfing the Web, all the while seeing that the computer problems get resolved. He can be reached at mdorman@lot49.med.miami.edu.