article

Implementing A Simple, Automated, Text-Only View

Present a fast-loading, convenient, text-only version of your site.

by Eric Hammond

When HTML was originally designed, inline images were not part of the plan. The addition of inline images by early browsers was likely one of the critical moves which caused the World Wide Web to explode in popular usage. However, there are situations where inline images can hinder a site's ease of use. The first and most obvious is when using a text-based browser such as the popular lynx Unix client, when a user is connected through a slow modem and is in a hurry.

ALT Attributes Of The IMG Tag

If, as a web site maintainer, you are trying to be accessible by the entire range of clients (including the text-based browsers) you can generally achieve this by adding ALT text attributes to every IMG tag. Text-based web clients display the alternate text in place of the image. If you leave out the ALT attribute, the client may simply display a placeholder which can clutter up the screen without giving the reader any useful information. For example, a poorly designed screen might look like the following with lynx:

	[image]
	[image] [image] [image]
	Thanks for visiting my [image] site!
	[image] [image]

Adding appropriate ALT attributes can turn this into something very readable:

	My Home Page
	[About me] [Favorite links] [Resume]
	Thanks for visiting my fantastic site!
	[My picture] [Send me feedback]

Text-Only Views

Now, let's turn to the problem of users with low bandwidth connections. Many graphical browsers offer the option of turning off the autoloading of inline images which speeds things up considerably, but some of these browsers leave the screen very unreadable and sometimes even unusable if they don't display the ALT text. This is most serious when your site uses images to display textual information such as page titles or buttons and menu options for the user's selection.

One option which can be seen on many of the major web sites is to provide a "text-only" view to be used by folks who want to bypass the download time of the inline images. Some of these sites provide only a text-only version of the home page which often contains an image map, but if you truly want to accommodate users on slow lines, you may wish to extend the view much further.

Start thinking about manually creating an entire duplicate text-only view of a site and you will realize that with any significant content volume this could be a nightmare to maintain, especially if you use CGI scripts or other advanced page creation mechanisms and you want text-only views of these results as well.

At SDRC we wanted to provide users our Web site with a text-only view which encompassed the whole site: currently about 800 HTML pages as well as pages generated by CGI scripts including search results and form submittals. We did not have the people, time, nor inclination to manually maintain a duplicate text-only hierarchy so our only option was to find a way to generate this view automatically from our existing pages which are filled with inline images (see Figures 1 and 2).

As it turns out, we found a method which did not require any modification to our existing pages, worked transparently for the user, and was so straight-forward to implement, it was written in less time than this article. There are a number of cautions, limitations, and special situations which are not handled by this method, but before we get to them, let's look at how it works.

Creating A Text-Only View Using A CGI Filter

The SDRC text-only view is implemented as a Perl CGI script (called nph-text-filter) which is mapped to the path /text in the server's configuration files. This means that when a web client requests any URL which starts with /text, the server runs the CGI script supplying it with the rest of the path text through the $PATH_INFO environment variable (see Figure 3). Our CGI script makes a secondary HTTP connection back to the same server requesting the real HTML page, replaces inline images with the ALT text, and returns this text-only page to the client. Figure 4 gives a representation of the steps involved.

Figure 1
    ______                ______              _______________
   |      | ---(1)--->   |      | ---(2)---> |               |
   |client|              |server| <--(3)---- |nph-text-filter|
   |      |              |______| ---(4)---> |               |
   |      |                                  |               |
   |______| <-------------(5)--------------- |_______________|

  1. Original HTTP request from Web browser to Web site for "/text/path..."
  2. Server runs CGI script providing it with $PATH_INFO of "/path..."
  3. CGI script connects back to server and sends secondary HTTP request for "/path..."
  4. Server responds with full, original HTML page (including IMG tags).
  5. CGI script responds to Web browser with a text-only version of the HTML page.

We are using the Apache httpd server on SunOS 4, but the same technique will work using NCSA httpd, Netscape Communications/Commerce server, and likely other servers as well. Note I said technique, not code. If you plan to migrate this code to another platform you should have some familiarity with Perl programming, CGI scripts, and your platform.

To see the nph-text-filter in action, drop by the SDRC web site, * http://www.sdrc.com/. Select the "Text-Only" hyperlink towards the bottom of the page and browse around. Select [Graphics] at the bottom of any text-only page to return to the graphics-rich version of that page for comparison.

Drawbacks Of This Method

The nph-text-filter CGI script was intended to provide a particular service at a particular site. It will not be suitable for all sites. For one thing, this method adds extra processing to each request for a text-only page. This additional CPU usage may not be acceptable on a site with extremely high request volume.

Since the CGI script makes a secondary request back to the server for the page to filter, each text-only page request really generates two entries in the access log files. If you are generating page statistics for your site, you will likely want to exclude all requests for documents beginning with "/text".

One of the biggest problems we've had with this technique at SDRC is the fact that this method does not provide support for user authentication. The server does not provide CGI scripts with information like username/password, so the nph-text-filter script cannot pass this back to the server on the secondary request. There may be some way around this, but for now, folks at our site must put up with the graphics in the few areas which require authentication.

If you use extensive image mapping on your site, you will have to add additional filtering to the program. Since the information required to handle an image map request is known only to the server, there is no obvious textual equivalent for an image map. You might consider adding hyperlink information in an HTML comment near the imagemap and using this to generate textual links in the CGI script.

Those are the main imperfections. Enough of that.

The Good News

The single best thing about this CGI script filtering technique is its simplicity. The nph-text-filter script itself has under 70 lines of code. This includes the socket code which sets up the Unix sockets for the HTTP request back to the server.

Since it is written in Perl, it is also very easy to modify the script to do your own site-specific filtering. For example, at SDRC we identify the [Text-Only] hyperlink on the home page using HTML comments and we filter this out so that once you are in the text-only view, you don't get the choice again (it can be done, by the way, but I wouldn't recommend it).

The nph-text-filter script also adds a hyperlink to the bottom of each text-only page which allows the user to jump back into the graphics-rich view. Our hyperlink reads [Graphics], but you can change this to another string, or remove it altogether.

Since this technique uses a CGI script to do the filtering and since all URLs starting with /text are mapped to the CGI script, it can handle any HTML pages your server can generate. These include normal static HTML files, server-parsed HTML, output from CGI scripts, server-built directory indexes, and even server error pages!

One enhancement recently made to the script is the ability to support server URL redirection. It does this by performing the URL filtering on the Location: field in the httpd response header. Without this feature, redirections from the text-only view would end up in the graphics-rich view.

Some nice things nph-text-filter does not do: It does not modify hyperlinks to other sites (as they are unlikely to have a /text hierarchy). It also does not attempt to filter any resource types other than those with a Content-Type of "text/html". Items like stand-alone images, plain text, and binary downloads are passed through without modification.

Lastly, this same basic technique can be used to do many things besides text-only views. For example, I used a modification of this script to see what our site would look like with a white background. I was able to browse around the whole site to get the feel of it before applying the change to all the real HTML files. You could also use this method to provide a view of your site with tables turned into pre-formatted text for primitive browsers. How about a view with automatic translation into another language? Okay, maybe not, but you get the idea...

Hints For Effective Usage

In order for the nph-text-filter to detect and modify URLs within the current site, all absolute hyperlinks should start with a slash. For example, do not use:

href="http://www.mysite.com/dir/sub/file.html"
Instead use
href="/dir/sub/file.html"

All web browsers will understand the implied http://www.mysite.com in this URL since they retrieved the document from your site. Of course, you should use the host name when sending the user to another site. Relative URLs need not be changed href="file.html" or href="../subdir/file.html"

Place a reasonable ALT attribute in every IMG tag. The value of the ALT attribute should not indicate what the user is missing, but rather it should make sense as text in context on the page. The user will not know that the text is being displayed as a replacement for an image. Images which are purely decorative should have empty strings (ALT="").

Surround IMG tags with text attributes which reflect how you wish the text to be displayed. For example:

<H2><IMG SRC="sdrc-logo.gif" ALT="SDRC"></H2>

This does not affect the way that the image is displayed in the graphics-rich view, and it allows you to provide very nice emphasis on the correct textual elements in the text-only view.

Installing The CGI Script

Here are the rough steps for getting the nph-text-filter and trying it out on your site. If you are not running Apache httpd on SunOS 4, you may have to do additional adaptation.

Step One
Download the nph-text-filter script from either of these URLs:

* http://www.sdrc.com/go/text-only
* http://www.com/pub/websmith/ws22s1

This page may also have further information added after the publication of this article.

Step Two
Review the script and edit anything you like. However, other than platform-specific items and a couple preferences, there really isn't anything very site specific which is needed for its basic operation.
Step Three
Install nph-text-filter script in your server's CGI directory. /usr/local/apache/cgi-bin/nph-text-filter for instance.
Step Four
Edit your httpd server's srm.conf configuration file and add the following directive:
ScriptAlias /text/ /usr/local/apache/cgi-bin/nph-text-filter/

where the second argument is, of course, the real location of the CGI script.
Step Five
Restart your httpd server so that it picks up the new configuration directive.
Step Six
Try it out by accessing * http://www.mysite.com/text/.

Understanding The CGI Script

The complete code of nph-text-filter is provided in Listing 1. In this section we will review some of the more important aspects of the script. If you don't care about understanding the details of how the script works and you don't need to port the code to another platform, you may wish to skip this section.

The first line of code simply extracts the program name from the name of the CGI script filename. This is only used for error messages in your server's error log.

($prog = $0) =~ s#.*/##;

You may feel free to change the file name of the CGI script to whatever you like, though it shouldn't really matter as the user never sees this name. In any case your CGI script name must match the second argument to the ScriptAlias directive in the srm.conf configuration file (see above).

The next section has two preference configurations which you may modify for your site.

$TEXT = "/text";
$ADDGRAPHICS = "[Graphics]";

The $TEXT preference indicates the URL prefix used to invoke the text-only view on your site. It must match the first argument to the ScriptAlias directive in the srm.conf configuration file. The $ADDGRAPHICS string is the text of the hyperlink shown at the bottom of each text-only page. If the user selects this hyperlink, he is taken back to the graphics-rich view of your site.

Unfortunately, we never generated socket.ph at our site when we installed Perl, so I hard-coded the next two values.

$AF_INET = 2;
@cx:$SOCK_STREAM = 1;

If you are porting nph-text-filter to some platform other than SunOS 4, you may have to modify these. See your system's sockets.h include file for possible hints.

When the httpd server invokes CGI scripts it passes a dozen or so valuable items of information in the form of environment variables.

$host           = $ENV{'SERVER_NAME'};
$port           = $ENV{'SERVER_PORT'};
$method         = $ENV{'REQUEST_METHOD'};
$path           = $ENV{'PATH_INFO'};
$protocol       = $ENV{'SERVER_PROTOCOL'};
$accept         = $ENV{'HTTP_ACCEPT'};
$agent          = $ENV{'HTTP_USER_AGENT'};
$referer        = $ENV{'HTTP_REFERER'};
$content_length	= $ENV{'CONTENT_LENGTH'};

These environment variables allow us to find out our current server's host name and port, which is critical for knowing where to send the secondary HTTP request for the real page. The request method (GET or POST), the path, and the protocol provide us with the HTTP request we should send to the server once we're connected. The HTTP_ACCEPT, HTTP_AGENT, and HTTP_REFERER just allow us to pass through some info from the user's client on the secondary request. The content length indicates if there is any POSTed data which we need to pass through on the secondary request. This would be critical if the user were submitting an HTML form.

I've written quite a bit of client/server software, but whenever it comes to the low-level socket code, I always steal from somebody else. The whole socket initialization section in this CGI script was adapted from Larry Wall's "client" script. (Larry is the author of Perl and at a recent USENIX conference was proclaimed a "god" of the systems administration community, which, in his humble manner, he tried to turn down. Great guy.)

Okay, now that we've waved our hands and the socket is set up, we can send the request to the server with simple print statements. We would be much better off here if we had direct access to the literal request text from the client, but we do the best we can with what we have.

print S <<"EOM";
$method $path $protocol
Accept: $accept
User-Agent: $agent via text-filter
EOM
print S "Content-length: $content_length\n" if defined $content_length;
print S "Referer: $referer\n" if defined $referer;
print S "\n";

The "$method $path $protocol" will expand into something like "GET /site/welcome HTTP/1.0" based on the environment variables the server passed the CGI script.

If the request from the client was the result of a POST (form submission) the server provided this data on our CGI script's standard input. We need to send this straight through to the server on our secondary request.

if ( defined $content_length ) {
	$buf = "";
	read(STDIN, $buf, $content_length);
	print S $buf;
}

Now the server has our complete request and we can start reading its response. First we pick up the header fields as defined by the HTTP/1.0 protocol. This is a set of "Name: value" lines and ends at the first blank line.

$header = "";
while ( <S> ) {
	$header .= $_;
	last if m#^\s*$#;
}

One of the possibilities in the HTTP response header is a redirection to another URL using the Location header. If it has one of these and it looks like it is going to a place on our current server (possible area for mistakes), we add the /text so that our user will stay inside our text-only view after his browser gets redirected.

$header =~ 
s#(\nLocation:\s*http://$host(:$port)?/)#$1text/#is
;

We then send this whole header (including the blank line) straight back to the client.

print $header;

Now we check the header to see if this resource is an HTML document which requires further filtering. If it is not HTML, then we send the rest of the response from the server straight through to the client and we get out.

if ( $header !~ m#^Content-type:\s*text/html#im ) {
	while ( <S> ) { print;}
	exit(0);
}

Otherwise, we have to read in this document and perform some filtering on it before sending it back to the client. The next two lines read the entire HTML file into a string in memory. Since the items we need to filter (IMG tags and hyperlinks) may cross line boundaries, this makes our pattern substitution much easier.

$/ = undef;
$content = <S>;

Now for the real filtering itself, the reason this exists in the first place. You'll see that it is almost the shortest section of the script.

For each inline images, we replace the whole IMG tag with the string contents of the ALT attribute in that tag.

while ( $content =~ 
s#<\s*img.*?alt\s*=\s*"(.*?)".*?>#$1#is ) {}

If there is any background image, we remove that.

$content =~ s#background\s*=\s*".*?"##is;

We prepend /text to each hyperlink URL and FORM ACTION URL which starts with "/".

$content =~ 
s#(href\s*=\s*")(/.*?)(")#$1$TEXT$2$3#igs;
$content =~ 
s#(action\s*=\s*")(/.*?)(")#$1$TEXT$2$3#igs;

We don't care about URLs which point to other sites, and we can ignore URLs which are relative since they will stay within the text-only view automatically.

We then insert the hyperlink back into the graphics-rich view, just before the </BODY> tag if it exists, otherwise at the end of the page.

unless ( $content =~ 
s#(<\s*/\s*body\s*>)#$graphlink#i ) {
	$content .= $graphlink;
}

The page is ready and the entire result of this filtering is passed back to the user's browser.

print $content;

Conclusion

The flexibility and power of Perl, combined with the ability to place a CGI script in a strategic location between the web server and web client, allow us to perform convenient filtering on the pages of an entire site.

One of the many benefits is the ability to present a parallel text-only view of a site, but the possible future applications are limited only by your imagination (and available CPU).


Eric Hammond is a webmaster and toolsmith at Structural Dynamics Research Corporation (SDRC) in Milford, Ohio. He has been involved with World Wide Web technology for 10 years -- What?... Oh... since the Spring of 1993, and the Internet since 1988. He can be reached by e-mail at eric.hammond@sdrc.com.