|Tools| |Up to Jax: Scanner Generator| |Using other features>

Jax: Scanner Generator Examples


2.1 Examples

Jax is a simple java lex compiler that generates seven bit (and can be recompiled to handle eight bit) ascii scanners. Jax will also generate a standalone scanner in the sense that there are no runtime support classes that you will need to use from the jax system.

This section walks through a couple of toy examples to show how you can generate scanners with jax. If you are familiar with flex or lex, you can just skim through the examples to see the syntax, and jump to the reference.

2.1.1 Word count

Here is a how to generate a scanner that counts the number of words in a file. Assume that a word is any sequence of characters that is not a blank, newline or tab.
# Count the number of words in a file
#
# Header section
%{

import java.io.*;

public class wc
{
  int word_count = 0;

  public static void main(String argv[]) throws IOException
  {
    wc myLexer = new wc();

    myLexer.init(System.in);
    myLexer.jax_next_token();
    System.out.println(myLexer.word_count + " words");
  }

%}

# A word is assumed to be any sequence of characters
# that is not a blank, tab or newline.


/[^\_\n\t]+/                   # The regular expression to match a word

  %{ word_count++; %}   # And its associated action.
  ;                     # dont forget the trailing semicolon

# Trailing section
%{

}

%}
A jax specification file has three parts. The header, the regular expressions to be matched, and the trailer. The header and trailer are enclosed within %{ .. %} and are reproduced in the output file. Any actions associated with a regular expression are also specified the same way. Jax processes this file and generates a java file with a function jax_next_token() which is used to start the matching process. You have to first prime the lexer though, and you do that by calling the init() method with an inputstream.
Each regular expression is like an entry for a switch statement
Whenever a regular expression is matched in the input, the generated scanner will run any action associated with the regular expression exactly as if a case entry in a switch statement was reached. The action for the regular expression becomes the body of the case entry. In the example, the action updates a variable to keep track of the word count. "Falling off" an action like this example does causes the lexer to continue the matching process from where it left off, so the switch statement is actually embedded inside a while loop.

To generate the scanner, first run jax on the file, and then compile the generated file. The same example is provided in the distribution. If you are in the root of the distribution, this is how it might work.

% java sbktech.tools.jax.driver -lexFile wc.java examples/wc.lex
% javac wc.java
% java wc < wc.java
677 words
% wc wc.java
     255     677    6723 wc.java
% 
Lets take a look at the regular expression itself for a bit. The first thing is that unlike flex (and like perl) regular expressions are specified within the slash (/) characters. Backslashes are used to escape any special characters. White space in patterns is not significant, so /abc/ is the same as /a b   c/. To represent a blank, use \_ and there are only a few more surprises in the the syntax which is rather like lex, except that jax does not provide ^ or $ or a/b operators to provide context sensitive matching.

2.1.2 HTML comment extractor

This is another toy example which extracts comments out of an HTML file. Scanning for comments is typically hard to get exactly right in handwritten scanners, and many browsers still do not remove comment correctly. Check your browser by viewing this page containing correct comments incorrectly scanned by many browsers. These examples were taken from *Mark Meyer's html bugs page, which contains other common html processing errors.

The definition from the *official comment specification says

A comment declaration consists of `<!' followed by zero or more comments followed by `>'. Each comment starts with `--' and includes all text up to and including the next occurrence of `--'. In a comment declaration, white space is allowed after each comment, but not before the first comment.
Here is one way to handle this through a jax specification.
/ <!
    ( -- ( [^\-] | -[^\-] )* -- [\_\r\n\t]* )*
  > /
  %{ htmlComments.append(jax_text()); %} ;


# Match the rest quickly
/[^<]+/
  %{ htmlContents.append(jax_text()); %} ;

/</
  %{ htmlContents.append('<'); %} ;
Only the interesting parts of the specification are shown. Here is the complete program, which is also present in the distribution. If you are in the root of the distribution, here is how you might compile and run it.
% java sbktech.tools.jax.driver -lexFile htmlsplit.java examples/htmlsplit.lex
% javac htmlsplit.java
% java htmlsplit examples/browbust.html
Html
====

<html><head><title>Examples of comment processing bugs</title></head><body>
[...]
In correct browsers, this will be the last sentence on this page.<p>
<p>
<p>
<p>
<p>
<p>
<p>
<p>
</body></html>

Comments
========

<!--
Your browser doesn't handle comments that cross a line boundary-->
[...]

|Tools| |Up to Jax: Scanner Generator| |Using other features>

KB Sriram
Comments, bug reports: kbs@sbktech.org

Revised: Wed Jul 24 08:01:13 1996
URL: http://www.sbktech.org/jax-ex1.html