|Tools| <Reference | |Up to Jax: Scanner Generator|

Jax: Scanner Generator Internals


2.4 Internals

This part is a quick note on some of the internals and hacks I've used, so people can tell me more fun things to put into the code. Also, I've added a small section on how fast the scanner runs, but thats a tricky game -- benchmarks are skewed by providing sample inputs that show up the package in a good light. I have followed this grand tradition, so just take 'em with a pinch of salt, and I'll add a worst performance section if you mail me sample code ;-)

2.4.1 Generating code

Jax uses jell to parse the regular expressions. Nothing fancy, it just builds a syntax tree out of each regular expression and adds a few bookkeeping notes for each rule. The DFA for the patterns is directly constructed from the syntax tree using the method outlined in the dragon book (Aho Sethi Ullman) followed by a state minimization pass.

I've "compressed" the rows of the state table by directly generating code for rows that have only a few different transitions. The trick of generating code directly was one I stole from the neat *re2c package. I've used it mostly because it was a quick way to reduce the size of most scanners generated by jax without sacrificing any time in other methods that involve doing multiple table lookups.

This is what it really means. Many scanner generators create a little loop that spins away on a state machine, with some variant of

  while (state != ACCEPTING_STATE) state = next(getchar(), state);
After table compression, the next() function is usually implemented as a couple of table lookups in arrays.

In jax's variant of this loop, it has a giant switch statement for all the states, and where it seems useful, the computation of the next function is implemented by a set of if statements rather than an array lookup. For instance, with a regular expression that looks like

/ba(na)+/
The generated code for the state transitions (lightly edited for readability) looks like
switch(jax_cur_state) {
  case 0: 
    if (jax_next_char == 'n') { jax_cur_state = 4;}
    else  jax_cur_state = -1;
    break;
  case 1: 
    if (jax_next_char == 'b') { jax_cur_state = 2;}
    else jax_cur_state = -1;
    break;
  case 2: 
    if (jax_next_char == 'a') { jax_cur_state = 3;}
    else jax_cur_state = -1;
    break;
  case 3: 
    if (jax_next_char == 'n') { jax_cur_state = 4;}
    else jax_cur_state = -1;
    break;
  case 4: 
    if (jax_next_char == 'a') { jax_cur_state = 0;}
    else jax_cur_state = -1;
    break;
  default: 
    jax_cur_state = -1;
    break;
}
The state machine for this example begins in state 1 incidentally. This method avoids generating state tables for rows that have only a few different transitions. In addition, in the special case a state maps to itself most of the time, the switch statement is bypassed, and a while loop is generated. For instance, jax's regular expression to remove single line comments starting with a #
/#.*/
maps into the following code (lightly edited)
case 4: 
boolean jax_bool_4 = false;
while ((jax_next_char != 0) && (jax_next_char != '\n'))
  {
    jax_bool_4 = true;
    jax_next_char = jax_advance_char();
  }
In "typical" scanners, this allows things like strings and comments to be quickly slurped up.

The second hack is in storing the state transition tables for a row as a string instead of a static array, and then calling the getChars() on the static string at runtime to convert it into an array. This lowers both the size of the class and the time taken to initialize the array. The java compiler creates static arrays by actually generating code to initialize this at runtime, while strings are stored as is in the class file. The getChars() method uses a fast native coded copy operation (System.arraycopy) to perform the copy.

2.4.2 Performance

Jax seems to perform (on a sparc10 running SunOS 5.4) about 30-40 times slower than even egrep, so I'll just stick to comparing it with how much worse it is to just read input a byte at a time from a BufferedInputStream

Jax's performance is largely affected by the length of the pattern it is matching. Jax performs better when it is matching large sequences than shorter ones. The reason is that jax has to save some amount of context each time it performs an action, and also as jax accumulates larger and larger data in its expanding input buffers, it gets larger blocks of data each time, which seems to improve the overall input reading speed. To a smaller degree, matching "simple" patterns like strings and keywords kicks in some additional optimizations.

I've compared the performance of jax on three types of specifications, and compared it with just reading one byte at a time from a BufferedInputStream. All inputs are from the /usr/dict/words file, which is about a 206K ascii file.

All four programs used are in the distribution, under the bnchmark directory. The system I used is a sparc 10 running SunOS 5.4 under lightly loaded conditions. All programs were compiled with javac -O and the output is the time returned for the third successive run of each program.

__________________________________________________________________|lurch
|
|____________________|slurp
|
|______________________________________|jaxTokenizer
|
|_________________________________________|buffer
lurch.lex
This is a jax file corresponding to /.|\n/ which matches every character and executes (an empty) user action.
8.50u 0.57s 0:09.28 97.7%
slurp.lex
This is an equally useless lexer, but it is a fast useless lexer, saving those precious moments waiting for it to do nothing. It corresponds to /(.|\n)*/ which reads in the entire file into its buffer before returning.
2.37u 0.57s 0:03.13 93.9%
jaxTokenizer.lex
This is a lexer that identifies words /[a-zA-Z0-9]+/
4.83u 0.58s 0:05.52 98.0%
buffer.java
This is a test that simply reads one byte at a time from a BufferedInputStream.
5.16u 0.64s 0:05.98 96.9%
Ok. So where's the fudging? First, method calls are fairly expensive in Java. The reading one byte from a BufferedInputStream behaves better if you read in large byte array chunks and then scan the array one at time. Second is that none of the tests includes the cost of doing a jax_text(), which takes some extra time to convert the matched characters into a String. Third, all the patterns are simple enough, so it enables a couple of extra optimizations to kick in.
|Tools| <Reference | |Up to Jax: Scanner Generator|

KB Sriram
Comments, bug reports: kbs@sbktech.org

Revised: Sun May 26 09:41:10 1996
URL: http://www.sbktech.org/jax-int.html