|Tools| <Examples| |Up to Jax: Scanner Generator| |Reference >

Jax: Scanner Generator Using other features


2.2 Using other features

This section presents more hooks jax provides to make common scanning tasks easier.

If you've used lex or flex before, the short story is that you get multiple start states, line or character counting and a case insensitive option, but nothing else. You can skim through the examples to see the syntax, and then jump to the reference .

2.2.1 TokenInputStream

Very often, the scanner is used as the first stage towards doing more complex things. In the previous examples, the scanner did all the work in the program. It would be nice to have the scanner to behave just like an InputStream, except it should return a stream of tokens instead of a stream of bytes.

Lets assume there is a token interface, and simpletoken, stringtoken and inttoken all implement this interface. simpletoken is used to represent (say) operators of various kinds, stringtoken to hold strings, and inttoken to represent integers. The goal is to return a stream of these tokens as they are matched in the input stream.

Lets also pretend that the strings use the \ character to escape the next character. Here is the first pass at scanning for such tokens.

%{
import java.io.*;

public class TokenInputStream
{
  private token cur_token;

  public TokenInputStream (InputStream inp) throws IOException
  { init(inp); }

  public token nextToken() throws IOException
  {
    if (jax_next_token() != -1) return cur_token;
    return new simpletoken(token.EOF, jax_cur_line);
  }

                                /* Test the lexer */
  public static void main(String argv[]) throws IOException
  {
    TokenInputStream inp = new TokenInputStream(new FileInputStream(argv[0]));
    token t;
    while ((t = inp.nextToken()).type() != token.EOF)
      { System.out.println(t); }
  }

%}

%line

/" ( [^"\\] | \\(.|\n) )* "/
   %{
      String s = jax_text();
      cur_token = new stringtoken(s.substring(1, s.length()-1), jax_cur_line);
      return 1;
   %};

/[0-9]+/
   %{ cur_token = new inttoken(jax_text(), jax_cur_line); return 1; %} ;

...
Only some of the interesting parts are shown, here is the complete file if you want to see all the details.

Notice that actions now also return 1 instead of breaking. This causes the method jax_next_token to return with the value returned by the actions, and yes, you can only return an integer from an action. A new method jax_text() is used to obtain the matched text for the regular expression. The function jax_next_token() automatically returns -1 when the end of file is reached, so this can be used to determine when the input has all terminated.

The TokenInputStream class stores the matched token in a class global variable. Each action just sets this variable and returns.

Adding a %line directive inserts code to keep track of the current line number, which you can access through the class global variable jax_cur_line.

Creating and compiling the scanner as usual, and running it on some examples

% java sbktech.tools.jax.driver -lexFile TokenInputStream.java examples/tok.lex
% javac TokenInputStream.java
% cat examples/tok.inp
"Hello World"
"\"Hello World\", she said"
2323
+ -
"Strings with a + or a - should be ok too"
"A string defined over \
several lines"
% java TokenInputStream examples/tok.inp 
Line 1: String ``Hello World''
Line 2: String ``\"Hello World\", she said''
Line 3: number (2323)
Line 4: +
Line 4: -
Line 5: String ``Strings with a + or a - should be ok too''
Line 7: String ``A string defined over \
several lines''
% 
Woops. While it scanned the tokens ok, there are several unsatisfactory things about the scanner, most notably that it doesn't actually remove the \ characters from the strings. Handling this properly requires more than just regular expressions, which is the topic of the next section.

2.2.2 Using context

The easiest way to deal with removing the escape character from the strings is to put jax into a different context once you start to see a string. In this context, jax uses a different set of regular expressions to scan the input.
/"/
   %{
      sbuf = new StringBuffer();
      jax_switch_state(MATCH_STRING);
      break;
   %};

/[0-9]+/
   %{ cur_token = new inttoken(jax_text(), jax_cur_line); return 1; %} ;

...

%state MATCH_STRING

# Anything not a " or \ can be matched quickly

/[^"\\\n]+/
  %{ sbuf.append(jax_text()); break; %};

# may as well add common conventions too
/\\n/ %{ sbuf.append('\n'); break; %};
/\\t/ %{ sbuf.append('\t'); break; %};

# on seeing a real new line, just ignore it. This is also the
# usual C convention.
/\\\n/ ;

# anything else just becomes itself
/\\./ %{ sbuf.append(jax_text().charAt(1)); break; %};

# Add some warnings about seeing unescaped newlines
/\n/
 %{
    System.err.println("Line " + jax_cur_line + ": unterminated string");
    jax_switch_state(INITIAL);
    cur_token = new stringtoken(new String(sbuf), jax_cur_line);
    return 1;
  %};

# Switch back to default on seeing the terminating "
/"/ 
 %{
    jax_switch_state(INITIAL);
    cur_token = new stringtoken(new String(sbuf), jax_cur_line);
    return 1;
 %};
New contexts are introduced with the %state keyword, and states can be switched from within the code using the method jax_switch_state(). Unlike lex, using the %state keyword alters the context for all the regular expressions following it.

To go back to the default context, use the state named INITIAL. You can also specify multiple contexts on the same %state directive, and the following regular expressions will be associated with all those contexts.

Compiling and running the new lexer on the same input now gives the escaped string contents.

% java sbktech.tools.jax.driver -lexFile TokenInputStream.java examples/tok1.lex
% javac TokenInputStream.java
% java TokenInputStream examples/tok.inp
Line 1: String ``Hello World''
Line 2: String ``"Hello World", she said''
Line 3: number (2323)
Line 4: +
Line 4: -
Line 5: String ``Strings with a + or a - should be ok too''
Line 7: String ``A string defined over several lines''
% 

|Tools| <Examples| |Up to Jax: Scanner Generator| |Reference >

KB Sriram
Comments, bug reports: kbs@sbktech.org

Revised: Tue Jul 23 21:31:34 1996
URL: http://www.sbktech.org/jax-ex2.html