StreamTokenizer class

 

 

  StreamTokenizer parses a file in a manner similar to StringTokenizer. 

 

  To create a StreamTokenizer on a char file, you must supply a Reader input stream such as a FileReader.  i.e.

 

FileReader fr = new FileReader( "yourcharfile" ); 

StreamTokenizer st = new StreamTokenizer( fr );

or  

st = new StreamTokenizer(new BufferedReader(new FileReader("yourfile")));

 

  To create a StreamTokenizer on a byte file, you must use an InputStreamReader stream to change the bytes into chars:  i.e.

 

Reader r = new BufferedReader( new InputStreamReader( new FileInputStrem(" yourbytefile" ) ) );

then  

StreamTokenizer st = new StreamTokenizer( r );

 

  When a token is encountered with .nextToken( ), its type is placed in the variable ttype and, if it is a word, its value is placed in the variable sval. If it is a number, its value is placed in nval.   nval is always a double.

 

  You cannot recognize line endings without asking for them, as StreamTokenizer treats them as whitespace by default.  The .eolIsSignificant( true ) method makes the parser recognize line endings and return the  StreamTokenizer.TT_EOL value in ttype, so they can be tested for.

 

  You do not need to supply StreamTokenizer method parameter values as ints, even though int is usually called for.  Defining a char will work via widening promotion, and chars are much more readable. 

 

Using the st example above:

 

st.wordChars( 44, 46 ); sets commas (int 44), dashes (int 45) and periods (int 46) to be treated as pieces of regular words.

 

st.wordChars( ',', '.' );  does the same thing.

 

Note that supplying character parameters out of their sequential numeric chart order, as in st.wordChars( '.', ',' );  will compile successfully but will not function at runtime

 

  You cannot define token delimiter characters in StreamTokenizer's constructor, as you can with StringTokenizer.  Instead, StreamTokenizer offers various methods to control the possible attributes of an incoming character.  The table shows the attributes and their control methods:

 

Situation:

Use Method:

Do not treat all End-of-Line characters as tokens, just as whitespace

eolIsSignificant( false )  (the default)

Define a range of characters to be considered "part of a word"

wordChars( int lowch, int  highch )

Define a range of chars to be ignorable whitespace characters

whitespaceChars( int lowch, int highch )

Define a character as ordinary, to be returned in ttype if encountered

ordinaryChar( int ch )

Define a range of characters as ordinary

ordinaryChars( int lowch, int highch )

Reset all characters to ordinary, thus returning each one in ttype

resetSyntax( )

Treat positive, decimal (.) and negative (-) numbers as numbers

parseNumbers( )  (the default)

Define a char whose matching pairs will bracket a String quote body, with the char to be returned in ttype and the quote body in sval

quoteChar( int ch )

 

 

  StreamTokenizer has several methods for controlling the parsing of program files written in C, C++, and Java.

 

Situation:

Use Method:

Define a char beginning remainder-of-line comments to be ignored

commentChar( int ch )

Ignore all line contents after occurrences of // 

slashSlashComments( true )

Ignore all file contents between /* ... */

slashStarComments( true )

 

 

  This snippet uses if statements to parse any text file whose lines contain just words, numbers, commas and periods, counting them.  It also uses the lineno( ) method to report on the number of lines in the file:

 

import java.io.*;

 

public class ParseWithIf {

    public static void main(String[] args) {

        int w = 0;

        int n = 0;

        int c = 0;

        int p = 0;

        StreamTokenizer st = null;

        try {

            st = new StreamTokenizer(new BufferedReader(new FileReader("yourfile")));

            st.eolIsSignificant(true);  // tells it to recognize line breaks

            st.ordinaryChars(',', '.');  // set comma and dot to be returned in ttype

            while ( true ) {

                st.nextToken( );

                if (st.ttype == StreamTokenizer.TT_WORD) {

                    w++;

                    continue;

                }

                if (st.ttype == StreamTokenizer.TT_NUMBER) {

                    n++;

                    continue;

                }

                if (st.ttype == '.') {

                    p++;

                    continue;

                }

                if (st.ttype == ',') {

                    c++;

                    continue;

                }

                if (st.ttype == StreamTokenizer.TT_EOL) {

                    continue;

                }

                if (st.ttype == StreamTokenizer.TT_EOF) {

                    break;

                } else {

                    System.out.println("Bad file format");

                    break;

                }   

            }  

        } catch (Exception ex) {  }    

        System.out.println( w + " words, " + n + " numbers, " + c + " commas, " + p + " periods" );

        System.out.println( st.lineno( ) + " Lines in file" );

    }

}

 

  You can also switch against the various StreamTokenizer result fields.  This snippet repeats the above example but with switch instead of if statements:

 

import java.io.*;

 

public class ParseWithSwitch {

    public static void main(String[] args) {

        int w = 0;

        int n = 0;

        int c = 0;

        int p = 0;

        try {

            StreamTokenizer st = new StreamTokenizer(new FileReader("yourfile"));

            st.eolIsSignificant(true);                                 // tells it to recognize line breaks

            st.ordinaryChars(',', '.');                                   // set comma and dot to be returned in ttype

            int token = st.nextToken();                              // prime the token field for first comparison                    

            st.pushBack();                                                  // reset for loop start

            while (token != StreamTokenizer.TT_EOF) {

                token = st.nextToken();

                switch (token) {

                case StreamTokenizer.TT_NUMBER:

                    n++;

                    break;

                case StreamTokenizer.TT_WORD:

                    w++;

                    break;

                case ',':

                    c++;

                    break;

                case '.':

                    p++;

                    break;

                case StreamTokenizer.TT_EOL:

                    break;

                case StreamTokenizer.TT_EOF:

                    break;

                }

            }

        } catch (Exception ex) {}    

        System.out.println(w + " words, " + n + " numbers, " + c + " commas, " + p + " periods");

    }

}

 

  You cannot set periods ( . ) or dashes ( - ), which are considered part of numbers, to be ordinary characters with  ordinaryChar( .. )  if you plan on explicitly parsing for numbers using parseNumbers(  ).  If you do, number parsing will take precedence and the periods and dashes will not be returned in ttype.  i.e.

 

st.ordinaryChars( '-', '.' );         then

st.parseNumbers( );                causes those chars not to be returned.

 

However stating st.ordinaryChars( '-', '.' );  alone, invoking the number parsing default without supplying the explicit st.parseNumbers( );  will return those two ordinary characters in ttype.

 

  You can use StreamTokenizer to parse a simple String character stream by supplying the StreamTokenizer constructor with a StringReader containing the desired characters.  i.e.

 

                  StreamTokenizer st = new StreamTokenizer( new StringReader( "Mary had a little lamb." ) );

 

  This snippet finds, sorts and displays all the uniquely different alphabetic words in a text file.  It reduces them to lower case with lowerCaseMode( ). But first resetSyntax( ) is used to eliminate unwanted default parts of words, like periods. The pushBack( ) method is used to reference the same token twice.  A TreeSet provides both the automatic uniqueness and the ordering of results:

 

import java.io.*;

import java.util.*;

  

Set S = new TreeSet( );

try { FileReader fr = new FileReader( "yourfile" );

        StreamTokenizer st = new StreamTokenizer(fr);

        st.resetSyntax( );

        st.wordChars('a', 'z');

        st.wordChars('A', 'Z');

        st.lowerCaseMode( true );

        while ( ( st.nextToken( ) ) != StreamTokenizer.TT_EOF) {

                        st.pushBack( );

                        if (st.nextToken( ) == StreamTokenizer.TT_WORD)  S.add( st.sval );

        }               }

catch (IOException e) { }

Iterator it = S.iterator( );

while ( it.hasNext( ) ) { System.out.println( it.next( ) );

}