STokenizer Class

Note: Please see description of state machines.

 

Purpose:

The String Tokenizer (STokenizer) returns a single token (via the extraction operator) from a string according to the rules set by its internal state machine. For  now, the state machine is hard coded inside the STokenizer class, but it should not be difficult to allow the user to set or change the state machine from the calling entity.

The extraction operator of the STokenizer object is repeatedly called to grab the  next token in the string.

When no more token can be found in the string, more() and done() will return false and true respectively. 

Token Class:

The token class is a package that is used by the STokenizer to send strings and types of tokens extracted from the input buffer to the calling entity. When a calling function calls the extract operator of the STokenizer, a Token object is returned. The Token class provides simple ways to interact with and report about these token strings.

 

class Token
{
public:
    Token();
    Token(string str, int type);
    friend ostream& operator <<(ostream& outs, const Token& t);
    int type() const;
    string type_string() const;
    string token_str() const;
private:
    string _token;
    int _type;
};

Token Types:

You must use the following constants in your project. Place them in the constants.h file under includes/tokenizer

const int MAX_COLUMNS = 256;
const int MAX_ROWS = 100;
const int MAX_BUFFER = 200;

const char ALFA[] = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
const char DIGITS[] = "0123456789";
const char OPERATORS[] = "><=!+-%&|";
const char SPACES[] = {' ', '\t', '\n', '\0'};
const char PUNC[] = "?.,:;'`~!";
const int START_DOUBLE = 0;
const int START_SPACES = 4;
const int START_ALPHA = 6;
const int START_OPERATOR = 20;
const int START_PUNC = 10;

//token types:
const int TOKEN_NUMBER = 1;
const int TOKEN_ALPHA = 2;
const int TOKEN_SPACE = 3;
const int TOKEN_OPERATOR = 4;
const int TOKEN_PUNC = 5;

const int TOKEN_UNKNOWN = -1;
const int TOKEN_END = -2;

 

STokenizer Class:

class STokenizer
{
public:
    STokenizer();
    STokenizer(char str[]);
    bool done();            //true: there are no more tokens
    bool more();            //true: there are more tokens
    //

    //---------------
    //extract one token (very similar to the way cin >> works)
    friend STokenizer& operator >> (STokenizer& s, Token& t);

    //set a new string as the input string
    void set_string(char str[]);

private:
    //create table for all the tokens we will recognize
    //                      (e.g. doubles, words, etc.)
    void make_table(int _table[][MAX_COLUMNS]);

    //extract the longest string that match
    //     one of the acceptable token types
    bool get_token(int& start_state, string& token);
    //---------------------------------
    char _buffer[MAX_BUFFER];       //input string
    int _pos;                       //current position in the string
    static int _table[MAX_ROWS][MAX_COLUMNS];
};

In the stokenizer.cpp, you must reintroduce the _table static member variable once again:

int STokenizer::_table[MAX_ROWS][MAX_COLUMNS];

Testing:

 

 

    char s[] = "it was the night of october 17th. pi was still 3.14.";
    STokenizer stk(s);
    Token t;
    //The all too familiar golden while loop:
    stk>>t;
    while(stk.more()){
        //process token here...
        cout<<setw(10)<<t.type_string()<<setw(10)<<t<<endl;


        t = Token();
        stk>>t;
    }

Output:

     ALPHA         |it|
     SPACE         | |
     ALPHA         |was|
     SPACE         | |
     ALPHA         |the|
     SPACE         | |
     ALPHA         |night|
     SPACE         | |
     ALPHA         |of|
     SPACE         | |
     ALPHA         |october|
     SPACE         | |
    NUMBER         |17|
     ALPHA         |th|
   UNKNOWN         |.|
     SPACE         | |
     ALPHA         |pi|
     SPACE         | |
     ALPHA         |was|
     SPACE         | |
     ALPHA         |still|
     SPACE         | |
    NUMBER         |3.14|
   UNKNOWN         |.|