PxdScript language tutorial series.

Chapter 2 - Scanning / Lexical analysis

Introduction

Hi again!

Welcome to the second chapter in the PxdScript tutorial series. This chapter deals with the first module in a compiler - the scanner. If you haven't read the first chapter yet I suggest you start there and read the chapters in order.

We'll make a soft start and begin with one of the easier tasks in writing a compiler - the scanning.

The Scanner

So, what does a scanner do? The scanners job is to perform a lexical analysis of the source code to transform it from being just a text file into a list of "tokens" defined by the programming language. In its basic form the text file containing the source code for the program is nothing more than a stream of characters. The text file itself does not contain any information about which characters belongs to which tokens. An assignment for example could look like this in the source code:

a = 5

but what we need is really a sequence of three tokens :

tIDENTIFIER tASSIGNMENT tINTCONST

Well, the above is almost true except for a tiny detail. It is custom not to introduce a special token name for tokens, which always has the length of one character. Thus the assignment above is really converted into the following three tokens:

tIDENTIFIER '=' tINTCONST

but that's just a minor detail one should keep in mind.

It is also the scanners job to make sure that every token in the source code is actually contained in the programming language. The scanner should complain if I tried to write something like:

a #= 5

because the token '#=' would not be recognized as a valid token in PxdScript.

At first sight it might seem like an easy task to write a scanner. One just reads one token at a time from the file and checks if it's a valid one, right? And if it is, it should be easy to classify it as a certain token type? Well, the problem is that it's not so easy to find out where a token starts and ends in a text file. If one could always assume that tokens were separated by spaces it could be done but consider the following source fragment:

while(b>5)

Even though the fragment consists of only one "word" in the text file it is clear that it is a sequence of many tokens. A correct scanning of this single "word" in the text file should result in a list of 6 tokens:

tWHILE '(' tIDENTIFIER '>' tINTCONST ')'

So we abandon the idea of just reading the source file one word at a time and accept the fact that tokens can begin and end in the middle of "words". But how do we figure out when a token starts and ends then? Lets look at the example from above again. If we only read the first 4 letters of the "word" then we'll get an identifier ("whil") but as soon as we read the fifth letter that identifier suddenly becomes a totally different kind of token - namely the reserved word "while" which is represented by its own token tWHILE.

Ok, so we have a problem - what do we do? We don't have time for this - if we're ever going to finish our compiler we cannot spend weeks trying to write a scanner which does everything right. Luckily it turns out that someone has already solved the problem for us!
Now is the time to introduce the first of two tools we'll use in our compiler creation - namely "Flex"!

Flex

Well, obviously we are not the first to stumble across this problem There exists thousands of different programming languages and each of those have faced the very same problem as we do now. At some point someone got tired enough of this to sit down and write a tool which automatically can generate a scanner module, written in C, for a given programming language. Flex is exactly one such tool. Of course it needs some sort of description of our language - it needs to know which tokens we want to allow and which we want to reject. We tell it through a configuration file - a lex file - written after a certain syntax, which I'll describe below.

The main part of the lex file consists of a set of rules describing the tokens in our language. These rules are written as "regular expressions" and our tokens must match at least one of these regular expressions to be accepted by our language.

A regular expression can be used to describe a whole class of strings. I'll briefly describe regular expressions through a number of examples covering most of the features we need rather than writing a detailed description of what it is. Regular expressions are not that exciting to read about but they are very useful when writing a scanner.

. - This regular expression (a dot) matches ANY character.
(a|b) - This is a simple regular expression matching a string which is either equal to a OR b. The character | means OR and on both sides of it is another regular expression.
[a-d] - This is a Flex shortcut for writing (a|b|c|d)
[a-zA-Z] - This is the same as ([a-z] | [A-Z]) so this matches a string consisting of exactly one character which is any letter, capitalized or not.
[ ] - A white space
[\n\"\t] - Either a new-line character or a " or a tab character. The thing to note here is the use of '\' to prefix special characters.
[^/] - Matches every character which is not a slash. The ^ means not.

Ok, then there are certain characters, which can be applied to the expressions above. The first such character is the '*' character. In the form above each regular expression matched exactly one character. If one writes a '*' after an expression it mean "zero or more times" so:

[ab]*[c]

means a string starting with any number of a and b characters and then ends with a c. aaaac match the expression above and so does ababac but aca does not.

Another character you can write after an expression is + which means "one or more times". And finally you can write ? after an expression to indicate either zero or one occurrences of the expression (and not more than one!). For example, c matches the regular expression above ([ab]*[c]) because * means 'zero or more' occurrences but it would not match the regular expression [ab]+[c] because + means 'one or more'. The other examples (aaaac and ababac) would also match the new expression. Try and think of some strings that matches the regular expression [ab]?[c] now...

Enough about regular expressions - lets take a look at how we use them in a lex file!

The lex file

In this section we'll first create a tiny lex file to play with just to see how Flex works. Then I'll describe the lex file used for PxdScript.

The format of a lex file is fairly weird. It must start with a section wedged in between two special tokens: '%{' and '%}'. Between this you can write C code to include header files, declare variable you are going to use in the lex file etc. We'll include the string.h library so we start our toy example like this:

p="p" < %} <string.h> #include %{>

Following that we write:

telling Flex that now we'll list the regular expressions it should match. After each regular expression you can write a block of C code which will be executed when Flex reads a token matching the regular expression from the file. In this small example we just want to see how a file can be turned into a list of tokens so we simply print a line to the prompt using "printf":

[ \t]+                                              /* ignore spaces and tabs*/;
"+"                                                  {printf("plus\n");}
"-"                                                  {printf("minus\n");}
"*"                                                  {printf("mul\n");}
"/"                                                  {printf("div\n");}
"%"                                                {printf("mod\n");}
"("                                                  {printf("left paran\n");}
")"                                                  {printf("right paran\n");}

0|([1-9][0-9]*)                                      {printf("intconst(%i)\n", atoi(yytext));}

[A-Za-z_][A-Za-z0-9_]*                               {printf("identifier(%s)\n", yytext);}

.                                                     /* ignore other chars*/;

There is a build-in variable we can use to access the string which actually matched the regular expression called 'yytext'. Notice how we use that to actually save the values of the identifiers and integer constants. Finally we finish the lex file by writing another

to tell Flex that there are no more regular expressions. You can find this tiny lex file here or create your own from the above.

Lets take a closer look at the two more complex regular expressions in the list of rules above. First there is the expression for integer constants:

0|([1-9][0-9]*)

In English this rule reads as: "Either the string contains just one character '0' - or it consists of first one of the digits 1 to 9 followed by any number og digits 0 to 9". At first you might think this is a very complicated way of representing a sequence of digits. Why not just use the expression '[0-9]+' - that is: "one or more digits from 0 to 9"? Well, such an expression would also allow numbers of the form '007' which is cool for a James Bond movie but not practical for a programming language. We don't want any prefixed zeros and if you look closely at the regular expression I used in the toy lex file you'll see that it can form any integer - but it does not accept numbers with prefixed zeros.

The rule for identifiers has a similar argument for why it's not just '[A-Za-z0-9_]+'. We want to avoid a certain type of strings as identifiers - which strings are those??

Flex works from the philosophy that it's better to "eat" as much from the source file as it possibly can at each step. Thus Flex will always use the regular expression which has the longest match when it scans the source file for the next token. This is to avoid having "while" recognized as two identifiers "whi" and "le" for example. If more than one regular expression can match a token from the source file with equal (longest) length then the rule listed first in the lex file is used. So it is safe to use the "dot" expression above to "eat" illegal characters because as it is listed last it will only be used when every other expression fails. If Flex did not have this rule about order of the expressions we would have a problem because we would then often have that two expressions both were the best possible match - for example when the next token in the source file was a '+' which both the "+" and the "." expression would match.

Now we want to test the lex file to see if it does what we want it to do and to do that we'll have to compile the scanner into a program. First we must run 'Flex' on the lex file to automatically generate the C code for the scanner. To do that open your bash (assuming you use Cygwin or a real Linux/Unix system) and type:

flex numbers.l

This will compile the lex file into the C file lex.yy.c. To compile this file you use gcc by typing

gcc -c lex.yy.c
gcc lex.yy.o -o numbers.exe

Fairly painful huh? I know, so I've also provided you with a Makefile to compile this little example (including running flex) so just type make and hit the enter key.

To test the scanner you can use the small test file I also provided with the zip package.

WARNING: the numbers.exe does NOT read the filename from its arguments - the test file must be piped to the scanner by typing

numbers < numbertest.txt.

Now you should see a list of tokens in the bash matching the expression in the test file.

Advanced lex features

Ok, before you can understand the lex file for PxdScript (lang.l) I'll have to tell you a little about some of the more advanced features of Flex. The full listing of lang.l can be found in the section below this one.

The first advanced feature shows right after the

%{

%}

block in lang.l and is the single line

%x commentsc

This line declares that we have grouped the regular expressions into two groups - each with a totally different set of rules! We always start in what is marked as the INITIAL start condition, which is similar to where we put the rules in the tiny example above.

The other start condition is called "commentsc" and the idea is that Flex can only be using rules from one group at a time. When we are scanning a comment in the source code we want to use the special set of rules defined in the <COMMENTSC"> { } block. This is to avoid recognizing the text, which has been out-commented as tokens. As you can see we never return any tokens from within the comment start condition - we just "eat" them from the source code while we still keep track at linenumber etc. One switches between start conditions using the BEGIN(name) keyword. Also notice how easy it turns out to be to implement nested comments - that is multi-line comments inside other multi-line comments. That's something I've always wanted in C++.

We could easily define many more start conditions if we needed them but for most languages two will suffice.

Secondly don't worry about where all the tokens used in the lex file has been defined. Flex works very tightly together with Bison (which we will cover in the next chapter) and the tokens and some of the variables used comes from Bison. Actually it's Bison which calls Flex asking for the next token in the program when it needs one, but we'll look at that in the next chapter. So for now you won't be able to get the lang.l to compile into a scanner like we did with the small toy example above!

Finally, notice how I have simply written the reserved keywords as they are and Flex recognize them as regular expressions (because that's what they are ;)

Ok, so now you should be able to understand the lex file used for scanning PxdScript programs, so lets have a look at it:

Lang.l

/* pxdscript flex-file
*
> * History:
*/

%{
#include "lang.tab.h"
#include
#include "memory.h"
#include "error.h"

extern int lineno;

int commentDepth=0;
%}

/* exclusive startconditions
*/ %x commentsc

%%

    /* comment start-condition for recognizing multi-line comments */
<COMMENTSC>
{ \n                                 lineno++;
"*/"                                 { commentDepth--; if (commentDepth == 0) BEGIN(INITIAL);}
"/*"                                 commentDepth++; "//"[^\n]* /* Single line comments take precedence over multilines */;
<<EOF>>                              yyerror("Unclosed comment.");
[^*/\n]*                             /* ignore anything thats not a '*' or '/' to increase speed */;
.                                    /* ignore all other characters (the text actually out-commented)*/; }

    /* INITIAL start-condition for ordinary pxdscript-code */

[ \t]+<                              /* ignore */;

"/*"                                 { commentDepth++; BEGIN(commentsc);}

\n                                   lineno++;
"//"[^\n]*                           /* ignore (singleline comment)*/;

bool
return tBOOL;
const return tCONST;
else return tELSE;
for return tFOR;
if return tIF;
int return tINT;
return return tRETURN;
program return tPROGRAM;
string return tSTRING;
void return tVOID;
while return tWHILE;

trigger_on_init                      return tTRIGGERONINIT;
trigger_on_collide                   return tTRIGGERONCOLLIDE;
trigger_on_click                     return tTRIGGERONCLICK;
trigger_on_pickup                    return tTRIGGERONPICKUP;

setint                               return tSETINT; TYPE_MDL return tENTITYMDL;

TYPE_PARTICLE                        return tENTITYPARTICLE;
TYPE_PLY                             return tENTITYPLAYER;
TYPE_LIGHT                          return tENTITYLIGHT;
TYPE_CAMERA                          return tENTITYCAMERA;

sleep                               return tSLEEP;
"&&"                                 return tAND;
"=="                                 return tEQUALS;
">="                                 return tGEQUALS;
"<="                                 return tLEQUALS;
"!="                                 return tNEQUALS;
"||"                                 return tOR;
"="                                  return '=';
">"                                  return '>';
"<"                                  return '<';
"!"                                  return '!';
"+"                                  return '+';
"-"                                  return '-';
"*"                                  return '*';
"/"                                  return '/';
"%"                                  return '%';
"{"                                  return '{';
"}"                                  return '}';
";"                                  return ';';
"("                                  return '(';
")"                                  return ')';
","                                  return ',';

0|([1-9][0-9]*)                      { yylval.intconst = atoi(yytext);
                                     return tINTCONST; }

true                                 { yylval.boolconst = 1;
                                     return tBOOLCONST; }

false                                { yylval.boolconst = 0;
                                     return tBOOLCONST; }

\"[^\"\n]*\"                         { yytext[strlen(yytext)-1]='\0'; /* remove "'s from yytext */ yylval.stringconst = strdup(yytext+1);
                                     return tSTRINGCONST; }

[A-Za-z_][A-Za-z0-9_]*               { yylval.identifier = strdup(yytext);
                                     return tIDENTIFIER; }

.                                    yyerror("Invalid character '%s'.", yytext);

%%

Status

So what is our total status? Well, remember that I mentioned that the front-ends job were to reject illegal programs - so which illegal programs can we reject now?
Not too many - only those which are illegal because they contain illegal tokens. So in our current state our "compiler" will reject programs such as:

Program start trigger_on_init {
int b@5;
}

but will happily accept programs like this:

program int foobar = 5 < 8;

Programs like the one above is the kind of illegal programs which we'll reject in the next chapter.

IMPORTANT WARNING: There is one thing I'll point out to save you from a few sleepless nights - never begin a comment in a lex file at the first character in a line! Notice how I start the comments a few characters in the line. Try moving the comments all the way to the left and run flex on it (add a comment in numbers.l). Well, maybe Flex is a nice tool but it sure also has some flaws ;)

Checklist

Final words

Well, that was the second chapter of this series. So far it's progressing fairly well I think. The next chapter will be a bitch - both for me to write and for you to read. Now you have been warned ;)

The good thing is that after chapter 3 the chapters will be shorter and after each completed chapter you will be able to see the progress that your compiler has made since the last chapter which is quite fun ;)

As always - send me comments on what I do well and what I do badly.

Keep on codin'
Kasper Fauerby

www.peroxide.dk