PxdScript language tutorial series.

Chapter 5 - Symbol Checking:

Introduction

Hi again guys. Today's topic is symbol checking - the second of our analyzing modules in the PxdScript compiler. In this phase we will check that all symbols (that is programs, functions and variables) are actually defined and visible at the point in the code where they are used. To do this we will build up a so-called symbol table containing all declarations with pointers to the position in the code where the symbol is declared. We will then use this as a look-up table whenever a symbol is used in the code to check that the symbol has been declared. Symbol checking is the only module which requires two recursions through the AST but more on this later. As most of these analyzing modules the important code is fairly small and easy to understand but there is a lot of unimportant "wrapping" code because of the recursion through the AST which takes up a lot of code lines. Download the source code and look through it and you will have no problem understanding it - there is about 100 important lines of C-code max and it should be commented and easy to spot.

The symbol table:

The symbol table is represented as a hash-table. I will not explain the hash-table data structure here but if you are uncertain what a hash table is you can look it up on the net elsewhere. There is a slight twist though - a symbol table is actually a hierarchical hash-table structure. Each symbol table has a pointer to a parent symbol table which is used to model the concept of different scope levels in the source code - but more on this later. The hash-table nodes are the SYMBOL structure defined in tree.h and looks like this:

typedef struct SYMBOL {
    char *name;
    SymbolKind kind;
    union {
        struct FUNCTION *functionS;
        struct PROGRAM *programS;
        struct DECL *declS;
    } val;
    struct SYMBOL *next;
} SYMBOL;

It contains the name of the symbol - ie. 'i' or 'myfunction' etc. Then it contains the kind of the symbol - a function, a program or a variable - along with pointers into the AST where the symbol is defined. And finally, as all hash-table nodes does, it contains a next-pointer to the next symbol with the same hash-code.

The hash-table is then defined like this:

#define HashSize 317

typedef struct SymbolTable {
SYMBOL *table[HashSize];
struct SymbolTable *parent;
} SymbolTable;

The 'HashSize' is not chosen totally randomly - of course it's a prime number but in addition experiments has shown that under the hash-code function we will use this size gives the most equal distribution of symbol names over the hash-table entries. The importance of this might be limited but hey - why not use the results other people made for us ;)
So what is the hash-code function then? Easy enough - it is simply calculated based on the characters making up the symbol name like this:

int Hash(char *str){
   unsigned int hash = 0;
   while (*str) hash = (hash << 1) + *str++;
   return hash % HashSize;
}

The scope rules:

Ok, so now we know how to represent the symbol table - but what exactly is it's purpose? What information must it contain?
Well, what we need is to know at every single possible position in the program, which symbols can we see? If we refer to a variable, invoke a function etc. then we need to know if such a name has been declared, and if it has been declared multiple times then we must decide if this is legal and also which of the definitions do we actually refer to?
This might sound like we need a lot of memory to store all those symbol tables - remember, we need one at every single possible point in the program! Well, luckily we can cut down a bit on the amount of symbol tables and only store one for each scope-change in the program. We all know how scope rules normally works - even though you might not have thought about it before. A scope change can only appear in a few different places in a program: We have a top-level scope which includes the entire source file. Since we do not have functions local to a program this is the only scope in which programs and functions can appear. This leaves us with the variable declarations. As in C each program or function introduces its own scope - it is perfectly legal to have a variable called 'i' in more than one function for example! We can also explicitly introduce a new scope by using the compound statement structure - i.e.. encapsulate a bunch of code lines with { }.

We also sometimes use the compound statement when writing a for-loop, if, if-then-else or while-loop so each of these constructs might introduce a new scope for their body. It is for example legal to write:

int i = 5;
if (i == 5) {
string i = "another i";
MyStringFunction(i);
}
i = 8;

that it, when inside the body of the 'if' statement we understand the variable 'i' to be a string, while it is an integer when outside. This is because the string variable 'i' is only a temporary variable, and once the program flow leaves the body of the if-sentence we can no longer see or manipulate the string variable declared within. This had been equally true even if we hadn't introduced a name-clash.
So, how do we figure out which variable is being referred to? The rule is to always use the variable declared closest to the place where we refer to it. If the variable has not been declared locally we are allowed to go up a scope level and look for it in that symbol table. We are not allowed to go down into another sub-scope of course. This is just like the rules you know from C/C++.
This also means that the following program snippet actually is legal:

int i = 5;
if (i==5) {
i = 8; // Here we refer to the 'i' in the parent scope.
string i = "another i";
MyStringFunction(i); // Here we refer to the locally declared string
variable 'i'.
}

The above is legal because at the point where we assign 8 to 'i' the string variable has not been declared, so we go up a scope and find the integer variable 'i' in the parent scope - get it?

In C/C++ there are also strict rules about which functions are visible at a certain point in the program. C/C++ requires that when you call a function it must be listed somewhere above in the source code - otherwise the C compiler will complain. This is a problem when 2 functions wishes to call each other - i.e.. the following code snippet is not legal in C/C++:

int a(int value) {
if (value > 0)
return b(value-1);
return 0;
}

int b(int value) {
if (value > 0)
return a(value-2);
return 1;
}

To solve the problem it is possible to declare a so-called prototype in C/C++ - that is, a promise to the compiler about that a certain function will be available for the compiler when it has finished compiling. This is fairly old-fashioned and in most modern programming languages mutual recursion like above is legal without prototypes and as we want PxdScript to be a modern language we will make it legal too (well, maybe also because it's a nice thing ;)

Two-pass checking:

So far everything has been possible with a single run through the AST. Because we want this mutual recursion described above this is no longer possible. Of course C/C++ had a reason for introducing the prototype language construct. With prototypes and the other standard scope rules we never needs to know anything about the source code appearing after a symbol is referred. If the symbol must be defined above in the source then we know that if it is legal to refer to the symbol, then it is already in the symbol table, right?
This is no longer the case because now functions can call other functions which will only appear later in the source. The problem is easily solved though simply by doing symbol checking in two passes through the AST. In the first pass we gather information about which symbols are declared and in which scope. Because we do not want to be able to refer to variables being declared later in the source we do all the symbol checking for variables in the first run - when referred to, a variable must be found in the symbol tables as they are at the point in the program where we refer to the variable.
In the second pass we know all symbols has been entered into the symbol tables and now we can check function invocations with mutual recursion.

Manipulating the symbol table:

Ok, now you should have a pretty good idea about what is going to happen - as usual we will recurse on the AST and whenever we find a declaration we check with the symbol table if it has already been defined in the current scope level - if it has we report an error, if not we put the new symbol into the symbol table. There are 3 functions we use to manipulate the symbol table:

SYMBOL *putSymbol(SymbolTable *t, char *name, SymbolKind kind) {
int i;
SYMBOL *s;
i = Hash(name); // calculate hash-code for symbol name
for (s = t->table[i]; s; s = s->next) {
    if (strcmp(s->name,name)==0) return s; //check if already defined
}
s = NEW(SYMBOL);
s->name = name;
s->kind = kind;
s->next = t->table[i];
t->table[i] = s; // Create new symbol and insert into the symbol table
return s;
}

SYMBOL *getSymbol(SymbolTable *t, char *name) {
int i;
SYMBOL *s;
i = Hash(name);
for (s = t->table[i]; s; s = s->next) {
    if (strcmp(s->name,name)==0) return s;
}
if (t->parent==NULL) return NULL;
return getSymbol(t->parent,name);
}

int defSymbol(SymbolTable *t, char *name) {
int i;
SYMBOL *s;
i = Hash(name);
for (s = t->table[i]; s; s = s->next) {
    if (strcmp(s->name,name)==0) return 1;
}
return 0;
}

The code should be easy to understand but notice that while 'getSymbol' uses the scope-rules and tries to find the symbol in the parent scope (the parent symbol table) then the Boolean function 'defSymbol' will only search the actual symbol table it is given. Thus 'defSymbol' should be used to check if a symbol is already defined in the same scope while 'getSymbol' is used when we actually wish to retrieve the symbol if we can.

The two passes:

void sym1PassFUNCTION(FUNCTION *f, SymbolTable *sym) {
struct SYMBOL *symbol;

if (defSymbol(sym, f->name)) {
    reportError(f->lineno, "Duplicate declaration of function '%s'", f->name);
} else {
symbol = putSymbol(sym, f->name, functionSym);
    symbol->val.functionS = f;

    f->sym = scopeSymbolTable(sym);
    if (f->formals != NULL)
      sym1PassDECL(f->formals, f->sym);
    if (f->stms != NULL)
      sym1PassSTM(f->stms, f->sym);
}
}

So first we check if another symbol with the same name already exist in the same scope. If it does, we report an error.

If not, we insert the symbol name into the symbol table (and from the parameter we tells the function that we are inserting a function - check back on the SYMBOL structure to see why). We also set the pointer in the SYMBOL node to the point in the AST where the declaration appears.

As I mentioned a function introduces its own scope, so we create a child symbol table to the current one by calling scopeSymbolTable (check the source in symbol.c) and recurse on the function body with that new child symbol table. Also notice that at this point we save a reference to the symbol table in the function node in the AST (f->sym = scopeSymbolTable(sym)).
There are a few other nodes in the AST where we save a pointer to the symbol table valid at that particular point. Look for these places in tree.h - I have marked them with the comment /* Symbol: */

Program nodes are handled just like functions and variable declarations aren't to complicated either. Check through the source for the details.
The second pass does nothing except at the invocation nodes in the AST where it checks that the function is visible (after our modern scope rules). This is also left as an exercise for you to understand by reading the source.

Checklist

Final words

That was that! We have now symbol checked our script program and build up the symbol table which we are going to need in the next module - namely the type checking module which will be a bit harder than this one. Play around with some examples to make sure that all this actually works and stay tuned for chapter 6 soon. Keep on codin'
Kasper Fauerby

www.peroxide.dk