PxdScript language tutorial series.

Chapter 6 - Type Checking:

Introduction

Hi again guys.

It has been a while since chapter 5 - sorry about that but I had other things to take care of. Yes, believe it or not, I actually spent some time on university work and even played a game or two instead of just making them. Anyway, the temporary craze has passed and I'm back to normal, typing away on chapter 6 of the PxdScript tutorial series - the last chapter before we have the front-end of the compiler done!
Yes, that's right - after today our compiler will be done analyzing the source code and all which will remain is to get it out in a compiled state to a binary file. That's the job of the backend but more on that in the next chapters.

Type Checking

Before we start looking at the source code let's talk a bit about what exactly type checking is. As I mentioned above the type checking phase is the last phase in the compilers front-end and its purpose is to check that the types of the different parts of an expression match (for example reject expressions like "foo" - 5) and that variables are used correctly with the right types. Any programming language has a bunch of types it operates with. When we designed PxdScript we introduced the following types :

void : the 'empty' type.
int : the type consisting of whole numbers.
bool : the type consisting of truth values true and false.
string : the type representing a series of characters (actually a series of char's - a type we chose not to include)

When you make a variable declaration in your source code like for example

int x;

you are actually defining an invariant promising that at any time the variable x will only contain values of the type int.

A program is type correct if all such invariants are valid.

Unfortunately this is an undecideable problem - in other words, if this is our only restriction then we cannot program a phase to check if a program is valid or not. If that were the case then this series would be over and as you can see I have planned 7 more chapters which should be a good indication that we'll do something else. To see why this is undecideable one must realize that the validness of the invariant depends on program flow - for example the following piece of code :

int x;
if (false)
x = "foobar";

actually is a type correct program. At any time in program execution the variable will only contain values of type int. Now in this simple case it is easy to see that x will never be assigned the string. One could even add a rule to the weeding phase to weed out all occurrences of if(false) statements. However it is not hard to construct more complex examples where it is not so easy for a computer-program to see that the code is indeed valid even if it seems obvious to humans. For example :

int x = 3;
int y = 7;
bool b = (x == y);
if (b)
x = "foobar";

I'll bet that no-matter how many and how good rules you make to check these things then I can always come up with a valid program which your compiler will reject!

So, instead of going for a type correct program we decide to simply define a bunch of type rules and then we say that a program is statically type correct if it satisfies all these rules. We will always end up with a type-checker which accepts too few programs! It will always reject certain valid programs. The way to improve a type system is to find a special case which the type-checker rejects even though it should accept it and then write some new code to make it accept that case - without breaking the code so it will accept illegal programs too! If you want your 5 minutes of fame you could download the source to the GNU compiler gcc and do this :)

Type Rules

So now we have to define a bunch of type rules for PxdScript. I have decided to go for a fairly strict type system where each component in an expression must match each other perfectly. For example, you can only add int to int, assign string to a string variable etc. Well, actually this is a truth with some modifications. Remember that I added the cast operator when I designed the language. This means that you can explicitly cast a variable to certain other types. Of course we need to put up type rules about which variables can be cast to which etc. but in the case of PxdScript where we have so few types that it should be fairly easy.

However, one would quickly get tired of explicitly having to cast variables even for the most trivial cases. One would like the language to behave like for example C where it is legal to assign an int to a float variable etc. This is called implicit casting and it simply means that in certain cases the compiler will insert casts for you when it can be done without loss of data. For example the following would be legal (if we had floats in PxdScript) :

int x = 3;
float y = x;

because the integer value could be cast to a float without any loss of data. On the other hand the opposite direction :

float x = 3.4;
int y = x;

would not be accepted because even though a float could be cast to an int we should not do so automatically because chances are that it was a mistake and the programmer actually meant for y to be a float variable too. In this case the programmer would have to explicitly cast the variable like this :

float x = 3.4;
int y = (int)x;

The beauty of having implicit casts is that it allows us to define strict type rules like I mentioned above. The compiler will actually insert a cast-node into the AST to make any implicit cast explicit and thus the types of the components in the expressions will always match perfectly. The problem is of course that we have to code this functionality but we wont let this hold us back.

Notice that while explicit casts are something you can define generally for the types then implicit casts are something you define for each type of operation. There are quite a few places in PxdScript where we apply implicit casting and one of the places is the addition rule. Of course we allow strings to be added to strings but we'll also allow adding ints to strings like this:

int nr = 5;
string s = "this is nr "+nr;

This is a good example of an implicit cast which only holds for the addition operation. It would not make sense to allow the same for subtraction!!

Ok, so which explicit casts do we allow then?

Source type:	Legal casts:
void	(none)
int	string, int
bool	string, bool
string	string, int, bool

One could argue that one should be able to cast between bool and int too, but then you are in the lucky position of having the source code for the type checker so you can add this yourself ;)

Besides all this casting stuff the type rules are as you would expect them to be when I told you they were strict. Look through the code to see the exact rules.

The Code

It should come as no surprise that we do the type checking by making a recursive run through the AST, just as we have seen it in the other phases. Most of the code is pretty self-explaining but there are a lot of functions which are not part of the recursion but rather tools we need in the type checking. I'll step through them here:

The first is initTypes() which simply initializes a few variables we use to compare with when we want to check if a certain type is int or bool etc.
One thing to notice is the variable undefinedTYPE which I declare here. As I'm sure you know from your work with C++ (or even worse: JAVA) a small error can result in a huge list of cryptic errors from the compiler and quickly one learns to start with the topmost error, correct that one and then try to compile again. Often a lot of the other errors has disappeared too then. This is because they weren't really errors but rather results of the first error!

Now that you know a bit more about compilers you might be able to see why this happens. Imagine there is a type error deep inside an expression. In the AST this error will then be deep down some branch and as the type checking is recursive it will also happen deep down a recursion. When the error is detected the compiler marks that particular node in the AST as not type correct and returns from the recursion. The parent to the node probably uses the node and expects it to be of a certain type. Because of the error though the node has not been marked as having the expected type and therefore the parent will also fail its type check and report another error and so on.

The idea behind this undefinedTYPE is that when an error occurs the node is marked as having this type and the parent will check for this and not report an error if the child has been marked with this type. The result is that PxdScript will behave nicely and not come up with those cryptic error messages we know from other languages.

The next function worth noticing is

TYPE *greaterType(TYPE *l, TYPE *r)
{
    /* Pass on undefinedTYPE: */
    if (l == undefinedTYPE || r == undefinedTYPE)
        return undefinedTYPE;

    /*bool*/
    if(l->kind == boolK && r->kind == boolK)
        return boolTYPE;
       /*int*/
    if(l->kind == intK && r->kind == intK)
        return intTYPE;
     /*string*/
    if(l->kind == stringK && ( r->kind == stringK || r->kind == intK || r->kind == boolK))
        return stringTYPE;

    /* l is not greater than or equal to r */
    return undefinedTYPE;
}

This function takes two types and returns the 'greatest' of them after a certain ordering.

The ordering is fairly intuitive and represents an ordering of how complex the different types are compared to each other. The function only defines half the ordering - i.e.. it is based on the 'left' parameter and then looks at the right. This is because when we need it we simply call it twice, the second time with its arguments switched. Look at the source below for more details about using this function.

Next comes:

TYPE *greatestCommonType(TYPE *t1, TYPE *t2) {
    TYPE *type;

    if((type = greaterType(t1,t2)) != undefinedTYPE)
        return type;
    else if((type = greaterType(t2,t1)) != undefinedTYPE)
        return type;

    return undefinedTYPE;
}

which does exactly what the name implies - it finds the greatest common type that the two types are both compatible with. This is used when deciding which type one should give an expression and is also used to find out about the implicit casts. For example it is used in the equals node in the AST where you could (theoretically) write:

bool b = true;
if (b == "true") { .. }

In this, one parameter would be of type bool and the other of type string. greatestCommonType would figure out that the greatest type the two have in common is string and the compiler would insert an implicit cast to cast b to a string. The final code would then look like:

bool b="true;
if ((string)b == "true") { .. }

Next follows the function:

bool assignable(TYPE *left, TYPE *right) {
    /* undefinedTYPE always yields success: */
    if (left == undefinedTYPE || right == undefinedTYPE)
        return true;

    if (greaterType(left, right) != undefinedTYPE)
        return true;
    else
        return false;
}

which decides if the right type can be assigned to the left type. This is possible exactly if the left type is 'greater' than the right.

And finally, the last 'utility' function is the function which implements the casting rules from the table above:

bool validCast(TYPE *source, TYPE *dest) {

    /* rules for cast int -> ?? */
    if (source->kind == intK)
        if (dest->kind == stringK || dest->kind == intK)
            return true;

    /* rules for cast string -> ?? */
    if (source->kind == stringK)
        if (dest->kind == intK || dest->kind == stringK || dest->kind == boolK)
            return true;

    /* rules for cast bool -> ?? */
    if (source->kind == boolK)
        if (dest->kind == stringK || dest->kind == boolK)
            return true;

    return false;
}

Ok, now for the recursion through the AST. As usual we start at the topmost level and work our way down the tree. The first interesting point in this recursion comes in the function which type-checks variable declarations:

void typeDECL(DECL *d)
{
switch(d->kind){
case formalK:
    break;
case variableK:
    // recurse on initialization
       if(d->val.variableD.initialization != NULL){
        typeEXP(d->val.variableD.initialization);
     // check
if we can assign initialization to the variable
   if (!assignable(d->type, d->val.variableD.initialization->type))
        reportError(d->lineno, "Cannot assign %s value to %s lvalue",
                    typeToString(d->val.variableD.initialization->type),
                    typeToString(d->type));

   // Notice we check if the undefinedTYPE flag is set and only touch the node if it isn't
    if (d->type != undefinedTYPE && d->val.variableD.initialization->type != undefinedTYPE) {
        /* make implicit casts explicit */
        if (d->val.variableD.initialization->type->kind != d->type->kind){
            d->val.variableD.initialization = makeEXPcast(d->type, d->val.variableD.initialization);
            d->val.variableD.initialization->type = d->type;
        }
   }
}
break;
case simplevarK:
....
break;
}

if(d->next != NULL)
    typeDECL(d->next);
}

Look through the code above and note how we apply both type-checking and implicit cast here.

The last bit of code I'll list here is the code for type-checking a call to a function:

void typeEXP(EXP *e)
{
    struct TYPE *type;
    int argumentNo;
    struct DECL *currentFormal;
    struct EXP **currentArgument;

    if (e == NULL)
        return;

    switch (e->kind) {
    .
      .
     case invokeK:
          // First type check all the arguments. Remember these can be
          // non-trivial and consist of complex expressions and even other
         // function calls.
        if (e->val.invokeE.arguments != NULL)
            typeEXP(e->val.invokeE.arguments);
        // Check which kind of symbol the invoked name refers to
        if (e->val.invokeE.symbol->kind == functionSym) {
            /* It is a function */

            /* Check argument types against formal types: */
            argumentNo=1;
            currentFormal = e->val.invokeE.symbol->val.functionS->formals;
            currentArgument = &(e->val.invokeE.arguments);

            while (currentFormal != NULL && (*currentArgument) != NULL) {
                /* Test if current argument is assignable to current formal */
                if(!assignable(currentFormal->type, (*currentArgument)->type)) {
                    reportError(e->lineno,
                                "Argument %d cannot be assigned to formal %d in function '%s'",
                                argumentNo,
                                argumentNo,
                                functionToSignatureString(e->val.invokeE.symbol->val.functionS));

                }

                /* make implicit casts explicit */
                type = greaterType(currentFormal->type, (*currentArgument)->type);
                if (type == undefinedTYPE) // if no greater type then we know there has been an error and bails
                    break;

                if ((*currentArgument)->type->kind != type->kind){
                    *currentArgument = makeEXPcast(currentFormal->type, *currentArgument);
                    (*currentArgument)->type = currentFormal->type;
                }
                       /* Next: */
                argumentNo++;
                currentFormal = currentFormal->next;
                currentArgument = &((*currentArgument)->next);
            }

            /* Report argument count errors: */
            if (currentFormal != NULL && (*currentArgument) == NULL)
                reportError(e->lineno, "Too few arguments to function '%s'",
                            functionToSignatureString(e->val.invokeE.symbol->val.functionS));
            else if (currentFormal == NULL && (*currentArgument) != NULL)
                reportError(e->lineno, "Too many arguments to function '%s'",
                            functionToSignatureString(e->val.invokeE.symbol->val.functionS));
        } else {
            /* It is a program */
            reportError(e->lineno, "Programs cannot be called");
        }

        /* Set expression's type to the returntype of the function: */
        e->type = symbolType(e->val.invokeE.symbol);
    break;
     .
     .
     .
   }// end switch
    if(e->next != NULL)
    typeEXP(e->next); }

Read through the above until you get the hang of it and then move on to the source you downloaded for this chapter and look through the rest. As with many other phases in a compiler there is a lot of code, but it's not that hard when you have an idea about what is going on in it.

Checklist

Final words

Well, that was the last phase in our front-end. Our compiler should now be able to reject all illegal programs as well as provide the user with good error messages when errors occurs. It is possible that there are errors in the code for this chapter. Type-checking is one of the mode complex phases and there are a lot of special cases to take into account when one starts playing around with this undefinedTYPE stuff etc. So if you find (and maybe even fix) an error in the source please email me so that I can add the changes to the 'notes' page.

We are now about halfway through our compiler. My teacher once made a funny comment about writing compilers. He said "Writing compilers is like skiing. At first you walk uphill (the front-end) for hours which is fairly tedious but then you reach the top and it's all getting fun from here (the back-end)". So lets take a breather while sitting on the top, looking down at what we have accomplished so far but fasten your seatbelt because next time we'll start the trip downhill ;)
Kasper Fauerby

www.peroxide.dk