PxdScript language tutorial series.

Chapter 10 - Assembling The Code:

by Kasper Fauerby


Hi again and welcome to chapter 10 of the PxdScript tutorial series. Today we'll assemble the output file emitted.a that we learned to emit in chapter 9 into binary byte code. Please scan over chapter 7 and 8 again to make sure you have all the stuff I said there about the VM fresh in memory because you'll need to have an understanding of how an interpreter reads and executes code to be able to follow this chapter.

So what's the problem?

Before we begin it'll be useful to think a bit about what we are actually trying to do and make a list of the potential problems we encounter to reach that goal. In chapter 9 we emitted the fully compiled program as byte code in symbolic form - some sort of assembly code. Now we want to assemble this file into a (single) stream of byte codes and save this to a binary file which will be our final output of the compiler/assembler. We then wants to save this byte code to a binary file along with any information about the functions and programs that our VM might need. Let's focus on the byte code stream a bit. Listed below is a small sample of the symbolic assembly code our compiler emits (it doesn't do anything in itself - it's just a random extract of some code but it contains all the things we are interested in studying).

goto stop_3
ifeq stop_1
ldc_string "a string nr "
ldc_int 100
iload 2
astore 1
call myfunc
iload 2

To convert this piece of code to binary byte code we have to make some decisions about what to do with the arguments of the byte codes. It's easy enough to convert the instructions like for example goto, iload or aadd into binary byte codes - we just make an enumeration of them saying that for example goto is represented by the byte '0', iload by '1' etc. But it's not necessarily clear what we should do with the arguments. One of the problems with the arguments is that they fall into different categories. Some represent constants like the arguments to ldc_int and some represents addresses / labels like the arguments to branching instructions like goto, ifeq etc. In some cases it's hard to tell what category a certain argument falls under - for example what about the argument to call? Is that an address (the address of the function) or is it a constant (representing the string name of the function called)? And what about the arguments to the instructions that loads or saves variables? Their arguments can be thought of as addresses into the execution stack (as explained in chapter 7) or they can be thought of as the integer constant of the variable address. We'll treat them in a third way actually as explained below.

To keep things simple we'll stick to having three types of arguments - a 32bit address into the code segment, a constant and a 16bit integer stored directly in the code stream. The 16bit integer argument is used with the iload and similar instructions where the arguments points out an address in the execution stack where the instruction should load or store the top of the stack. By storing this as a 16bit value in the code we've limited our functions and programs to have approx. 65000 local variables each - this seems reasonable to me :) We chose to consider the argument to the call instruction as a constant string, containing the name of the function to be called. 

This leaves us with two sub-problems to solve: what to do about the labels and arguments to branching instructions and what to do about constant arguments. I'll treat these two sub-problems in separate sections below. After we've solved these problems I'll show you how to design a binary file format to store our binary data as well as the extra information we need for our VM. Last we'll talk a bit about the source code for the assembler but not in too great detail because I've commented the source heavily to make it easy to understand.

The constant pool

Lets treat the easiest of the two argument problems first - what to do about the constant arguments. The obvious idea is to simply store them in the code segment as a series of bytes - no one can stop us from storing things that takes up more than one byte in the code segment and it's always clear what kind of argument an instruction needs. If we encounter a ldc_int instruction we know we should load a 32bit integer constant from the code, if we encounter a ldc_string we'll load a zero terminated string etc.
But it seems as a waste of space if for example a long string is referenced as a constant many times in the code. Each time we would store the entire string in the code. Also it seems a bit slow - each time we wanted to load the string from the code we would have to allocate memory for it, find out where it ends etc. All in all it seems like a messy solution - so we'll think of something else!
Instead we'll create a so-called constant pool which is just a list of all the different constants in the code. Every time an instruction refer to a constant we'll store a 32bit index into the constant pool in the code segment right after the instruction byte code. Here is an example:

ldc_string "a string nr "
ldc_int 100 ldc_string "a string nr "

When we assemble this piece of code we'll first encounter the constant string "a string nr ". We'll store this in the constant pool at index 0. Later we find the integer constant '100' and store this in the constant pool at index 1. Finally we encounter the string again, but as it's already in the constant pool we don't add it again. Now lets say the instruction ldc_string has byte code 33 and the ldc_int instruction has byte code 26 - then the final byte code for these three lines becomes (remember we store indices into the constant pool as 32bit - or 4 bytes):

33 0 0 0 0 26 0 0 0 1 33 0 0 0 0

If we knew that we wouldn't need more than approx. 65000 constants in our entire script collection we could chose to save the indices as 16bit values and save 2 bytes per reference to a constant. As it is now we only reduce the code size when loading constant strings (and only if they are longer than 3 characters) - the integer constants wouldn't take up more space in the code segment than the index - but to keep things simple and to avoid any limitations I've chosen to represent constant indices as 32bit values. In a script language on a PC we normally isn't that concerned about code size anyway. The important thing is that we know have a unified way of storing references to constants and this can easily be expanded to support other data types than integers and strings.

Labels & branching instructions

OK - handling the constants was easy enough and the problem with the instructions with addresses as arguments is only slightly harder. So lets see what we can come up with.

What we want to do is this: each time we encounter a reference to a label we'll store the actual 32bit address into the code segment of the instruction immediately after the label referenced. And each time we encounter a label we'll have to write down it's address so that we know what to write when we encounter a reference to it. This is easy enough with backward jumps in the code like for example:

iload 1
// do something here.
goto start_0

It's not hard to imagine places where this will happen - for example there must always be such a case in loops (while loops and for loops). It's easy because when we reach the reference to the label start_0 we've already written down the address of that label - namely the address of the iload byte code, the byte code immediately after the label. But there are also places in the code where we have jumps forward in the code (to skip a part of the code in a conditional like 'if' for example, or to terminate a loop). An example could be this little piece of code:

if_icmplt true_2
goto stop_3

Here we have a problem because when we reach the reference to the label true_2, for example, we haven't defined that label yet and thus we cannot know the address of it. This is why assembling cannot be done in a single pass over the code. We could either do two full passes and in the first pass we simply record the address of all labels - in the second we would then be able to convert all the references to a label into its actual address because we would always know the address from the first pass. Or we could do a single pass over the code and then simply make a list of "holes" in the byte code that are "TODO". That is, when we encounter a reference to a label that we haven't seen yet we simply reserve 4 bytes in the code segment and add a pair of (label, code address) to a list where label is the label referenced and code address is the address to the unfilled hole in the code segment we just reserved space for. When we've finished assembling all the code we must have encountered all labels referred to and stored their addresses. We can then simply run through the list and fill out the holes with the correct addresses. This is how I've chosen to do it in this assembler. I've called the pair (label, code address) a PLUG and to fill out the holes I call the function fillPlugs.

Now there is another choice to make about which address to store for a given label. One could store the address as the offset from the start of the byte code for the function or program in which the label appears. This would give a kind of local addressing scheme and one could move the function or program around in the code segment and the addresses for the labels would still be correct because the local offset would remain the same regardless of where the function is placed in the code segment. To convert a local address to an absolute (or global) address one would simply add the address of the first byte code in the function to the addresses stored in the code. This is similar to what a linker would do - link together pieces of code, each with compiled with addresses stored with locally to that piece of code. Since I see no need to move around the functions or programs in the code segment in our script language I've decided to store global addresses in the code segment - so we store all arguments to branching instructions as a 32bit absolute address into the code segment.

This brings up a small problem when trying to fill the "holes" described above. We'd like to fill all the holes at once for all the programs and functions - for example when the assembler has finished assembling all the functions and programs in the input text file, just before saving the code to file. The problem is that labels only have unique names for the function in which they appear. Seen globally the label start_0, for example, will probably appear many times - one time for each function or program. We need to store each appearance as different labels with different addresses. The solution is simply to prefix the label names with the name of the function or program in which they appear when we add them to the list of unfilled holes. As the names of the functions and programs must be unique in a script collection the combined string 'functioname + labelname' must be unique globally. See the code for more details on this little twist.

As for how the code for the different functions and programs is stored in the code segment: they are simply placed right after each other. We then store the address of the first and last byte code for a function or program in a header structure so that the VM will know where to find it.

Designing the file format

We'll also need to decide on some file format for our final output of the compiler. We need to store four different kinds of things in it: the code segment, the constant pool, function information and program information. I've made a simple format where things are stored in chunks where each chunk has an ID associated with it. See the file pxdScriptHeaders.h for more information about these chunk IDs and other structures and defines we use in the file format.

The chunks themselves can be stored in any order but in our case the output file will always be ordered like this: first comes a number of function and program chunks (as they appear in the input text file). Then follows the constant pool chunk and last the code segment chunk.

A function and program chunk is stored in the obvious way from looking at their header structures but lets look at the function header and discuss the fields:

typedef struct FUNCTIONHEADER {
   char *name; /* zero-terminated */
   unsigned char inputs; /* number of inputs */
   unsigned char outputs; /* returns (0 or 1) */
   unsigned char localLimit;
   unsigned short int stackLimit;
   unsigned int startCode;
   unsigned int endCode;

This is the information we'll store about a function for the VM. The fields are pretty obvious - the name is of course the name of the function, then comes the number inputs and outputs for this function. The VM will need this to be able to handle calls to the function. The field localLimit is one of the things we calculated earlier in our compiler (in the chapter on resource calculations) and it tells the VM how much space on the execution stack the function will need. This is useful so the VM can decide if it can run the function without getting a stack overflow at call time. It then doesn't need to check for stack overflows during the execution of the actual byte code which speeds up the execution of the byte codes that manipulates the stack. A program has a similar header structure that is saved to the file.

The constant pool is simply saved by first storing the number of entries in it followed by pairs of (type, data) where type is stored as a byte enumerating the different types and the data is stored differently depending on the type. Strings are always saved in a zero-terminated way (opposed to storing the length, then the string data). There really isn't anything fancy going on here.

The byte code is simply saved as a 32bit code length followed by a dump of the byte code array.

The code

I've actually explained most of what's going on in the code in the sections above but lets take a brief guided tour through the different files and functions.

The file 'opcodes.h' contains an enumeration of the different instructions of the PxdScript language. The instructions will be saved as byte code after these numbers and the VM will include this header file too, to know which opcodes it should be able to handle.

The header file 'pxdScriptHeaders.h' contains the definitions needed to save/load the binary file as well as the headers that describes the functions and programs in the file. The VM will also need this file to be able to load the assembled file and to know where to find the programs and functions called.

The pair of files 'asmutils.c' and 'asmutils.h' contains some helper functions that are totally uninteresting and has nothing to do with the assembling process as such - but we need them in the assembler.

Now lets browse the main assembler file 'assembler.c'. The first interesting thing we encounter is the struct 'LABELADDRESS' which is used to solve the problem of filling the holes in the byte code of unknown addresses. As described above it's a pair (label, code address). Then follows a few straight forward functions which manipulates some linked lists storing labels and managing the plugs.

A bit further down the file you'll encounter the function 'setLabelPosition(char *name)'. This function is called whenever the assembler reads a token that is a label rather than code. It'll simply create a new label, add it to the (global) list of known labels and record its address. Remember the address of a label is the address of the byte code immediately after the label - or the address of where we'll put the next byte code if we handle the label when we find it. The global variable 'codeposition' points out exactly this address. Also notice how we expand the label name to be unique by prefixing it with the name of the function or program currently being assembled, as discussed above.

Scroll down the file a bit further and you'll encounter the next interesting function 'void asmCODE(char *token)'. This function takes as argument a string containing an instruction read from the source text file. It then simply maps this string to the corresponding byte code. If the instruction has any arguments they are read here! For example look at where the function handles the 'goto' instruction. It knows that after a goto instruction follows the name of a label to jump to. The function reads the argument and as it's an address it registers this as a location in the code segment which should have the actual address "plugged" in later (as discussed above in great detail). A bit further down it handles the  ldc_int and similar instructions. It loads the argument, registers this as a constant and saves the 32bit constant pool index in the code stream. Finally, as we've discussed, whenever it meets a load/store instruction it'll read in the argument and store this as a 16bit integer value directly in the code segment immediately after the instruction byte code. The function is long and boring - but quite straight forward.

After the asmCODE function follows a few functions to dump the data to the output file. These should need no further explanation.

Now you've reached the main assembling function 'void AssembleFile(char *fn)'. This function first creates the output file and opens the file with the symbolic byte code (in our compiler this is always 'emitted.a'), does a little initialization and then starts to read top-level tokens from the input file. In our case this will always be a 'function' token or a 'program' token. Each case is handled by its own sub function. When all functions and programs has been assembled it'll fill out the plugs (by now we must have seen all the labels in the file) and then it'll emit the constant pool and code segment. The function and program headers are emitted by their sub functions when they are encountered.

The functions for assembling functions and programs are very similar and quite well commented so I won't say much about them. Their overall flow is: read information about the function/program (name, arguments, trigger type, stack limit etc.) then read and assemble the code instructions until the end of the function/program is reached. Then dump the function/program header to the output file.
And that's all there is to it - we have now assembled our program and our compiler is finished! <insert cheering sound here>


Final words

Wow - it's almost hard to believe isn't it? The compiler is really, truly finished now! Well, as finished as *I'm* going to make it for you that is! Now it's your turn to start extending it with features you need in your particular game or application. The compiler is pretty powerful as it is with regards to code generation, which syntax it allows and generates correct code for etc. - but it lacks some means of communicating with the "outside world", so to speak... with your game engine. Without these links it'll be able to beautifully run loops, juggle around with conditional expressions and calculate complex expressions on the stack. It'll concatenate strings, call functions, implement recursive algorithms - but you'll never know because there is no way of outputting information from the language - unless you create one! This is not entirely true - I've provided you with a few "general" means of communicating with a surrounding game engine - namely the setint language construct - and I'll use that in the next and final chapter in this series where we create a VM to show you that this thing actually works. But the setint language construct is not the best / most flexible way of communicating with the game engine and by now you know enough to do better on your own (see chapter 1 again for some inspiration).

Well, see you next time for some VM creation. I suspect that chapter is going to be fairly long even if we try and keep things simple but we'll see.. somehow I'll manage to get it done soon I hope :) Until then - go improve your compiler! (Actually, as you already know a bit about VMs you could also try and see if you could write one yourself, or at least a partially working one, before I write the next chapter...)

Keep on codin'

  Kasper Fauerby