Nora Sandler

Book Update

2023-10-17T14:00:00+00:00

I’ve got a couple of updates about my upcoming book Writing a C Compiler, which I first announced in a blog post last year.

I’ll start with the bad news: we’ve had to push the release date back until mid-2024. But I also have good news, which is that the entire book is now available early access to anyone who’s preordered through the No Starch Press website. I’ve also made the book’s companion test suite and reference implementation available on Github.

If you preordered this book last year, I realize you’ve been waiting a long time for it! I’m excited to make this early access version available, so you can start working on your compiler before the official release date. As with any early access book, you might still run into typos, layout problems, and the like, which will get fixed before the book is released. The test suite and reference implementation are also still works in progress. Between now and the book’s release date, I’ll be adding more test cases, especially for the last three chapters and the extra credit features; I’m also planning to make readability improvements to the reference implementation. Even though these codebases aren’t quite complete, they’re close enough to use while you work on the project.

Feedback!

If you have questions or corrections about the early access chapters, please email me. You can also report errors through No Starch’s Early Access comment form¹. And if you run into any issues with the test suite or reference implementation, please file a bug in that repo’s issue tracker. You can file bugs against the test suite here, and against the reference implementation here.

¹ No need to report typos/formatting issues/etc.; those will get fixed when the book goes through copyediting and proofreading.↩

Writing a C Compiler is a book!

2022-03-29T16:00:00+00:00

Update here.

I have some very exciting news to share: the “Writing a C Compiler” series is now a book!

Writing a C Compiler: Build a Real Programming Language from Scratch is coming out from No Starch Press in late 2023. You can preorder at the link to get early access to the first few chapters.

In the last post in the series, I said that I was going to take a six-month break to figure out how to finish the compiler. Instead, I took a three-year break, reworked the backend, implemented the rest of the features I wanted to add (well, most of them), and wrote a book. If you were already following the series, you can jump to this section to learn what’s changed. Otherwise, read on for an elevator pitch!

What’s the deal with this book?

Writing a C Compiler is a hands-on guide to, well, writing your own C compiler. It takes the same basic approach as the series of blog posts I published here a few years ago. You start out by compiling the tiniest possible C program to x64 assembly, then add a new feature in each chapter. This book is all about compiling a real, widely used programming language into real assembly code, with all the low-level details and ugly edge cases that entails.

At the same time, I wanted to write this book for a broad audience, not just people who already know assembly code or have the C standard memorized. So I’ve tried to lay the whole process–ugly edge cases included–in a way that’s accessible, easy to follow, and maybe even fun. The implementation code in the book is all pseudocode, so you can implement your compiler in whatever language you want!

Here’s a non-exhaustive look at what you’ll learn:

Part I introduces the basics, like expressions, variables, control flow statements, and function calls.
Part II adds more types, including floating-point numbers, arrays and pointers, and structs.
Part III covers a few classic optimizations, like constant folding, dead code elimination, and register allocation.

I didn’t include every feature in the C standard, but I wanted the end result to feel complete. I’ve also tried to cover the fundamentals that you’ll need to know if you want to keep building out new features on your own.

What if I’ve already done the series?

When I started working on the book, I thought that I’d just be building on the existing series. But the implementation in the book quickly diverged from what I’d originally posted. The most obvious problem is that the original design produced 32-bit x86 assembly, which was quickly becoming obsolete even when I first started the project back in 2017.

The other problem was that I needed a new intermediate representation. Converting the AST directly to assembly worked well for the first few chapters, but got more and more unwieldy as the project went on. I knew that things would only get worse as I started to add new types, and optimizations were going to be really difficult. The new implementation converts the program to three-address code before it generates assembly.

The upshot is that I won’t be continuing the series on this blog. The good news, of course, is that you can finish your compiler by working through the book, which covers a lot more ground! The bad news is that you won’t be able to skip straight to Part II; you’ll have to bring your backend in line with the implementation described in Part I first. Hopefully, the payoff of finishing your compiler will be well worth the extra work!

Update 3/1/2023

An earlier version of this blog post said the book would be coming out in January 2023. Unfortunately, we’ve had to push back this release date until later this year. If you’ve preordered the book, thanks so much for your patience; the wonderful folks at No Starch Press and I are working hard to make this the best book possible!

C Compiler, Part 10: Global Variables

2019-02-18T17:00:00+00:00

This is the tenth post in a series. Read part 1 here.

We’re back! I said I was going to do a non-compiler post next, but that turned out to be a lie. Instead, we’re going to implement global variables. This isn’t too complicated, but it lets us learn about some new sections of object files and program memory.

As always, tests are here.

Note for macOS Users: since the last post, Apple started phasing out support for 32-bit programs on macOS. What that means for us is that if you’re using the default C compiler on macOS Mojave, you’ll get an error if you try to compile for a 32-bit backend¹:

$ gcc -m32 example.c
ld: warning: The i386 architecture is deprecated for macOS (remove from the Xcode build setting: ARCHS)
ld: warning: ignoring file /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/lib/libSystem.tbd, missing required architecture i386 in file /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/lib/libSystem.tbd
ld: dynamic main executables must link with libSystem.dylib for architecture i386
clang: error: linker command failed with exit code 1 (use -v to see invocation)
ld: warning: The i386 architecture is deprecated for macOS (remove from the Xcode build setting: ARCHS)

But never fear! The Homebrew version of GCC works just fine, although it still emits a warning:

$ gcc-8 -m32 static.c
ld: warning: The i386 architecture is deprecated for macOS (remove from the Xcode build setting: ARCHS)

I’m pretty sure there’s a way to get the default compiler to build 32-bit programs as well but I don’t know what it is.

When you run a 32-bit program (like the ones produced by your compiler), you might also get a warning that it isn’t optimized for your computer. This is also due to Apple’s efforts to phase out 32-bit programs, but you don’t need to do anything about it.

The bigger issue, of course, is that the next version of macOS won’t run 32-bit programs at all. I plan to update all my posts before that happens to cover 64-bit compilation too. And yes, I do regret targeting a 32-bit architecture to begin with, thank you for asking. Luckily, apart from calling conventions all the differences so far are pretty minor.

With that out of the way, let’s move on to…

Part 10: Global Variables

We can already handle local variables declared inside functions. Now we’ll add support for global variables, which any function can access.

int foo;

int fun1() {
    foo = 3;
    return 0;
}

int fun2() {
    return foo;
}

int main() {
    fun1();
    return fun2();
}

Note that global variables can be shadowed by local variables of the same name:

int foo = 3;

int main() {
    int foo = 4; //shadows global 'foo'
    return foo; // returns 4
}

Global variables are similar to functions in that they can be declared many times, but defined (i.e. initialized) only once:

int foo; // declaration

int main() {
    return foo; // returns 3
}

int foo = 3; // definition

And, like functions, global variables must be declared (but not necessarily defined) before they’re used:

int main() {
    return foo; // ERROR: not declared!
}

int foo;

Declaring a function and a global variable with the same name is an error:

int foo() {
    return 3;
}

int foo = 4; // ERROR

Unlike local variables, global variables don’t need to be explicitly initialized. If a local variable isn’t initialized, its value is undefined, but if a global variable isn’t initialized its value is 0.

int main() {
    int foo;
    return foo; // This could be literally anything
}

int foo;

int main() {
    return foo; // This will definitely be 0
}

Note that we’re using the terms “declaration” and “definition” the same way we did for functions. This is a global variable declaration²:

int foo;

This is both a declaration and a definition:

int foo = 1;

The static and extern keywords would add some extra complications, but we won’t support those yet.

Now let’s move on to…

Lexing

No new tokens this week, so we don’t have to touch the lexer.

Parsing

Previously, a program was a list of function declarations. Now it’s a list of top-level declarations, each of which is either a function declaration or a variable declaration.

So our top-level AST definitions now look like this:

toplevel_item = Function(function_declaration)
              | Variable(declaration)
toplevel = Program(toplevel_item list)              

And we need a corresponding change to the top-level grammar rule:

<program> ::= { <function> | <declaration> }

☑ Task:

Update the parsing pass to support global variables. The parsing stage should now succeed on all valid examples in stages 1-10.

Code Generation

Global variables need to live somewhere in memory. They can’t live on the stack, because they need to be accessible from every stack frame. Instead, they live in a different chunk of memory, the data section. We’ve already seen what a running program’s stack looks like; now let’s step back and see how all of its memory is laid out³:

The x86 instructions we’ve been dealing with so far all live in the text section. Our global variables will live in the data section, which we can further subdivide into initialized and uninitialized data—the uninitialized data section is usually called BSS⁴.

So far we’ve only generated assembly for the text section, which contains actual program instructions; let’s see what the assembly to describe a variable in the data section looks like:

    .globl _my_var ; make this symbol visible to the linker
    .data          ; what's next describes the data section    
    .align 2       ; this data should aligned on 4-byte intervals
_my_var:
    .long 1337     ; allocate a long integer with value 1337

A couple things to note here:

The .data directive tells the assembler we’re in the data section. We’ll also need a .text directive to indicate when we switch back to the text section.
A label like _my_var labels a memory address. The assembler and linker don’t care whether that address refers to an instruction in the text section or a variable in the data section; they’re going to treat it the same way.
On macOS, .align n means “align the next thing to a multiple of 2ⁿ bytes”. So .align 2 means we’re using a 4-byte alignment. On Linux, .align n means “align the next thing to a multiple of n bytes”, so you’d want .align 4 to get the same result.

Once you’ve allocated a variable, you can refer to its label directly in assembly:

    movl %eax, _my_var ; move the value in %eax to the memory address of _my_var

So the basic gist here is:

When you encounter a declaration for a global variable, add it to the variable map. The variable map entry will be its label instead of a stack index:
```
 var_map = var_map.put("my_var", "_my_var")
```
Note that this new variable map entry must be visible when we generate later top-level items; this isn’t true of entries we add while processing function definitions.
When you encounter a definition for a global variable, with an initializer, emit assembly to allocate it in the data section. Then emit a .text directive before you go back to generating function definitions.
When you encounter a reference to a variable, handle it the same way you did before. If its entry in the variable map is a label instead of a stack index, of course, you should use it directly instead of as an offset from %ebp. If it doesn’t have an entry, that’s an error.

But there are a few wrinkles.

Uninitialized Variables

If, by the end of the program, we have any variables left that have been declared but not defined, we need to declare them in a special section for uninitialized data. On Linux, all uninitialized data lives in the BSS section, which also includes any variables initialized to 0. On macOS it’s a little more complicated: uninitialized static variables go in BSS, and uninitialized global variables go in the common section, which indicates to the linker that they may be initialized in a different object file. We don’t support static variables yet, so on macOS we don’t need to store anything in BSS. Of course, we also don’t have any tests with multiple source files, so if you just use BSS instead of common, effectively making all global variables static, the tests will still pass.

The data section consists of the actual values of our data; we can load it directly into memory and use it as-is. The BSS and common sections, on the other hand, don’t contain all of our uninitialized values, because they would just be big blocks of zeros. Storing a big block of zeros on disk would be a waste of space. Instead, we just store the size of BSS and common in our binary, and allocate that much memory for them when we load the program. So keeping initialized and uninitialized variables separate is just a trick to reduce the size of binaries.

On macOS, we can allocate space in the common section using the .comm directive:

    .text
    .comm _my_var,4,2 ; allocate 4 bytes for symbol _my_var, with 4-byte alignment

Allocating space in BSS, on the other hand, looks almost exactly the same as allocating a non-zero variable, but we’ll use .zero 4 to allocate 4 bytes of zeros instead of .long n to allocate a long integer with value n:

    .globl _my_var ; make this symbol visible to the linker
    .bss           ; what's next describes the BSS section    
    .align 4       ; this data should aligned on 4-byte intervals (Linux align directive)
_my_var:
    .zero 4        ; allocate 4 bytes of zeros

Note that in assembly, unlike in C, it’s perfectly fine to reference a label like _my_var before that label is defined. That’s why we can wait until the end of the program to allocate any uninitialized variables.

Non-Constant Initializers

Global variables are loaded into memory before the program starts, which means we can’t execute any instructions to calculate their initial values. Therefore their initializers need to be constants. For example, this isn’t valid:

int foo = 5;
int bar = foo + 1; // NOT A CONSTANT!
int main() {
    return bar;
}

Most compilers permit global variables to be initialized with constant expressions, like:

int foo = 2 + 3 * 5;

This requires you to compute 2 + 3 * 5 at compile time. You can support this if you want, but you don’t have to; the test suite doesn’t check for it.

Validation

To recap, here’s what we need to validate:

Variables, including global variables, are declared before they are defined.
No global variable is defined more than once.
No global variable is initialized with a non-constant value.
No symbol is declared as both a function and a variable.

It’s easy to validate the first bullet point during code generation; we’re doing that for local variables anyway. The remaining points can be validated either during code generation, or in a separate validation pass. I’d recommend handling them wherever you validate function definitions and calls.

☑ Task:

Update the code generation pass (and your validation pass, if you have one) to fail with an error for all invalid stage 10 examples, and succeed on all valid stage 10 examples.

PIE 🥧

If you compile a program with global variables using a real compiler, the assembly will look quite different from what we described above. You may also notice, if you’re on macOS, that the linker will warn you about the assembly your compiler produces:

$ ./my_compiler global.c
ld: warning: The i386 architecture is deprecated for macOS (remove from the Xcode build setting: ARCHS)
ld: warning: PIE disabled. Absolute addressing (perhaps -mdynamic-no-pic) not allowed in code signed PIE, but used in _main from /var/folders/9t/p20tf0zs4ql425tdktwnfjkm0000gn/T//cczcZcyQ.o. To fix this warning, don't compile with -mdynamic-no-pic or link with -Wl,-no_pie

PIE stands for “position-independent executable”, which means an executable consisting entirely of position-independent code. This section briefly explains what position-independent code is and why you might need it, but doesn’t explain how to implement it. Feel free to skip it if you’re not interested.

Position-independent code is code that can run no matter where it’s loaded in memory, because it never refers to absolute memory addresses. The code our compiler produces is not position-independent, because it has instructions like:

    movl $3, _my_var

In order for this instruction to run, the linker needs to replace _my_var with an absolute memory address. This works if we know the absolute address of the data and BSS sections in advance.

Position-independent code, on the other hand, never refers to the address of symbols like _my_var directly; instead, those addresses are calculated relative to the current instruction pointer. In case I didn’t have enough of a reason to regret targeting a 32-bit architecture, position-independent assembly is much simpler with a 64-bit instruction set:

movl $3, _my_var(%rip) ; use _my_var as offset from instruction pointer

To get the same result with a 32-bit architecture you need something like this:

    call    ___x86.get_pc_thunk.ax
L1$pb:
    leal    _my_var-L1$pb(%eax), %eax
    movl    (%eax), %eax

I won’t walk through exactly what this code is doing; if you’re curious, this article gives a good overview of position-independent code for x86.

There are two reasons you might want to generate position-independent code:

You’re compiling a shared library. Maybe this is a really widely used library, like libc. Maybe all or most processes on a system will want a copy of this library. It seems like a waste to have a separate copy for every process, eating up all your RAM. Instead, we can load the library into physical memory just once, then map it into the virtual memory of every process that needs it. But we can’t guarantee a library the same starting address in every process that loads it. So sharing one library between several processes only works if the library works no matter what memory address it’s at—which is to say, it needs to be position-independent. However, we’re compiling an executable, not a library, so this doesn’t apply to us.
You have address space layout randomization (ASLR) enabled. ASLR is a security feature that makes some memory corruption attacks harder to carry out. Many of these attacks involve forcing program execution to jump to the instructions an attacker would like to execute. With ASLR enabled, memory segments are loaded at random locations⁵, which makes it harder for attackers to figure out what address to jump to. Code needs to be position independent in order to run correctly when loaded to a random memory address. Since Apple really wants all macOS applications to support ASLR⁶, the linker will try to build a position-independent executable by default, and complain if it can’t.

The fact that your compiler can’t generate position-independent executables is just one of many, many reasons you shouldn’t use it to build real software. I don’t have that much faith in these blog posts, and neither should you!

If you want to learn more about ASLR, I found these slides helpful. Of course, there’s also Wikipedia.

Up Next

So far, I’ve been implementing a compiler and writing posts as I go. This system worked really well for a while, but now it’s starting to work less well; I realized that some decisions I made in earlier stages made this stage harder to complete, so I had to go back and change them. I think I’m likely to run into more problems like that in later posts. So I’m going to take a break, finish building the compiler (whatever I decide “finished” means), and then come back and write the rest of this series. I probably won’t post another update for six months. So basically…I’m going to keep posting at about the same rate I have been.

When I come back, I’ll have a plan for what to cover in the rest of the series. See you then!

If you have any questions, corrections, or other feedback, you can email me or open an issue.

¹ The compiler that ships with the XCode Command Line Tools—the one that was giving me this error—is actually not GCC. It’s Clang, another open-source compiler that’s developed mostly by Apple. XCode installs Clang at /usr/bin/gcc, no doubt for very sound and legitimate reasons, although I don’t know what they are. ↩

² The standard actually considers this a tentative definition (section 6.9.2):

A declaration of an identifier for an object that has file scope without an initializer, and without a storage-class specifier or with the storage-class specifier static, constitutes a tentative definition.

Basically, if we can’t find a real definition anywhere else in the file, we can treat a declaration like a definition with an initial value of 0. We’re still going to call it a declaration, though. ↩

³ Typical computer data memory arrangement by Majenko is licensed under CC BY-SA 4.0.

This diagram is an oversimplification; it doesn’t show every memory segment we might find in a running program. Also, sometimes memory segments are laid out in a different order—we’ll talk about that later. The point is that we have a dedicated chunk of memory for global variables.↩

⁴ BSS stands for “Block Started by Symbol,” which is a relic of an assembler written in the 1950s(!). You can read more here if you want to go down a bit of a Wikipedia rabbit hole.↩

⁵ Exactly which memory segments are randomized, and how random their base addresses actually are, varies between systems. ↩

⁶ Source. ↩

C Compiler, Part 9: Functions

2018-06-27T20:00:00+00:00

This is the ninth post in a series. Read part 1 here.

In this post we’re adding function calls! This is a particularly exciting post because we get to talk about calling conventions and stack frames and some weird corners of the C11 standard. Plus, by the end of this post we’ll be able to compile “Hello, World!” 🎉

As usual, accompanying tests are here.

Part 9: Functions

Of course, our compiler can already handle function definitions, because we can already define main. But in this post, we’ll add support for function calls:

int three() {
    return 3;
}

int main() {
    return three();
}

We’ll also add support for function parameters:

int sum(int a, int b) {
    return a + b;
}

int main() {
    return sum(1, 1);
}

And for forward declarations:

int sum(int a, int b);

int main() {
    return sum(1, 1);
}

int sum(int a, int b) {
    return a + b;
}

Terminology

A function declaration specifies a function’s name, return type, and optionally its parameter list:
```
  int foo();
```
A function prototype is a special type of function declaration that includes parameter type information:
```
  int foo(int a);
```
Function prototypes are the only function declarations we’ll support, even in places where the C11 standard allows non-prototype declarations.
A function definition is a declaration plus a function body:
```
  int foo(int a) {
      return a + 1;
  }
```
Note that you can declare a function as many times as you like, but you can only define it once¹. Also note that whenever we say “all function declarations,” that includes function declarations that are part of function definitions.
A forward declaration is a function declaration without a function body. It tells the compiler you’re going to define the function later, possibly in a different file, and lets you use a function before it’s defined.
```
  int foo(int a);
```
You can also declare a function that has already been defined. This is legal but technically not a forward declaration…I guess it’s a backwards declaration? It would also be pretty pointless:
```
  int foo() {
      return 4;
  }

  int foo();
```
A function’s arguments are the values passed to a function call. A function’s parameters are the variables defined in the function declaration. In this code snippet, a is a parameter and 3 is an argument:
```
  int foo(int a) {
      return a + 1;
  }

  int main() {
      return foo(3);
  }
```

Limitations

For now, we’ll only support functions with return type int and parameters with type int.
We won’t support function declarations with missing parameters or type information; in other words, we’ll require all function declarations to be function prototypes, whether or not they’re part of function definitions.
We’ll interpret an empty parameter list (e.g. in the declaration int foo()) to mean that the function has no parameters. This deviates from the C11 standard; according to the standard, int foo(void) is a function prototype indicating foo has no parameters, and int foo() is a declaration where the parameters aren’t specified (i.e. not a function prototype).
We won’t support function definitions using identifier-list form, which looks like this:
```
  int foo(a)
  int a;
  {
      return a * 2;
  }
```
We’ll require parameter names in function declarations. For example, we won’t support this:
```
  int foo(int, int);
```
We won’t support storage class specifiers (e.g. extern, static), type qualifiers (e.g. const, atomic), function specifiers (inline, _Noreturn) or alignment specifiers (_Alignas)

Lexing

Nothing fancy here; we just need to add commas to separate the function arguments. Here’s the full list of tokens so far:

{
}
(
)
;
int
return
Identifier [a-zA-Z]\w*
Integer literal [0-9]+
-
~
!
+
*
/
&&
||
==
!=
<
<=
>
>=
=
if
else
:
?
for
while
do
break
continue
,

☑ Task:

Add support for commas to the lexer.

Parsing

We’ll deal with function definitions first, then function calls.

Function Definitions

In our old definition, a function just had a name and a body:

function_declaration = Function(string, block_item list) //string is the function name

Now we need to add a list of parameters. We also need to support declarations that don’t include a function body. I defined a single function_declaration AST rule, with an optional function body, to represent both declarations and definitions:

function_declaration = Function(string, // function name
                                string list, // parameters
                                block_item list option) // body

But you could also have different rules for function declarations and definitions if you wanted.

Note that we don’t include the function’s return type or parameter types, because right now int is the only type. We’ll need to expand this definition when we add other types.

We also need to update the grammar. Here was the old <function> grammar rule:

<function> ::= "int" <id> "(" ")" "{" { <block-item> } "}"

And here’s the new one. Note that the function declaration ends with either a function body (if it’s a definition) or a semicolon (if it’s not).

<function> ::= "int" <id> "(" [ "int" <id> { "," "int" <id> } ] ")" ( "{" { <block-item> } "}" | ";" )

Function Calls

A function call is an expression that looks like this:

foo(arg1, arg2)

It has an ID (the function name) and a list of arguments. Its arguments can be arbitrary expressions:

foo(arg1 + 2, bar())

So we can update the AST definition for expressions like this:

exp = ...
    | FunCall(string, exp list) // string is the function name
    ...

We also need to update the grammar. Function calls have the highest possible precedence level, right up there with postfix unary operators. So we’ll add them to the <factor> rule in the grammar:

<factor> ::= <function-call> | "(" <exp> ")" | <unary_op> <factor> | <int> | <id>
<function-call> ::= id "(" [ <exp> { "," <exp> } ] ")"

Top Level

In our old definition, a program consisted of a single function definition. Now it needs to permit multiple function declarations:

program = Program(function_declaration list)

<program> ::= { <function> }

☑ Task:

Update parsing to succeed on all valid stage 1-9 examples. You may or may not want to handle invalid examples here: see the next section on validation.

Validation

We need to validate that the function declarations and calls in our program are legal. You can either handle these checks during code generation, or add a new validation pass between parsing and code generation. Edited to add: I previously recommended performing validation during the parsing stage. This turns out to be a bad idea, because this will become increasingly cumbersome as we need to validate more things in future posts.

Your compiler must fail if:

The program includes two definitions of the same function name.

  int foo(){
      return 3;
  }

  int foo(int a){
      return a + 1;
  }

Two declarations of a function have different numbers of parameters. Different parameter names are okay, though.

This is illegal²:

  int foo(int a, int b);

  int foo(int a){
      return a + 1;
  }

But this is okay:

  int foo(int a);

  int foo(int b){
      return b + 1;
  }

A function is called with the wrong number of arguments, e.g.

  int foo(int a){
      return a + 1;
  }

  int main() {
      return foo(3, 4);
  }

Optionally, you may want to fail if a function is called before it’s declared. Note that it’s totally legal to call a function that has been declared but not defined. It’s also legal to declare a function and never define it; however, linking will fail if the function isn’t declared in some other library the linker can find³.

So this is illegal:
```
  int main() {
      return putchar(65);
  }

  int foo(){
      return 3;
  }
```
But this is legal:
```
  int putchar(int c);

  int main() {
      putchar(65);
  }
```
This last point is optional because neither GCC nor clang enforces it — they both warn but don’t fail on the illegal example above. Calling a function before it’s declared is called “implicit function declaration” and it was legal before C99, so I guess enforcing this rule would have broken a lot of older code. The test suite doesn’t include any implicit function declarations, so you can handle it however you like and you can still pass all the tests.

☑ Task:

Update your compiler to fail on invalid stage 1-9 examples. You can handle this during code generation, or a new stage between parsing and code generation. Bonus points for useful error messages.

To handle this, you’ll probably want to traverse the tree and maintain a map to track the number of arguments to each function, and whether that function has been defined yet.

Code Generation

Once again, we’ll handle function definitions first, then function calls. But before we do any of that, let’s discuss…

Calling Conventions

In most of the examples above, we defined a function and then called it in the same file. But we also want to call functions from shared libraries; we particularly want to call the standard library, so we can access I/O functions, so we can write “Hello, World”. When you use a shared library, you generally don’t recompile it yourself; you link to a precompiled binary. We definitely don’t want to recompile the whole standard library! That means we need to generate machine code that can interact with object files built by other compilers. In earlier posts, I’ve often said “this isn’t how a real compiler would do this thing, but it works.” In this post, we have to do things the same way as everyone else or we can’t use prebuilt libraries.

In other words, we need to follow the appropriate calling convention. A calling convention answers questions like:

How are arguments passed to the callee? Are they passed in registers or on the stack?
Is the caller or callee responsible for removing arguments from the stack after the callee has executed?
How are return values passed back to the caller?
Which registers are caller-saved and which are callee-saved⁴?

C programs on 32-bit OS X, Linux, and other Unix-like systems use the cdecl calling convention⁵, which means:

Arguments are passed on the stack. They’re pushed on the stack from right to left (so the first function argument is at the lowest address).
The caller cleans the arguments from the stack.
Return values are passed in the EAX register. (The full answer is more complicated, but this is good enough as long as we can only return integers.)
The EAX, ECX, and EDX registers are caller-saved, and all others are callee-saved. We’ll see in the next section that the callee has to restore EBP and ESP before it returns, and restores EIP with the ret instruction. Normally it would also need to restore ESI, EDI, and EBX, but we don’t actually use these registers. And we already push values from EAX, ECX, and EDX onto the stack right away if we’re going to need them later. So basically, we don’t have to worry about saving and restoring registers at all.

There are two import differences between OS X and Linux:

Stack alignment. On OS X, the stack needs to be 16-byte aligned at the beginning of a function call (i.e. when the call instruction is issued)⁶. This isn’t required on Linux, but GCC still keeps the stack 16-byte aligned⁷.
Name decoration. On OS X, function names in assembly are prepended with an underscore (e.g. main becomes _main). On systems that use the ELF file format (Linux and most other *nix systems), there’s no underscore. This isn’t part of the calling convention per se but it is important.

We’ll need to be really comfortable with all this to implement it ourselves, so let’s look at…

cdecl Function Calls in Excruciating Detail

foo(1, 2, 3);

What, exactly, happens when your computer executes this line of code? We touched on this in part 5, but now we’ll dig into it a lot more. We won’t worry about keeping the stack 16-byte aligned for now.

We’ll say that foo is being called from another function, bar. The line of C above will get turned into this assembly:

push $3
push $2
push $1
call _foo
add $0xc, %esp

First, let’s look at the state of the world before we start calling foo⁸:

One chunk of memory contains the stack frame, which we’re already familiar with. The EBP and ESP registers point to the bottom and top of the stack frame, respectively, so the processor can figure out where the stack is.

Another chunk of memory, which we haven’t talked about yet, contains the CPU instructions being executed. The EIP register contains the memory address of the current instruction. To advance to the next instruction, the CPU just increments EIP⁹. The call instruction, and all the jump instructions we’ve already encountered, work by manipulating EIP. In these diagrams I’ll show EIP pointing to the instruction we’re about to execute.

When bar wants to call foo, the first step is putting the function arguments on the stack where foo can find them¹⁰. They’re pushed onto the stack in reverse order^11:

push $3
push $2
push $1

Which means the world now looks like this:

Next bar issues the call instruction, which does two things:

Push the address of the instruction after call (the “return address”) onto the stack.
Jump to _foo (by moving the address of _foo into EIP).

Now the world looks like this:

Okay, we’re officially in foo now. Next step is the function prologue to set up a new stack frame:

push %ebp
mov %esp, %ebp

Now we can execute the body of foo. We can access its parameters because they’re at a predictable location on the stack relative to EBP: %ebp + 0x8, %ebp + 0xc, and %ebp + 0x10, respectively.

Once we’ve done some things in foo, and placed a return value in EAX, it’s time to return to bar. Except for that return value, we want everything on the stack to be exactly the same as it was before the call. The first step is to run the function epilogue to restore the old stack frame:

mov %ebp, %esp ; deallocate any local variables on the stack
pop %ebp        ; restore old EBP

The stack now looks exactly the same as it did right after the call instruction, before the function prologue. That means the return address is on top of the stack again.

Then we execute the ret instruction, which pops the top value off the stack and jumps to it unconditionally (i.e. copies it into EIP).

Now we just have to remove the function arguments from the stack, and we’re done. No need to pop them off one by one; we can just adjust the value of ESP.

add $0xc, %esp

Now the stack has been restored to exactly the way it was before the call, and we can proceed with the rest of bar.

And now we’re finally ready to implement the code-generation stage of the compiler!

Function Definitions

As with main, we want to make each function global (so it can be called from other files) and label it:

    .globl _fun
_fun:

Make sure to include the leading underscore before the function name if you’re on OS X, and not otherwise.

We already know how to generate the function prologue and epilogue, because that’s also exactly the same as main. We just need to add all the function parameters to var_map and current_scope. As we saw above, the first paramter will be at ebp + 8, and each subsequent parameter will be four bytes higher than the last:

param_offset = 8 // first parameter is at EBP + 8
for each function parameter:
    var_map.put(parameter, param offset)
    current_scope.add(parameter)
    param_offset += 4

Then parameters get handled like any other variable in the function body.

Function Prototypes

We don’t generate any assembly for function prototypes that aren’t part of definitions.

Function Calls

As we saw above, the caller needs to:

Put the arguments on the stack, in reverse order¹²:

 for each argument in reversed(function_call.arguments):
     generate_exp(arg) // puts arg in eax
     emit 'pushl %eax'

Issue the call instruction.

     emit 'call _{}'.format(function_name)

Remove the arguments from the stack after the callee returns.

     bytes_to_remove = 4 * number of function arguments
     emit 'addl ${}, %esp'.format(bytes_to_remove)

Stack Alignment

On OS X, the stack needs to be 16-byte aligned when the call instruction is issued. A normal C compiler would know exactly how much padding to add to maintain that alignment. But because we push intermediate results of expressions onto the stack, and function calls can occur within larger expressions, we have no idea where the stack pointer is when we encounter a function call. My solution was to emit assembly just before each function call that calculates how much padding is needed, subtracts from ESP accordingly, and then pushes the result of the padding calculation onto the stack, all before putting the function arguments on the stack. After the function returns, the caller first removes the arguments, then pops off the result of the padding calculation, and finally adds that value to ESP to restore it to its original state.

Here’s the assembly to do that:

    movl %esp, %eax
    subl $n, %eax    ; n = (4*(arg_count + 1)), # of bytes allocated for arguments + padding value itself
                     ; eax now contains the value ESP will have when call instruction is executed
    xorl %edx, %edx  ; zero out EDX, which will contain remainder of division
    movl $0x20, %ecx ; 0x20 = 16
    idivl %ecx       ; calculate eax / 16. EDX contains remainder, i.e. # of bytes to subtract from ESP 
    subl %edx, %esp  ; pad ESP
    pushl %edx       ; push padding result onto stack; we'll need it to deallocate padding later
    ; ...push arguments, call function, remove arguments...
    popl %edx        ; pop padding result
    addl %edx, %esp  ; remove padding

This solution is kind of hideous, so let me know if you come up with a better one.

Top Level

Obviously, you need to generate assembly for every function definition, not just one.

☑ Task:

Update your compiler to handle all stage 9 examples. Make sure it produces the right return code and, for the “hello world” test case, the right output to stdout.

Fibonacci & Hello, World!

Now we can calculate Fibonacci numbers:

int fib(int n) {
    if (n == 0 || n == 1) {
        return n;
    } else {
        return fib(n - 1) + fib(n - 2);
    }
}

int main() {
    int n = 10;
    return fib(n);
}

We can also make calls to the standard library! Since we only know about ints, we can only call standard library functions where the parameters are all ints and the return value is also an int. Lucky for us, putchar is just such a function. For example, since the ASCII value of ‘A’ is 65, we could print ‘A’ to standard out like this:

int main() {
    putchar(65);
}

And we can print out ‘Hello, World!’ like this:

int putchar(int c);

int main() {
    putchar(72);
    putchar(101);
    putchar(108);
    putchar(108);
    putchar(111);
    putchar(44);
    putchar(32);
    putchar(87);
    putchar(111);
    putchar(114);
    putchar(108);
    putchar(100);
    putchar(33);
    putchar(10);
}

Up Next

My next post or two won’t be about compilers. After that I’ll get back to this series, but I haven’t decided what to implement next. Maybe pointers? We’ll see!

Update: just kidding, the next post is about compilers after all, and covers global variables.

If you have any questions, corrections, or other feedback, you can email me or open an issue.

¹ Technically, you can redefine a function in the same program but not in the same translation unit. A translation unit is a source file plus everything that gets pulled in during preprocessing from #include directives. (Source: C11 standard, section 5.1.1.1)

So it’s legal to redefine a function from a linked library. But linking happens after the compiler runs, so for our purposes the rule is that each function can only be defined once.↩

² However, this is legal according to C11:

int foo();

int foo(int a){
    return a + 1;
}

That’s because int foo(); doesn’t mean “declare a function foo with no variables”; it means “declare a function foo, but we don’t know anything about its variables.” But our compiler diverges from the standard in this respect; it assumes that int foo(); means “declare foo with no variables,” so it will fail here. ↩

³ What the linker does and where it looks for function definitions is way beyond the scope of this blog post; if you want to learn more you might like the Beginner’s Guide to Linkers or this series on linkers. ↩

⁴ If a register is caller-saved, that means the callee is allowed to overwrite it. So if the caller wants to access the value in that register after the callee returns, it needs to push that value onto the stack, then pop it back into the register after the function call has completed.

If a register is callee-saved, the caller can assume that the register will be unchanged after the function call finishes. So if the callee wants to use that register, it has to save the register’s contents to the stack and restore those contents before returning control to the caller. ↩

⁵ Windows is a lot more complicated; sometimes it uses cdecl, sometimes it uses different calling conventions. A lot of Linux/OS X documentation doesn’t even call it cdecl, presumably because it’s the only calling convention in *nix-world. ↩

⁶ Source: OS X ABI Function Call Guide. It’s not 100% clear why OS X imposes this requirement but it probably has something to do with making SSE instructions run faster. ↩

⁷ See the GCC documentation on -mpreferred-stack-boundary. ↩

⁸ Note that these are not valid memory addresses; at least on Linux, the lowest memory address in use is 0x08048000. (See here and here). I think this is also true on OS X but I haven’t checked. ↩

⁹ It’s actually a little more complicated than this; instructions are variable-width, so you can’t increment EIP by the same amount for every instruction. ↩

¹⁰ Actually, the first step is pushing some caller-saved registers onto the stack. But, like I mentioned earlier, the janky way we’re managing registers means we can ignore this. ↩

¹¹ Pushing arguments onto the stack in reverse order makes it easier to handle functions with a variable number of arguments; the callee knows the location of the first argument even if it doesn’t know how many arguments there are. ↩

¹² This means we’ll also evaluate the arguments in reverse order. This is valid; function arguments may be evaluated in any order. (Source: C11 standard section 6.5.2.2, paragraph 10.) ↩

C Compiler, Part 8: Loops

2018-04-10T19:00:00+00:00

This is the eighth post in a series. Read part 1 here.

In this post we’re going to add loops! Now we’ll finally be able to compile FizzBuzz…except we won’t, because we can’t call printf yet. Still, it’s progress!

If you’ve been following along, note that there was a mistake in the last post. Make sure you read the “Deallocating Variables” section and update your compiler to pass the new stage 7 tests before you start on stage 8.

As usual, accompanying tests are here.

Part 8: Loops

In this post we’re implementing what the C11 standard calls iteration statements; if you want to refer to the standard itself, they’re in section 6.8.5. There are a few different iteration statements:

`for` loops

First, some terminology. I’m going to call the three parts of a for loop header the initial clause, controlling expression, and post-expression, as in:

for (int i = 0; // initial clause
     i < 10;    // controlling expression
     i = i + 1  // post-expression
     ) {
        // do something
}

for loops come in two flavors: one where the initial statement is a variable declaration, and one where it’s just an expression.

Flavor #1:

for (int i = 0; i < 10; i = i + 1) {
    // do something
}

Flavor #2:

int i;
for (i = 0; i < 10; i = i + 1) {
    //do something
}

One interesting thing about for loops is that any of the expressions in the loop header can be empty:

for (;;) {
    //do something
}

But if the controlling expression is empty, the compiler needs to replace it with a constant nonzero expression¹. So the example above is equivalent to:

for (;1;) {
    //do something
}

`while` and `do` Loops

There’s not a whole lot to say about these.

while (i < 10) {
    i  = i + 1;
}

do {
    i = i + 1;
} while (i < 10); // <- the semicolon is required!

`break` and `continue`

break and continue aren’t loops, but they always appear inside loops, so it makes sense to add them now². The C11 standard calls them “jump statements” and defines them in section 6.8.6.

A break statement inside a loop causes execution to jump to the end of the loop:

while (1) {
    break; // go to end of loop
}
// break statement will go here

A continue statement causes execution to jump to the end of the loop body – immediately before the post expression in a for loop.

for (int i = 0; i < 10; i = i + 1) {
    if (i % 2)
        continue;
    // do something

    //continue statement will jump here
}

In the example above, the loop will execute ten times, but only “do something” for odd values of i.

Null statements

Sort of like you can have null expressions in a for loop, you can also have null statements³:

int a = 0;
; // does nothing
return a;

Null statements don’t really have anything to do with loops, but they share a common feature with the expressions in a for loop: they’re both defined in terms of optional expressions in the standard. Since we need to support optional expressions in for loops, it’s pretty easy to add support for null expressions too.

As usual, we’ll update the lexing, parsing, and code generation passes, in order.

Lexing

We’re adding five (!) keywords in this post: for, do, while, break, and continue. Here’s all our tokens so far:

{
}
(
)
;
int
return
Identifier [a-zA-Z]\w*
Integer literal [0-9]+
-
~
!
+
*
/
&&
||
==
!=
<
<=
>
>=
=
if
else
:
?
for
while
do
break
continue

☑ Task:

You know the drill here.

Parsing

We’re adding six kinds of statements: do loops, while loops, the two different kinds of for loop, break and continue. We’re also changing the Exp statement; its argument is now optional, so we can use it to represent null statements. Now we can construct a null statement in the AST like this:

null_exp = Exp(None)

The initial expression and post-expression in a for loop are also optional.

Here’s the updated definition of statements in the AST, with new and changed parts bolded:

statement = Return(exp) 
          | Exp(exp option)
          | Conditional(exp, statement, statement option) // exp is controlling condition
                                                          // first statement is 'if' block
                                                          // second statement is optional 'else' block
          | Compound(block_item list)
          | For(exp option, exp, exp option, statement) // initial expression, condition, post-expression, body
          | ForDecl(declaration, exp, exp option, statement) // initial declaration, condition, post-expression, body
          | While(expression, statement) // condition, body
          | Do(statement, expression) // body, condition
          | Break
          | Continue

Note that our AST lets break and continue statements appear outside of loops, even though that’s illegal; we’ll catch that error during code generation, not parsing.

The trickiest part of the grammar here is dealing with optional expressions. I dealt with this by defining an <exp-option> symbol:

<exp-option> ::= <exp> | ""

Once we’ve added that, updating the grammar for statements is pretty easy:

<statement> ::= "return" <exp> ";"
              | <exp-option> ";"
              | "if" "(" <exp> ")" <statement> [ "else" <statement> ]
              | "{" { <block-item> } "}
              | "for" "(" <exp-option> ";" <exp-option> ";" <exp-option> ")" <statement>
              | "for" "(" <declaration> <exp-option> ";" <exp-option> ")" <statement>
              | "while" "(" <exp> ")" <statement>
              | "do" <statement> "while" "(" <exp> ")" ";"
              | "break" ";"
              | "continue" ";"

If you’re wondering why there’s a semicolon after the initial <exp-option> in the first for rule, but not after the initial <declaration> in the second one, it’s because the rule for <declaration> also includes a semicolon.

Parsing <exp-option> isn’t entirely straightforward, because the empty string is not actually a token. I dealt with this by looking ahead to see if the next token was a close paren (after a post-expression) or a semicolon (after a statement, post-expression or controlling condition). If it was, the expression was empty; if not, not. I think this approach violates some formalisms about context-free grammars and LL parsers: in order to parse an <exp-option> symbol, you may have to look at a token that comes after that symbol. This isn’t actually a problem, but if it bothers you, you can refactor the grammar to avoid it:

<exp-option-semicolon> ::= <exp> ";" | ";"
<exp-option-close-paren> ::= <exp> ")" | ")"
<statement> ::= ...
                | <exp-option-semicolon> // null statement
                | "for" "(" <declaration> <exp-option-semicolon> <exp-option-close-paren> ")" <statement>
                ...

Note that there’s a discrepancy here between the grammar and the AST definition; the grammar allows controlling expressions in for loops to be empty, but the AST doesn’t. That’s because, as I mentioned earlier, an empty controlling expression needs to be replaced with a nonzero constant. So our approach to parsing controlling expressions in for loops will look something like this:

match parse_optional_exp(controlling_expression) with
| Some e -> e
| None -> Const(1) // construct a constant nonzero expression

You could do this during the code generation stage instead of the parsing stage, if you wanted.

☑ Task:

Update parsing to succeed on all valid stage 1-8 examples, and fail on all invalid stage 8 examples whose names start with syntax_err.

Code Generation

Null Statements

Don’t emit any assembly for null statements. Easy!

`while` loops

Given a while loop like this:

while (expression)
    statement

we can describe its control flow like this:

Evaluate expression.
If it’s false, jump to step 5.
Execute statement.
Jump to step 1.
Finish.

I won’t show you the exact assembly you need to generate here; by now you know enough to figure it out yourself. The main thing is labeling steps 1 and 5, so when we need a jump instruction we have somewhere to jump to. It’s worth noting that the loop body is a new scope, and you need to reset your current_scope set accordingly.

`do` Loops

These are basically the same as while loops; just evaluate the expression after the statement.

`for` loops

Given a for loop like this:

for (init; condition; post-expression)
    statement

we can break it down in the same way as while loops above:

Evaluate init.
Evaluate condition.
If it’s false, jump to step 7.
Execute statement.
Execute post-expression.
Jump to step 2.
Finish.

The init and post-expression might be empty, in which case we just don’t emit any assembly for steps 1 and 5. Note that a for loop, including the header, is a block with its own scope, and the body of the for loop is also a block. That means you can have code like this:

int i = 100; // scope 1
for (int i = 0; i < 10; i = i + 1) { // scope 2 - variable i shadows previous i
    int i; //scope 3 - this variable i shadows BOTH previous i's
}

The main gotcha here is that you need to pop the variable declared in init off the stack when you exit the block, just like you needed to handle deallocating other variables in the last post.

`break` and `continue`

We can implement each of these with a single jmp instruction – the trick is just figuring out where to jump to. A break statement “terminates execution of the smallest enclosing switch or iteration statement,” so we want to jump to the point right after the loop⁴. We already have an “end of loop” label, which we jump to when the controlling condition is false; we just need to pass that label around along with the variable map, stack index and current scope.

We also need to pass another label for continue to refer to. continue “causes a jump to the loop-continuation portion of the smallest enclosing iteration statement; that is, to the end of the loop body”⁵ – that’s step 4 in the while loop or step 5 in the for loop above.

Unlike the stack index, variable map and so forth, the jump and continue labels can be null, if you’re not inside a loop. Hitting a break or continue statement when these labels are null should, of course, cause an error.

At this point, I was passing enough arguments around that I defined a Context type and wrapped it all up in that. You may want to do something similar, but you don’t have to.

Up Next

In the next post we’re going to implement a pretty fundamental concept: function calls. I don’t know about you but I am VERY EXCITED for function calls. See you then!

If you have any questions, corrections, or other feedback, you can email me or open an issue.

¹ See section 6.8.5.3 of the C11 standard.↩

² break can also appear in switch statements, but we haven’t added those yet.↩

³ C11 standard, section 6.8.3. ↩

⁴ C11 standard, section 6.8.6.3.↩

⁵ C11 standard, section 6.8.6.2.↩

Writing a C Compiler, Part 7

2018-03-14T23:00:00+00:00

Update 4/9

There was a pretty big mistake in the original post - I forgot to deallocate local variables! I’ve added the “Deallocating Variables” section, and added the example from that section to the test suite.

This is the seventh post in a series. Read part 1 here.

In this post we’re adding support for compound statements, which are a little weird because they don’t do very much. We’ll generate almost no new assembly in this post, but we’ll be able to compile new and exciting programs at the end of it. How is this possible? Let’s find out!

As usual, accompanying tests are here.

Part 7: Compound Statements

A compound statement is just a list of statements and declarations wrapped in curly braces. They’re normally used as substatements of if, while, and other control structures, like this¹:

if (flag) {
    //this is a compound statement!
    int a = 1;
}

but they can also be free-standing, like this:

int main() {
    int a;
    {
        //this is also a compound statement!
        a = 4;
    }
}

You can have deeply nested compound statements:

int main() {
    //compound statement #1 (function bodies are compound statements!)
    int a = 1;
    {
        //compound statement #2
        a = 2;
        {
            //compound statement #3
            a = 3;
            if (a) {
                //compound statement #4
                a = 4;
            }
        }
    }
}

Like I mentioned in the last post, a compound statement is one type of block, and I’m going to use the terms synonymously for the rest of this post. C uses lexical scoping; a variable’s scope is dictated by the block where it’s defined. (By “scope”, I mean where in the program you’re allowed to refer to it.) More precisely, a variable’s scope starts at its definition, and ends when you exit the block where it’s defined². Up until this point in the series, function bodies were the only blocks around, so a variable could be used at any point in main after it was defined. Now it’s more complicated. I’m going to talk a bit about how scoping works in C; if you’re already familiar with this, you can skip ahead to the next section.

If a variable is defined in an inner scope, it can’t be accessed in an outer scope:

// here is the outer scope
{
    // here is the inner scope
    int foo = 2;
}

// now we're back in the outer scope
foo = 3; // ERROR - foo isn't defined in this scope!

However, code in an inner scope can access variables in an outer scope:

int a = 2;
{
    a = 4; // this is okay
}
return a; // returns 4 - changes made inside the inner scope are reflected here

You can’t have two variables with the same name in the same scope:

int foo = 0;
int foo = 1; //This will throw a compiler error

But you can have two variables with the same name in different scopes. Once the variable in the inner scope is declared, it will shadow the variable from the outer scope; the outer variable will be inaccessible until the inner variable goes out of scope.

int foo = 0;
{
    int foo; // this is a TOTALLY DIFFERENT foo, unrelated to foo from earlier
    foo = 2; // this refers to the inner foo; outer foo is inaccessible
}
return foo; //this will return 0 - it refers to the original foo, which is unchanged

The key idea here is that the inner and outer foo variables are two totally unrelated variables that just happen to have the same name. When we’re in the inner block, the outer variable foo still exists, but we have no way to refer to it, because foo now refers to the inner variable.

Note, however, that outer foo is accessible in the inner block before the point where it’s shadowed:

int foo = 0;
{
    foo = 3; //changes outer foo
    int foo = 4; //defines inner foo, shadowing outer foo
}
return foo; //returns 3

Lexing

Compound statements don’t require any new tokens, so we don’t need to touch the lexing pass this week.

Parsing

Here’s the current definition of statements in our AST:

statement = Return(exp) 
          | Exp(exp)
          | Conditional(exp, statement, statement option) //exp is controlling condition
                                                          //first statement is 'if' block
                                                          //second statement is optional 'else' block

We just need to add a Compound statement to this definition. Also recall that we added a block_item construct to the AST in our last post:

block_item = Statement(statement) | Declaration(declaration)

A compound statement is just a list of statements and declarations, so our new definition of statements will look like this:

statement = Return(exp) 
          | Exp(exp)
          | Conditional(exp, statement, statement option) //exp is controlling condition
                                                          //first statement is 'if' block
                                                          //second statement is optional 'else' block
          | Compound(block_item list)

We’ll parse conditional expressions and conditional statements totally differently. Statements are easier, so let’s handle those first.

Now let’s update our grammar. The rule for blocks is extremely simple:

"{" { <block-item> } "}

Note that "{" "}" are literal curly braces, and { } indicates repetition. This is hard to read! But it just means we have an arbitrary number of block items wrapped in braces – if you refer back to the grammar for <function> you can see that we define function bodies exactly the same way.

Putting it all together, our updated grammar looks like this:

<statement> ::= "return" <exp> ";"
              | <exp> ";"
              | "if" "(" <exp> ")" <statement> [ "else" <statement> ]
              | "{" { <block-item> } "}

☑ Task:

Update the parsing pass to handle blocks. It should successfully parse all valid examples in stage 1-7. As in part 5, some invalid examples should fail during parsing and some should fail during code generation. At this point, your parsing pass should throw an appropriate error for all invalid stage 7 examples whose names start with syntax_err.

Code Generation

As we saw earlier, it’s possible to have two different variables, in two different scopes, stored at two different locations on the stack, with the same name. Here’s an example:

int foo = 3;
{
  int foo = 4;
}

So, whenever the program refers to variable foo, our generated code needs to access the correct foo on the stack – or raise an error if foo has gone out of scope. The code generation step this week is all about managing the variable map so we always look up the right foo.

The trick here is that every block has a separate copy of the variable map. That way, defining (or redefining) a variable in an inner scope won’t interfere with an outer scope. And if you’re using an immutable map (which you should be), every block will necessarily get its own variable map, so this approach is surprisingly easy.

Let’s look at some pseudocode. After Part 5, your code to generate a function body probably looked something like this:

def generate_function_body(body):
  // initialize variable map and stack index
  var_map = Map()
  stack_index = -4

  //process statements one at a time
  for statement in body:
    var_map, stack_index = generate_statement(statement, var_map, stack_index) 

Note that generate_statement has to return a new var_map. Every declaration updates the variable map (or, more precisely, creates a new variable map), and in part 5 generate_statement also handled declarations. Whenever we process a declaration, we need to return the latest, greatest variable map so future statements can reference the variable we just declared.

But in the last post, we separated statements from declarations in our AST, so you might have changed the last line to:

    var_map, stack_index = generate_statement_or_declaration(statement, var_map, stack_index) 

At this point, a declaration will create a new variable map, but a statement won’t. Whatever happens in a statement – including a compound statement, which may itself contain declarations – has no impact on the variable map for the enclosing scope. Once you understand that point, handling nested scopes is easy:

def generate_function_body(body):
  // initialize variable map and stack index
  var_map = Map()
  stack_index = -4

  //process statements one at a time
  for block_item in body:
    if block_item is a declaration:
        //update the variable map
        var_map, stack_index = generate_declaration(statement, var_map, stack_index)
    else:
        //don't update the variable map
        generate_statement(statement, var_map, stack_index)

Of course you’ll need to generalize generate_function_body into generate_block; the one difference between generating a function body and any other block is that you need to initialize your empty variable map and stack index at the start of the function body.

Now let’s walk through a small example to see how this maintains the right variable maps for different scopes:

int main(){
    // 1) function body
    {   // 2) block
        int a = 2; // 3) variable declaration
        a = 3; // 4) variable reference
    }
    return a; // 5) return statement
}

We’ll process the function body with generate_block. Right now we’ve got an empty variable map.
We call generate_block recursively to process the inner block. The variable map is still empty.
This is a declaration, so we add a to the variable map (technically, we create a copy of the variable map that contains a, because all these maps are immutable).
We look up a’s location on the stack in the variable map from step 3.
Back in the outer scope, var_map refers to the original, empty variable map. Since a isn’t defined in this map, this will throw an error, as it should.

The code for handling declarations also needs to be changed. The pseudocode for processing declarations from part 5 included this line:

if var_map.contains("a"):
  fail() //shouldn't declare a var twice

This is now incorrect; it’s legal to declare two variables with the same name, as long as the declarations aren’t in the same scope. To solve this, we need a way to distinguish between variables defined in the current scope, and variables defined in an outer scope. My solution was to maintain a set of variables that are defined in the current scope, which means generate_block now looks something like this:

def generate_block(block, var_map, stack_index):

  current_scope = Set()

  //process statements one at a time
  for block_item in block:
    if block_item is a declaration:
        //update the variable map
        var_map, stack_index, current_scope = generate_declaration(statement, var_map, stack_index, current_scope)
    else:
        //don't update the variable map
        generate_statement(statement, var_map, stack_index)

Finally, we check current_scope, rather than var_map, for duplicate variable declarations, and add the variable to both structures on success:

if current_scope.contains("a"):
  fail() //shouldn't declare a var twice in the same scope
else:
  //emit assembly, update stack_index and var_map as before...
  new_scope = current_scope.add("a")
  return (var_map, stack_index, current_scope)

This solution feels hacky, but I haven’t come up with a better one.

Now, if a is redefined in an inner scope, it just overwrites the old a in the variable map, so this scope and any inner ones will use the correct stack location, corresponding to the innermost definition of a. This won’t affect the outer scope at all, because the outer scope is still using the original, unmodified variable map.

Deallocating Variables

We’ve carefully managed our variable map to prevent a block from interfering with any variable declarations in its enclosing scope. But there’s one side effect we couldn’t avoid: allocating a variable changes the stack pointer. This is a problem, because the stack pointer and our stack_index variable will get out of sync. Consider the following example:

int main() {
  {
    int i = 0;
  }
  int j = 1;
  return j;
}

At first, the variable map is empty and stack_index is -4, because the first empty spot on the stack is four bytes below EBP:

When we process the block in this example with generate_block, we’ll push i onto the stack:

    movl $0, %eax
    push %eax

Now ESP is at EBP - 4, and stack_index is -8:

After we exit the block, we forget that we allocated i. That means i is no longer in our variable map, and we’re still working with our original stack index of -4; remember that generate_block doesn’t return a stack index. We should forget i, because it’s out of scope.

The problem is, i is still there, because ESP is still pointing at it.

So when we push j, it will be just below i, at EBP - 8:

  movl $1, %eax
  push %eax

But because the stack index was -4, we’ll add a mapping from j to -4 in our variable map. Any future references to j (like in the return statement) will incorrectly use the stack location of i instead.

We could solve this by having generate_block return a stack index, but it’s probably better to just pop variables off the stack when we’re done with them, right at the end of generate_block. Conveniently, the size of current_scope tells us how many variables we need to pop.

def generate_block(block, var_map, stack_index)

  current_scope = Set()
  ...as before...

  bytes_to_deallocate = 4 * current_scope.size()
  emit "    addl ${}, %esp".format(bytes_to_deallocate)

☑ Task:

Update the code-generation pass to correctly handle compound statements. It should succeed on all valid examples and fail on all invalid examples for stages 1-7.

Up Next

In the next post, we’ll add for, do, and while loops. See you then!

If you have any questions, corrections, or other feedback, you can email me or open an issue.

¹ I’ll use comments to clarify the code snippets throughout this post, even though we haven’t added support for comments yet. ↩

² Global variables work a bit differently but we haven’t added those yet. ↩

Writing a C Compiler, Part 6

2018-02-25T20:00:00+00:00

This is the sixth post in a series. Read part 1 here.

Hi, this blog isn’t dead! It was just, uh, resting. I’ve been swamped with non-blog things for the past few weeks but I’m back on track now, probably, I hope.

Today we’ll implement conditional statements and expressions. As usual, accompanying tests are here.

Part 6: Conditionals

In this post we’ll add support for two types of conditional constructs:

Conditional statements, a.k.a. if statements
Ternary conditional expressions, which have the form a ? b : c. I’ll sometimes just call these “conditional expressions”.

If Statements

An if statement consists of a condition, a substatement that executes if the condition is true, and maybe another substatement that executes if the condition is false. Either of these substatements can be a single statement, like this:

if (flag)
  return 0;

or a compound statement, like this:

if (flag) {
  int a = 1;
  return a*2;
}

Adding support for compound statements is a distinct task that we’re not going to handle in this post. So for now, we’ll only support the first of the examples above, and not the second.

We say a condition is false if it evaluates to zero, and true otherwise, just like when we implemented boolean operators in earlier posts.

Else If

Note that C doesn’t have an explicit else if construct. If an if keyword immediately follows an else keyword, the whole if statement gets parsed as the else branch. In other words, the following code snippets are equivalent:

if (flag)
    return 0;
else if (other_flag)
    return 1;
else
    return 2;

if (flag)
    return 0;
else {
    if (other_flag)
        return 1;
    else
        return 2;
}

Conditional Expressions

These expressions take the following form:

a ? b : c

If a is true, the expression will evaluate to b; otherwise it will evaluate to c.

Note that we should only execute the expression we actually need. For example, in the following code snippet:

0 ? foo() : bar()

the function foo should never be called. You might be tempted to call both foo and bar, then discard the result from foo, but that would be wrong; foo could print to the console, make a network call, or dereference a null pointer and crash the program. Obviously this point is also true of if statements – we should execute the if branch or the else branch but definitely not both.

Conditional expressions and if statements might seem very similar, but it’s important to remember that statements and expressions are used in totally different ways. For example, an expression has a value, but a statement doesn’t. So this is legal:

int a = flag ? 2 : 3;

but this isn’t¹:

//this is bogus
int a = if (flag)
            2;
        else
            3;

On the other hand, a statement can contain other statements, but an expression can’t contain statements. For example, you can nest a return statement inside an if statement:

if (flag)
    return 0;

but you can’t have a return statement inside a conditional expression:

//this is also bogus
flag ? return 1 : return 2;

Lexing

We need to define a few more tokens: if and else keywords for if statements, plus : and ? operators for conditional expressions. Here’s the full list of tokens, with new tokens in bold at the bottom:

Open brace {
Close brace }
Open parenthesis (
Close parenthesis )
Semicolon ;
Int keyword int
Return keyword return
Identifier [a-zA-Z]\w*
Integer literal [0-9]+
Minus -
Bitwise complement ~
Logical negation !
Addition +
Multiplication *
Division /
AND &&
OR ||
Equal ==
Not Equal !=
Less than <
Less than or equal <=
Greater than >
Greater than or equal >=
Assignment =
If keyword if
Else keyword else
Colon :
Question mark ?

☑ Task:

Update the lex function to handle the new tokens. It should work for all stage 1-6 examples in the test suite, including the invalid ones.

Parsing

We’ll parse conditional expressions and if statements totally differently. Let’s handle if statements first.

If Statements

So far, we’ve defined three types of statements in our AST: return statements, expressions, and variable declarations. Right now the definition looks like this:

statement = Return(exp) 
          | Declare(string, exp option) //string is variable name
                                        //exp is optional initializer
          | Exp(exp)

We need to add an If statement, which has three parts: an expression (the controlling condition), an if branch and an optional else branch. Here’s our updated AST definition for statements:

statement = Return(exp) 
          | Declare(string, exp option) //string is variable name
                                        //exp is optional initializer
          | Exp(exp)
          | If(exp, statement, statement option) //exp is controlling condition
                                                 //first statement is 'if' branch
                                                 //second statement is optional 'else' branch

Now let’s update our grammar. The rule for if statements consists of:

The if keyword
An expression wrapped in parentheses (the condition)
A statement (executed if the condition is true)
Optionally, the else keyword, followed by another statement (executed if the condition is false)

"if" "(" <exp> ")" <statement> [ "else" <statement> ]

So the updated grammar for statements looks like this:

<statement> ::= "return" <exp> ";"
              | <exp> ";"
              | "int" <id> [ = <exp> ] ";"
              | "if" "(" <exp> ")" <statement> [ "else" <statement> ]

Our definition of statements is recursive! But it’s not left-recursive, so it’s not a problem.

But we have another problem. We defined variable declarations as a type of statement, but declarations in C aren’t statements. For example, this code snippet isn’t valid:

//this will throw a compiler error!
if (flag)
  int i = 0;

When we added variable declarations in the last post, it didn’t matter whether or not we defined them as statements; we could parse the same subset of C and generate the same assembly either way. Now that we’re dealing with more complex structures like if statements, that simplification impacts what we can and can’t parse, so we need to fix it.

So we need to move Declare out of the statement type and into its own type. But this introduces a new problem: we’ve defined a function body as a list of statements, but if declarations aren’t statements, then you can’t have declarations in a function body. To fix this, we’ll need to tweak how we define functions in our AST. Let’s introduce some terminology:

A block item is a statement or declaration.
A block or compound statement is a list of block items wrapped in curly braces².

Function bodies are just a special case of blocks; they contain a list of declarations and statements. To represent them, we’ll introduce a new block_item type that can hold either a statement or a declaration. This will also come in handy when we add support for blocks in general in the next post. With those changes, the relevant parts of our AST will look like this:

statement = Return(exp)                                         
          | Exp(exp)
          | Conditional(exp, statement, statement option) //exp is controlling condition
                                                          //first statement is 'if' block
                                                          //second statement is optional 'else' block

declaration = Declare(string, exp option) //string is variable name 
                                          //exp is optional initializer

block_item = Statement(statement) | Declaration(declaration)

function_declaration = Function(string, block_item list) //string is the function name                                                                                      

And here’s the updated grammar:

<statement> ::= "return" <exp> ";"
              | <exp> ";"
              | "if" "(" <exp> ")" <statement> [ "else" <statement> ]
<declaration> ::= "int" <id> [ = <exp> ] ";"
<block-item> ::= <statement> | <declaration>
<function> ::= "int" <id> "(" ")" "{" { <block-item> } "}"

Now that we have our AST and grammar, you should be able to update your compiler to parse conditional statements. You may want to do that before we move on to conditional expressions.

☑ Task:

Update the parsing pass to handle conditional statements. It should successfully parse all valid stage 6 examples in write_a_c_compiler/stage_6/valid/statement, and throw an error for all invalid stage 6 examples in write_a_c_compiler/stage_6/invalid/statement.

Conditional Expressions

Now let’s add ternary conditional expressions. Here’s how we’ve defined our AST for expressions so far:

exp = Assign(string, exp)
    | Var(string) //string is variable name
    | BinOp(binary_operator, exp, exp)
    | UnOp(unary_operator, exp)
    | Constant(int)

It’s straightforward to add a Conditional form:

exp = Assign(string, exp)
    | Var(string) //string is variable name
    | BinOp(binary_operator, exp, exp)
    | UnOp(unary_operator, exp)
    | Constant(int)
    | Conditional(exp, exp, exp) //the three expressions are the condition, 'if' expression and 'else' expression, respectively

We also need to update the grammar rules for expressions, which currently look like this:

<exp> ::= <id> "=" <exp> | <logical-or-exp>
<logical-or-exp> ::= <logical-and-exp> { "||" <logical-and-exp> } 
...more rules...

The conditional operator has lower precedence than assignment (=) but higher precedence than logical OR (||), and it’s right-associative. We can take its grammar rule straight from section 6.5.15 of the C11 standard:

<conditional-exp> ::= <logical-or-exp> "?" <exp> ":" <conditional-exp>

Let’s think about why it’s defined this way. I’ll refer to the three sub-expressions as e1, e2, and e3, such that a conditional expression has the form e1 ? e2 : e3. Expression e1 has to be a <logical-or-exp> because it can’t be an assignment expression or a conditional expression. It can’t be an assignment expression because assignment has lower precedence than the conditional operator. In other words:

a = 1 ? 2 : 3;

must be parsed as:

a = (1 ? 2 : 3);

In our current grammar this is specified unambiguously, but if we instead defined a conditional expression as:

<conditional-exp> ::= <exp> "?" <exp> ":" <conditional-exp>

then it would be ambiguous; the statement above could also be parsed as:

(a = 1) ? 2 : 3;

Note that (a = 1) ? 2 : 3; is a valid statement, but you need the parentheses in order to parse it that way.

So that’s why e1 can’t be an assignment expression. It can’t be a conditional expression because ? is right-associative. In other words:

flag1 ? 4 : flag2 ? 6 : 7

must be parsed as

flag1 ? 4 : (flag2 ? 6 : 7)

If we had defined a conditional expression as:

<conditional-exp> ::= <conditional-exp> "?" <exp> ":" <conditional-exp>

then the example above could also be parsed as:

(flag1 ? 4 : flag2) ? 6 : 7

and the grammar would be ambiguous.

Expression e2 in our ternary conditional can take any form; safely fenced in by ? and :, it can’t introduce any grammatical ambiguity. You can think of implicit parentheses wrapping everything between ? and :.

Expression e3 can be another ternary conditional, as in the example a > b ? 4 : flag ? 6 : 7. But it can’t be an assignment statement – why not? Let’s look at the following example:

flag ? a = 1 : a = 0

If we try to compile this with gcc, we’ll get something like the following error message:

error: expression is not assignable
    flag ? a = 1 : a = 0;
    ~~~~~~~~~~~~~~~~ ^

In other words, gcc tried to parse the expression like this:

(flag ? a = 1 : a) = 0

This obviously doesn’t work because the expression on the left isn’t a variable³. You might wonder why we can’t use the following grammar rule:

<conditional-exp> ::= <logical-or-exp> "?" <exp> ":" <exp>

Then gcc could just parse it like this:

flag ? a = 1 : (a = 0)

That grammar rule would work fine; in fact, that’s how conditional expressions are defined in C++⁴. I don’t know why it’s different in C, but if you know I’d like to hear from you.

We also need a way to specify expressions that aren’t conditionals, so we’ll make the ‘conditional’ part of this grammar rule optional⁵:

<conditional-exp> ::= <logical-or-exp> [ "?" <exp> ":" <conditional-exp> ]

Anyway, we now know the correct grammar. Here are all the new and updated grammar rules concerning expressions:

<exp> ::= <id> "=" <exp> | <conditional-exp>
<conditional-exp> ::= <logical-or-exp> [ "?" <exp> ":" <conditional-exp> ]
<logical-or-exp> ::= <logical-and-exp> { "||" <logical-and-exp> } 
...

☑ Task:

Update the parsing pass to handle ternary conditional expressions. At this point, it should successfully parse all valid stage 6 examples, and throw an error for all invalid examples.

Put It All Together

For the sake of completeness, here’s our full AST definition and grammar, with new and changed parts bolded:

AST:

program = Program(function_declaration)

function_declaration = Function(string, block_item list) //string is the function name

block_item = Statement(statement) | Declaration(declaration)

declaration = Declare(string, exp option) //string is variable name 
                                          //exp is optional initializer

statement = Return(exp) 
          | Exp(exp)
          | Conditional(exp, statement, statement option) //exp is controlling condition
                                                          //first statement is 'if' block
                                                          //second statement is optional 'else' block
                                                          
exp = Assign(string, exp)
    | Var(string) //string is variable name
    | BinOp(binary_operator, exp, exp)
    | UnOp(unary_operator, exp)
    | Constant(int)
    | CondExp(exp, exp, exp) //the three expressions are the condition, 'if' expression and 'else' expression, respectively

Grammar:

<program> ::= <function>
<function> ::= "int" <id> "(" ")" "{" { <block-item> } "}"
<block-item> ::= <statement> | <declaration>
<declaration> ::= "int" <id> [ = <exp> ] ";"
<statement> ::= "return" <exp> ";"
              | <exp> ";"
              | "if" "(" <exp> ")" <statement> [ "else" <statement> ]


<exp> ::= <id> "=" <exp> | <conditional-exp>
<conditional-exp> ::= <logical-or-exp> [ "?" <exp> ":" <conditional-exp> ]
<logical-or-exp> ::= <logical-and-exp> { "||" <logical-and-exp> }
<logical-and-exp> ::= <equality-exp> { "&&" <equality-exp> }
<equality-exp> ::= <relational-exp> { ("!=" | "==") <relational-exp> }
<relational-exp> ::= <additive-exp> { ("<" | ">" | "<=" | ">=") <additive-exp> }
<additive-exp> ::= <term> { ("+" | "-") <term> }
<term> ::= <factor> { ("*" | "/") <factor> }
<factor> ::= "(" <exp> ")" | <unary_op> <factor> | <int> | <id>
<unary_op> ::= "!" | "~" | "-"

Code Generation

To generate the assembly for if statements and conditional expressions, we’re going to need conditional and unconditional jumps, which we introduced in part 4. We can generate assembly for the conditional expression e1 ? e2 : e3 as follows:

    <CODE FOR e1 GOES HERE>
    cmpl $0, %eax
    je   _e3                  ; if e1 == 0, e1 is false so execute e3
    <CODE FOR e2 GOES HERE>  ; we're still here so e1 must be true. execute e2.
    jmp  _post_conditional    ; jump over e3
_e3:
    <CODE FOR e3 GOES HERE>  ; we jumped here because e1 was false. execute e3.
_post_conditional:            ; we need this label to jump over e3

The assembly for if statements is quite similar, although it’s slightly complicated by the optional else clause. I’ll let you figure it out yourself.

As in the assembly for && and || we saw earlier, labels have to be unique.

☑ Task:

Update the code-generation pass to correctly handle ternary conditional expressions and if statements. It should success on all valid examples and fail on all invalid examples for stages 1-6.

Up Next

In the next post, we’ll add compound statements, so brace yourself (pun intended) for an exciting discussion of lexical scope! I hope that will be two weeks from now and not two months. See you then!

If you have any questions, corrections, or other feedback, you can email me or open an issue.

¹ But the if construct in many functional languages is an expression, and works just like C’s ternary conditionals. This is valid OCaml, for instance:

let a = if b then 1 else 2

↩

² The terms “block” and “compound statement” aren’t 100% synonymous; compound statements are a subset of blocks. But the terms are similar enough that it’s fine to treat them as synonyms for now. ↩

³ Actually, any “modifiable lvalue” is allowed on the left side of an assignment statement, not just variables. *x, &x, ++x, and x++ are all examples of modifiable lvalues. Conditional expressions aren’t, though. ↩

⁴ See this Stack Overflow answer and the C++11 standard. ↩

⁵ Thanks to Stephen Bastians for pointing out a mistake in this grammar rule in an earlier verson of this post.↩

Writing a C Compiler, Part 5

2018-01-08T20:00:00+00:00

This is the fifth post in a series. Read part 1 here.

We’ve spent the last two weeks adding binary primitives, and I don’t know about you, but I’m starting to get kind of bored with it. This week, we’ll do something completely different and add support for local variables. We’ll finally be able to compile functions longer than one line! Hooray!

As always, accompanying tests are here.

Week 5: Local Variables

We’re adding variables this week! Programming without variables is hard, so this is very exciting. To keep things simple, we’re going to support variables in a very restricted way for now:

We only support local variables, which are declared in main. No global variables.
We only support variables of type int.
We don’t support type modifiers like short, long or unsigned, storage-class specifiers like static, or type qualifiers like const. Just plain old int.
You can only declare one variable per statement. We won’t support statements like int a, b;

There are three things you can do with a variable:

Declare it (int a;)
- When you declare it, you can also optionally initialize it (int a = 2;)
Assign to it (a = 3;)
Reference it in an expression (a + 2)

We’ll need to add support for these three things. We’ll also add support for functions containing more than one statement.

Lexing

The only new token this week is the assignment operator, =. Here’s our list of tokens, with the newest addition in bold at the bottom:

Open brace {
Close brace }
Open parenthesis (
Close parenthesis )
Semicolon ;
Int keyword int
Return keyword return
Identifier [a-zA-Z]\w*
Integer literal [0-9]+
Minus -
Bitwise complement ~
Logical negation !
Addition +
Multiplication *
Division /
AND &&
OR ||
Equal ==
Not Equal !=
Less than <
Less than or equal <=
Greater than >
Greater than or equal >=
Assignment =

☑ Task:

Update the lex function to handle the = token. It should work for all stage 1-5 examples in the test suite, including the invalid ones.

Parsing

We need to make a lot of changes to our AST this week. Let’s look at a sample program we’d like to handle:

int main() {
    int a = 1;
    a = a + 1;
    return a;
}

In this program, main contains three statements:

A variable declaration (int a = 1;)
A variable assignment (a = a + 1;)
A return statement (return a;)

We need to update the defintion of function_declaration in the AST so a function can contain a list of statements, not just a single statement:

function_declaration = Function(string, statement list) //string is function name

Right now, the only statements we’ve defined are return statements. That’s not right either. Let’s add some more:

statement = Return(exp) 
          | Declare(string, exp option) //string is variable name
                                        //exp is optional initializer
          | Exp(exp)

We’ve added Decl for variable declarations. We can use an option type (Maybe in Haskell) to represent that we may or may not have an initializer.

The AST for int a; might look like this:

decl = Declare("a", None) //None because we don't initialize it

And the AST for int a = 3 might look like this:

init_exp = Const(3)
decl = Declare("a", Some(init_exp))

Note that we don’t store the variable’s type anywhere in our AST; we don’t need to, because it can only have type int. We’ll need to start tracking type information once we have multiple types

We’ve also added a standalone Exp statement, which means we can now write programs like this:

int main() {
    2 + 2;
    return 0;
}

This is valid C; if you compile it with gcc, it will issue a warning but it won’t fail.

However, 2+2; isn’t a very useful statement. The real reason to add an Exp statement is so we can write statements like this:

a = 2;

Variable assignment is just an expression! That’s why you this statement is valid:

a = 2 * (b = 2);

In the code snippet above, the expression b = 2 has the value 2, and the side effect of updating b to have that value. This would be evaluated as:

a = 2 * (b = 2)
a = 2 * 2 //also b is 2 now
a = 4

Now we need to update exp in our AST definition to handle assignment operators. My first thought was to just add = as another binary operator – after all, a = b looks kind of like a + b. But that’s totally wrong: the two operands of a binary operator can be arbitrary expressions, but the left side of an assignment operator can’t. A statement like 2 = 2 doesn’t make any sense, because you can’t assign a new value to 2.

Instead, we’ll just define assignment as a new type of expression:

exp = Assign(string, exp) //string is variable, exp is value to assign
    | BinOp(binary_operator, exp, exp)
    | UnOp(unary_operator, exp)
    | Constant(int)

Now we can write the AST for the statement a = 2; like this:

assign_exp = Assign("a", Const(2))
assign_statement = Exp(assign_exp)

Now we can define variables and update their values, but that’s not super helpful unless we can actually reference them. Let’s add variable reference as another type of expression:

exp = Assign(string, exp)
    | Var(string) //string is variable name
    | BinOp(binary_operator, exp, exp)
    | UnOp(unary_operator, exp)
    | Constant(int)

Now we can write the AST return a; like this:

return_exp = Var("a")
return_statement = Return(return_exp)

If we put it all together, here’s our new AST, with changes bolded:

program = Program(function_declaration)
function_declaration = Function(string, statement list) //string is the function name

statement = Return(exp) 
          | Declare(string, exp option) //string is variable name
                                        //exp is optional initializer
          | Exp(exp) 

exp = Assign(string, exp) 
    | Var(string) //string is variable name 
    | BinOp(binary_operator, exp, exp)
    | UnOp(unary_operator, exp)
    | Constant(int)

We also need to update our grammar. First, we need to update <function> to allow multiple statements.

Old definition:

<function> ::= "int" <id> "(" ")" "{" <statement> "}"

New definition:

<function> ::= "int" <id> "(" ")" "{" { <statement> } "}"

Thanks to the interspersed {/}, indicating repetitition, and "{"/"}", indicating literal curly braces, this is almost completely unreadable. But it just means a function can have more than one statement now.

We need to handle multiple types of statement. We already have return statements:

"return" <exp> ";"

And standalone expressions are super easy:

<exp> ";"

A variable declaration needs a type specifier (int) followed by a name, optionally followed by an initializer. We use [] here to indicate something is optional:

"int" <id> [ = <exp> ] ";"

Let’s put it all together to get a our new definition of <statement>:

<statement> ::= "return" <exp> ";"
              | <exp> ";"
              | "int" <id> [ = <exp> ] ";"

Finally, we need to update <exp>. Assignment is our lowest-precedence operator, so it becomes our top level <exp> expression. Also note that, unlike most of our other operators, it’s right-associative, which makes it a bit simpler to express.

<exp> ::= <id> "=" <exp> | <logical-or-exp>
<logical-or-exp> ::= <logical-and-exp> { "||" <logical-and-exp> }

The grammar for all our binary operations (<logical-and-exp> on down to <term>) is unchanged. We just need to change <factor> so we can refer to variables as well as constants:

<factor> ::= "(" <exp> ")" | <unary_op> <factor> | <int> | <id>

When you put it all together, here’s our new grammar, with changes bolded:

<program> ::= <function>
<function> ::= "int" <id> "(" ")" "{" { <statement> } "}"
<statement> ::= "return" <exp> ";"
              | <exp> ";"
              | "int" <id> [ = <exp>] ";" 
<exp> ::= <id> "=" <exp> | <logical-or-exp>
<logical-or-exp> ::= <logical-and-exp> { "||" <logical-and-exp> } 
<logical-and-exp> ::= <equality-exp> { "&&" <equality-exp> }
<equality-exp> ::= <relational-exp> { ("!=" | "==") <relational-exp> }
<relational-exp> ::= <additive-exp> { ("<" | ">" | "<=" | ">=") <additive-exp> }
<additive-exp> ::= <term> { ("+" | "-") <term> }
<term> ::= <factor> { ("*" | "/") <factor> }
<factor> ::= "(" <exp> ")" | <unary_op> <factor> | <int> | <id>
<unary_op> ::= "!" | "~" | "-"

☑ Task:

Update your expression-parsing code to handle variable declaration, assignment, and references. It should successfully parse all valid stage 1-5 examples in the test suite. The invalid examples are a little different this week. Some of them should fail during parsing; others can be parsed successfully but should cause errors during code generation (e.g. because they reference variables that haven’t been declared.) I decided to deal with this in the laziest way possible; the names of the invalid examples that should fail during parsing all start with syntax_err.

Code Generation

We need to save local variables somewhere, so we’ll save them on the stack. We also need to remember exactly where on the stack each variable was saved, so we can refer to it later. To track this information, we’ll create a map from variable names to locations.

But how are we supposed to know a variable’s location at compile time? Absolute memory addresses aren’t determined until runtime. We could store the variable’s offset from ESP, except that the value of ESP changes whenever we push something onto the stack. The solution is to store the variable’s offset from a different register, EBP. To understand why this will work, we need to know a little bit about stack frames.

Stack Frames

Whenever we call a function, we allocate a chunk of memory for it on top of the stack – this memory is called the stack frame. The stack frame holds function arguments, the address to jump to after the function returns, and of course local variables. We already know that ESP points to the top of stack, which is also the top of the current stack frame¹. The EBP (or base pointer) register points to the bottom of the current stack frame. Without EBP, we wouldn’t know where once stack frame ends and the other begins, and we wouldn’t be able to find important values like a function’s return address.

When a function (let’s call it f) returns, its caller needs to be able to pick up where it left off. That means its stack frame, and the values in ESP and EBP, all need to be exactly the same as they were before f was called. The first thing f needs to do is set up a new stack frame for itself, using the following instructions:

    push %ebp       ; save old value of EBP
    movl %esp, %ebp ; current top of stack is bottom of new stack frame

These instructions are called the function prologue. Immediately before f returns, it executes the function epilogue to remove this stack frame, leaving everything just as it was before the function prologue:

    movl %ebp, %esp ; restore ESP; now it points to old EBP
    pop %ebp        ; restore old EBP; now ESP is where it was before prologue
    ret

Up to this point, we could get away with not having a function prologue or epilogue, but now we need to add them. Adding them helps us in two ways:

We can store variable locations as offsets from EBP. We know there’s nothing above EBP (because we set up an empty stack frame in the function prologue), and we know that EBP won’t change until the function epilogue.
We can safely push local variables onto the stack without changing the caller’s stack frame².

You should generate the function prologue at the start of the function definition, right after the function’s label. You should generate the function epilogue as part of the return statement, right before ret.

Besides our variable map, we need to keep track of a stack index, which tells us the offset of the next available spot on the stack, relative to EBP. The next available spot is always the four-byte stack slot right after ESP, at ESP - 4. Right after the function prologue, EBP and ESP are the same. That means the stack index will also be -4. Whenever we push a variable onto the stack, we’ll decrement the stack index by 4³.

Now let’s look at how we can handle declaring, assigning, and referring to variables.

Variable Declaration

When you encounter a variable declaration, just save the variable onto the stack and add it to the variable map⁴. Note that it’s illegal to declare a variable twice in the same local scope⁵, as in the following code snippet:

int a;
int a;

So your program should fail if the variable is already in the variable map. Here’s how you might generate assembly for the statement int a = expression:

  if var_map.contains("a"):
    fail() //shouldn't declare a var twice
  generate_exp(expression)      // generate assembly to calculate e1 and move it to eax
  emit "    pushl %eax" // save initial value of "a" onto the stack
  var_map = var_map.put("a", stack_index) // record location of a in the variable map
  stack_index = stack_index - 4 // stack location of next address will be 4 bytes lower

A few points here:

If a variable isn’t initialized, you can just initialize it to 0. Or whatever you want, really.
The variable map exists during code generation, not at runtime.
You should definitely use an immutable data structure for your variable map. In the next post we’ll add if statements, and then we’ll have nested scopes; a variable declared inside an if block isn’t accessible outside it. If you have to worry about code from an inner scope messing with the variable map in an outer scope, you will not be a happy camper.

Variable Assignment

We can look up a variable’s location in memory in our map; to assign it a new value, just move that value to the right memory location. Here’s how to handle a = expression:

  generate_exp(expression) // generate assembly to calculate expression and move it to eax 
  var_offset = var_map.find("a") //if "a" isn't in the map, fail b/c it hasn't been declared yet
  emit "    movl %eax, {}(%ebp)".format(var_offset) //using python-style string formatting here

Note that the value of expression is still in EAX, so this assignment expression has the correct value.

Variable Reference

To refer to a variable in an expression, just copy it from the stack to EAX:

  var_offset = var_map.find("a") //find location of variable "a" on the stack
                                 //should fail if it hasn't been declared yet
  emit "    movl {}(%ebp), %eax".format(var_offset) //retrieve value of variable

Missing Return Statements

Now that we support multiple types of statements, we can successfully parse programs with no return statement at all:

int main() {
  int a = 2;
}

What’s the expected behavior here? According to section 5.1.2.2.3 of the C11 standard:

If the return type of the main function is a type compatible with int, a return from the initial call to the main function is equivalent to calling the exit function with the value returned by the main function as its argument; reaching the } that terminates the main function returns a value of 0.

So, main needs to return 0 if it’s missing a return statement. Right now main is our only function, so that’s the only case we need to handle.

Eventually, we’ll need to deal with this problem in functions other than main. Here’s what section 6.9.1 of the standard says about missing return statements in general:

If the } that terminates a function is reached, and the value of the function call is used by the caller, the behavior is undefined.

So this program has undefined behavior:

int foo() {
  1 + 1;
}

int main() {
  return foo();
}

You could technically handle this however you want – fail, continue silently, issue a HALT AND CATCH FIRE instruction.

This program, on the other hand, is perfectly valid, because the value returned from foo() is never used:

int foo() {
  1 + 1;
}

int main() {
  foo();
  return 0;
}

Honestly, the specification here seems really dumb to me. If I write a non-void function without a return statement, that is WRONG and I want the compiler to save me from myself, even if I haven’t technically used it in an illegal way yet. I can’t think of any situation where we’d want this behavior; if you can, please let me know.

However, that’s the spec, so our functions have to return successfully even when they’re missing a return statement. That means you need to issue the function epilogue and ret instruction even if the return statement is missing. It’s probably easiest to handle main and all other functions uniformly, so you can just return 0 from any function without a return statement.

☑ Task:

Update your code-generation pass to:

Generate function prologues and epilogues.
Generate correct code for variable declarations, assignments, and references.
Make main return 0 even if the return statement is missing.

Your code should succeed on all valid examples and fail on all invalid examples for stages 1-5.

Bonus features

At this point, there are a handful of other features you can implement pretty easily:

Compound Assignment Operators

+=
-=
/=
*=
%=
<<=
>>=
&=
|=
^=

Comma Operators

e1, e2. The result is the value of e2; the value of e1 is ignored.

Increment/Decrement Operators

Prefix and postfix ++
Prefix and postfix --

This week’s tests don’t cover these, so it’s up to you whether to implement them or skip them.

Up Next

I’m going to switch to one blog post every two weeks. In the next post, we’ll add if statements and conditional operators (a ? b : c). See you then!

Update 1/12

Corrected the “Missing Return Statements” section, which previously said that the behavior of main is undefined when it’s missing a return statement. Also updated the test suite accordingly.
Clarified that declaring a variable multiple times is sometimes legal at file scope.

Thanks to Olivier Gay for pointing out both those things.

If you have any questions, corrections, or other feedback, you can email me or open an issue.

¹ Keep in mind that the stack grows down towards lower addresses; we decrement ESP whenever we push things onto the stack, and ESP will always hold a lower value than EBP. So the top of the stack is really…on the bottom ¯_(ツ)_/¯ ↩

² Even though main is the only function, it still has a caller: it’s called by the setup routine, crt0. ↩

³ We don’t really need to keep track of the stack index, since we can just derive it from the size of the variable map. However, the stack index will come in handy once we add types other than int, since at that point our variables won’t all be the same size. If you don’t want to keep track of it for now, that’s fine with me. ↩

⁴ This is not at all how real compilers work; they usually allocate space for local variables all at once in the function prologue, or just store them in registers. Our way is less effort, though. ↩

⁵ It’s sometimes legal to declare a variable at file scope, per section 6.9.2 of the C11 specification. ↩

Writing a C Compiler, Part 4

2017-12-28T23:30:00+00:00

This is the fourth post in a series. Read part 1 here.

This week we’re adding some boolean operators (&&, ||) and a whole bunch of relational operators (<, ==, etc.). Since we already know how to handle binary operators, this week will be pretty straightforward. As always, you can find the accompanying tests here.

The test suite is slightly weird this week; the three tests whose names start with skip_on_failure_ use local variables, which we haven’t implemented yet. I’ve included them because otherwise the test suite can’t validate that short-circuiting works correctly. When you run the test suite, they should show up as NOT_IMPLEMENTED rather than FAIL in the results, and they shouldn’t count toward the total number of failures. Once you’ve implemented local variables, these tests should pass.

Week 4: Even More Binary Operators

We’re adding eight new operators this week:

Logical AND &&
Logical OR ||
Equal to ==
Not equal to !=
Less than <
Less than or equal to <=
Greater than >
Greater than or equal to >=

As usual, we’ll update our lexing, parsing, and code generation passes to support these operations.

Lexing

Each new operator corresponds to a new token. Here’s the full list of tokens we need to support, with old tokens at the top and new tokens in bold at the bottom:

Open brace {
Close brace }
Open parenthesis (
Close parenthesis )
Semicolon ;
Int keyword int
Return keyword return
Identifier [a-zA-Z]\w*
Integer literal [0-9]+
Minus -
Bitwise complement ~
Logical negation !
Addition +
Multiplication *
Division /
AND &&
OR ||
Equal ==
Not Equal !=
Less than <
Less than or equal <=
Greater than >
Greater than or equal >=

☑ Task:

Update the lex function to handle the new tokens. It should work for all valid and invalid stage 1-4 examples in the test suite, except the skip_on_failure_ ones.

Parsing

Last week, we found that we needed one production rule in our grammar for each operator precedence level. This week we have a lot more precedence levels, which means our grammar will grow a lot. However, our parsing strategy hasn’t changed at all; we’ll handle our new production rules exactly the same way as the old rules for exp and term. Honestly, this is going to be pretty tedious, but I hope it will help solidify all the stuff about parsing from last week.

Here are our all binary operators, from highest to lowest precedence¹:

Multiplication & division (*, /)
Addition & subtraction (+,-)
Relational less than/greater than/less than or equal/greater than or equal (<, >,<=,>=)
Relational equal/not equal (==, !=)
Logical AND (&&)
Logical OR (||)

We handled the first two bullet points last week; the last four are new. We’ll add a production rule for each of the last four bullet points. The new grammar is below, with changed/added rules bolded.

<program> ::= <function>
<function> ::= "int" <id> "(" ")" "{" <statement> "}"
<statement> ::= "return" <exp> ";"
<exp> ::= <logical-and-exp> { "||" <logical-and-exp> }
<logical-and-exp> ::= <equality-exp> { "&&" <equality-exp> }
<equality-exp> ::= <relational-exp> { ("!=" | "==") <relational-exp> }
<relational-exp> ::= <additive-exp> { ("<" | ">" | "<=" | ">=") <additive-exp> }
<additive-exp> ::= <term> { ("+" | "-") <term> }
<term> ::= <factor> { ("*" | "/") <factor> }
<factor> ::= "(" <exp> ")" | <unary_op> <factor> | <int>
<unary_op> ::= "!" | "~" | "-"

<additive-exp> is the same as <exp> from last week. We had to rename it because <exp> now refers to logical OR expressions, which now have lowest precedence.

Last week you wrote parse_exp and parse_term; now you’ll need parse_relational_exp, parse_equality_exp, etc. Other than handling different operators, these functions will all be identical.

And for the sake of completeness, here’s our AST definition:

program = Program(function_declaration)
function_declaration = Function(string, statement) //string is the function name
statement = Return(exp)
exp = BinOp(binary_operator, exp, exp) 
    | UnOp(unary_operator, exp) 
    | Constant(int)

This is identical to last week, except we’ve added more possible values of binary_operator.

☑ Task:

Update your expression-parsing code to handle this week’s new binary operators. It should successfully parse all valid stage 1-4 examples in the test suite (except the skip_on_failure ones), and fail on all invalid stage 1-4 examples. The test suite doesn’t directly verify that your program generates the correct AST, so you’ll need to manually inspect the AST for each example to make sure it’s right.

Code Generation

Our general approach to code generation for binary operations is the same as last week:

Calculate e1
Push it onto the stack
Calculate e2
Pop e1 from the stack back into a register
Perform the operation on e1 and e2.

All the new stuff will be in step 5.

Relational Operators

Let’s handle the relational operators first. Like the logical NOT operator (!) in week 2, these return 1 for true results and 0 for false results. These operators are almost identical to ! except that they compare two expressions to each other, instead of comparing an expression to zero.

Here’s the assembly we generated for ! in week 2:

    <CODE FOR exp GOES HERE>
    cmpl   $0, %eax    ;set ZF on if exp == 0, set it off otherwise
    movl   $0, %eax    ;zero out EAX (doesn't change FLAGS)
    sete   %al         ;set AL register (the lower byte of EAX) to 1 iff ZF is 1

We can modify this slightly to implement ==:

    <CODE FOR e1 GOES HERE>
    push   %eax          ; save value of e1 on the stack
    <CODE FOR e2 GOES HERE>
    pop    %ecx          ; pop e1 from the stack into ecx - e2 is already in eax
    cmpl   %eax, %ecx    ;set ZF on if e1 == e2, set it off otherwise
    movl   $0, %eax      ;zero out EAX (doesn't change FLAGS)
    sete   %al           ;set AL register (the lower byte of EAX) to 1 iff ZF is on

The sete instruction is just one of a whole slew of conditional set instructions. There’s also setne (set if not equal), setge (set if greater than or equal), and so on. To implement <, >, and the other relational operators, we can generate exactly the same assembly as we used for == above, just replacing sete with the appropriate conditional set instruction. Easy!

In week 2, we talked about testing for equality with the zero flag (ZF). But we can’t use ZF to determine which operand is larger. For that, we need the sign flag (SF), which is set if the result of an operation is negative, like so:

    movl $0, %eax ;zero out EAX
    movl $2, %ecx ;ECX = 2
    cmpl $3, %ecx ;compute 2 - 3, set flags
    setl %al      ;set AL if 2 < 3, i.e. if 2 - 3 is negative

Now let’s talk about && and ||. I’ll use & and | to indicate bitwise AND and OR, respectively.

Short-Circuit Evaluation

The C11 standard guarantees that evaluation of && and || will short-circuit: if we know the result after evaluating the first clause, we don’t evaluate the second clause². For example, consider the following line of code:

return 0 && foo();

Because the first clause is false, we don’t need to know the return value of foo, so we won’t call foo at all. Whether foo is called won’t change the return value on this line, but it could perform I/O, update global variables, or have other important side effects. So making sure that && and || short-circuit isn’t just a performance optimization; it’s required for some programs to execute correctly.

Logical OR

To guarantee that logical OR short-circuits, we’ll need to jump over clause 2 when clause 1 is true. We’ll follow these steps to calculate e1 || e2:

Calculate e1
If the result is 0, jump to the step 4.
Set EAX to 1 and jump to the end.
Calculate e2.
If the result is 0, set EAX to 0. Otherwise set EAX to 1.

Step 2 will require a new type of instruction called conditional jumps. These are similar to the conditional set instructions, like sete and setne, that we’ve already used. The only difference is that instead of setting a byte to 1, they jump to a specific point in the assembly code, which we specify with a label. Here’s an example of je, the “jump if equal” instruction, in action:

    cmpl $0, %eax ; set ZF if EAX == 0
    je _there    ; if ZF is set, go to _there
    movl $1, %eax
    ret
_there:
    movl $2, %eax
    ret

If EAX is 0 at the start of this code snippet, it will return 2; otherwise it will return 1. Let’s look at exactly what instructions will execute in each case.

First consider the case where EAX is 0 at the start:

cmpl $0, %eax Because EAX is 0, this will set the zero flag (ZF) to true.
je _there Because ZF is true, it will jump.
movl $2, %eax This executes next because it’s the first instruction after _there. It sets EAX to 2.
ret The return value will be 2.

Now consider the case where EAX isn’t zero:

cmpl $0, %eax Because EAX isn’t 0, this will set ZF to false.
je _there Because ZF is false, it will not jump, so this instruction is a no-op.
movl $1, %eax Since we didn’t jump, control passes to the next instruction as usual. It sets EAX to 1.
ret The return value will be 1.

We’ll also need the jmp instruction, which performs an unconditional jump. Here’s an example of jmp in action:

    movl $0, %eax ; zero out EAX
    jmp _there    ; go to _there label
    movl $5 %eax  ; this will never execute, we always jump over it
_there:
    ret           ; will always return zero

Now that we’re familiar with jmp and je, here’s the assembly for e1 || e2:

    <CODE FOR e1 GOES HERE>
    cmpl $0, %eax            ; check if e1 is true
    je _clause2              ; e1 is 0, so we need to evaluate clause 2
    movl $1, %eax            ; we didn't jump, so e1 is true and therefore result is 1
    jmp _end
_clause2:
    <CODE FOR e2 GOES HERE>
    cmpl $0, %eax            ; check if e2 is true
    movl $0, %eax            ; zero out EAX without changing ZF
    setne %al                ; set AL register (the low byte of EAX) to 1 iff e2 != 0
_end:

Note that labels have to be unique. This means you can’t actually use _clause2 or _end as labels, because you’ll have duplicate labels if your program includes more than one logical OR. You should probably write a utility function to generate unique labels. It doesn’t have to be fancy; the label generator in nqcc just includes an incrementing counter in every label.

The _end label here may look odd, since it doesn’t appear to label anything. Actually, it labels whatever comes right after this expression; it just gives us a target to jump over _clause2.

Logical AND

Almost identical to logical OR, except we short-circuit if e1 is 0. We use the jne (jump if not equal) instruction. In that case we don’t need to move anything into EAX, since 0 is the result we want. Here’s the assembly:

    <CODE FOR e1 GOES HERE>
    cmpl $0, %eax            ; check if e1 is true
    jne _clause2             ; e1 isn't 0, so we need to evaluate clause 2
    jmp _end
_clause2:
    <CODE FOR e2 GOES HERE>
    cmpl $0, %eax            ; check if e2 is true
    movl $0, %eax            ; zero out EAX without changing ZF
    setne %al                ; set AL register (the low byte of EAX) to 1 iff e2 != 0
_end:

As with logical OR, we need to make sure the labels are unique.

☑ Task:

Update your code-generation pass to emit correct code for &&, ||, ==, !=, <, <=, >, and >=. It should succeed on all valid examples (except the skip_on_failure_ ones) and fail on all invalid examples for stages 1-4.

Other Binary Operators

We still haven’t implemented all the binary operators! We can’t implement assignment operators yet (like += and -=), because we don’t have support for local variables. But there are other operators you should be able to implement on your own now:

Modulo %
Bitwise AND &
Bitwise OR |
Bitwise XOR ^
Bitwise shift left <<
Bitwise shift right >>

This week’s tests don’t cover these, so it’s up to you whether to implement them or skip them.

Up Next

Next week we’ll add local variables! That means we’ll finally be able to write programs that aren’t just return statements. See you then!

Update 2/2/2019

Updated code generation for logical AND and OR to short-circuit correctly.

If you have any questions, corrections, or other feedback, you can email me or open an issue.

¹ You can find a complete C operator precedence table here.↩

² See section 6.5.13, paragraph 4, for logical AND and 6.5.14, paragraph 4 for logical OR.↩

Writing a C Compiler, Part 3

2017-12-15T20:30:00+00:00

This is the third post in a series. Read part 1 here.

This week we’ll add binary operations to support basic arithmetic. We’ll figure out how to correctly handle operator precedence and associativity. You can find the accompanying tests here.

Week 3: Binary Operators

This week we’re adding several binary operations (operators that take two values):

Addition +
Subtraction -
Multiplication *
Division /

As usual, we’ll update each stage of the compiler to support these operations.

Lexing

Each of the operators above will require a new token, except for subtraction – we already have a - token. It gets tokenized the same way whether it’s a subtraction or negation operator; we’ll figure out how to interpret it during the parsing stage. Arithmetic expressions can also contain parentheses, but we already have tokens for those too, so we don’t need to change our lexer at all to handle them.

Here’s the full list of tokens we need to support. Tokens from previous weeks are at the top, new tokens are bolded at the bottom:

Open brace {
Close brace }
Open parenthesis (
Close parenthesis )
Semicolon ;
Int keyword int
Return keyword return
Identifier [a-zA-Z]\w*
Integer literal [0-9]+
Minus -
Bitwise complement ~
Logical negation !
Addition +
Multiplication *
Division /

☑ Task:

Update the lex function to handle the new tokens. It should work for all stage 1, 2, and 3 examples in the test suite, including the invalid ones.

Parsing

This week we’ll need to add another expression type to our AST: binary operations. Here’s the latest set of definitions for our AST nodes; only the definition of exp has changed:

program = Program(function_declaration)
function_declaration = Function(string, statement) //string is the function name
statement = Return(exp)
exp = BinOp(binary_operator, exp, exp)
    | UnOp(unary_operator, exp)
    | Constant(int)

Note that we now distinguish between binary and unary operators in our AST definition. For example, - as in negation and - as in subtraction would be different types/classes/whatever in our AST. Here’s how we might construct the AST for 2 - (-3):

two = Const(2)
three = Const(3)
neg_three = UnOp(NEG, three)
exp = BinOp(MINUS, two, neg_three)

We also need to change the definition of <exp> in our grammar. The most obvious definition is something like this:

<exp> ::= <exp> <binary_op> <exp> | <unary_op> <exp> | "(" <exp> ")" | <int>

But there are several related problems with this grammar:

It doesn’t handle operator precedence. Consider the expression 2 + 3 * 4. Using the grammar above, you can construct two possible parse trees:

Tree #1

Tree #2

Using the first parse tree, this expression evalutes to (2 + 3) * 4 = 24. Using the second, it’s 2 + (3 * 4) = 14. According to the C standard and mathematical convention, * has higher precendence than +, so the second parse tree is correct. Our grammar has to encode this precedence somehow.

This is a problem with our unary operations too – according to this grammar, ~2 + 3 could be parsed as ~(2 + 3), which is of course wrong.

It doesn’t handle associativity. Operations at the same precedence level should be evaluated left-to-right¹. For example 1 - 2 - 3 should be parsed as (1 - 2) - 3. But, according to the grammar above, parsing it as 1 - (2 - 3) is also valid.
It’s left-recursive. In the grammar above, one of the production rules for <exp> is:
```
 <exp> <binary_op> <exp>
```
In this production rule, the left-most (i.e. first) symbol is also <exp> — that’s what left-recursive means. Left-recursive grammars aren’t incorrect, but recursive descent (RD) parsers can’t handle them². We’ll talk about why this is a problem later in this post.

Let’s start by tackling problem #1, precedence³. We’ll handle unary operators first – they always have higher precedence than binary operators. A unary operator should only be applied to a whole expression if:

the expression is a single integer (e.g. ~4)
the expression is wrapped in parentheses (e.g. ~(1+1)), or
the expression is itself a unary operation (e.g. ~!8, -~(2+2)).

To express this, we’re going to need another symbol in our grammar to refer to “an expression a unary operator can be applied to”. We’ll call it a factor. We’ll rewrite our grammar like this:

<exp> ::= <exp> <binary_op> <exp> | <factor>
<factor> ::= "(" <exp> ")" | <unary_op> <factor> | <int>

We’ve now created two levels of precedence: one for binary operations, one for unary operations. We’re also handling parentheses correctly – putting an expression inside parentheses forces higher precedence.

We can make a similar change to force * and / to be higher precedence than + and -. We added a <factor> symbol before, representing the operands of unary operations. Now we’ll add a <term> symbol, representing the operands of multiplication and division⁴.

<exp> ::= <exp> ("+" | "-") <exp> | <term>
<term> ::= <term> ("*" | "/") <term> | <factor>
<factor> ::= "(" <exp> ")" | <unary_op> <factor> | <int>

This grammar encodes operator precedence correctly. There’s now only one possible parse tree for 2 + 3 * 4:

That’s problem #1 solved. Problem #2 was that this grammar didn’t handle associativity. If you’re not using an RD parser, you generally use left recursive production rules for left-associative operations, and right recursive rules for right-associative operations. In that case, we could rewrite the rule for <exp> like so:

<exp> ::= <exp> ("+" | "-") <term> | <term>

This would force addition and subtraction to be left-associative; you can’t parse 1 - 2 - 3 as 1 - (2 - 3) because 2 - 3 isn’t a term.

But we are using an RD parser, so we can’t handle this left recursive rule. To understand why this won’t work, let’s try writing a function to parse expressions according to this rule.

def parse_expression(tokens):
    //determine which of two production rules applies:
    //  * <exp> ("+" | "-") <term>
    //  * <term>
    if is_term(tokens): //how do we figure this out???
        return parse_term(tokens)
    else:
        //recursively call parse_expression to handle it
        e1 = parse_expression(tokens) //recurse forever ☠️

To figure out which production rule to use, we can’t just look at the first token or two – we need to know if there’s a + or - operation anywhere in this expression. And if we determine that this expression is a sum or difference, we’re going to call parse_expression recursively forever. In order to not do that, we’d need to find the last <term> at the end of the expression, parse and remove that, then go back and parse the rest. Both of these problems (figuring out which production rule to use, and parsing the last term at the end) require us to look ahead an arbitrary number of tokens until we hit the end of the expression. You might be able to get this approach to work – I’m not sure if there are existing parsing algorithms similar to this – but it will be complicated, and it definitely won’t be a recursive descent parser. So we’re not going to do that.

If, on the other hand, we just switched around <term> and <exp> to avoid left recursion, we’d have this rule:

<exp> ::= <term> ("+" | "-") <exp> | <term>

This is easy to parse but wrong – it’s right-associative. Using this grammar, you would have to parse 1 - 2 - 3 as 1 - (2 - 3).

So our options seem to be an unparseable left recursive grammar, or an incorrect right recursive grammar. Luckily, there’s another solution. We’ll introduce repetition into our grammar, so we can define an expression as a term, possibly plus or minus a term, possibly plus or minus another term…and so on forever. In EBNF notation, wrapping something in curly braces ({}) means it can be repeated zero or more times. Here’s the final grammar we’ll be using for expressions this week:

<exp> ::= <term> { ("+" | "-") <term> }
<term> ::= <factor> { ("*" | "/") <factor> }
<factor> ::= "(" <exp> ")" | <unary_op> <factor> | <int>

This grammar handles precedence correctly, it isn’t left recursive and it isn’t right-associative. However, it’s not really left-associative either. Before, an <exp> was a binary operation with two terms – now it has an arbitrary number of terms. If you have a bunch of operations at the same precendence level (like in 1 - 2 - 3), this grammar doesn’t provide any way to group them into sub-expressions.

That’s okay, though! Our grammar doesn’t need to correspond perfectly with our AST. We can still build our AST in a left-associative way. We’ll parse the first term, and then, if there are any more terms, we process them in a loop, constructing a new BinOp node at each iteration. Here’s the pseudocode for this:

def parse_expression(toks):
    term = parse_term(toks) //pops off some tokens
    next = toks.peek() //check the next token, but don't pop it off the list yet
    while next == PLUS or next == MINUS: //there's another term!
        op = convert_to_op(toks.next())
        next_term = parse_term(toks) //pops off some more tokens
        term = BinOp(op, term, next_term)
        next = toks.peek()

    return t1

We can use exactly the same approach in parse_term.

parse_factor is straightforward, since it doesn’t have to handle associativity. We’ll look at the first token to identify which production rule to use; pop off constants and make sure they have the value we expect; and call other functions to handle non-terminal symbols.

def parse_factor(toks)
    next = toks.next()
    if next == OPEN_PAREN:
        //<factor> ::= "(" <exp> ")"
        exp = parse_exp(toks) //parse expression inside parens
        if toks.next() != CLOSE_PAREN: //make sure parens are balanced
            fail()
        return exp
    else if is_unop(next)
        //<factor> ::= <unary_op> <factor>
        op = convert_to_op(next)
        factor = parse_factor(toks)
        return UnOp(op, factor)
    else if next.type == "INT":
        //<factor> ::= <int>
        return Const(convert_to_int(next))
    else:
        fail()

Just for the sake of completeness, here’s this week’s complete grammar, including the new stuff above (expressions, terms, factors) and stuff that hasn’t changed since last week (functions, statements, etc.):

<program> ::= <function>
<function> ::= "int" <id> "(" ")" "{" <statement> "}"
<statement> ::= "return" <exp> ";"
<exp> ::= <term> { ("+" | "-") <term> }
<term> ::= <factor> { ("*" | "/") <factor> }
<factor> ::= "(" <exp> ")" | <unary_op> <factor> | <int>

☑ Task:

Update your expression-parsing code to handle addition, subtraction, multiplication, and division. It should successfully parse all valid stage 1, 2, and 3 examples in the test suite, and fail on all invalid stage 1, 2, and 3 examples. Manually inspect the AST for each example to make sure it handles associativity and operator precedence correctly. If you haven’t written a pretty printer yet, you’ll probably need to do that now.

Code Generation

There’s a new challenge in the code generation stage this week. To handle a binary expression, like e1 + e2, our generated assembly needs to:

Calculate e1 and save it somewhere.
Calculate e2.
Add e1 to e2, and store the result in EAX.

So, we need somewhere to save the first operand. Saving it in a register would be complicated; the second operand can itself contain subexpressions, so it might also need to save intermediate results in a register, potentially overwriting e1⁵. Instead, we’ll save the first operand on the stack.

Let’s talk about the stack briefly. Every process on a computer has some memory. This memory is divided into several segments, one of which is the call stack, or just the stack. The address of the top of the stack is stored in the ESP register, aka the stack pointer. Like with most stacks, you can push things onto the top, or pop things off the top; x86 includes push and pop instructions to do just that. One confusing thing about the stack is that it grows towards lower memory addresses – when you push something onto the stack, you decrement ESP. The processor relies on ESP to figure out where the top of the stack is. So, pushl val does the following⁶:

Writes val to the next empty spot on the stack (i.e. ESP - 4)
Decrements ESP by 4, so it contains the memory address of val.

Along the same lines, popl dest does the following:

Reads value from the top of the stack (i.e. the value at the memory address in ESP).
Copies that value into dest, which is a register or other memory location
Increments ESP by 4, so it points to the value just below val.

But right now, all you really need to know is that there’s a stack, push puts things on it and pop takes things off it. So, here’s our assembly for e1 + e2:

    <CODE FOR e1 GOES HERE>
    push %eax ; save value of e1 on the stack
    <CODE FOR e2 GOES HERE>
    pop %ecx ; pop e1 from the stack into ecx
    addl %ecx, %eax ; add e1 to e2, save results in eax

You can handle e1 * e2 exactly the same way, using imul instead of addl. Subtraction is a bit more complicated because order of operands matters; subl src, dst computes dst - src, and saves the result in dst. You’ll need to make sure e1 is in dst and e2 is in src – and, of course, that the result ends up in EAX. Division is even trickier: idivl dst treats EDX and EAX as a single, 64-bit register and calculates [EDX:EAX] / dst. It then stores the quotient in EAX and the remainder in EDX. To make it work, you’ll need to first move e1 into EAX, and then sign-extend it into EDX using the cdq instruction before issuing the idivl instruction⁷. You can check the Intel Software Developer’s Manual for more details on any of these instructions.

☑ Task:

Update your code-generation pass to emit correct code for addition, subtraction, division, and multiplication. It should succeed on all valid examples and fail on all invalid examples for stages 1, 2, and 3.

Using GDB

If you run into trouble during code generation, you may want to step through it with GDB or LLDB. I’m not going to cover how to use GDB here (I list some tutorials under “Further Reading”), but here are a few tips specifically for stepping through assembly without a symbol table:

You can use nexti and stepi to step through one assembly instruction at a time.
In GDB, but not LLDB, layout asm displays the assembly as you step through it, and layout regs displays the registers.
You can set breakpoints at functions (e.g. b main) even when your binary doesn’t have debug symbols.

Also, running GDB on OS X is a pain in the butt. You can find instructions about how to get it working here, or you can use LLDB instead. I kind of hate LLDB but maybe I’m just not used to it.

Up Next

Next week, I’ll be on vacation. The week after that, we’ll add more binary operators to support comparisons and boolean logic. See you then!

If you have any questions, corrections, or other feedback, you can email me or open an issue.

Nora Sandler

Book Update

Feedback!

Writing a C Compiler is a book!

What’s the deal with this book?

What if I’ve already done the series?

Update 3/1/2023

C Compiler, Part 10: Global Variables

Part 10: Global Variables

Lexing

Parsing

☑ Task:

Code Generation

Uninitialized Variables

Non-Constant Initializers

Validation

☑ Task:

PIE 🥧

Up Next

C Compiler, Part 9: Functions

Part 9: Functions

Terminology

Limitations

Lexing

☑ Task:

Parsing

Function Definitions

Function Calls

Top Level

☑ Task:

Validation

☑ Task:

Code Generation

Calling Conventions

cdecl Function Calls in Excruciating Detail

Function Definitions

Function Prototypes

Function Calls

Stack Alignment

Top Level

☑ Task:

Fibonacci & Hello, World!

Up Next

C Compiler, Part 8: Loops

Part 8: Loops

for loops

while and do Loops

break and continue

Null statements

Lexing

☑ Task:

Parsing

☑ Task:

Code Generation

Null Statements

while loops

do Loops

for loops

break and continue

Up Next

Writing a C Compiler, Part 7

Update 4/9

Part 7: Compound Statements

Lexing

Parsing

☑ Task:

Code Generation

Deallocating Variables

☑ Task:

Up Next

Writing a C Compiler, Part 6

Part 6: Conditionals

If Statements

Else If

Conditional Expressions

Lexing

☑ Task:

Parsing

If Statements

☑ Task:

`for` loops

`while` and `do` Loops

`break` and `continue`

`while` loops

`do` Loops

`for` loops

`break` and `continue`