Writing a C Compiler, Part 4
This is the fourth post in a series. Read part 1 here.
This week we’re adding some boolean operators (&&
, ||
) and a whole bunch of relational operators (<
, ==
, etc.). Since we already know how to handle binary operators, this week will be pretty straightforward. As always, you can find the accompanying tests here.
The test suite is slightly weird this week; the three tests whose names start with skip_on_failure_
use local variables, which we haven’t implemented yet. I’ve included them because otherwise the test suite can’t validate that short-circuiting works correctly. When you run the test suite, they should show up as NOT_IMPLEMENTED
rather than FAIL
in the results, and they shouldn’t count toward the total number of failures. Once you’ve implemented local variables, these tests should pass.
Week 4: Even More Binary Operators
We’re adding eight new operators this week:
- Logical AND
&&
- Logical OR
||
- Equal to
==
- Not equal to
!=
- Less than
<
- Less than or equal to
<=
- Greater than
>
- Greater than or equal to
>=
As usual, we’ll update our lexing, parsing, and code generation passes to support these operations.
Lexing
Each new operator corresponds to a new token. Here’s the full list of tokens we need to support, with old tokens at the top and new tokens in bold at the bottom:
- Open brace
{
- Close brace
}
- Open parenthesis
(
- Close parenthesis
)
- Semicolon
;
- Int keyword
int
- Return keyword
return
- Identifier
[a-zA-Z]\w*
- Integer literal
[0-9]+
- Minus
-
- Bitwise complement
~
- Logical negation
!
- Addition
+
- Multiplication
*
- Division
/
- AND
&&
- OR
||
- Equal
==
- Not Equal
!=
- Less than
<
- Less than or equal
<=
- Greater than
>
- Greater than or equal
>=
☑ Task:
Update the lex function to handle the new tokens. It should work for all valid and invalid stage 1-4 examples in the test suite, except the skip_on_failure_
ones.
Parsing
Last week, we found that we needed one production rule in our grammar for each operator precedence level. This week we have a lot more precedence levels, which means our grammar will grow a lot. However, our parsing strategy hasn’t changed at all; we’ll handle our new production rules exactly the same way as the old rules for exp
and term
. Honestly, this is going to be pretty tedious, but I hope it will help solidify all the stuff about parsing from last week.
Here are our all binary operators, from highest to lowest precedence1:
- Multiplication & division (
*
,/
) - Addition & subtraction (
+
,-
) - Relational less than/greater than/less than or equal/greater than or equal (
<
,>
,<=
,>=
) - Relational equal/not equal (
==
,!=
) - Logical AND (
&&
) - Logical OR (
||
)
We handled the first two bullet points last week; the last four are new. We’ll add a production rule for each of the last four bullet points. The new grammar is below, with changed/added rules bolded.
<program> ::= <function> <function> ::= "int" <id> "(" ")" "{" <statement> "}" <statement> ::= "return" <exp> ";" <exp> ::= <logical-and-exp> { "||" <logical-and-exp> } <logical-and-exp> ::= <equality-exp> { "&&" <equality-exp> } <equality-exp> ::= <relational-exp> { ("!=" | "==") <relational-exp> } <relational-exp> ::= <additive-exp> { ("<" | ">" | "<=" | ">=") <additive-exp> } <additive-exp> ::= <term> { ("+" | "-") <term> } <term> ::= <factor> { ("*" | "/") <factor> } <factor> ::= "(" <exp> ")" | <unary_op> <factor> | <int> <unary_op> ::= "!" | "~" | "-"
<additive-exp>
is the same as <exp>
from last week. We had to rename it because <exp>
now refers to logical OR expressions, which now have lowest precedence.
Last week you wrote parse_exp
and parse_term
; now you’ll need parse_relational_exp
, parse_equality_exp
, etc. Other than handling different operators, these functions will all be identical.
And for the sake of completeness, here’s our AST definition:
program = Program(function_declaration)
function_declaration = Function(string, statement) //string is the function name
statement = Return(exp)
exp = BinOp(binary_operator, exp, exp)
| UnOp(unary_operator, exp)
| Constant(int)
This is identical to last week, except we’ve added more possible values of binary_operator
.
☑ Task:
Update your expression-parsing code to handle this week’s new binary operators. It should successfully parse all valid stage 1-4 examples in the test suite (except the skip_on_failure
ones), and fail on all invalid stage 1-4 examples. The test suite doesn’t directly verify that your program generates the correct AST, so you’ll need to manually inspect the AST for each example to make sure it’s right.
Code Generation
Our general approach to code generation for binary operations is the same as last week:
- Calculate
e1
- Push it onto the stack
- Calculate
e2
- Pop
e1
from the stack back into a register - Perform the operation on
e1
ande2
.
All the new stuff will be in step 5.
Relational Operators
Let’s handle the relational operators first. Like the logical NOT operator (!
) in week 2, these return 1 for true results and 0 for false results. These operators are almost identical to !
except that they compare two expressions to each other, instead of comparing an expression to zero.
Here’s the assembly we generated for !
in week 2:
<CODE FOR exp GOES HERE>
cmpl $0, %eax ;set ZF on if exp == 0, set it off otherwise
movl $0, %eax ;zero out EAX (doesn't change FLAGS)
sete %al ;set AL register (the lower byte of EAX) to 1 iff ZF is 1
We can modify this slightly to implement ==
:
<CODE FOR e1 GOES HERE>
push %eax ; save value of e1 on the stack
<CODE FOR e2 GOES HERE>
pop %ecx ; pop e1 from the stack into ecx - e2 is already in eax
cmpl %eax, %ecx ;set ZF on if e1 == e2, set it off otherwise
movl $0, %eax ;zero out EAX (doesn't change FLAGS)
sete %al ;set AL register (the lower byte of EAX) to 1 iff ZF is on
The sete
instruction is just one of a whole slew of conditional set instructions. There’s also setne
(set if not equal), setge
(set if greater than or equal), and so on. To implement <
, >
, and the other relational operators, we can generate exactly the same assembly as we used for ==
above, just replacing sete
with the appropriate conditional set instruction. Easy!
In week 2, we talked about testing for equality with the zero flag (ZF). But we can’t use ZF to determine which operand is larger. For that, we need the sign flag (SF), which is set if the result of an operation is negative, like so:
movl $0, %eax ;zero out EAX
movl $2, %ecx ;ECX = 2
cmpl $3, %ecx ;compute 2 - 3, set flags
setl %al ;set AL if 2 < 3, i.e. if 2 - 3 is negative
Now let’s talk about &&
and ||
. I’ll use &
and |
to indicate bitwise AND and OR, respectively.
Short-Circuit Evaluation
The C11 standard guarantees that evaluation of &&
and ||
will short-circuit: if we know the result after evaluating the first clause, we don’t evaluate the second clause2. For example, consider the following line of code:
return 0 && foo();
Because the first clause is false, we don’t need to know the return value of foo
, so we won’t call foo
at all. Whether foo
is called won’t change the return value on this line, but it could perform I/O, update global variables, or have other important side effects. So making sure that &&
and ||
short-circuit isn’t just a performance optimization; it’s required for some programs to execute correctly.
Logical OR
To guarantee that logical OR short-circuits, we’ll need to jump over clause 2 when clause 1 is true.
We’ll follow these steps to calculate e1 || e2
:
- Calculate
e1
- If the result is 0, jump to the step 4.
- Set EAX to 1 and jump to the end.
- Calculate
e2
. - If the result is 0, set EAX to 0. Otherwise set EAX to 1.
Step 2 will require a new type of instruction called conditional jumps. These are similar to the conditional set instructions, like sete
and setne
, that we’ve already used. The only difference is that instead of setting a byte to 1, they jump to a specific point in the assembly code, which we specify with a label. Here’s an example of je
, the “jump if equal” instruction, in action:
cmpl $0, %eax ; set ZF if EAX == 0
je _there ; if ZF is set, go to _there
movl $1, %eax
ret
_there:
movl $2, %eax
ret
If EAX is 0 at the start of this code snippet, it will return 2; otherwise it will return 1. Let’s look at exactly what instructions will execute in each case.
First consider the case where EAX is 0 at the start:
cmpl $0, %eax
Because EAX is 0, this will set the zero flag (ZF) to true.je _there
Because ZF is true, it will jump.movl $2, %eax
This executes next because it’s the first instruction after_there
. It sets EAX to 2.ret
The return value will be 2.
Now consider the case where EAX isn’t zero:
cmpl $0, %eax
Because EAX isn’t 0, this will set ZF to false.je _there
Because ZF is false, it will not jump, so this instruction is a no-op.movl $1, %eax
Since we didn’t jump, control passes to the next instruction as usual. It sets EAX to 1.ret
The return value will be 1.
We’ll also need the jmp
instruction, which performs an unconditional jump. Here’s an example of jmp
in action:
movl $0, %eax ; zero out EAX
jmp _there ; go to _there label
movl $5 %eax ; this will never execute, we always jump over it
_there:
ret ; will always return zero
Now that we’re familiar with jmp
and je
, here’s the assembly for e1 || e2
:
<CODE FOR e1 GOES HERE>
cmpl $0, %eax ; check if e1 is true
je _clause2 ; e1 is 0, so we need to evaluate clause 2
movl $1, %eax ; we didn't jump, so e1 is true and therefore result is 1
jmp _end
_clause2:
<CODE FOR e2 GOES HERE>
cmpl $0, %eax ; check if e2 is true
movl $0, %eax ; zero out EAX without changing ZF
setne %al ; set AL register (the low byte of EAX) to 1 iff e2 != 0
_end:
Note that labels have to be unique. This means you can’t actually use _clause2
or _end
as labels, because you’ll have duplicate labels if your program includes more than one logical OR. You should probably write a utility function to generate unique labels. It doesn’t have to be fancy; the label generator in nqcc just includes an incrementing counter in every label.
The _end
label here may look odd, since it doesn’t appear to label anything. Actually, it labels whatever comes right after this expression; it just gives us a target to jump over _clause2
.
Logical AND
Almost identical to logical OR, except we short-circuit if e1
is 0. We use the jne
(jump if not equal) instruction. In that case we don’t need to move anything into EAX, since 0 is the result we want. Here’s the assembly:
<CODE FOR e1 GOES HERE>
cmpl $0, %eax ; check if e1 is true
jne _clause2 ; e1 isn't 0, so we need to evaluate clause 2
jmp _end
_clause2:
<CODE FOR e2 GOES HERE>
cmpl $0, %eax ; check if e2 is true
movl $0, %eax ; zero out EAX without changing ZF
setne %al ; set AL register (the low byte of EAX) to 1 iff e2 != 0
_end:
As with logical OR, we need to make sure the labels are unique.
☑ Task:
Update your code-generation pass to emit correct code for &&
, ||
, ==
, !=
, <
, <=
, >
, and >=
. It should succeed on all valid examples (except the skip_on_failure_
ones) and fail on all invalid examples for stages 1-4.
Other Binary Operators
We still haven’t implemented all the binary operators! We can’t implement assignment operators yet (like +=
and -=
), because we don’t have support for local variables. But there are other operators you should be able to implement on your own now:
- Modulo
%
- Bitwise AND
&
- Bitwise OR
|
- Bitwise XOR
^
- Bitwise shift left
<<
- Bitwise shift right
>>
This week’s tests don’t cover these, so it’s up to you whether to implement them or skip them.
Up Next
Next week we’ll add local variables! That means we’ll finally be able to write programs that aren’t just return statements. See you then!
Update 2/2/2019
- Updated code generation for logical AND and OR to short-circuit correctly.
If you have any questions, corrections, or other feedback, you can email me or open an issue.
2 See section 6.5.13, paragraph 4, for logical AND and 6.5.14, paragraph 4 for logical OR.↩