<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://norasandler.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://norasandler.com/" rel="alternate" type="text/html" /><updated>2026-06-02T22:53:22+00:00</updated><id>https://norasandler.com/feed.xml</id><title type="html">Nora Sandler</title><subtitle>it&apos;s a blog</subtitle><author><name>Nora Sandler</name></author><entry><title type="html">And Now for Something Completely Different</title><link href="https://norasandler.com/2026/06/02/And-Now-for-Something-Completely-Different.html" rel="alternate" type="text/html" title="And Now for Something Completely Different" /><published>2026-06-02T14:00:00+00:00</published><updated>2026-06-02T14:00:00+00:00</updated><id>https://norasandler.com/2026/06/02/And-Now-for-Something-Completely-Different</id><content type="html" xml:base="https://norasandler.com/2026/06/02/And-Now-for-Something-Completely-Different.html"><![CDATA[<p>My partner and I wrote an article about our quest to complete the Seattle Independent Bookstore Day Passport Challenge entirely on transit. <a href="https://www.theurbanist.org/33-bookstores-10-days-no-car-an-urbanists-guide-to-independent-bookstore-day/">You can read it at the Urbanist</a>. It has absolutely nothing to do with programming!
Read on for some photos we took that didn’t make it into the article:</p>

<figure>
    <img src="/assets/bookstore_day_2026/madison_book_shelves.jpeg" alt="Three shelves filled with colorful books, covers facing out. One shelf has a 'New Releases' label made of Scrabble tiles. More shelves in the background. Some titles: Upward Bound, Last Night in Brooklyn, Dear Monica Lewinsky" />
    <figcaption>The shelves at <a href="https://www.madisonbks.com/">Madison Books</a></figcaption>
</figure>

<div class="img-wrapper">
    <div style="flex: 1.33;">
    <figure>
        <img src="/assets/bookstore_day_2026/brian_reading_2_line.jpeg" alt="Close up of a bearded man with long hair and glasses reading in front of a bus window, with highway and water in the background." />
        <figcaption>Brian reads on the bus across Lake Washington</figcaption>
    </figure>
    </div>
    <div style="flex: 1.33;">
    <figure>
        <img src="/assets/bookstore_day_2026/street_sign_cherry_blossoms.jpeg" alt="A traffic light with a sign that says 'Right lane must turn right,' framed by cherry blossoms, seen from below at an angle'" />
        <figcaption>Cherry blossoms in Kirkland</figcaption>
    </figure>
    </div>
</div>

<div class="img-wrapper">
    <div style="flex: 1.33;">
    <figure>
        <img src="/assets/bookstore_day_2026/brian_page_2.jpeg" alt="A man carrying a dog in a backpack browses in a bookstore" />
        <figcaption>Brian browsing the shelves at <a href="https://www.page2books.com/">Page 2 Books</a> in Burien</figcaption>
    </figure>
    </div>
    <div style="flex: 1.33;">
    <figure>
        <img src="/assets/bookstore_day_2026/away_with_words_tub.jpeg" alt="A bathtub filled with bath bombs and a big rubber duck, with bubble decorations draped over it, next to a dark wood shelf full of lotions and shower steamers" />
        <figcaption>Bath products on display at <a href="https://awaywithwordsbookshop.com/">Away With Words</a> in Poulsbo</figcaption>
    </figure>
    </div>
</div>

<div class="img-wrapper">
    <div>
    <figure>
        <img src="/assets/bookstore_day_2026/staff_picks_phinney.jpeg" alt="A bookshelf, full of colorful books, including Memoirs of Hadrian, The City &amp; The City, Thomas and Beulah. Taped to the bottom of the shelf, a pen-and-ink drawing taped to it of a man dressed like a movie theater attendant with a dazed expression standing behind a popcorn machine. In the drawing, a sign in front of the man says 'Staff Picks'. Next to the drawing the shelf is labeled 'Recommended Reads'." />
        <figcaption>Staff picks at <a href="https://www.phinneybooks.com/">Phinney Books</a></figcaption>
    </figure>
    </div>
</div>]]></content><author><name>Nora Sandler</name></author><summary type="html"><![CDATA[My partner and I wrote an article about our quest to complete the Seattle Independent Bookstore Day Passport Challenge entirely on transit. You can read it at the Urbanist. It has absolutely nothing to do with programming! Read on for some photos we took that didn’t make it into the article:]]></summary></entry><entry><title type="html">Writing a C Compiler Is Here!</title><link href="https://norasandler.com/2024/08/20/The-Book-Is-Here.html" rel="alternate" type="text/html" title="Writing a C Compiler Is Here!" /><published>2024-08-20T14:00:00+00:00</published><updated>2024-08-20T14:00:00+00:00</updated><id>https://norasandler.com/2024/08/20/The-Book-Is-Here</id><content type="html" xml:base="https://norasandler.com/2024/08/20/The-Book-Is-Here.html"><![CDATA[<p><img src="/assets/arlo_with_book.jpeg" alt="My dog sitting on the couch next to a stack of copies of Writing a C Compiler, which he is looking at skeptically." /></p>

<p>It’s finally here! <em>Writing a C Compiler</em> goes on sale today on <a href="https://bookshop.org/p/books/writing-a-c-compiler-build-a-real-programming-language-from-scratch-nora-sandler/18414210">Bookshop.org</a>, <a href="https://www.barnesandnoble.com/w/writing-a-c-compiler-nora-sandler/1141287012">Barnes &amp; Noble</a>, <a href="https://www.amazon.com/Writing-Compiler-Programming-Language-Scratch/dp/1718500424/">Amazon</a>, and wherever books are sold<sup id="anchor1"><a href="#fn1">1</a></sup>. (It’s been available <a href="https://nostarch.com/writing-c-compiler">directly from No Starch Press</a> for a few weeks now!) You can find links to the companion code, errata, and other resources on the book’s <a href="/book">web page</a>.</p>

<p>I’m incredibly grateful to everyone who preordered a copy, and especially to the readers who took the time to email me and share their feedback. One of the best parts of writing a project-based book like this hearing how people are making it their own—readers have been writing compilers in everything from Rust to Scheme to C, and a few brave souls are even targeting totally different instruction sets. I can’t wait to see what else this book’s readers build.</p>

<div class="footnote">
  <p><sup id="fn1">1</sup>
If you see it for sale in a physical bookstore, <a href="mailto:nora@norasandler.com">send me a photo</a>!</p>
</div>]]></content><author><name>Nora Sandler</name></author><category term="compiler-tutorial" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Book Update</title><link href="https://norasandler.com/2023/10/17/Book-update.html" rel="alternate" type="text/html" title="Book Update" /><published>2023-10-17T14:00:00+00:00</published><updated>2023-10-17T14:00:00+00:00</updated><id>https://norasandler.com/2023/10/17/Book-update</id><content type="html" xml:base="https://norasandler.com/2023/10/17/Book-update.html"><![CDATA[<p>I’ve got a couple of updates about my upcoming book <a href="https://nostarch.com/writing-c-compiler"><em>Writing a C Compiler</em></a>, which I first announced in a <a href="/2022/03/29/Write-a-C-Compiler-the-Book.html">blog post last year</a>.</p>

<p>I’ll start with the bad news: we’ve had to push the release date back until mid-2024. But I also have good news, which is that <strong>the entire book is now available early access</strong> to anyone who’s preordered through the <a href="https://nostarch.com/writing-c-compiler">No Starch Press website</a>. I’ve also made the book’s companion <a href="https://github.com/nlsandler/writing-a-c-compiler-tests/tree/complete-test-suite">test suite</a> and <a href="https://github.com/nlsandler/nqcc2">reference implementation</a> available on Github.</p>

<p>If you preordered this book last year, I realize you’ve been waiting a long time for it! I’m excited to make this early access version available, so you can start working on your compiler before the official release date. As with any early access book, you might still run into typos, layout problems, and the like, which will get fixed before the book is released. The test suite and reference implementation are also still works in progress. Between now and the book’s release date, I’ll be adding more test cases, especially for the last three chapters and the extra credit features; I’m also planning to make readability improvements to the reference implementation. Even though these codebases aren’t quite complete, they’re close enough to use while you work on the project.</p>

<h1 id="feedback">Feedback!</h1>

<p>If you have questions or corrections about the early access chapters, please <a href="mailto:nora@norasandler.com">email me</a>. You can also report errors through No Starch’s <a href="https://docs.google.com/forms/d/e/1FAIpQLSfjCqdOzGOdoe7m1Rgqfo-dqvz85Gqe8758jwUD9mpFYiSjGA/viewform?fbzx=-3092278227089906900">Early Access comment form</a><sup id="anchor1"><a href="#fn1">1</a></sup>. And if you run into any issues with the test suite or reference implementation, please file a bug in that repo’s issue tracker. You can file bugs against the test suite <a href="https://github.com/nlsandler/writing-a-c-compiler-tests/issues">here</a>, and against the reference implementation <a href="https://github.com/nlsandler/nqcc2/issues">here</a>.</p>

<div class="footnote">
  <p><sup id="fn1">1</sup>
No need to report typos/formatting issues/etc.; those will get fixed when the book goes through copyediting and proofreading.<a href="#anchor1">↩</a></p>
</div>]]></content><author><name>Nora Sandler</name></author><category term="compiler-tutorial" /><summary type="html"><![CDATA[I’ve got a couple of updates about my upcoming book Writing a C Compiler, which I first announced in a blog post last year.]]></summary></entry><entry><title type="html">Writing a C Compiler is a book!</title><link href="https://norasandler.com/2022/03/29/Write-a-C-Compiler-the-Book.html" rel="alternate" type="text/html" title="Writing a C Compiler is a book!" /><published>2022-03-29T16:00:00+00:00</published><updated>2022-03-29T16:00:00+00:00</updated><id>https://norasandler.com/2022/03/29/Write-a-C-Compiler-the-Book</id><content type="html" xml:base="https://norasandler.com/2022/03/29/Write-a-C-Compiler-the-Book.html"><![CDATA[<p><strong>Update <a href="/2023/10/17/Book-update.html">here</a>.</strong></p>

<p>I have some very exciting news to share: the “Writing a C Compiler” series is now a book!</p>

<p><a href="https://nostarch.com/writing-c-compiler"><em>Writing a C Compiler: Build a Real Programming Language from Scratch</em></a> is coming out from No Starch Press in late 2023. You can preorder at the link to get early access to the first few chapters.</p>

<p>In the <a href="/2019/02/18/Write-a-Compiler-10.html">last post in the series</a>, I said that I was going to take a six-month break to figure out how to finish the compiler. Instead, I took a three-year break, reworked the backend, implemented the rest of the features I wanted to add (well, most of them), and wrote a book. If you were already following the series, you can jump to <a href="#what-if-ive-already-done-the-series">this section</a> to learn what’s changed. Otherwise, read on for an elevator pitch!</p>

<h1 id="whats-the-deal-with-this-book">What’s the deal with this book?</h1>

<p><em>Writing a C Compiler</em> is a hands-on guide to, well, writing your own C compiler. It takes the same basic approach as the <a href="/2017/11/29/Write-a-Compiler.html">series</a> of blog posts I published here a few years ago. You start out by compiling the tiniest possible C program to x64 assembly, then add a new feature in each chapter. This book is all about compiling a real, widely used programming language into real assembly code, with all the low-level details and ugly edge cases that entails.</p>

<p>At the same time, I wanted to write this book for a broad audience, not just people who already know assembly code or have the C standard memorized. So I’ve tried to lay the whole process–ugly edge cases included–in a way that’s accessible, easy to follow, and maybe even fun. The implementation code in the book is all pseudocode, so you can implement your compiler in whatever language you want!</p>

<p>Here’s a non-exhaustive look at what you’ll learn:
<a id="outline"></a></p>
<ul>
  <li><strong>Part I</strong> introduces the basics, like expressions, variables, control flow statements, and function calls.</li>
  <li><strong>Part II</strong> adds more types, including floating-point numbers, arrays and pointers, and structs.</li>
  <li><strong>Part III</strong> covers a few classic optimizations, like constant folding, dead code elimination, and register allocation.</li>
</ul>

<p>I didn’t include every feature in the C standard, but I wanted the end result to <em>feel</em> complete. I’ve also tried to cover the fundamentals that you’ll need to know if you want to keep building out new features on your own.</p>

<h1 id="what-if-ive-already-done-the-series">What if I’ve already done the series?</h1>

<p>When I started working on the book, I thought that I’d just be building on the existing series. But the implementation in the book quickly diverged from what I’d originally posted. The most obvious problem is that the original design produced 32-bit x86 assembly, which was quickly becoming obsolete even when I first started the project back in 2017.</p>

<p>The other problem was that I needed a new intermediate representation. Converting the AST directly to assembly worked well for the first few chapters, but got more and more unwieldy as the project went on. I knew that things would only get worse as I started to add new types, and optimizations were going to be really difficult. The new implementation converts the program to <a href="https://en.wikipedia.org/wiki/Three-address_code">three-address code</a> before it generates assembly.</p>

<p>The upshot is that I won’t be continuing the series on this blog. The good news, of course, is that you can finish your compiler by working through the book, which covers <a href="#outline">a lot more ground</a>! The bad news is that you won’t be able to skip straight to Part II; you’ll have to bring your backend in line with the implementation described in Part I first. Hopefully, the payoff of finishing your compiler will be well worth the extra work!</p>

<h2 id="update-312023">Update 3/1/2023</h2>
<p>An earlier version of this blog post said the book would be coming out in January 2023. Unfortunately, we’ve had to push back this release date until later this year. If you’ve preordered the book, thanks so much for your patience; the wonderful folks at No Starch Press and I are working hard to make this the best book possible!</p>]]></content><author><name>Nora Sandler</name></author><category term="compiler-tutorial" /><summary type="html"><![CDATA[Update here.]]></summary></entry><entry><title type="html">C Compiler, Part 10: Global Variables</title><link href="https://norasandler.com/2019/02/18/Write-a-Compiler-10.html" rel="alternate" type="text/html" title="C Compiler, Part 10: Global Variables" /><published>2019-02-18T17:00:00+00:00</published><updated>2019-02-18T17:00:00+00:00</updated><id>https://norasandler.com/2019/02/18/Write-a-Compiler-10</id><content type="html" xml:base="https://norasandler.com/2019/02/18/Write-a-Compiler-10.html"><![CDATA[<p><em>This is the tenth post in a series. Read part 1 <a href="/2017/11/29/Write-a-Compiler.html">here</a>.</em></p>

<p>We’re back! I said I was going to do a non-compiler post next, but that turned out to be a lie. Instead, we’re going to implement global variables. This isn’t too complicated, but it lets us learn about some new sections of object files and program memory.</p>

<p>As always, tests are <a href="https://github.com/nlsandler/write_a_c_compiler">here</a>.</p>

<p><strong>Note for macOS Users:</strong> since the last post, Apple started phasing out support for 32-bit programs on macOS. What that means for us is that if you’re using the default C compiler on macOS Mojave, you’ll get an error if you try to compile for a 32-bit backend<sup id="anchor1"><a href="#fn1">1</a></sup>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>gcc <span class="nt">-m32</span> example.c
ld: warning: The i386 architecture is deprecated <span class="k">for </span>macOS <span class="o">(</span>remove from the Xcode build setting: ARCHS<span class="o">)</span>
ld: warning: ignoring file /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/lib/libSystem.tbd, missing required architecture i386 <span class="k">in </span>file /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/lib/libSystem.tbd
ld: dynamic main executables must <span class="nb">link </span>with libSystem.dylib <span class="k">for </span>architecture i386
clang: error: linker <span class="nb">command </span>failed with <span class="nb">exit </span>code 1 <span class="o">(</span>use <span class="nt">-v</span> to see invocation<span class="o">)</span>
ld: warning: The i386 architecture is deprecated <span class="k">for </span>macOS <span class="o">(</span>remove from the Xcode build setting: ARCHS<span class="o">)</span>
</code></pre></div></div>

<p>But never fear! The Homebrew version of GCC works just fine, although it still emits a warning:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>gcc-8 <span class="nt">-m32</span> static.c
ld: warning: The i386 architecture is deprecated <span class="k">for </span>macOS <span class="o">(</span>remove from the Xcode build setting: ARCHS<span class="o">)</span>
</code></pre></div></div>

<p>I’m pretty sure there’s a way to get the default compiler to build 32-bit programs as well but I don’t know what it is.</p>

<p>When you run a 32-bit program (like the ones produced by <em>your</em> compiler), you might also get a warning that it isn’t optimized for your computer. This is also due to Apple’s efforts to phase out 32-bit programs, but you don’t need to do anything about it.</p>

<p>The bigger issue, of course, is that the next version of macOS won’t run 32-bit programs at all. I plan to update all my posts before that happens to cover 64-bit compilation too. And yes, I do regret targeting a 32-bit architecture to begin with, thank you for asking. Luckily, apart from calling conventions all the differences so far are pretty minor.</p>

<p>With that out of the way, let’s move on to…</p>

<h1 id="part-10-global-variables">Part 10: Global Variables</h1>

<p>We can already handle local variables declared inside functions. Now we’ll add support for global variables, which any function can access.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">foo</span><span class="p">;</span>

<span class="kt">int</span> <span class="nf">fun1</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">foo</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">fun2</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">foo</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">fun1</span><span class="p">();</span>
    <span class="k">return</span> <span class="n">fun2</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note that global variables can be shadowed by local variables of the same name:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">foo</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">foo</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span> <span class="c1">//shadows global 'foo'</span>
    <span class="k">return</span> <span class="n">foo</span><span class="p">;</span> <span class="c1">// returns 4</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Global variables are similar to functions in that they can be declared many times, but defined (i.e. initialized) only once:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">foo</span><span class="p">;</span> <span class="c1">// declaration</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">foo</span><span class="p">;</span> <span class="c1">// returns 3</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="n">foo</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span> <span class="c1">// definition</span>
</code></pre></div></div>

<p>And, like functions, global variables must be declared (but not necessarily defined) before they’re used:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">foo</span><span class="p">;</span> <span class="c1">// ERROR: not declared!</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="n">foo</span><span class="p">;</span>
</code></pre></div></div>

<p>Declaring a function and a global variable with the same name is an error:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">foo</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">return</span> <span class="mi">3</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="n">foo</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span> <span class="c1">// ERROR</span>
</code></pre></div></div>

<p>Unlike local variables, global variables don’t need to be explicitly initialized. If a local variable isn’t initialized, its value is undefined, but if a global variable isn’t initialized its value is 0.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">foo</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">foo</span><span class="p">;</span> <span class="c1">// This could be literally anything</span>
<span class="p">}</span>
</code></pre></div></div>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">foo</span><span class="p">;</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">foo</span><span class="p">;</span> <span class="c1">// This will definitely be 0</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note that we’re using the terms “declaration” and “definition” the same way we did for functions. This is a global variable declaration<sup id="anchor2"><a href="#fn2">2</a></sup>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">foo</span><span class="p">;</span>
</code></pre></div></div>

<p>This is both a declaration and a definition:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">foo</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">static</code> and <code class="language-plaintext highlighter-rouge">extern</code> keywords would add some extra complications, but we won’t support those yet.</p>

<p>Now let’s move on to…</p>

<h2 id="lexing">Lexing</h2>
<p>No new tokens this week, so we don’t have to touch the lexer.</p>

<h2 id="parsing">Parsing</h2>
<p>Previously, a program was a list of function declarations. Now it’s a list of top-level declarations, each of which is either a function declaration or a variable declaration.</p>

<p>So our top-level AST definitions now look like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>toplevel_item = Function(function_declaration)
              | Variable(declaration)
toplevel = Program(toplevel_item list)              
</code></pre></div></div>

<p>And we need a corresponding change to the top-level grammar rule:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;program&gt; ::= { &lt;function&gt; | &lt;declaration&gt; }
</code></pre></div></div>

<h4 id="-task">☑ Task:</h4>
<p>Update the parsing pass to support global variables. The parsing stage should now succeed on all valid examples in stages 1-10.</p>

<h2 id="code-generation">Code Generation</h2>
<p>Global variables need to live somewhere in memory. They can’t live on the stack, because they need to be accessible from every stack frame. Instead, they live in a different chunk of memory, the data section. We’ve already seen what a running program’s stack looks like; now let’s step back and see how all of its memory is laid out<sup id="anchor3"><a href="#fn3">3</a></sup>:</p>

<p><img class="small" style="width: 20%;" alt="Diagram of program memory layout. The stack starts at a high address and grows down into free space. The heap starts at a lower address and grows up into the same region of free space. Below the heap, from top to bottom, are Initialized Data, Uninitialized Data (BSS) and Text." src="/assets/program_memory_layout.png" /></p>

<p>The x86 instructions we’ve been dealing with so far all live in the text section. Our global variables will live in the data section, which we can further subdivide into initialized and uninitialized data—the uninitialized data section is usually called BSS<sup id="anchor4"><a href="#fn4">4</a></sup>.</p>

<p>So far we’ve only generated assembly for the text section, which contains actual program instructions; let’s see what the assembly to describe a variable in the data section looks like:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">.globl</span> <span class="nv">_my_var</span> <span class="c1">; make this symbol visible to the linker</span>
    <span class="nf">.data</span>          <span class="c1">; what's next describes the data section    </span>
    <span class="nf">.align</span> <span class="mi">2</span>       <span class="c1">; this data should aligned on 4-byte intervals</span>
<span class="nl">_my_var:</span>
    <span class="nf">.long</span> <span class="mi">1337</span>     <span class="c1">; allocate a long integer with value 1337</span>
</code></pre></div></div>

<p>A couple things to note here:</p>

<ul>
  <li>The <code class="language-plaintext highlighter-rouge">.data</code> directive tells the assembler we’re in the data section. We’ll also need a <code class="language-plaintext highlighter-rouge">.text</code> directive to indicate when we switch back to the text section.</li>
  <li>A label like <code class="language-plaintext highlighter-rouge">_my_var</code> labels a memory address. The assembler and linker don’t care whether that address refers to an instruction in the text section or a variable in the data section; they’re going to treat it the same way.</li>
  <li>On macOS, <code class="language-plaintext highlighter-rouge">.align n</code> means “align the next thing to a multiple of 2<sup>n</sup> bytes”. So <code class="language-plaintext highlighter-rouge">.align 2</code> means we’re using a 4-byte alignment. On Linux, <code class="language-plaintext highlighter-rouge">.align n</code> means “align the next thing to a multiple of n bytes”, so you’d want <code class="language-plaintext highlighter-rouge">.align 4</code> to get the same result.</li>
</ul>

<p>Once you’ve allocated a variable, you can refer to its label directly in assembly:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">movl</span> <span class="o">%</span><span class="nb">eax</span><span class="p">,</span> <span class="nv">_my_var</span> <span class="c1">; move the value in %eax to the memory address of _my_var</span>
</code></pre></div></div>

<p>So the basic gist here is:</p>

<ol>
  <li>
    <p>When you encounter a <em>declaration</em> for a global variable, add it to the variable map. The variable map entry will be its label instead of a stack index:</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> var_map = var_map.put("my_var", "_my_var")
</code></pre></div>    </div>

    <p>Note that this new variable map entry must be visible when we generate later top-level items; this isn’t true of entries we add while processing function definitions.</p>
  </li>
  <li>When you encounter a <em>definition</em> for a global variable, with an initializer, emit assembly to allocate it in the data section. Then emit a <code class="language-plaintext highlighter-rouge">.text</code> directive before you go back to generating function definitions.</li>
  <li>When you encounter a <em>reference</em> to a variable, handle it the same way you did before. If its entry in the variable map is a label instead of a stack index, of course, you should use it directly instead of as an offset from <code class="language-plaintext highlighter-rouge">%ebp</code>. If it doesn’t have an entry, that’s an error.</li>
</ol>

<p>But there are a few wrinkles.</p>

<h3 id="uninitialized-variables">Uninitialized Variables</h3>
<p>If, by the end of the program, we have any variables left that have been declared but not defined, we need to declare them in a special section for uninitialized data. On Linux, all uninitialized data lives in the BSS section, which also includes any variables initialized to 0. On macOS it’s a little more complicated: uninitialized static variables go in BSS, and uninitialized global variables go in the common section, which indicates to the linker that they may be initialized in a different object file. We don’t support static variables yet, so on macOS we don’t need to store anything in BSS. Of course, we also don’t have any tests with multiple source files, so if you just use BSS instead of common, effectively making all global variables static, the tests will still pass.</p>

<p>The data section consists of the actual values of our data; we can load it directly into memory and use it as-is. The BSS and common sections, on the other hand, don’t contain all of our uninitialized values, because they would just be big blocks of zeros. Storing a big block of zeros on disk would be a waste of space. Instead, we just store the size of BSS and common in our binary, and allocate that much memory for them when we load the program. So keeping initialized and uninitialized variables separate is just a trick to reduce the size of binaries.</p>

<p>On macOS, we can allocate space in the common section using the <code class="language-plaintext highlighter-rouge">.comm</code> directive:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">.text</span>
    <span class="nf">.comm</span> <span class="nv">_my_var</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">2</span> <span class="c1">; allocate 4 bytes for symbol _my_var, with 4-byte alignment</span>
</code></pre></div></div>

<p>Allocating space in BSS, on the other hand, looks almost exactly the same as allocating a non-zero variable, but we’ll use <code class="language-plaintext highlighter-rouge">.zero 4</code> to allocate 4 bytes of zeros instead of <code class="language-plaintext highlighter-rouge">.long n</code> to allocate a long integer with value <code class="language-plaintext highlighter-rouge">n</code>:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">.globl</span> <span class="nv">_my_var</span> <span class="c1">; make this symbol visible to the linker</span>
    <span class="nf">.bss</span>           <span class="c1">; what's next describes the BSS section    </span>
    <span class="nf">.align</span> <span class="mi">4</span>       <span class="c1">; this data should aligned on 4-byte intervals (Linux align directive)</span>
<span class="nl">_my_var:</span>
    <span class="nf">.zero</span> <span class="mi">4</span>        <span class="c1">; allocate 4 bytes of zeros</span>
</code></pre></div></div>

<p>Note that in assembly, unlike in C, it’s perfectly fine to reference a label like <code class="language-plaintext highlighter-rouge">_my_var</code> before that label is defined. That’s why we can wait until the end of the program to allocate any uninitialized variables.</p>

<h3 id="non-constant-initializers">Non-Constant Initializers</h3>
<p>Global variables are loaded into memory before the program starts, which means we can’t execute any instructions to calculate their initial values. Therefore their initializers need to be constants. For example, this isn’t valid:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">foo</span> <span class="o">=</span> <span class="mi">5</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">bar</span> <span class="o">=</span> <span class="n">foo</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// NOT A CONSTANT!</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">bar</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Most compilers permit global variables to be initialized with constant expressions, like:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">foo</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">+</span> <span class="mi">3</span> <span class="o">*</span> <span class="mi">5</span><span class="p">;</span>
</code></pre></div></div>

<p>This requires you to compute <code class="language-plaintext highlighter-rouge">2 + 3 * 5</code> at compile time. You can support this if you want, but you don’t have to; the test suite doesn’t check for it.</p>

<h3 id="validation">Validation</h3>

<p>To recap, here’s what we need to validate:</p>
<ul>
  <li>Variables, including global variables, are declared before they are defined.</li>
  <li>No global variable is defined more than once.</li>
  <li>No global variable is initialized with a non-constant value.</li>
  <li>No symbol is declared as both a function and a variable.</li>
</ul>

<p>It’s easy to validate the first bullet point during code generation; we’re doing that for local variables anyway. The remaining points can be validated either during code generation, or in a separate validation pass. I’d recommend handling them wherever you validate function definitions and calls.</p>

<h4 id="-task-1">☑ Task:</h4>
<p>Update the code generation pass (and your validation pass, if you have one) to fail with an error for all invalid stage 10 examples, and succeed on all valid stage 10 examples.</p>

<h2 id="pie-">PIE 🥧</h2>

<p>If you compile a program with global variables using a real compiler, the assembly will look quite different from what we described above. You may also notice, if you’re on macOS, that the linker will warn you about the assembly your compiler produces:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>./my_compiler global.c
ld: warning: The i386 architecture is deprecated <span class="k">for </span>macOS <span class="o">(</span>remove from the Xcode build setting: ARCHS<span class="o">)</span>
ld: warning: PIE disabled. Absolute addressing <span class="o">(</span>perhaps <span class="nt">-mdynamic-no-pic</span><span class="o">)</span> not allowed <span class="k">in </span>code signed PIE, but used <span class="k">in </span>_main from /var/folders/9t/p20tf0zs4ql425tdktwnfjkm0000gn/T//cczcZcyQ.o. To fix this warning, don<span class="s1">'t compile with -mdynamic-no-pic or link with -Wl,-no_pie
</span></code></pre></div></div>

<p>PIE stands for “position-independent executable”, which means an executable consisting entirely of position-independent code. This section briefly explains what position-independent code is and why you might need it, but doesn’t explain how to implement it. Feel free to skip it if you’re not interested.</p>

<p>Position-independent code is code that can run no matter where it’s loaded in memory, because it never refers to absolute memory addresses. The code our compiler produces is not position-independent, because it has instructions like:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">movl</span> <span class="kc">$</span><span class="mi">3</span><span class="p">,</span> <span class="nv">_my_var</span>
</code></pre></div></div>

<p>In order for this instruction to run, the linker needs to replace <code class="language-plaintext highlighter-rouge">_my_var</code> with an absolute memory address. This works if we know the absolute address of the data and BSS sections in advance.</p>

<p>Position-independent code, on the other hand, never refers to the address of symbols like <code class="language-plaintext highlighter-rouge">_my_var</code> directly; instead, those addresses are calculated relative to the current instruction pointer. In case I didn’t have enough of a reason to regret targeting a 32-bit architecture, position-independent assembly is much simpler with a 64-bit instruction set:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">movl</span> <span class="kc">$</span><span class="mi">3</span><span class="p">,</span> <span class="nv">_my_var</span><span class="p">(</span><span class="o">%</span><span class="nv">rip</span><span class="p">)</span> <span class="c1">; use _my_var as offset from instruction pointer</span>
</code></pre></div></div>

<p>To get the same result with a 32-bit architecture you need something like this:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">call</span>    <span class="nv">___x86.get_pc_thunk.ax</span>
<span class="nl">L1$pb:</span>
    <span class="nf">leal</span>    <span class="nv">_my_var</span><span class="o">-</span><span class="nv">L1$pb</span><span class="p">(</span><span class="o">%</span><span class="nb">eax</span><span class="p">),</span> <span class="o">%</span><span class="nb">eax</span>
    <span class="nf">movl</span>    <span class="p">(</span><span class="o">%</span><span class="nb">eax</span><span class="p">),</span> <span class="o">%</span><span class="nb">eax</span>
</code></pre></div></div>
<p>I won’t walk through exactly what this code is doing; if you’re curious, <a href="https://eli.thegreenplace.net/2011/11/03/position-independent-code-pic-in-shared-libraries/">this article</a> gives a good overview of position-independent code for x86.</p>

<p>There are two reasons you might want to generate position-independent code:</p>

<ol>
  <li>
    <p>You’re compiling a shared library. Maybe this is a really widely used library, like libc. Maybe all or most processes on a system will want a copy of this library. It seems like a waste to have a separate copy for every process, eating up all your RAM. Instead, we can load the library into physical memory just once, then map it into the virtual memory of every process that needs it. But we can’t guarantee a library the same starting address in every process that loads it. So sharing one library between several processes only works if the library works no matter what memory address it’s at—which is to say, it needs to be position-independent. However, we’re compiling an executable, not a library, so this doesn’t apply to us.</p>
  </li>
  <li>
    <p>You have address space layout randomization (ASLR) enabled. ASLR is a security feature that makes some memory corruption attacks harder to carry out. Many of these attacks involve forcing program execution to jump to the instructions an attacker would like to execute. With ASLR enabled, memory segments are loaded at random locations<sup id="anchor5"><a href="#fn5">5</a></sup>, which makes it harder for attackers to figure out what address to jump to. Code needs to be position independent in order to run correctly when loaded to a random memory address. Since Apple really wants all macOS applications to support ASLR<sup id="anchor6"><a href="#fn6">6</a></sup>, the linker will try to build a position-independent executable by default, and complain if it can’t.</p>
  </li>
</ol>

<p>The fact that your compiler can’t generate position-independent executables is just one of many, many reasons you shouldn’t use it to build real software. I don’t have that much faith in these blog posts, and neither should you!</p>

<p>If you want to learn more about ASLR, I found <a href="http://security.cs.rpi.edu/courses/binexp-spring2015/lectures/15/09_lecture.pdf">these slides</a> helpful. Of course, there’s also <a href="https://en.wikipedia.org/wiki/Address_space_layout_randomization">Wikipedia</a>.</p>

<h2 id="up-next">Up Next</h2>

<p>So far, I’ve been implementing a compiler and writing posts as I go. This system worked really well for a while, but now it’s starting to work less well; I realized that some decisions I made in earlier stages made this stage harder to complete, so I had to go back and change them. I think I’m likely to run into more problems like that in later posts. So I’m going to take a break, finish building the compiler (whatever I decide “finished” means), and then come back and write the rest of this series. I probably won’t post another update for six months. So basically…I’m going to keep posting at about the same rate I have been.</p>

<p>When I come back, I’ll have a plan for what to cover in the rest of the series. See you then!</p>

<p><em>If you have any questions, corrections, or other feedback, you can <a href="mailto:nora@norasandler.com">email me</a> or <a href="https://github.com/nlsandler/write_a_c_compiler/issues">open an issue</a>.</em></p>

<div class="footnote">
  <p><sup id="fn1">1</sup>
The compiler that ships with the XCode Command Line Tools—the one that was giving me this error—is actually <em>not</em> GCC. It’s <a href="https://en.wikipedia.org/wiki/Clang">Clang</a>, another open-source compiler that’s developed mostly by Apple. XCode installs Clang at <code class="language-plaintext highlighter-rouge">/usr/bin/gcc</code>, no doubt for very sound and legitimate reasons, although I don’t know what they are.
<a href="#anchor1">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn2">2</sup>
The standard actually considers this a <em>tentative definition</em> (section 6.9.2):</p>

  <blockquote>
    <p>A declaration of an identifier for an object that has file scope without an initializer, and without a
storage-class specifier or with the storage-class specifier static, constitutes a tentative definition.</p>
  </blockquote>

  <p>Basically, if we can’t find a real definition anywhere else in the file, we can treat a declaration like a definition with an initial value of 0. We’re still going to call it a declaration, though.
<a href="#anchor2">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn3">3</sup>
<a href="https://commons.wikimedia.org/wiki/File:Typical_computer_data_memory_arrangement.png">Typical computer data memory arrangement</a> by Majenko is licensed under <a href="https://creativecommons.org/licenses/by-sa/4.0/deed.en">CC BY-SA 4.0</a>.</p>

  <p>This diagram is an oversimplification; it doesn’t show every memory segment we might find in a running program. Also, sometimes memory segments are laid out in a different order—we’ll talk about that later. The point is that we have a dedicated chunk of memory for global variables.<a href="#anchor3">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn4">4</sup>
BSS stands for “Block Started by Symbol,” which is a relic of an assembler written in the 1950s(!). You can read more <a href="https://en.wikipedia.org/wiki/.bss#Origin">here</a> if you want to go down a bit of a Wikipedia rabbit hole.<a href="#anchor4">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn5">5</sup>
Exactly which memory segments are randomized, and how random their base addresses actually are, varies between systems. <a href="#anchor5">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn6">6</sup>
<a href="https://developer.apple.com/library/archive/qa/qa1788/_index.html">Source</a>. <a href="#anchor6">↩</a></p>
</div>]]></content><author><name>Nora Sandler</name></author><category term="compiler-tutorial" /><summary type="html"><![CDATA[This is the tenth post in a series. Read part 1 here.]]></summary></entry><entry><title type="html">C Compiler, Part 9: Functions</title><link href="https://norasandler.com/2018/06/27/Write-a-Compiler-9.html" rel="alternate" type="text/html" title="C Compiler, Part 9: Functions" /><published>2018-06-27T20:00:00+00:00</published><updated>2018-06-27T20:00:00+00:00</updated><id>https://norasandler.com/2018/06/27/Write-a-Compiler-9</id><content type="html" xml:base="https://norasandler.com/2018/06/27/Write-a-Compiler-9.html"><![CDATA[<p><em>This is the ninth post in a series. Read part 1 <a href="/2017/11/29/Write-a-Compiler.html">here</a>.</em></p>

<p>In this post we’re adding function calls! This is a particularly exciting post because we get to talk about calling conventions and stack frames and some weird corners of the C11 standard. Plus, by the end of this post we’ll be able to compile “Hello, World!” 🎉</p>

<p>As usual, accompanying tests are <a href="https://github.com/nlsandler/write_a_c_compiler">here</a>.</p>

<h1 id="part-9-functions">Part 9: Functions</h1>

<p>Of course, our compiler can already handle function definitions, because we can already define <code class="language-plaintext highlighter-rouge">main</code>.
But in this post, we’ll add support for function calls:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">three</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">return</span> <span class="mi">3</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">three</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We’ll also add support for function parameters:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">sum</span><span class="p">(</span><span class="kt">int</span> <span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="n">b</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">sum</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>And for forward declarations:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">sum</span><span class="p">(</span><span class="kt">int</span> <span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="n">b</span><span class="p">);</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">sum</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">sum</span><span class="p">(</span><span class="kt">int</span> <span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="n">b</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="terminology">Terminology</h3>

<ul>
  <li>
    <p>A function <strong>declaration</strong> specifies a function’s name, return type, and optionally its parameter list:</p>

    <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="kt">int</span> <span class="nf">foo</span><span class="p">();</span>
</code></pre></div>    </div>
  </li>
  <li>
    <p>A function <strong>prototype</strong> is a special type of function declaration that includes parameter type information:</p>

    <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="kt">int</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="n">a</span><span class="p">);</span>
</code></pre></div>    </div>
    <p>Function prototypes are the only function declarations we’ll support, even in places where the C11 standard allows non-prototype declarations.</p>
  </li>
  <li>
    <p>A function <strong>definition</strong> is a declaration plus a function body:</p>

    <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="kt">int</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="n">a</span><span class="p">)</span> <span class="p">{</span>
      <span class="k">return</span> <span class="n">a</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
  <span class="p">}</span>
</code></pre></div>    </div>

    <p>Note that you can declare a function as many times as you like, but you can only define it once<sup id="anchor1"><a href="#fn1">1</a></sup>. Also note that whenever we say “all function declarations,” that includes function declarations that are part of function definitions.</p>
  </li>
  <li>
    <p>A <strong>forward declaration</strong> is a function declaration without a function body. It tells the compiler you’re going to define the function later, possibly in a different file, and lets you use a function before it’s defined.</p>

    <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="kt">int</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="n">a</span><span class="p">);</span>
</code></pre></div>    </div>

    <p>You can also declare a function that has already been defined. This is legal but technically not a forward declaration…I guess it’s a backwards declaration? It would also be pretty pointless:</p>

    <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="kt">int</span> <span class="nf">foo</span><span class="p">()</span> <span class="p">{</span>
      <span class="k">return</span> <span class="mi">4</span><span class="p">;</span>
  <span class="p">}</span>

  <span class="kt">int</span> <span class="nf">foo</span><span class="p">();</span>
</code></pre></div>    </div>
  </li>
  <li>
    <p>A function’s <strong>arguments</strong> are the values passed to a function call. A function’s <strong>parameters</strong> are the variables defined in the function declaration. In this code snippet, <code class="language-plaintext highlighter-rouge">a</code> is a parameter and <code class="language-plaintext highlighter-rouge">3</code> is an argument:</p>

    <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="kt">int</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="n">a</span><span class="p">)</span> <span class="p">{</span>
      <span class="k">return</span> <span class="n">a</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
  <span class="p">}</span>

  <span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
      <span class="k">return</span> <span class="n">foo</span><span class="p">(</span><span class="mi">3</span><span class="p">);</span>
  <span class="p">}</span>
</code></pre></div>    </div>
  </li>
</ul>

<h3 id="limitations">Limitations</h3>

<ul>
  <li>
    <p>For now, we’ll only support functions with return type <code class="language-plaintext highlighter-rouge">int</code> and parameters with type <code class="language-plaintext highlighter-rouge">int</code>.</p>
  </li>
  <li>
    <p>We won’t support function declarations with missing parameters or type information; in other words, we’ll require all function declarations to be function prototypes, whether or not they’re part of function definitions.</p>
  </li>
  <li>
    <p>We’ll interpret an empty parameter list (e.g. in the declaration <code class="language-plaintext highlighter-rouge">int foo()</code>) to mean that the function has no parameters. This deviates from the C11 standard; according to the standard, <code class="language-plaintext highlighter-rouge">int foo(void)</code> is a function prototype indicating <code class="language-plaintext highlighter-rouge">foo</code> has no parameters, and <code class="language-plaintext highlighter-rouge">int foo()</code> is a declaration where the parameters aren’t specified (i.e. not a function prototype).</p>
  </li>
  <li>
    <p>We won’t support function definitions using identifier-list form, which looks like this:</p>

    <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="kt">int</span> <span class="n">foo</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>
  <span class="kt">int</span> <span class="n">a</span><span class="p">;</span>
  <span class="p">{</span>
      <span class="k">return</span> <span class="n">a</span> <span class="o">*</span> <span class="mi">2</span><span class="p">;</span>
  <span class="p">}</span>
</code></pre></div>    </div>
  </li>
  <li>
    <p>We’ll require parameter names in function declarations. For example, we won’t support this:</p>

    <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="kt">int</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span>
</code></pre></div>    </div>
  </li>
  <li>
    <p>We won’t support storage class specifiers (e.g. <code class="language-plaintext highlighter-rouge">extern</code>, <code class="language-plaintext highlighter-rouge">static</code>), type qualifiers (e.g. <code class="language-plaintext highlighter-rouge">const</code>, <code class="language-plaintext highlighter-rouge">atomic</code>), function specifiers (<code class="language-plaintext highlighter-rouge">inline</code>, <code class="language-plaintext highlighter-rouge">_Noreturn</code>) or alignment specifiers (<code class="language-plaintext highlighter-rouge">_Alignas</code>)</p>
  </li>
</ul>

<h2 id="lexing">Lexing</h2>

<p>Nothing fancy here; we just need to add commas to separate the function arguments. Here’s the full list of tokens so far:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">{</code></li>
  <li><code class="language-plaintext highlighter-rouge">}</code></li>
  <li><code class="language-plaintext highlighter-rouge">(</code></li>
  <li><code class="language-plaintext highlighter-rouge">)</code></li>
  <li><code class="language-plaintext highlighter-rouge">;</code></li>
  <li><code class="language-plaintext highlighter-rouge">int</code></li>
  <li><code class="language-plaintext highlighter-rouge">return</code></li>
  <li>Identifier <code class="language-plaintext highlighter-rouge">[a-zA-Z]\w*</code></li>
  <li>Integer literal <code class="language-plaintext highlighter-rouge">[0-9]+</code></li>
  <li><code class="language-plaintext highlighter-rouge">-</code></li>
  <li><code class="language-plaintext highlighter-rouge">~</code></li>
  <li><code class="language-plaintext highlighter-rouge">!</code></li>
  <li><code class="language-plaintext highlighter-rouge">+</code></li>
  <li><code class="language-plaintext highlighter-rouge">*</code></li>
  <li><code class="language-plaintext highlighter-rouge">/</code></li>
  <li><code class="language-plaintext highlighter-rouge">&amp;&amp;</code></li>
  <li><code class="language-plaintext highlighter-rouge">||</code></li>
  <li><code class="language-plaintext highlighter-rouge">==</code></li>
  <li><code class="language-plaintext highlighter-rouge">!=</code></li>
  <li><code class="language-plaintext highlighter-rouge">&lt;</code></li>
  <li><code class="language-plaintext highlighter-rouge">&lt;=</code></li>
  <li><code class="language-plaintext highlighter-rouge">&gt;</code></li>
  <li><code class="language-plaintext highlighter-rouge">&gt;=</code></li>
  <li><code class="language-plaintext highlighter-rouge">=</code></li>
  <li><code class="language-plaintext highlighter-rouge">if</code></li>
  <li><code class="language-plaintext highlighter-rouge">else</code></li>
  <li><code class="language-plaintext highlighter-rouge">:</code></li>
  <li><code class="language-plaintext highlighter-rouge">?</code></li>
  <li><code class="language-plaintext highlighter-rouge">for</code></li>
  <li><code class="language-plaintext highlighter-rouge">while</code></li>
  <li><code class="language-plaintext highlighter-rouge">do</code></li>
  <li><code class="language-plaintext highlighter-rouge">break</code></li>
  <li><code class="language-plaintext highlighter-rouge">continue</code></li>
  <li><strong><code class="language-plaintext highlighter-rouge">,</code></strong></li>
</ul>

<h4 id="-task">☑ Task:</h4>
<p>Add support for commas to the lexer.</p>

<h2 id="parsing">Parsing</h2>

<p>We’ll deal with function definitions first, then function calls.</p>

<h3 id="function-definitions">Function Definitions</h3>

<p>In our old definition, a function just had a name and a body:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function_declaration = Function(string, block_item list) //string is the function name
</code></pre></div></div>

<p>Now we need to add a list of parameters. We also need to support declarations that don’t include a function body. I defined a single <code class="language-plaintext highlighter-rouge">function_declaration</code> AST rule, with an optional function body, to represent both declarations and definitions:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function_declaration = Function(string, // function name
                                string list, // parameters
                                block_item list option) // body
</code></pre></div></div>

<p>But you could also have different rules for function declarations and definitions if you wanted.</p>

<p>Note that we don’t include the function’s return type or parameter types, because right now <code class="language-plaintext highlighter-rouge">int</code> is the only type. We’ll need to expand this definition when we add other types.</p>

<p>We also need to update the grammar. Here was the old <code class="language-plaintext highlighter-rouge">&lt;function&gt;</code> grammar rule:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;function&gt; ::= "int" &lt;id&gt; "(" ")" "{" { &lt;block-item&gt; } "}"
</code></pre></div></div>

<p>And here’s the new one. Note that the function declaration ends with either a function body (if it’s a definition) or a semicolon (if it’s not).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;function&gt; ::= "int" &lt;id&gt; "(" [ "int" &lt;id&gt; { "," "int" &lt;id&gt; } ] ")" ( "{" { &lt;block-item&gt; } "}" | ";" )
</code></pre></div></div>

<h3 id="function-calls">Function Calls</h3>

<p>A function call is an expression that looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">foo</span><span class="p">(</span><span class="n">arg1</span><span class="p">,</span> <span class="n">arg2</span><span class="p">)</span>
</code></pre></div></div>

<p>It has an ID (the function name) and a list of arguments. Its arguments can be arbitrary expressions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">foo</span><span class="p">(</span><span class="n">arg1</span> <span class="o">+</span> <span class="mi">2</span><span class="p">,</span> <span class="n">bar</span><span class="p">())</span>
</code></pre></div></div>

<p>So we can update the AST definition for expressions like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>exp = ...
    | FunCall(string, exp list) // string is the function name
    ...
</code></pre></div></div>

<p>We also need to update the grammar. Function calls have the highest possible precedence level, right up there with postfix unary operators.
So we’ll add them to the <code class="language-plaintext highlighter-rouge">&lt;factor&gt;</code> rule in the grammar:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;factor&gt; ::= &lt;function-call&gt; | "(" &lt;exp&gt; ")" | &lt;unary_op&gt; &lt;factor&gt; | &lt;int&gt; | &lt;id&gt;
&lt;function-call&gt; ::= id "(" [ &lt;exp&gt; { "," &lt;exp&gt; } ] ")"
</code></pre></div></div>

<h3 id="top-level">Top Level</h3>

<p>In our old definition, a program consisted of a single function definition. Now it needs to permit multiple function declarations:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program = Program(function_declaration list)
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;program&gt; ::= { &lt;function&gt; }
</code></pre></div></div>

<h4 id="-task-1">☑ Task:</h4>
<p>Update parsing to succeed on all valid stage 1-9 examples. You may or may not want to handle invalid examples here: see the next section on validation.</p>

<h2 id="validation">Validation</h2>

<p>We need to validate that the function declarations and calls in our program are legal. You can either handle these checks during code generation, or add a new validation pass between parsing and code generation. <strong>Edited to add:</strong> I previously recommended performing validation during the parsing stage. This turns out to be a bad idea, because this will become increasingly cumbersome as we need to validate more things in future posts.</p>

<p>Your compiler must fail if:</p>

<ul>
  <li>
    <p>The program includes two definitions of the same function name.</p>

    <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="kt">int</span> <span class="nf">foo</span><span class="p">(){</span>
      <span class="k">return</span> <span class="mi">3</span><span class="p">;</span>
  <span class="p">}</span>

  <span class="kt">int</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="n">a</span><span class="p">){</span>
      <span class="k">return</span> <span class="n">a</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
  <span class="p">}</span>
</code></pre></div>    </div>
  </li>
  <li>
    <p>Two declarations of a function have different numbers of parameters. Different parameter names are okay, though.</p>

    <p>This is illegal<sup id="anchor2"><a href="#fn2">2</a></sup>:</p>

    <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="kt">int</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="n">b</span><span class="p">);</span>

  <span class="kt">int</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="n">a</span><span class="p">){</span>
      <span class="k">return</span> <span class="n">a</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
  <span class="p">}</span>
</code></pre></div>    </div>

    <p>But this is okay:</p>

    <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="kt">int</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="n">a</span><span class="p">);</span>

  <span class="kt">int</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="n">b</span><span class="p">){</span>
      <span class="k">return</span> <span class="n">b</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
  <span class="p">}</span>
</code></pre></div>    </div>
  </li>
  <li>
    <p>A function is called with the wrong number of arguments, e.g.</p>

    <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="kt">int</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="n">a</span><span class="p">){</span>
      <span class="k">return</span> <span class="n">a</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
  <span class="p">}</span>

  <span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
      <span class="k">return</span> <span class="n">foo</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">);</span>
  <span class="p">}</span>
</code></pre></div>    </div>
  </li>
  <li>
    <p>Optionally, you may want to fail if a function is called before it’s declared. Note that it’s totally legal to call a function that has been declared but not defined. It’s also legal to declare a function and <em>never</em> define it; however, linking will fail if the function isn’t declared in some other library the linker can find<sup id="anchor3"><a href="#fn3">3</a></sup>.</p>

    <p>So this is illegal:</p>
    <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
      <span class="k">return</span> <span class="n">putchar</span><span class="p">(</span><span class="mi">65</span><span class="p">);</span>
  <span class="p">}</span>

  <span class="kt">int</span> <span class="nf">foo</span><span class="p">(){</span>
      <span class="k">return</span> <span class="mi">3</span><span class="p">;</span>
  <span class="p">}</span>
</code></pre></div>    </div>

    <p>But this is legal:</p>
    <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="kt">int</span> <span class="nf">putchar</span><span class="p">(</span><span class="kt">int</span> <span class="n">c</span><span class="p">);</span>

  <span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
      <span class="n">putchar</span><span class="p">(</span><span class="mi">65</span><span class="p">);</span>
  <span class="p">}</span>
</code></pre></div>    </div>

    <p>This last point is optional because neither GCC nor clang enforces it — they both warn but don’t fail on the illegal example above. Calling a function before it’s declared is called “implicit function declaration” and it was legal before C99, so I guess enforcing this rule would have broken a lot of older code. The test suite doesn’t include any implicit function declarations, so you can handle it however you like and you can still pass all the tests.</p>
  </li>
</ul>

<h4 id="-task-2">☑ Task:</h4>
<p>Update your compiler to fail on invalid stage 1-9 examples. You can handle this during code generation, or a new stage between parsing and code generation. Bonus points for useful error messages.</p>

<p>To handle this, you’ll probably want to traverse the tree and maintain a map to track the number of arguments to each function, and whether that function has been defined yet.</p>

<h2 id="code-generation">Code Generation</h2>

<p>Once again, we’ll handle function definitions first, then function calls. But before we do any of that, let’s discuss…</p>

<h3 id="calling-conventions">Calling Conventions</h3>

<p>In most of the examples above, we defined a function and then called it in the same file. But we also want to call functions from shared libraries; we particularly want to call the standard library, so we can access I/O functions, so we can write “Hello, World”. When you use a shared library, you generally don’t recompile it yourself; you link to a precompiled binary. We definitely don’t want to recompile the whole standard library! That means we need to generate machine code that can interact with object files built by other compilers. In earlier posts, I’ve often said “this isn’t how a real compiler would do this thing, but it works.” In this post, we <em>have</em> to do things the same way as everyone else or we can’t use prebuilt libraries.</p>

<p>In other words, we need to follow the appropriate <em>calling convention</em>. A calling convention answers questions like:</p>

<ul>
  <li>How are arguments passed to the callee? Are they passed in registers or on the stack?</li>
  <li>Is the caller or callee responsible for removing arguments from the stack after the callee has executed?</li>
  <li>How are return values passed back to the caller?</li>
  <li>Which registers are caller-saved and which are callee-saved<sup id="anchor4"><a href="#fn4">4</a></sup>?</li>
</ul>

<p>C programs on 32-bit OS X, Linux, and other Unix-like systems use the <code class="language-plaintext highlighter-rouge">cdecl</code> calling convention<sup id="anchor5"><a href="#fn5">5</a></sup>, which means:</p>

<ul>
  <li>Arguments are passed on the stack. They’re pushed on the stack from right to left (so the first function argument is at the lowest address).</li>
  <li>The caller cleans the arguments from the stack.</li>
  <li>Return values are passed in the <code class="language-plaintext highlighter-rouge">EAX</code> register. (The full answer is more complicated, but this is good enough as long as we can only return integers.)</li>
  <li>The <code class="language-plaintext highlighter-rouge">EAX</code>, <code class="language-plaintext highlighter-rouge">ECX</code>, and <code class="language-plaintext highlighter-rouge">EDX</code> registers are caller-saved, and all others are callee-saved. We’ll see in the next section that the callee has to restore <code class="language-plaintext highlighter-rouge">EBP</code> and <code class="language-plaintext highlighter-rouge">ESP</code> before it returns, and restores <code class="language-plaintext highlighter-rouge">EIP</code> with the <code class="language-plaintext highlighter-rouge">ret</code> instruction. Normally it would also need to restore <code class="language-plaintext highlighter-rouge">ESI</code>, <code class="language-plaintext highlighter-rouge">EDI</code>, and <code class="language-plaintext highlighter-rouge">EBX</code>, but we don’t actually use these registers. And we already push values from <code class="language-plaintext highlighter-rouge">EAX</code>, <code class="language-plaintext highlighter-rouge">ECX</code>, and <code class="language-plaintext highlighter-rouge">EDX</code> onto the stack right away if we’re going to need them later. So basically, we don’t have to worry about saving and restoring registers at all.</li>
</ul>

<p>There are two import differences between OS X and Linux:</p>

<ul>
  <li>Stack alignment. On OS X, the stack needs to be 16-byte aligned at the beginning of a function call (i.e. when the <code class="language-plaintext highlighter-rouge">call</code> instruction is issued)<sup id="anchor6"><a href="#fn6">6</a></sup>. This isn’t required on Linux, but GCC still keeps the stack 16-byte aligned<sup id="anchor7"><a href="#fn7">7</a></sup>.</li>
  <li>Name decoration. On OS X, function names in assembly are prepended with an underscore (e.g. <code class="language-plaintext highlighter-rouge">main</code> becomes <code class="language-plaintext highlighter-rouge">_main</code>). On systems that use the ELF file format (Linux and most other *nix systems), there’s no underscore.  This isn’t part of the calling convention per se but it is important.</li>
</ul>

<p>We’ll need to be really comfortable with all this to implement it ourselves, so let’s look at…</p>

<h3 id="cdecl-function-calls-in-excruciating-detail">cdecl Function Calls in Excruciating Detail</h3>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">foo</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">);</span>
</code></pre></div></div>

<p>What, exactly, happens when your computer executes this line of code? We touched on this in <a href="/2018/01/08/Write-a-Compiler-5.html">part 5</a>, but now we’ll dig into it a lot more. We won’t worry about keeping the stack 16-byte aligned for now.</p>

<p>We’ll say that <code class="language-plaintext highlighter-rouge">foo</code> is being called from another function, <code class="language-plaintext highlighter-rouge">bar</code>. The line of C above will get turned into this assembly:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">push</span> <span class="kc">$</span><span class="mi">3</span>
<span class="nf">push</span> <span class="kc">$</span><span class="mi">2</span>
<span class="nf">push</span> <span class="kc">$</span><span class="mi">1</span>
<span class="nf">call</span> <span class="nv">_foo</span>
<span class="nf">add</span> <span class="kc">$</span><span class="mh">0xc</span><span class="p">,</span> <span class="o">%</span><span class="nb">esp</span>
</code></pre></div></div>

<p>First, let’s look at the state of the world before we start calling <code class="language-plaintext highlighter-rouge">foo</code><sup id="anchor8"><a href="#fn8">8</a></sup>:</p>

<p><img src="/assets/before_function_call.svg" alt="EBP points at the base of bar's stack frame at 0x14. ESP is 4 bytes below it at 0x10. EIP points at &quot;pushl $3&quot;." /></p>

<p>One chunk of memory contains the stack frame, which we’re already familiar with. The <code class="language-plaintext highlighter-rouge">EBP</code> and <code class="language-plaintext highlighter-rouge">ESP</code> registers point to the bottom and top of the stack frame, respectively, so the processor can figure out where the stack is.</p>

<p>Another chunk of memory, which we haven’t talked about yet, contains the CPU instructions being executed. The <code class="language-plaintext highlighter-rouge">EIP</code> register contains the memory address of the current instruction. To advance to the next instruction, the CPU just increments <code class="language-plaintext highlighter-rouge">EIP</code><sup id="anchor9"><a href="#fn9">9</a></sup>. The <code class="language-plaintext highlighter-rouge">call</code> instruction, and all the jump instructions we’ve already encountered, work by manipulating EIP. In these diagrams I’ll show <code class="language-plaintext highlighter-rouge">EIP</code> pointing to the instruction we’re about to execute.</p>

<p>When <code class="language-plaintext highlighter-rouge">bar</code> wants to call <code class="language-plaintext highlighter-rouge">foo</code>, the first step is putting the function arguments on the stack where <code class="language-plaintext highlighter-rouge">foo</code> can find them<sup id="anchor10"><a href="#fn10">10</a></sup>. They’re pushed onto the stack in reverse order<sup id="anchor11"><a href="#fn11">11</a>:</sup></p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">push</span> <span class="kc">$</span><span class="mi">3</span>
<span class="nf">push</span> <span class="kc">$</span><span class="mi">2</span>
<span class="nf">push</span> <span class="kc">$</span><span class="mi">1</span>
</code></pre></div></div>

<p>Which means the world now looks like this:</p>

<p><img src="/assets/before_function_call_args_pushed.svg" alt="Values 3, 2, and 1 have been pushed onto the stack, in that order. ESP points to memory address 0x20, which holds value 1. EIP points at &quot;call _foo&quot;. EBP is unchanged." /></p>

<p>Next <code class="language-plaintext highlighter-rouge">bar</code> issues the <code class="language-plaintext highlighter-rouge">call</code> instruction, which does two things:</p>

<ol>
  <li>Push the address of the instruction <em>after</em> <code class="language-plaintext highlighter-rouge">call</code> (the “return address”) onto the stack.</li>
  <li>Jump to <code class="language-plaintext highlighter-rouge">_foo</code> (by moving the address of <code class="language-plaintext highlighter-rouge">_foo</code> into <code class="language-plaintext highlighter-rouge">EIP</code>).</li>
</ol>

<p>Now the world looks like this:</p>

<p><img src="/assets/after_call.svg" alt="ESP points to 0x24, which holds the return address: address of the instruction just after &quot;call _foo&quot;. EIP points to the first instruction in _foo. EBP is unchanged." /></p>

<p>Okay, we’re officially in <code class="language-plaintext highlighter-rouge">foo</code> now. Next step is the function prologue to set up a new stack frame:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">push</span> <span class="o">%</span><span class="nb">ebp</span>
<span class="nf">mov</span> <span class="o">%</span><span class="nb">esp</span><span class="p">,</span> <span class="o">%</span><span class="nb">ebp</span>
</code></pre></div></div>

<p><img src="/assets/after_function_prologue.svg" alt="ESP and EBP both point at 0x28, which holds the previous value of EBP (0x10)." /></p>

<p>Now we can execute the body of <code class="language-plaintext highlighter-rouge">foo</code>. We can access its parameters because they’re at a predictable location on the stack relative to <code class="language-plaintext highlighter-rouge">EBP</code>: <code class="language-plaintext highlighter-rouge">%ebp + 0x8</code>, <code class="language-plaintext highlighter-rouge">%ebp + 0xc</code>, and <code class="language-plaintext highlighter-rouge">%ebp + 0x10</code>, respectively.</p>

<p>Once we’ve done some things in <code class="language-plaintext highlighter-rouge">foo</code>, and placed a return value in <code class="language-plaintext highlighter-rouge">EAX</code>, it’s time to return to <code class="language-plaintext highlighter-rouge">bar</code>. Except for that return value, we want everything on the stack to be exactly the same as it was before the call. The first step is to run the function epilogue to restore the old stack frame:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span> <span class="o">%</span><span class="nb">ebp</span><span class="p">,</span> <span class="o">%</span><span class="nb">esp</span> <span class="c1">; deallocate any local variables on the stack</span>
<span class="nf">pop</span> <span class="o">%</span><span class="nb">ebp</span>        <span class="c1">; restore old EBP</span>
</code></pre></div></div>

<p>The stack now looks exactly the same as it did right after the <code class="language-plaintext highlighter-rouge">call</code> instruction, before the function prologue. That means the return address is on top of the stack again.</p>

<p>Then we execute the <code class="language-plaintext highlighter-rouge">ret</code> instruction, which pops the top value off the stack and jumps to it unconditionally (i.e. copies it into <code class="language-plaintext highlighter-rouge">EIP</code>).</p>

<p><img src="/assets/after_ret.svg" alt="ESP points at 0x20, which holds function argument 1. EIP points to the address of the instruction right after &quot;_call foo&quot;." /></p>

<p>Now we just have to remove the function arguments from the stack, and we’re done. No need to pop them off one by one; we can just adjust the value of <code class="language-plaintext highlighter-rouge">ESP</code>.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">add</span> <span class="kc">$</span><span class="mh">0xc</span><span class="p">,</span> <span class="o">%</span><span class="nb">esp</span>
</code></pre></div></div>

<p>Now the stack has been restored to exactly the way it was before the call, and we can proceed with the rest of <code class="language-plaintext highlighter-rouge">bar</code>.</p>

<p>And now we’re finally ready to implement the code-generation stage of the compiler!</p>

<h3 id="function-definitions-1">Function Definitions</h3>

<p>As with <code class="language-plaintext highlighter-rouge">main</code>, we want to make each function global (so it can be called from other files) and label it:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">.globl</span> <span class="nv">_fun</span>
<span class="nl">_fun:</span>
</code></pre></div></div>

<p>Make sure to include the leading underscore before the function name if you’re on OS X, and not otherwise.</p>

<p>We already know how to generate the function prologue and epilogue, because that’s also exactly the same as <code class="language-plaintext highlighter-rouge">main</code>. We just need to add all the function parameters to <code class="language-plaintext highlighter-rouge">var_map</code> and <code class="language-plaintext highlighter-rouge">current_scope</code>. As we saw above, the first paramter will be at <code class="language-plaintext highlighter-rouge">ebp + 8</code>, and each subsequent parameter will be four bytes higher than the last:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>param_offset = 8 // first parameter is at EBP + 8
for each function parameter:
    var_map.put(parameter, param offset)
    current_scope.add(parameter)
    param_offset += 4
</code></pre></div></div>

<p>Then parameters get handled like any other variable in the function body.</p>

<h3 id="function-prototypes">Function Prototypes</h3>

<p>We don’t generate any assembly for function prototypes that aren’t part of definitions.</p>

<h3 id="function-calls-1">Function Calls</h3>

<p>As we saw above, the caller needs to:</p>

<ol>
  <li>
    <p>Put the arguments on the stack, in reverse order<sup id="anchor12"><a href="#fn12">12</a></sup>:</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> for each argument in reversed(function_call.arguments):
     generate_exp(arg) // puts arg in eax
     emit 'pushl %eax'
</code></pre></div>    </div>
  </li>
  <li>
    <p>Issue the <code class="language-plaintext highlighter-rouge">call</code> instruction.</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>     emit 'call _{}'.format(function_name)
</code></pre></div>    </div>
  </li>
  <li>
    <p>Remove the arguments from the stack after the callee returns.</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>     bytes_to_remove = 4 * number of function arguments
     emit 'addl ${}, %esp'.format(bytes_to_remove)
</code></pre></div>    </div>
  </li>
</ol>

<h4 id="stack-alignment">Stack Alignment</h4>
<p>On OS X, the stack needs to be 16-byte aligned when the call instruction is issued. A normal C compiler would know exactly how much padding to add to maintain that alignment. But because we push intermediate results of expressions onto the stack, and function calls can occur within larger expressions, we have no idea where the stack pointer is when we encounter a function call. My solution was to emit assembly just before each function call that calculates how much padding is needed, subtracts from ESP accordingly, and then pushes the result of the padding calculation onto the stack, all before putting the function arguments on the stack. After the function returns, the caller first removes the arguments, then pops off the result of the padding calculation, and finally adds that value to ESP to restore it to its original state.</p>

<p>Here’s the assembly to do that:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">movl</span> <span class="o">%</span><span class="nb">esp</span><span class="p">,</span> <span class="o">%</span><span class="nb">eax</span>
    <span class="nf">subl</span> <span class="kc">$</span><span class="nv">n</span><span class="p">,</span> <span class="o">%</span><span class="nb">eax</span>    <span class="c1">; n = (4*(arg_count + 1)), # of bytes allocated for arguments + padding value itself</span>
                     <span class="c1">; eax now contains the value ESP will have when call instruction is executed</span>
    <span class="nf">xorl</span> <span class="o">%</span><span class="nb">edx</span><span class="p">,</span> <span class="o">%</span><span class="nb">edx</span>  <span class="c1">; zero out EDX, which will contain remainder of division</span>
    <span class="nf">movl</span> <span class="kc">$</span><span class="mh">0x20</span><span class="p">,</span> <span class="o">%</span><span class="nb">ecx</span> <span class="c1">; 0x20 = 16</span>
    <span class="nf">idivl</span> <span class="o">%</span><span class="nb">ecx</span>       <span class="c1">; calculate eax / 16. EDX contains remainder, i.e. # of bytes to subtract from ESP </span>
    <span class="nf">subl</span> <span class="o">%</span><span class="nb">edx</span><span class="p">,</span> <span class="o">%</span><span class="nb">esp</span>  <span class="c1">; pad ESP</span>
    <span class="nf">pushl</span> <span class="o">%</span><span class="nb">edx</span>       <span class="c1">; push padding result onto stack; we'll need it to deallocate padding later</span>
    <span class="c1">; ...push arguments, call function, remove arguments...</span>
    <span class="nf">popl</span> <span class="o">%</span><span class="nb">edx</span>        <span class="c1">; pop padding result</span>
    <span class="nf">addl</span> <span class="o">%</span><span class="nb">edx</span><span class="p">,</span> <span class="o">%</span><span class="nb">esp</span>  <span class="c1">; remove padding</span>
</code></pre></div></div>

<p>This solution is kind of hideous, so let me know if you come up with a better one.</p>

<h3 id="top-level-1">Top Level</h3>

<p>Obviously, you need to generate assembly for every function definition, not just one.</p>

<h4 id="-task-3">☑ Task:</h4>
<p>Update your compiler to handle all stage 9 examples. Make sure it produces the right return code <em>and</em>, for the “hello world” test case, the right output to stdout.</p>

<h2 id="fibonacci--hello-world">Fibonacci &amp; Hello, World!</h2>

<p>Now we can calculate Fibonacci numbers:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">fib</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">n</span> <span class="o">==</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">n</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">n</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">fib</span><span class="p">(</span><span class="n">n</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">+</span> <span class="n">fib</span><span class="p">(</span><span class="n">n</span> <span class="o">-</span> <span class="mi">2</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">10</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We can also make calls to the standard library! Since we only know about ints, we can only call standard library functions where the parameters are all ints and the return value is also an int. Lucky for us, <code class="language-plaintext highlighter-rouge">putchar</code> is just such a function. For example, since the ASCII value of ‘A’ is 65, we could print ‘A’ to standard out like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">putchar</span><span class="p">(</span><span class="mi">65</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And we can print out ‘Hello, World!’ like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">putchar</span><span class="p">(</span><span class="kt">int</span> <span class="n">c</span><span class="p">);</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">putchar</span><span class="p">(</span><span class="mi">72</span><span class="p">);</span>
    <span class="n">putchar</span><span class="p">(</span><span class="mi">101</span><span class="p">);</span>
    <span class="n">putchar</span><span class="p">(</span><span class="mi">108</span><span class="p">);</span>
    <span class="n">putchar</span><span class="p">(</span><span class="mi">108</span><span class="p">);</span>
    <span class="n">putchar</span><span class="p">(</span><span class="mi">111</span><span class="p">);</span>
    <span class="n">putchar</span><span class="p">(</span><span class="mi">44</span><span class="p">);</span>
    <span class="n">putchar</span><span class="p">(</span><span class="mi">32</span><span class="p">);</span>
    <span class="n">putchar</span><span class="p">(</span><span class="mi">87</span><span class="p">);</span>
    <span class="n">putchar</span><span class="p">(</span><span class="mi">111</span><span class="p">);</span>
    <span class="n">putchar</span><span class="p">(</span><span class="mi">114</span><span class="p">);</span>
    <span class="n">putchar</span><span class="p">(</span><span class="mi">108</span><span class="p">);</span>
    <span class="n">putchar</span><span class="p">(</span><span class="mi">100</span><span class="p">);</span>
    <span class="n">putchar</span><span class="p">(</span><span class="mi">33</span><span class="p">);</span>
    <span class="n">putchar</span><span class="p">(</span><span class="mi">10</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="up-next">Up Next</h2>

<p>My next post or two won’t be about compilers. After that I’ll get back to this series, but I haven’t decided what to implement next. Maybe pointers? We’ll see!</p>

<p><strong>Update:</strong> just kidding, the <a href="/2019/02/18/Write-a-Compiler-10.html">next post</a> is about compilers after all, and covers global variables.</p>

<p><em>If you have any questions, corrections, or other feedback, you can <a href="mailto:nora@norasandler.com">email me</a> or <a href="https://github.com/nlsandler/write_a_c_compiler/issues">open an issue</a>.</em></p>

<div class="footnote">
  <p><sup id="fn1">1</sup>
Technically, you can redefine a function in the same <em>program</em> but not in the same <em>translation unit</em>. A translation unit is a source file plus everything that gets pulled in during preprocessing from <code class="language-plaintext highlighter-rouge">#include</code> directives. (Source: <a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf">C11 standard</a>, section 5.1.1.1)</p>

  <p>So it’s legal to redefine a function from a linked library. But linking happens after the compiler runs, so for our purposes the rule is that each function can only be defined once.<a href="#anchor1">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn2">2</sup>
However, this is legal according to C11:</p>

  <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">foo</span><span class="p">();</span>

<span class="kt">int</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="n">a</span><span class="p">){</span>
    <span class="k">return</span> <span class="n">a</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div>  </div>

  <p>That’s because <code class="language-plaintext highlighter-rouge">int foo();</code> doesn’t mean “declare a function foo with no variables”; it means “declare a function foo, but we don’t know anything about its variables.” But our compiler diverges from the standard in this respect; it assumes that <code class="language-plaintext highlighter-rouge">int foo();</code> means “declare foo with no variables,” so it will fail here.
<a href="#anchor2">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn3">3</sup>
What the linker does and where it looks for function definitions is way beyond the scope of this blog post; if you want to learn more you might like the <a href="http://www.lurklurk.org/linkers/linkers.html">Beginner’s Guide to Linkers</a> or <a href="https://lwn.net/Articles/276782/">this series on linkers</a>.
<a href="#anchor3">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn4">4</sup>
If a register is caller-saved, that means the callee is allowed to overwrite it. So if the caller wants to access the value in that register after the callee returns, it needs to push that value onto the stack, then pop it back into the register after the function call has completed.</p>

  <p>If a register is callee-saved, the caller can assume that the register will be unchanged after the function call finishes. So if the callee wants to use that register, it has to save the register’s contents to the stack and restore those contents before returning control to the caller.
<a href="#anchor4">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn5">5</sup>
Windows is a lot more complicated; sometimes it uses <code class="language-plaintext highlighter-rouge">cdecl</code>, sometimes it uses different calling conventions. A lot of Linux/OS X documentation doesn’t even call it <code class="language-plaintext highlighter-rouge">cdecl</code>, presumably because it’s the only calling convention in *nix-world.
<a href="#anchor5">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn6">6</sup>
Source: <a href="https://developer.apple.com/library/archive/documentation/DeveloperTools/Conceptual/LowLevelABI/130-IA-32_Function_Calling_Conventions/IA32.html">OS X ABI Function Call Guide</a>. It’s not 100% clear why OS X imposes this requirement but it probably has something to do with <a href="https://stackoverflow.com/questions/612443/why-does-the-mac-abi-require-16-byte-stack-alignment-for-x86-32">making SSE instructions run faster</a>.
<a href="#anchor6">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn7">7</sup>
See the <a href="https://gcc.gnu.org/onlinedocs/gcc-2.95.2/gcc_2.html#SEC31">GCC documentation</a> on <code class="language-plaintext highlighter-rouge">-mpreferred-stack-boundary</code>.
<a href="#anchor7">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn8">8</sup>
Note that these are not valid memory addresses; at least on Linux, the lowest memory address in use is 0x08048000. (See <a href="https://stackoverflow.com/questions/7187981/whats-the-memory-before-0x08048000-used-for-in-32-bit-machine">here</a> and <a href="https://stackoverflow.com/questions/12488010/why-the-entry-point-address-in-my-executable-is-0x8048330-0x330-being-offset-of">here</a>). I think this is also true on OS X but I haven’t checked.
<a href="#anchor8">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn9">9</sup>
It’s actually a little more complicated than this; instructions are variable-width, so you can’t increment EIP by the same amount for every instruction.
<a href="#anchor9">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn10">10</sup>
Actually, the first step is pushing some caller-saved registers onto the stack. But, like I mentioned earlier, the janky way we’re managing registers means we can ignore this.
<a href="#anchor10">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn11">11</sup>
Pushing arguments onto the stack in reverse order makes it easier to handle functions with a variable number of arguments; the callee knows the location of the first argument even if it doesn’t know how many arguments there are.
<a href="#anchor11">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn12">12</sup>
This means we’ll also <em>evaluate</em> the arguments in reverse order. This is valid; function arguments may be evaluated in any order. (Source: <a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf">C11 standard</a> section 6.5.2.2, paragraph 10.)
<a href="#anchor12">↩</a></p>
</div>]]></content><author><name>Nora Sandler</name></author><category term="compiler-tutorial" /><summary type="html"><![CDATA[This is the ninth post in a series. Read part 1 here.]]></summary></entry><entry><title type="html">C Compiler, Part 8: Loops</title><link href="https://norasandler.com/2018/04/10/Write-a-Compiler-8.html" rel="alternate" type="text/html" title="C Compiler, Part 8: Loops" /><published>2018-04-10T19:00:00+00:00</published><updated>2018-04-10T19:00:00+00:00</updated><id>https://norasandler.com/2018/04/10/Write-a-Compiler-8</id><content type="html" xml:base="https://norasandler.com/2018/04/10/Write-a-Compiler-8.html"><![CDATA[<p><em>This is the eighth post in a series. Read part 1 <a href="/2017/11/29/Write-a-Compiler.html">here</a>.</em></p>

<p>In this post we’re going to add loops! Now we’ll finally be able to compile FizzBuzz…except we won’t, because we can’t call printf yet. Still, it’s progress!</p>

<p>If you’ve been following along, note that there was a mistake in <a href="/2018/03/14/Write-a-Compiler-7.html">the last post</a>. Make sure you read the “Deallocating Variables” section and update your compiler to pass the new stage 7 tests before you start on stage 8.</p>

<p>As usual, accompanying tests are <a href="https://github.com/nlsandler/write_a_c_compiler">here</a>.</p>

<h1 id="part-8-loops">Part 8: Loops</h1>

<p>In this post we’re implementing what the <a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf">C11 standard</a> calls iteration statements; if you want to refer to the standard itself, they’re in section 6.8.5. There are a few different iteration statements:</p>

<h3 id="for-loops"><code class="language-plaintext highlighter-rouge">for</code> loops</h3>

<p>First, some terminology. I’m going to call the three parts of a <code class="language-plaintext highlighter-rouge">for</code> loop header the <em>initial clause</em>, <em>controlling expression</em>, and <em>post-expression</em>, as in:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// initial clause</span>
     <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">;</span>    <span class="c1">// controlling expression</span>
     <span class="n">i</span> <span class="o">=</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span>  <span class="c1">// post-expression</span>
     <span class="p">)</span> <span class="p">{</span>
        <span class="c1">// do something</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">for</code> loops come in two flavors: one where the initial statement is a variable declaration, and one where it’s just an expression.</p>

<p>Flavor #1:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">;</span> <span class="n">i</span> <span class="o">=</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// do something</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Flavor #2:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">;</span> <span class="n">i</span> <span class="o">=</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">//do something</span>
<span class="p">}</span>
</code></pre></div></div>

<p>One interesting thing about <code class="language-plaintext highlighter-rouge">for</code> loops is that any of the expressions in the loop header can be empty:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="c1">//do something</span>
<span class="p">}</span>
</code></pre></div></div>

<p>But if the controlling expression is empty, the compiler needs to replace it with a constant nonzero expression<sup id="anchor1"><a href="#fn1">1</a></sup>.
So the example above is equivalent to:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;</span><span class="mi">1</span><span class="p">;)</span> <span class="p">{</span>
    <span class="c1">//do something</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="while-and-do-loops"><code class="language-plaintext highlighter-rouge">while</code> and <code class="language-plaintext highlighter-rouge">do</code> Loops</h3>

<p>There’s not a whole lot to say about these.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">i</span>  <span class="o">=</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">do</span> <span class="p">{</span>
    <span class="n">i</span> <span class="o">=</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">);</span> <span class="c1">// &lt;- the semicolon is required!</span>
</code></pre></div></div>

<h3 id="break-and-continue"><code class="language-plaintext highlighter-rouge">break</code> and <code class="language-plaintext highlighter-rouge">continue</code></h3>

<p><code class="language-plaintext highlighter-rouge">break</code> and <code class="language-plaintext highlighter-rouge">continue</code> aren’t loops, but they always appear inside loops, so it makes sense to add them now<sup id="anchor2"><a href="#fn2">2</a></sup>. The C11 standard calls them “jump statements” and defines them in section 6.8.6.</p>

<p>A <code class="language-plaintext highlighter-rouge">break</code> statement inside a loop causes execution to jump to the end of the loop:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">while</span> <span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">break</span><span class="p">;</span> <span class="c1">// go to end of loop</span>
<span class="p">}</span>
<span class="c1">// break statement will go here</span>
</code></pre></div></div>

<p>A <code class="language-plaintext highlighter-rouge">continue</code> statement causes execution to jump to the end of the loop body – immediately before the post expression in a for loop.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">;</span> <span class="n">i</span> <span class="o">=</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">%</span> <span class="mi">2</span><span class="p">)</span>
        <span class="k">continue</span><span class="p">;</span>
    <span class="c1">// do something</span>

    <span class="c1">//continue statement will jump here</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In the example above, the loop will execute ten times, but only “do something” for odd values of i.</p>

<h3 id="null-statements">Null statements</h3>

<p>Sort of like you can have null expressions in a <code class="language-plaintext highlighter-rouge">for</code> loop, you can also have null statements<sup id="anchor3"><a href="#fn3">3</a></sup>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">a</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">;</span> <span class="c1">// does nothing</span>
<span class="k">return</span> <span class="n">a</span><span class="p">;</span>
</code></pre></div></div>

<p>Null statements don’t really have anything to do with loops, but they share a common feature with the expressions in a for loop: they’re both defined in terms of optional expressions in the standard. Since we need to support optional expressions in for loops, it’s pretty easy to add support for null expressions too.</p>

<p>As usual, we’ll update the lexing, parsing, and code generation passes, in order.</p>

<h2 id="lexing">Lexing</h2>

<p>We’re adding five (!) keywords in this post: <code class="language-plaintext highlighter-rouge">for</code>, <code class="language-plaintext highlighter-rouge">do</code>, <code class="language-plaintext highlighter-rouge">while</code>, <code class="language-plaintext highlighter-rouge">break</code>, and <code class="language-plaintext highlighter-rouge">continue</code>.
Here’s all our tokens so far:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">{</code></li>
  <li><code class="language-plaintext highlighter-rouge">}</code></li>
  <li><code class="language-plaintext highlighter-rouge">(</code></li>
  <li><code class="language-plaintext highlighter-rouge">)</code></li>
  <li><code class="language-plaintext highlighter-rouge">;</code></li>
  <li><code class="language-plaintext highlighter-rouge">int</code></li>
  <li><code class="language-plaintext highlighter-rouge">return</code></li>
  <li>Identifier <code class="language-plaintext highlighter-rouge">[a-zA-Z]\w*</code></li>
  <li>Integer literal <code class="language-plaintext highlighter-rouge">[0-9]+</code></li>
  <li><code class="language-plaintext highlighter-rouge">-</code></li>
  <li><code class="language-plaintext highlighter-rouge">~</code></li>
  <li><code class="language-plaintext highlighter-rouge">!</code></li>
  <li><code class="language-plaintext highlighter-rouge">+</code></li>
  <li><code class="language-plaintext highlighter-rouge">*</code></li>
  <li><code class="language-plaintext highlighter-rouge">/</code></li>
  <li><code class="language-plaintext highlighter-rouge">&amp;&amp;</code></li>
  <li><code class="language-plaintext highlighter-rouge">||</code></li>
  <li><code class="language-plaintext highlighter-rouge">==</code></li>
  <li><code class="language-plaintext highlighter-rouge">!=</code></li>
  <li><code class="language-plaintext highlighter-rouge">&lt;</code></li>
  <li><code class="language-plaintext highlighter-rouge">&lt;=</code></li>
  <li><code class="language-plaintext highlighter-rouge">&gt;</code></li>
  <li><code class="language-plaintext highlighter-rouge">&gt;=</code></li>
  <li><code class="language-plaintext highlighter-rouge">=</code></li>
  <li><code class="language-plaintext highlighter-rouge">if</code></li>
  <li><code class="language-plaintext highlighter-rouge">else</code></li>
  <li><code class="language-plaintext highlighter-rouge">:</code></li>
  <li><code class="language-plaintext highlighter-rouge">?</code></li>
  <li><strong><code class="language-plaintext highlighter-rouge">for</code></strong></li>
  <li><strong><code class="language-plaintext highlighter-rouge">while</code></strong></li>
  <li><strong><code class="language-plaintext highlighter-rouge">do</code></strong></li>
  <li><strong><code class="language-plaintext highlighter-rouge">break</code></strong></li>
  <li><strong><code class="language-plaintext highlighter-rouge">continue</code></strong></li>
</ul>

<h4 id="-task">☑ Task:</h4>
<p>You know the drill here.</p>

<h2 id="parsing">Parsing</h2>

<p>We’re adding six kinds of statements: <code class="language-plaintext highlighter-rouge">do</code> loops, <code class="language-plaintext highlighter-rouge">while</code> loops, the two different kinds of <code class="language-plaintext highlighter-rouge">for</code> loop, <code class="language-plaintext highlighter-rouge">break</code> and <code class="language-plaintext highlighter-rouge">continue</code>.
We’re also changing the <code class="language-plaintext highlighter-rouge">Exp</code> statement; its argument is now optional, so we can use it to represent null statements.
Now we can construct a null statement in the AST like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>null_exp = Exp(None)
</code></pre></div></div>

<p>The initial expression and post-expression in a <code class="language-plaintext highlighter-rouge">for</code> loop are also optional.</p>

<p>Here’s the updated definition of statements in the AST, with new and changed parts bolded:</p>

<pre>
statement = Return(exp)
<b>          | Exp(exp option)</b>
          | Conditional(exp, statement, statement option) // exp is controlling condition
                                                          // first statement is 'if' block
                                                          // second statement is optional 'else' block
          | Compound(block_item list)
<b>          | For(exp option, exp, exp option, statement) // initial expression, condition, post-expression, body
          | ForDecl(declaration, exp, exp option, statement) // initial declaration, condition, post-expression, body
          | While(expression, statement) // condition, body
          | Do(statement, expression) // body, condition
          | Break
          | Continue</b>
</pre>

<p>Note that our AST lets <code class="language-plaintext highlighter-rouge">break</code> and <code class="language-plaintext highlighter-rouge">continue</code> statements appear outside of loops, even though that’s illegal; we’ll catch that error during code generation, not parsing.</p>

<p>The trickiest part of the grammar here is dealing with optional expressions. I dealt with this by defining an <code class="language-plaintext highlighter-rouge">&lt;exp-option&gt;</code> symbol:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;exp-option&gt; ::= &lt;exp&gt; | ""
</code></pre></div></div>

<p>Once we’ve added that, updating the grammar for statements is pretty easy:</p>

<pre>
&lt;statement&gt; ::= "return" &lt;exp&gt; ";"
<b>              | &lt;exp-option&gt; ";"</b>
              | "if" "(" &lt;exp&gt; ")" &lt;statement&gt; [ "else" &lt;statement&gt; ]
              | "{" { &lt;block-item&gt; } "}
<b>              | "for" "(" &lt;exp-option&gt; ";" &lt;exp-option&gt; ";" &lt;exp-option&gt; ")" &lt;statement&gt;
              | "for" "(" &lt;declaration&gt; &lt;exp-option&gt; ";" &lt;exp-option&gt; ")" &lt;statement&gt;
              | "while" "(" &lt;exp&gt; ")" &lt;statement&gt;
              | "do" &lt;statement&gt; "while" "(" &lt;exp&gt; ")" ";"
              | "break" ";"
              | "continue" ";"</b>
</pre>

<p>If you’re wondering why there’s a semicolon after the initial <code class="language-plaintext highlighter-rouge">&lt;exp-option&gt;</code> in the first <code class="language-plaintext highlighter-rouge">for</code> rule, but not after the initial <code class="language-plaintext highlighter-rouge">&lt;declaration&gt;</code> in the second one, it’s because the rule for <code class="language-plaintext highlighter-rouge">&lt;declaration&gt;</code> also includes a semicolon.</p>

<p>Parsing <code class="language-plaintext highlighter-rouge">&lt;exp-option&gt;</code> isn’t entirely straightforward, because the empty string is not actually a token. I dealt with this by looking ahead to see if the next token was a close paren (after a post-expression) or a semicolon (after a statement, post-expression or controlling condition). If it was, the expression was empty; if not, not. I think this approach violates some formalisms about context-free grammars and LL parsers: in order to parse an <code class="language-plaintext highlighter-rouge">&lt;exp-option&gt;</code> symbol, you may have to look at a token that comes <em>after</em> that symbol.
This isn’t actually a problem, but if it bothers you, you can refactor the grammar to avoid it<sup id="anchor4"><a href="#fn4">4</a></sup>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;exp-option-semicolon&gt; ::= &lt;exp&gt; ";" | ";"
&lt;exp-option-close-paren&gt; ::= &lt;exp&gt; ")" | ")"
&lt;statement&gt; ::= ...
                | &lt;exp-option-semicolon&gt; // null statement
                | "for" "(" &lt;declaration&gt; &lt;exp-option-semicolon&gt; &lt;exp-option-close-paren&gt; &lt;statement&gt;
                ...
</code></pre></div></div>

<p>Note that there’s a discrepancy here between the grammar and the AST definition; the grammar allows controlling expressions in <code class="language-plaintext highlighter-rouge">for</code> loops to be empty, but the AST doesn’t. That’s because, as I mentioned earlier, an empty controlling expression needs to be replaced with a nonzero constant. So our approach to parsing controlling expressions in <code class="language-plaintext highlighter-rouge">for</code> loops will look something like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>match parse_optional_exp(controlling_expression) with
| Some e -&gt; e
| None -&gt; Const(1) // construct a constant nonzero expression
</code></pre></div></div>

<p>You could do this during the code generation stage instead of the parsing stage, if you wanted.</p>

<h4 id="-task-1">☑ Task:</h4>
<p>Update parsing to succeed on all valid stage 1-8 examples, and fail on all invalid stage 8 examples whose names start with <code class="language-plaintext highlighter-rouge">syntax_err</code>.</p>

<h2 id="code-generation">Code Generation</h2>

<h3 id="null-statements-1">Null Statements</h3>
<p>Don’t emit any assembly for null statements. Easy!</p>

<h3 id="while-loops"><code class="language-plaintext highlighter-rouge">while</code> loops</h3>

<p>Given a <code class="language-plaintext highlighter-rouge">while</code> loop like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">while</span> <span class="p">(</span><span class="n">expression</span><span class="p">)</span>
    <span class="n">statement</span>
</code></pre></div></div>

<p>we can describe its control flow like this:</p>

<ol>
  <li>Evaluate <code class="language-plaintext highlighter-rouge">expression</code>.</li>
  <li>If it’s false, jump to step 5.</li>
  <li>Execute <code class="language-plaintext highlighter-rouge">statement</code>.</li>
  <li>Jump to step 1.</li>
  <li>Finish.</li>
</ol>

<p>I won’t show you the exact assembly you need to generate here; by now you know enough to figure it out yourself.
The main thing is labeling steps 1 and 5, so when we need a jump instruction we have somewhere to jump to.
It’s worth noting that the loop body is a new scope, and you need to reset your <code class="language-plaintext highlighter-rouge">current_scope</code> set accordingly.</p>

<h3 id="do-loops"><code class="language-plaintext highlighter-rouge">do</code> Loops</h3>

<p>These are basically the same as <code class="language-plaintext highlighter-rouge">while</code> loops; just evaluate the expression after the statement.</p>

<h3 id="for-loops-1"><code class="language-plaintext highlighter-rouge">for</code> loops</h3>

<p>Given a <code class="language-plaintext highlighter-rouge">for</code> loop like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="n">init</span><span class="p">;</span> <span class="n">condition</span><span class="p">;</span> <span class="n">post</span><span class="o">-</span><span class="n">expression</span><span class="p">)</span>
    <span class="n">statement</span>
</code></pre></div></div>

<p>we can break it down in the same way as <code class="language-plaintext highlighter-rouge">while</code> loops above:</p>

<ol>
  <li>Evaluate <code class="language-plaintext highlighter-rouge">init</code>.</li>
  <li>Evaluate <code class="language-plaintext highlighter-rouge">condition</code>.</li>
  <li>If it’s false, jump to step 7.</li>
  <li>Execute <code class="language-plaintext highlighter-rouge">statement</code>.</li>
  <li>Execute <code class="language-plaintext highlighter-rouge">post-expression</code>.</li>
  <li>Jump to step 2.</li>
  <li>Finish.</li>
</ol>

<p>The init and post-expression might be empty, in which case we just don’t emit any assembly for steps 1 and 5. Note that a <code class="language-plaintext highlighter-rouge">for</code> loop, including the header, is a block with its own scope, and the <em>body</em> of the <code class="language-plaintext highlighter-rouge">for</code> loop is <em>also</em> a block. That means you can have code like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">100</span><span class="p">;</span> <span class="c1">// scope 1</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">;</span> <span class="n">i</span> <span class="o">=</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// scope 2 - variable i shadows previous i</span>
    <span class="kt">int</span> <span class="n">i</span><span class="p">;</span> <span class="c1">//scope 3 - this variable i shadows BOTH previous i's</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The main gotcha here is that you need to pop the variable declared in <code class="language-plaintext highlighter-rouge">init</code> off the stack
when you exit the block, just like you needed to handle deallocating other variables in the last post.</p>

<h3 id="break-and-continue-1"><code class="language-plaintext highlighter-rouge">break</code> and <code class="language-plaintext highlighter-rouge">continue</code></h3>

<p>We can implement each of these with a single <code class="language-plaintext highlighter-rouge">jmp</code> instruction – the trick is just figuring out where to jump <em>to</em>. A break statement “terminates execution of the smallest enclosing <code class="language-plaintext highlighter-rouge">switch</code> or iteration statement,” so we want to jump to the point right after the loop<sup id="anchor5"><a href="#fn5">5</a></sup>. We already have an “end of loop” label, which we jump to when the controlling condition is false; we just need to pass that label around along with the variable map, stack index and current scope.</p>

<p>We also need to pass <em>another</em> label for <code class="language-plaintext highlighter-rouge">continue</code> to refer to. <code class="language-plaintext highlighter-rouge">continue</code> “causes a jump to the loop-continuation portion of the smallest
enclosing iteration statement; that is, to the end of the loop body”<sup id="anchor6"><a href="#fn6">6</a></sup> – that’s step 4 in the <code class="language-plaintext highlighter-rouge">while</code> loop or step 5 in the <code class="language-plaintext highlighter-rouge">for</code> loop above.</p>

<p>Unlike the stack index, variable map and so forth, the jump and continue labels can be null, if you’re not inside a loop. Hitting a <code class="language-plaintext highlighter-rouge">break</code> or <code class="language-plaintext highlighter-rouge">continue</code> statement when these labels are null should, of course, cause an error.</p>

<p>At this point, I was passing enough arguments around that I defined a <code class="language-plaintext highlighter-rouge">Context</code> type and wrapped it all up in that. You may want to do something similar, but you don’t have to.</p>

<h2 id="up-next">Up Next</h2>

<p>In the <a href="/2018/06/27/Write-a-Compiler-9.html">next post</a> we’re going to implement a pretty fundamental concept: <strong>function calls</strong>. I don’t know about you but I am VERY EXCITED for function calls. See you then!</p>

<p><em>If you have any questions, corrections, or other feedback, you can <a href="mailto:nora@norasandler.com">email me</a> or <a href="https://github.com/nlsandler/write_a_c_compiler/issues">open an issue</a>.</em></p>

<div class="footnote">
  <p><sup id="fn1">1</sup>
See section 6.8.5.3 of the C11 standard.<a href="#anchor1">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn2">2</sup>
<code class="language-plaintext highlighter-rouge">break</code> can also appear in <code class="language-plaintext highlighter-rouge">switch</code> statements, but we haven’t added those yet.<a href="#anchor2">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn3">3</sup>
C11 standard, section 6.8.3. <a href="#anchor3">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn4">4</sup>
Thank you to Ian for catching an error in the refactored grammar in an earlier version of this post.<a href="#anchor4">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn5">5</sup>
C11 standard, section 6.8.6.3.<a href="#anchor5">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn6">6</sup>
C11 standard, section 6.8.6.2.<a href="#anchor6">↩</a></p>
</div>]]></content><author><name>Nora Sandler</name></author><category term="compiler-tutorial" /><summary type="html"><![CDATA[This is the eighth post in a series. Read part 1 here.]]></summary></entry><entry><title type="html">Writing a C Compiler, Part 7</title><link href="https://norasandler.com/2018/03/14/Write-a-Compiler-7.html" rel="alternate" type="text/html" title="Writing a C Compiler, Part 7" /><published>2018-03-14T23:00:00+00:00</published><updated>2018-03-14T23:00:00+00:00</updated><id>https://norasandler.com/2018/03/14/Write-a-Compiler-7</id><content type="html" xml:base="https://norasandler.com/2018/03/14/Write-a-Compiler-7.html"><![CDATA[<h2 id="update-49">Update 4/9</h2>

<ul>
  <li>There was a pretty big mistake in the original post - I forgot to deallocate local variables! I’ve added the “Deallocating Variables” section, and added the example from that section to the test suite.</li>
</ul>

<p><em>This is the seventh post in a series. Read part 1 <a href="/2017/11/29/Write-a-Compiler.html">here</a>.</em></p>

<p>In this post we’re adding support for compound statements, which are a little weird because they don’t <em>do</em> very much.
We’ll generate almost no new assembly in this post, but we’ll be able to compile new and exciting programs at the end of it.
How is this possible? Let’s find out!</p>

<p>As usual, accompanying tests are <a href="https://github.com/nlsandler/write_a_c_compiler">here</a>.</p>

<h1 id="part-7-compound-statements">Part 7: Compound Statements</h1>

<p>A compound statement is just a list of statements and declarations wrapped in curly braces.
They’re normally used as substatements of <code class="language-plaintext highlighter-rouge">if</code>, <code class="language-plaintext highlighter-rouge">while</code>, and other control structures, like this<sup id="anchor1"><a href="#fn1">1</a></sup>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">flag</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">//this is a compound statement!</span>
    <span class="kt">int</span> <span class="n">a</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>but they can also be free-standing, like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">a</span><span class="p">;</span>
    <span class="p">{</span>
        <span class="c1">//this is also a compound statement!</span>
        <span class="n">a</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>You can have deeply nested compound statements:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="c1">//compound statement #1 (function bodies are compound statements!)</span>
    <span class="kt">int</span> <span class="n">a</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">{</span>
        <span class="c1">//compound statement #2</span>
        <span class="n">a</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span>
        <span class="p">{</span>
            <span class="c1">//compound statement #3</span>
            <span class="n">a</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="p">{</span>
                <span class="c1">//compound statement #4</span>
                <span class="n">a</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Like I mentioned in the last post, a compound statement is one type of <strong>block</strong>, and I’m going to use the terms synonymously for the rest of this post. C uses <strong>lexical scoping</strong>; a variable’s scope is dictated by the block where it’s defined. (By “scope”, I mean where in the program you’re allowed to refer to it.) More precisely, a variable’s scope starts at its definition, and ends when you exit the block where it’s defined<sup id="anchor2"><a href="#fn2">2</a></sup>. Up until this point in the series, function bodies were the only blocks around, so a variable could be used at any point in <code class="language-plaintext highlighter-rouge">main</code> after it was defined. Now it’s more complicated. I’m going to talk a bit about how scoping works in C; if you’re already familiar with this, you can skip ahead to the next section.</p>

<p>If a variable is defined in an inner scope, it can’t be accessed in an outer scope:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// here is the outer scope</span>
<span class="p">{</span>
    <span class="c1">// here is the inner scope</span>
    <span class="kt">int</span> <span class="n">foo</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span>
<span class="p">}</span>

<span class="c1">// now we're back in the outer scope</span>
<span class="n">foo</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span> <span class="c1">// ERROR - foo isn't defined in this scope!</span>
</code></pre></div></div>

<p>However, code in an inner scope can access variables in an outer scope:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">a</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span>
<span class="p">{</span>
    <span class="n">a</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span> <span class="c1">// this is okay</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">a</span><span class="p">;</span> <span class="c1">// returns 4 - changes made inside the inner scope are reflected here</span>
</code></pre></div></div>

<p>You can’t have two variables with the same name in the same scope:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">foo</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">foo</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">//This will throw a compiler error</span>
</code></pre></div></div>

<p>But you can have two variables with the same name in <em>different</em> scopes. Once the variable in the inner scope is declared, it will shadow the variable from the outer scope; the outer variable will be inaccessible until the inner variable goes out of scope.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">foo</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">foo</span><span class="p">;</span> <span class="c1">// this is a TOTALLY DIFFERENT foo, unrelated to foo from earlier</span>
    <span class="n">foo</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span> <span class="c1">// this refers to the inner foo; outer foo is inaccessible</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">foo</span><span class="p">;</span> <span class="c1">//this will return 0 - it refers to the original foo, which is unchanged</span>
</code></pre></div></div>

<p>The key idea here is that the inner and outer <code class="language-plaintext highlighter-rouge">foo</code> variables are two totally unrelated variables that just happen to have the same name. When we’re in the inner block, the outer variable <code class="language-plaintext highlighter-rouge">foo</code> still exists, but we have no way to refer to it, because <code class="language-plaintext highlighter-rouge">foo</code> now refers to the inner variable.</p>

<p>Note, however, that outer <code class="language-plaintext highlighter-rouge">foo</code> is accessible in the inner block before the point where it’s shadowed:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">foo</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">{</span>
    <span class="n">foo</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span> <span class="c1">//changes outer foo</span>
    <span class="kt">int</span> <span class="n">foo</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span> <span class="c1">//defines inner foo, shadowing outer foo</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">foo</span><span class="p">;</span> <span class="c1">//returns 3</span>
</code></pre></div></div>

<h2 id="lexing">Lexing</h2>

<p>Compound statements don’t require any new tokens, so we don’t need to touch the lexing pass this week.</p>

<h2 id="parsing">Parsing</h2>

<p>Here’s the current definition of statements in our AST:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>statement = Return(exp) 
          | Exp(exp)
          | Conditional(exp, statement, statement option) //exp is controlling condition
                                                          //first statement is 'if' block
                                                          //second statement is optional 'else' block
</code></pre></div></div>

<p>We just need to add a <code class="language-plaintext highlighter-rouge">Compound</code> statement to this definition.
Also recall that we added a <code class="language-plaintext highlighter-rouge">block_item</code> construct to the AST in our last post:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>block_item = Statement(statement) | Declaration(declaration)
</code></pre></div></div>

<p>A compound statement is just a list of statements and declarations, so our new definition of statements will look like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>statement = Return(exp) 
          | Exp(exp)
          | Conditional(exp, statement, statement option) //exp is controlling condition
                                                          //first statement is 'if' block
                                                          //second statement is optional 'else' block
          | Compound(block_item list)
</code></pre></div></div>

<p>We’ll parse conditional expressions and conditional statements totally differently. Statements are easier, so let’s handle those first.</p>

<p>Now let’s update our grammar. The rule for blocks is extremely simple:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"{" { &lt;block-item&gt; } "}
</code></pre></div></div>

<p>Note that <code class="language-plaintext highlighter-rouge">"{" "}"</code> are literal curly braces, and <code class="language-plaintext highlighter-rouge">{ }</code> indicates repetition. This is hard to read! But it just means we have an arbitrary number of block items wrapped in braces – if you refer back to the grammar for <code class="language-plaintext highlighter-rouge">&lt;function&gt;</code> you can see that we define function bodies exactly the same way.</p>

<p>Putting it all together, our updated grammar looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;statement&gt; ::= "return" &lt;exp&gt; ";"
              | &lt;exp&gt; ";"
              | "if" "(" &lt;exp&gt; ")" &lt;statement&gt; [ "else" &lt;statement&gt; ]
              | "{" { &lt;block-item&gt; } "}
</code></pre></div></div>

<h4 id="-task">☑ Task:</h4>
<p>Update the parsing pass to handle blocks. It should successfully parse all valid examples in stage 1-7. As in part 5, some invalid examples should fail during parsing and some should fail during code generation. At this point, your parsing pass should throw an appropriate error for all invalid stage 7 examples whose names start with <code class="language-plaintext highlighter-rouge">syntax_err</code>.</p>

<h2 id="code-generation">Code Generation</h2>

<p>As we saw earlier, it’s possible to have two different variables, in two different scopes, stored at two different locations on the stack, with the same name. Here’s an example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">foo</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
<span class="p">{</span>
  <span class="kt">int</span> <span class="n">foo</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So, whenever the program refers to variable <code class="language-plaintext highlighter-rouge">foo</code>, our generated code needs to access the correct <code class="language-plaintext highlighter-rouge">foo</code> on the stack – or raise an error if <code class="language-plaintext highlighter-rouge">foo</code> has gone out of scope. The code generation step this week is all about managing the variable map so we always look up the right <code class="language-plaintext highlighter-rouge">foo</code>.</p>

<p>The trick here is that <strong>every block has a separate copy of the variable map</strong>. That way, defining (or redefining) a variable in an inner scope won’t interfere with an outer scope. And if you’re using an immutable map (which you should be), every block will necessarily get its own variable map, so this approach is surprisingly easy.</p>

<p>Let’s look at some pseudocode. After Part 5, your code to generate a function body probably looked something like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def generate_function_body(body):
  // initialize variable map and stack index
  var_map = Map()
  stack_index = -4

  //process statements one at a time
  for statement in body:
    var_map, stack_index = generate_statement(statement, var_map, stack_index) 
</code></pre></div></div>

<p>Note that <code class="language-plaintext highlighter-rouge">generate_statement</code> has to return a new <code class="language-plaintext highlighter-rouge">var_map</code>. Every declaration updates the variable map (or, more precisely, creates a new variable map), and in part 5 <code class="language-plaintext highlighter-rouge">generate_statement</code> also handled declarations. Whenever we process a declaration, we need to return the latest, greatest variable map so future statements can reference the variable we just declared.</p>

<p>But in the last post, we separated statements from declarations in our AST, so you might have changed the last line to:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    var_map, stack_index = generate_statement_or_declaration(statement, var_map, stack_index) 
</code></pre></div></div>

<p>At this point, a declaration will create a new variable map, but a statement won’t. Whatever happens in a statement – including a compound statement, which may itself contain declarations – has no impact on the variable map for the enclosing scope. Once you understand that point, handling nested scopes is easy:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def generate_function_body(body):
  // initialize variable map and stack index
  var_map = Map()
  stack_index = -4

  //process statements one at a time
  for block_item in body:
    if block_item is a declaration:
        //update the variable map
        var_map, stack_index = generate_declaration(statement, var_map, stack_index)
    else:
        //don't update the variable map
        generate_statement(statement, var_map, stack_index)
</code></pre></div></div>

<p>Of course you’ll need to generalize <code class="language-plaintext highlighter-rouge">generate_function_body</code> into <code class="language-plaintext highlighter-rouge">generate_block</code>; the one difference between generating a function body and any other block is that you need to initialize your empty variable map and stack index at the start of the function body.</p>

<p>Now let’s walk through a small example to see how this maintains the right variable maps for different scopes:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(){</span>
    <span class="c1">// 1) function body</span>
    <span class="p">{</span>   <span class="c1">// 2) block</span>
        <span class="kt">int</span> <span class="n">a</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span> <span class="c1">// 3) variable declaration</span>
        <span class="n">a</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span> <span class="c1">// 4) variable reference</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">a</span><span class="p">;</span> <span class="c1">// 5) return statement</span>
<span class="p">}</span>
</code></pre></div></div>

<ol>
  <li>We’ll process the function body with <code class="language-plaintext highlighter-rouge">generate_block</code>. Right now we’ve got an empty variable map.</li>
  <li>We call <code class="language-plaintext highlighter-rouge">generate_block</code> recursively to process the inner block. The variable map is still empty.</li>
  <li>This is a declaration, so we add <code class="language-plaintext highlighter-rouge">a</code> to the variable map (technically, we create a copy of the variable map that contains <code class="language-plaintext highlighter-rouge">a</code>, because all these maps are immutable).</li>
  <li>We look up <code class="language-plaintext highlighter-rouge">a</code>’s location on the stack in the variable map from step 3.</li>
  <li>Back in the outer scope, <code class="language-plaintext highlighter-rouge">var_map</code> refers to the original, <em>empty</em> variable map. Since <code class="language-plaintext highlighter-rouge">a</code> isn’t defined in this map, this will throw an error, as it should.</li>
</ol>

<p>The code for handling declarations also needs to be changed. The pseudocode for processing declarations from part 5 included this line:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if var_map.contains("a"):
  fail() //shouldn't declare a var twice
</code></pre></div></div>

<p>This is now incorrect; it’s legal to declare two variables with the same name, as long as the declarations aren’t in the same scope. To solve this, we need a way to distinguish between variables defined in the current scope, and variables defined in an outer scope. My solution was to maintain a set of variables that are defined in the current scope, which means <code class="language-plaintext highlighter-rouge">generate_block</code> now looks something like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def generate_block(block, var_map, stack_index):

  current_scope = Set()

  //process statements one at a time
  for block_item in block:
    if block_item is a declaration:
        //update the variable map
        var_map, stack_index, current_scope = generate_declaration(statement, var_map, stack_index, current_scope)
    else:
        //don't update the variable map
        generate_statement(statement, var_map, stack_index)
</code></pre></div></div>

<p>Finally, we check <code class="language-plaintext highlighter-rouge">current_scope</code>, rather than <code class="language-plaintext highlighter-rouge">var_map</code>, for duplicate variable declarations, and add the variable to both structures on success:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if current_scope.contains("a"):
  fail() //shouldn't declare a var twice in the same scope
else:
  //emit assembly, update stack_index and var_map as before...
  new_scope = current_scope.add("a")
  return (var_map, stack_index, current_scope)
</code></pre></div></div>

<p>This solution feels hacky, but I haven’t come up with a better one.</p>

<p>Now, if <code class="language-plaintext highlighter-rouge">a</code> is redefined in an inner scope, it just overwrites the old <code class="language-plaintext highlighter-rouge">a</code> in the variable map, so this scope and any inner ones will use the correct stack location, corresponding to the innermost definition of <code class="language-plaintext highlighter-rouge">a</code>. This won’t affect the outer scope at all, because the outer scope is still using the original, unmodified variable map.</p>

<h3 id="deallocating-variables">Deallocating Variables</h3>

<p>We’ve carefully managed our variable map to prevent a block from interfering with any variable declarations in its enclosing scope. But there’s one side effect we couldn’t avoid: allocating a variable changes the stack pointer. This is a problem, because the stack pointer and our <code class="language-plaintext highlighter-rouge">stack_index</code> variable will get out of sync. Consider the following example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
  <span class="p">{</span>
    <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
  <span class="p">}</span>
  <span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
  <span class="k">return</span> <span class="n">j</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>At first, the variable map is empty and <code class="language-plaintext highlighter-rouge">stack_index</code> is -4, because the first empty spot on the stack is four bytes below EBP:</p>

<p><img src="/assets/bad_stack_pointer.svg" alt="EBP and ESP point to the same location, the lowest address on the stack; the stack index points to the address just below it." /></p>

<p>When we process the block in this example with <code class="language-plaintext highlighter-rouge">generate_block</code>, we’ll push <code class="language-plaintext highlighter-rouge">i</code> onto the stack:</p>

<pre><code class="language-asm">    movl $0, %eax
    push %eax
</code></pre>

<p>Now ESP is at EBP - 4, and <code class="language-plaintext highlighter-rouge">stack_index</code> is -8:</p>

<p><img src="/assets/bad_stack_pointer_2.svg" alt="ESP points at the address just below EBP on the call stack, which holds literal value 0. the stack index points just below that value. An entry in the variable map associates i with address EBP - 4" /></p>

<p>After we exit the block, we forget that we allocated <code class="language-plaintext highlighter-rouge">i</code>. That means <code class="language-plaintext highlighter-rouge">i</code> is no longer in our variable map, and we’re still working with our original stack index of -4; remember that <code class="language-plaintext highlighter-rouge">generate_block</code> doesn’t return a stack index. We <em>should</em> forget <code class="language-plaintext highlighter-rouge">i</code>, because it’s out of scope.</p>

<p>The problem is, <code class="language-plaintext highlighter-rouge">i</code> is still there, because ESP is still pointing at it.</p>

<p><img src="/assets/bad_stack_pointer_3.svg" alt="ESP and stack index both point at EBP - 4, the address where i was allocated. The variable map is empty." /></p>

<p>So when we push <code class="language-plaintext highlighter-rouge">j</code>, it will be just below <code class="language-plaintext highlighter-rouge">i</code>, at EBP - 8:</p>

<pre><code class="language-asm">  movl $1, %eax
  push %eax
</code></pre>

<p><img src="/assets/bad_stack_pointer_4.svg" alt="ESP and the stack index both point at EBP - 8, which contains literal value 1. However, the variable map associates j with EBP - 4. " /></p>

<p>But because the stack index was -4, we’ll add a mapping from <code class="language-plaintext highlighter-rouge">j</code> to -4 in our variable map. Any future references to <code class="language-plaintext highlighter-rouge">j</code> (like in the return statement) will incorrectly use the stack location of <code class="language-plaintext highlighter-rouge">i</code> instead.</p>

<p>We <em>could</em> solve this by having <code class="language-plaintext highlighter-rouge">generate_block</code> return a stack index, but it’s probably better to just pop variables off the stack when we’re done with them, right at the end of <code class="language-plaintext highlighter-rouge">generate_block</code>. Conveniently, the size of <code class="language-plaintext highlighter-rouge">current_scope</code> tells us how many variables we need to pop.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def generate_block(block, var_map, stack_index)

  current_scope = Set()
  ...as before...

  bytes_to_deallocate = 4 * current_scope.size()
  emit "    addl ${}, %esp".format(bytes_to_deallocate)
</code></pre></div></div>

<h4 id="-task-1">☑ Task:</h4>
<p>Update the code-generation pass to correctly handle compound statements. It should succeed on all valid examples and fail on all invalid examples for stages 1-7.</p>

<h2 id="up-next">Up Next</h2>
<p>In the <a href="/2018/04/10/Write-a-Compiler-8.html">next post</a>, we’ll add <code class="language-plaintext highlighter-rouge">for</code>, <code class="language-plaintext highlighter-rouge">do</code>, and <code class="language-plaintext highlighter-rouge">while</code> loops. See you then!</p>

<p><em>If you have any questions, corrections, or other feedback, you can <a href="mailto:nora@norasandler.com">email me</a> or <a href="https://github.com/nlsandler/write_a_c_compiler/issues">open an issue</a>.</em></p>

<div class="footnote">
  <p><sup id="fn1">1</sup>
I’ll use comments to clarify the code snippets throughout this post, even though we haven’t added support for comments yet.
<a href="#anchor1">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn2">2</sup>
Global variables work a bit differently but we haven’t added those yet.
<a href="#anchor2">↩</a></p>
</div>]]></content><author><name>Nora Sandler</name></author><category term="compiler-tutorial" /><summary type="html"><![CDATA[Update 4/9]]></summary></entry><entry><title type="html">Writing a C Compiler, Part 6</title><link href="https://norasandler.com/2018/02/25/Write-a-Compiler-6.html" rel="alternate" type="text/html" title="Writing a C Compiler, Part 6" /><published>2018-02-25T20:00:00+00:00</published><updated>2018-02-25T20:00:00+00:00</updated><id>https://norasandler.com/2018/02/25/Write-a-Compiler-6</id><content type="html" xml:base="https://norasandler.com/2018/02/25/Write-a-Compiler-6.html"><![CDATA[<p><em>This is the sixth post in a series. Read part 1 <a href="/2017/11/29/Write-a-Compiler.html">here</a>.</em></p>

<p>Hi, this blog isn’t dead! It was just, uh, resting. I’ve been swamped with non-blog things for the past few weeks but I’m back on track now, probably, I hope.</p>

<p>Today we’ll implement conditional statements and expressions. As usual, accompanying tests are <a href="https://github.com/nlsandler/write_a_c_compiler">here</a>.</p>

<h1 id="part-6-conditionals">Part 6: Conditionals</h1>

<p>In this post we’ll add support for two types of conditional constructs:</p>

<ol>
  <li>Conditional statements, a.k.a. <code class="language-plaintext highlighter-rouge">if</code> statements</li>
  <li>Ternary conditional expressions, which have the form <code class="language-plaintext highlighter-rouge">a ? b : c</code>. I’ll sometimes just call these “conditional expressions”.</li>
</ol>

<h3 id="if-statements">If Statements</h3>

<p>An <code class="language-plaintext highlighter-rouge">if</code> statement consists of a condition, a substatement that executes if the condition is true, and maybe another substatement that executes if the condition is false. Either of these substatements can be a single statement, like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">flag</span><span class="p">)</span>
  <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</code></pre></div></div>

<p>or a compound statement, like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">flag</span><span class="p">)</span> <span class="p">{</span>
  <span class="kt">int</span> <span class="n">a</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
  <span class="k">return</span> <span class="n">a</span><span class="o">*</span><span class="mi">2</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Adding support for compound statements is a distinct task that we’re not going to handle in this post. So for now, we’ll only support the first of the examples above, and not the second.</p>

<p>We say a condition is <strong>false</strong> if it evaluates to zero, and <strong>true</strong> otherwise, just like when we implemented boolean operators in earlier posts.</p>

<h4 id="else-if">Else If</h4>

<p>Note that C doesn’t have an explicit <code class="language-plaintext highlighter-rouge">else if</code> construct. If an <code class="language-plaintext highlighter-rouge">if</code> keyword immediately follows an <code class="language-plaintext highlighter-rouge">else</code> keyword, the whole <code class="language-plaintext highlighter-rouge">if</code> statement gets parsed as the <code class="language-plaintext highlighter-rouge">else</code> branch. In other words, the following code snippets are equivalent:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">flag</span><span class="p">)</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">other_flag</span><span class="p">)</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">else</span>
    <span class="k">return</span> <span class="mi">2</span><span class="p">;</span>
</code></pre></div></div>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">flag</span><span class="p">)</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">else</span> <span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">other_flag</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">else</span>
        <span class="k">return</span> <span class="mi">2</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="conditional-expressions">Conditional Expressions</h3>

<p>These expressions take the following form:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">?</span> <span class="n">b</span> <span class="o">:</span> <span class="n">c</span>
</code></pre></div></div>

<p>If <code class="language-plaintext highlighter-rouge">a</code> is true, the expression will evaluate to <code class="language-plaintext highlighter-rouge">b</code>; otherwise it will evaluate to <code class="language-plaintext highlighter-rouge">c</code>.</p>

<p>Note that we should only execute the expression we actually need. For example, in the following code snippet:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mi">0</span> <span class="o">?</span> <span class="n">foo</span><span class="p">()</span> <span class="o">:</span> <span class="n">bar</span><span class="p">()</span>
</code></pre></div></div>

<p>the function <code class="language-plaintext highlighter-rouge">foo</code> should never be called. You might be tempted to call both <code class="language-plaintext highlighter-rouge">foo</code> and <code class="language-plaintext highlighter-rouge">bar</code>, then discard the result from <code class="language-plaintext highlighter-rouge">foo</code>, but that would be wrong; <code class="language-plaintext highlighter-rouge">foo</code> could print to the console, make a network call, or dereference a null pointer and crash the program. Obviously this point is also true of <code class="language-plaintext highlighter-rouge">if</code> statements – we should execute the <code class="language-plaintext highlighter-rouge">if</code> branch or the <code class="language-plaintext highlighter-rouge">else</code> branch but definitely not both.</p>

<p>Conditional expressions and <code class="language-plaintext highlighter-rouge">if</code> statements might seem very similar, but it’s important to remember that statements and expressions are used in totally different ways. For example, an expression has a value, but a statement doesn’t. So this is legal:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">a</span> <span class="o">=</span> <span class="n">flag</span> <span class="o">?</span> <span class="mi">2</span> <span class="o">:</span> <span class="mi">3</span><span class="p">;</span>
</code></pre></div></div>

<p>but this isn’t<sup id="anchor1"><a href="#fn1">1</a></sup>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//this is bogus</span>
<span class="kt">int</span> <span class="n">a</span> <span class="o">=</span> <span class="k">if</span> <span class="p">(</span><span class="n">flag</span><span class="p">)</span>
            <span class="mi">2</span><span class="p">;</span>
        <span class="k">else</span>
            <span class="mi">3</span><span class="p">;</span>
</code></pre></div></div>

<p>On the other hand, a statement can contain other statements, but an expression can’t contain statements. For example, you can nest a <code class="language-plaintext highlighter-rouge">return</code> statement inside an <code class="language-plaintext highlighter-rouge">if</code> statement:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">flag</span><span class="p">)</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</code></pre></div></div>

<p>but you can’t have a <code class="language-plaintext highlighter-rouge">return</code> statement inside a conditional expression:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//this is also bogus</span>
<span class="n">flag</span> <span class="o">?</span> <span class="k">return</span> <span class="mi">1</span> <span class="o">:</span> <span class="k">return</span> <span class="mi">2</span><span class="p">;</span>
</code></pre></div></div>

<h2 id="lexing">Lexing</h2>

<p>We need to define a few more tokens: <code class="language-plaintext highlighter-rouge">if</code> and <code class="language-plaintext highlighter-rouge">else</code> keywords for <code class="language-plaintext highlighter-rouge">if</code> statements, plus <code class="language-plaintext highlighter-rouge">:</code> and <code class="language-plaintext highlighter-rouge">?</code> operators for conditional expressions. Here’s the full list of tokens, with new tokens in bold at the bottom:</p>

<ul>
  <li>Open brace <code class="language-plaintext highlighter-rouge">{</code></li>
  <li>Close brace <code class="language-plaintext highlighter-rouge">}</code></li>
  <li>Open parenthesis <code class="language-plaintext highlighter-rouge">(</code></li>
  <li>Close parenthesis <code class="language-plaintext highlighter-rouge">)</code></li>
  <li>Semicolon <code class="language-plaintext highlighter-rouge">;</code></li>
  <li>Int keyword <code class="language-plaintext highlighter-rouge">int</code></li>
  <li>Return keyword <code class="language-plaintext highlighter-rouge">return</code></li>
  <li>Identifier <code class="language-plaintext highlighter-rouge">[a-zA-Z]\w*</code></li>
  <li>Integer literal <code class="language-plaintext highlighter-rouge">[0-9]+</code></li>
  <li>Minus <code class="language-plaintext highlighter-rouge">-</code></li>
  <li>Bitwise complement <code class="language-plaintext highlighter-rouge">~</code></li>
  <li>Logical negation <code class="language-plaintext highlighter-rouge">!</code></li>
  <li>Addition <code class="language-plaintext highlighter-rouge">+</code></li>
  <li>Multiplication <code class="language-plaintext highlighter-rouge">*</code></li>
  <li>Division <code class="language-plaintext highlighter-rouge">/</code></li>
  <li>AND <code class="language-plaintext highlighter-rouge">&amp;&amp;</code></li>
  <li>OR <code class="language-plaintext highlighter-rouge">||</code></li>
  <li>Equal <code class="language-plaintext highlighter-rouge">==</code></li>
  <li>Not Equal <code class="language-plaintext highlighter-rouge">!=</code></li>
  <li>Less than <code class="language-plaintext highlighter-rouge">&lt;</code></li>
  <li>Less than or equal <code class="language-plaintext highlighter-rouge">&lt;=</code></li>
  <li>Greater than <code class="language-plaintext highlighter-rouge">&gt;</code></li>
  <li>Greater than or equal <code class="language-plaintext highlighter-rouge">&gt;=</code></li>
  <li>Assignment <code class="language-plaintext highlighter-rouge">=</code></li>
  <li><strong>If keyword <code class="language-plaintext highlighter-rouge">if</code></strong></li>
  <li><strong>Else keyword <code class="language-plaintext highlighter-rouge">else</code></strong></li>
  <li><strong>Colon <code class="language-plaintext highlighter-rouge">:</code></strong></li>
  <li><strong>Question mark <code class="language-plaintext highlighter-rouge">?</code></strong></li>
</ul>

<h4 id="-task">☑ Task:</h4>
<p>Update the <em>lex</em> function to handle the new tokens. It should work for all stage 1-6 examples in the test suite, including the invalid ones.</p>

<h2 id="parsing">Parsing</h2>

<p>We’ll parse conditional expressions and <code class="language-plaintext highlighter-rouge">if</code> statements totally differently. Let’s handle <code class="language-plaintext highlighter-rouge">if</code> statements first.</p>

<h3 id="if-statements-1">If Statements</h3>

<p>So far, we’ve defined three types of statements in our AST: return statements, expressions, and variable declarations. Right now the definition looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>statement = Return(exp) 
          | Declare(string, exp option) //string is variable name
                                        //exp is optional initializer
          | Exp(exp)
</code></pre></div></div>

<p>We need to add an <code class="language-plaintext highlighter-rouge">If</code> statement, which has three parts: an expression (the controlling condition), an <code class="language-plaintext highlighter-rouge">if</code> branch and an optional <code class="language-plaintext highlighter-rouge">else</code> branch. Here’s our updated AST definition for statements:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>statement = Return(exp) 
          | Declare(string, exp option) //string is variable name
                                        //exp is optional initializer
          | Exp(exp)
          | If(exp, statement, statement option) //exp is controlling condition
                                                 //first statement is 'if' branch
                                                 //second statement is optional 'else' branch
</code></pre></div></div>

<p>Now let’s update our grammar. The rule for <code class="language-plaintext highlighter-rouge">if</code> statements consists of:</p>

<ul>
  <li>The <code class="language-plaintext highlighter-rouge">if</code> keyword</li>
  <li>An expression wrapped in parentheses (the condition)</li>
  <li>A statement (executed if the condition is true)</li>
  <li>Optionally, the <code class="language-plaintext highlighter-rouge">else</code> keyword, followed by another statement (executed if the condition is false)</li>
</ul>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"if" "(" &lt;exp&gt; ")" &lt;statement&gt; [ "else" &lt;statement&gt; ]
</code></pre></div></div>

<p>So the updated grammar for statements looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;statement&gt; ::= "return" &lt;exp&gt; ";"
              | &lt;exp&gt; ";"
              | "int" &lt;id&gt; [ = &lt;exp&gt; ] ";"
              | "if" "(" &lt;exp&gt; ")" &lt;statement&gt; [ "else" &lt;statement&gt; ]
</code></pre></div></div>

<p>Our definition of statements is recursive! But it’s not left-recursive, so it’s not a problem.</p>

<p>But we have another problem. We defined variable declarations as a type of statement, but declarations in C <strong>aren’t statements</strong>. For example, this code snippet isn’t valid:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//this will throw a compiler error!</span>
<span class="k">if</span> <span class="p">(</span><span class="n">flag</span><span class="p">)</span>
  <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</code></pre></div></div>

<p>When we added variable declarations in the last post, it didn’t matter whether or not we defined them as statements; we could parse the same subset of C and generate the same assembly either way. Now that we’re dealing with more complex structures like <code class="language-plaintext highlighter-rouge">if</code> statements, that simplification impacts what we can and can’t parse, so we need to fix it.</p>

<p>So we need to move <code class="language-plaintext highlighter-rouge">Declare</code> out of the <code class="language-plaintext highlighter-rouge">statement</code> type and into its own type. But this introduces a new problem: we’ve defined a function body as a list of statements, but if declarations aren’t statements, then you can’t have declarations in a function body. To fix this, we’ll need to tweak how we define functions in our AST. Let’s introduce some terminology:</p>

<ul>
  <li>A <strong>block item</strong> is a statement or declaration.</li>
  <li>A <strong>block</strong> or <strong>compound statement</strong> is a list of block items wrapped in curly braces<sup id="anchor2"><a href="#fn2">2</a></sup>.</li>
</ul>

<p>Function bodies are just a special case of blocks; they contain a list of declarations and statements. To represent them, we’ll introduce a new <code class="language-plaintext highlighter-rouge">block_item</code> type that can hold either a statement or a declaration. This will also come in handy when we add support for blocks in general in the next post. With those changes, the relevant parts of our AST will look like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>statement = Return(exp)                                         
          | Exp(exp)
          | Conditional(exp, statement, statement option) //exp is controlling condition
                                                          //first statement is 'if' block
                                                          //second statement is optional 'else' block

declaration = Declare(string, exp option) //string is variable name 
                                          //exp is optional initializer

block_item = Statement(statement) | Declaration(declaration)

function_declaration = Function(string, block_item list) //string is the function name                                                                                      
</code></pre></div></div>

<p>And here’s the updated grammar:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;statement&gt; ::= "return" &lt;exp&gt; ";"
              | &lt;exp&gt; ";"
              | "if" "(" &lt;exp&gt; ")" &lt;statement&gt; [ "else" &lt;statement&gt; ]
&lt;declaration&gt; ::= "int" &lt;id&gt; [ = &lt;exp&gt; ] ";"
&lt;block-item&gt; ::= &lt;statement&gt; | &lt;declaration&gt;
&lt;function&gt; ::= "int" &lt;id&gt; "(" ")" "{" { &lt;block-item&gt; } "}"
</code></pre></div></div>

<p>Now that we have our AST and grammar, you should be able to update your compiler to parse conditional statements. You may want to do that before we move on to conditional expressions.</p>

<h4 id="-task-1">☑ Task:</h4>
<p>Update the parsing pass to handle conditional statements. It should successfully parse all valid stage 6 examples in <code class="language-plaintext highlighter-rouge">write_a_c_compiler/stage_6/valid/statement</code>, and throw an error for all invalid stage 6 examples in <code class="language-plaintext highlighter-rouge">write_a_c_compiler/stage_6/invalid/statement</code>.</p>

<h3 id="conditional-expressions-1">Conditional Expressions</h3>

<p>Now let’s add ternary conditional expressions. Here’s how we’ve defined our AST for expressions so far:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>exp = Assign(string, exp)
    | Var(string) //string is variable name
    | BinOp(binary_operator, exp, exp)
    | UnOp(unary_operator, exp)
    | Constant(int)
</code></pre></div></div>

<p>It’s straightforward to add a <code class="language-plaintext highlighter-rouge">Conditional</code> form:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>exp = Assign(string, exp)
    | Var(string) //string is variable name
    | BinOp(binary_operator, exp, exp)
    | UnOp(unary_operator, exp)
    | Constant(int)
    | Conditional(exp, exp, exp) //the three expressions are the condition, 'if' expression and 'else' expression, respectively
</code></pre></div></div>

<p>We also need to update the grammar rules for expressions, which currently look like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;exp&gt; ::= &lt;id&gt; "=" &lt;exp&gt; | &lt;logical-or-exp&gt;
&lt;logical-or-exp&gt; ::= &lt;logical-and-exp&gt; { "||" &lt;logical-and-exp&gt; } 
...more rules...
</code></pre></div></div>

<p>The conditional operator has lower precedence than assignment (<code class="language-plaintext highlighter-rouge">=</code>) but higher precedence than logical OR (<code class="language-plaintext highlighter-rouge">||</code>), and it’s right-associative. We can take its grammar rule straight from section 6.5.15 of the <a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf">C11 standard</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;conditional-exp&gt; ::= &lt;logical-or-exp&gt; "?" &lt;exp&gt; ":" &lt;conditional-exp&gt;
</code></pre></div></div>

<p>Let’s think about why it’s defined this way. I’ll refer to the three sub-expressions as <strong>e1</strong>, <strong>e2</strong>, and <strong>e3</strong>, such that a conditional expression has the form <code class="language-plaintext highlighter-rouge">e1 ? e2 : e3</code>. Expression <strong>e1</strong> has to be a <code class="language-plaintext highlighter-rouge">&lt;logical-or-exp&gt;</code> because it can’t be an assignment expression or a conditional expression. It can’t be an assignment expression because assignment has lower precedence than the conditional operator. In other words:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">?</span> <span class="mi">2</span> <span class="o">:</span> <span class="mi">3</span><span class="p">;</span>
</code></pre></div></div>

<p>must be parsed as:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span> <span class="o">?</span> <span class="mi">2</span> <span class="o">:</span> <span class="mi">3</span><span class="p">);</span>
</code></pre></div></div>

<p>In our current grammar this is specified unambiguously, but if we instead defined a conditional expression as:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;conditional-exp&gt; ::= &lt;exp&gt; "?" &lt;exp&gt; ":" &lt;conditional-exp&gt;
</code></pre></div></div>

<p>then it would be ambiguous; the statement above could also be parsed as:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="n">a</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span> <span class="o">?</span> <span class="mi">2</span> <span class="o">:</span> <span class="mi">3</span><span class="p">;</span>
</code></pre></div></div>

<p>Note that <code class="language-plaintext highlighter-rouge">(a = 1) ? 2 : 3;</code> is a valid statement, but you need the parentheses in order to parse it that way.</p>

<p>So that’s why <strong>e1</strong> can’t be an assignment expression. It can’t be a conditional expression because <code class="language-plaintext highlighter-rouge">?</code> is right-associative. In other words:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">flag1</span> <span class="o">?</span> <span class="mi">4</span> <span class="o">:</span> <span class="n">flag2</span> <span class="o">?</span> <span class="mi">6</span> <span class="o">:</span> <span class="mi">7</span>
</code></pre></div></div>

<p>must be parsed as</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">flag1</span> <span class="o">?</span> <span class="mi">4</span> <span class="o">:</span> <span class="p">(</span><span class="n">flag2</span> <span class="o">?</span> <span class="mi">6</span> <span class="o">:</span> <span class="mi">7</span><span class="p">)</span>
</code></pre></div></div>

<p>If we had defined a conditional expression as:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;conditional-exp&gt; ::= &lt;conditional-exp&gt; "?" &lt;exp&gt; ":" &lt;conditional-exp&gt;
</code></pre></div></div>

<p>then the example above could also be parsed as:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="n">flag1</span> <span class="o">?</span> <span class="mi">4</span> <span class="o">:</span> <span class="n">flag2</span><span class="p">)</span> <span class="o">?</span> <span class="mi">6</span> <span class="o">:</span> <span class="mi">7</span>
</code></pre></div></div>

<p>and the grammar would be ambiguous.</p>

<p>Expression <strong>e2</strong> in our ternary conditional can take any form; safely fenced in by <code class="language-plaintext highlighter-rouge">?</code> and <code class="language-plaintext highlighter-rouge">:</code>, it can’t introduce any grammatical ambiguity. You can think of implicit parentheses wrapping everything between <code class="language-plaintext highlighter-rouge">?</code> and <code class="language-plaintext highlighter-rouge">:</code>.</p>

<p>Expression <strong>e3</strong> can be another ternary conditional, as in the example <code class="language-plaintext highlighter-rouge">a &gt; b ? 4 : flag ? 6 : 7</code>. But it <em>can’t</em> be an assignment statement – why not? Let’s look at the following example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">flag</span> <span class="o">?</span> <span class="n">a</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">:</span> <span class="n">a</span> <span class="o">=</span> <span class="mi">0</span>
</code></pre></div></div>

<p>If we try to compile this with gcc, we’ll get something like the following error message:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>error: expression is not assignable
    flag ? a = 1 : a = 0;
    ~~~~~~~~~~~~~~~~ ^
</code></pre></div></div>

<p>In other words, gcc tried to parse the expression like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="n">flag</span> <span class="o">?</span> <span class="n">a</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">:</span> <span class="n">a</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span>
</code></pre></div></div>

<p>This obviously doesn’t work because the expression on the left isn’t a variable<sup id="anchor3"><a href="#fn3">3</a></sup>. You might wonder why we can’t use the following grammar rule:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;conditional-exp&gt; ::= &lt;logical-or-exp&gt; "?" &lt;exp&gt; ":" &lt;exp&gt;
</code></pre></div></div>

<p>Then gcc could just parse it like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">flag</span> <span class="o">?</span> <span class="n">a</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">:</span> <span class="p">(</span><span class="n">a</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>

<p>That grammar rule would work fine; in fact, that’s how conditional expressions are defined in C++<sup id="anchor4"><a href="#fn4">4</a></sup>. I don’t know why it’s different in C, but if <em>you</em> know I’d like to hear from you.</p>

<p>We also need a way to specify expressions that aren’t conditionals, so we’ll make the ‘conditional’ part of this grammar rule optional<sup id="anchor5"><a href="#fn5">5</a></sup>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;conditional-exp&gt; ::= &lt;logical-or-exp&gt; [ "?" &lt;exp&gt; ":" &lt;conditional-exp&gt; ]
</code></pre></div></div>

<p>Anyway, we now know the correct grammar. Here are all the new and updated grammar rules concerning expressions:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;exp&gt; ::= &lt;id&gt; "=" &lt;exp&gt; | &lt;conditional-exp&gt;
&lt;conditional-exp&gt; ::= &lt;logical-or-exp&gt; [ "?" &lt;exp&gt; ":" &lt;conditional-exp&gt; ]
&lt;logical-or-exp&gt; ::= &lt;logical-and-exp&gt; { "||" &lt;logical-and-exp&gt; } 
...
</code></pre></div></div>

<h4 id="-task-2">☑ Task:</h4>
<p>Update the parsing pass to handle ternary conditional expressions. At this point, it should successfully parse all valid stage 6 examples, and throw an error for all invalid examples.</p>

<h3 id="put-it-all-together">Put It All Together</h3>
<p>For the sake of completeness, here’s our full AST definition and grammar, with new and changed parts bolded:</p>

<p>AST:</p>
<pre>
program = Program(function_declaration)

<b>function_declaration = Function(string, block_item list) //string is the function name

block_item = Statement(statement) | Declaration(declaration)

declaration = Declare(string, exp option) //string is variable name 
                                          //exp is optional initializer</b>

statement = Return(exp) 
          | Exp(exp)
<b>          | Conditional(exp, statement, statement option) //exp is controlling condition
                                                          //first statement is 'if' block
                                                          //second statement is optional 'else' block
</b>                                                          
exp = Assign(string, exp)
    | Var(string) //string is variable name
    | BinOp(binary_operator, exp, exp)
    | UnOp(unary_operator, exp)
    | Constant(int)
<b>    | CondExp(exp, exp, exp) //the three expressions are the condition, 'if' expression and 'else' expression, respectively</b>
</pre>

<p>Grammar:</p>
<pre>
&lt;program&gt; ::= &lt;function&gt;
<b>&lt;function&gt; ::= "int" &lt;id&gt; "(" ")" "{" { &lt;block-item&gt; } "}"
&lt;block-item&gt; ::= &lt;statement&gt; | &lt;declaration&gt;
&lt;declaration&gt; ::= "int" &lt;id&gt; [ = &lt;exp&gt; ] ";"</b>
&lt;statement&gt; ::= "return" &lt;exp&gt; ";"
              | &lt;exp&gt; ";"
<b>              | "if" "(" &lt;exp&gt; ")" &lt;statement&gt; [ "else" &lt;statement&gt; ]</b>
<br />
<b>&lt;exp&gt; ::= &lt;id&gt; "=" &lt;exp&gt; | &lt;conditional-exp&gt;
&lt;conditional-exp&gt; ::= &lt;logical-or-exp&gt; [ "?" &lt;exp&gt; ":" &lt;conditional-exp&gt; ]</b>
&lt;logical-or-exp&gt; ::= &lt;logical-and-exp&gt; { "||" &lt;logical-and-exp&gt; }
&lt;logical-and-exp&gt; ::= &lt;equality-exp&gt; { "&amp;&amp;" &lt;equality-exp&gt; }
&lt;equality-exp&gt; ::= &lt;relational-exp&gt; { ("!=" | "==") &lt;relational-exp&gt; }
&lt;relational-exp&gt; ::= &lt;additive-exp&gt; { ("&lt;" | "&gt;" | "&lt;=" | "&gt;=") &lt;additive-exp&gt; }
&lt;additive-exp&gt; ::= &lt;term&gt; { ("+" | "-") &lt;term&gt; }
&lt;term&gt; ::= &lt;factor&gt; { ("*" | "/") &lt;factor&gt; }
&lt;factor&gt; ::= "(" &lt;exp&gt; ")" | &lt;unary_op&gt; &lt;factor&gt; | &lt;int&gt; | &lt;id&gt;
&lt;unary_op&gt; ::= "!" | "~" | "-"
</pre>

<h2 id="code-generation">Code Generation</h2>

<p>To generate the assembly for <code class="language-plaintext highlighter-rouge">if</code> statements and conditional expressions, we’re going to need conditional and unconditional jumps, which we introduced in <a href="https://norasandler.com/2017/12/28/Write-a-Compiler-4.html">part 4</a>. We can generate assembly for the conditional expression <code class="language-plaintext highlighter-rouge">e1 ? e2 : e3</code> as follows:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="err">&lt;</span><span class="nf">CODE</span> <span class="nv">FOR</span> <span class="nv">e1</span> <span class="nv">GOES</span> <span class="nv">HERE</span><span class="o">&gt;</span>
    <span class="nf">cmpl</span> <span class="kc">$</span><span class="mi">0</span><span class="p">,</span> <span class="o">%</span><span class="nb">eax</span>
    <span class="nf">je</span>   <span class="nv">_e3</span>                  <span class="c1">; if e1 == 0, e1 is false so execute e3</span>
    <span class="err">&lt;</span><span class="nf">CODE</span> <span class="nv">FOR</span> <span class="nv">e2</span> <span class="nv">GOES</span> <span class="nv">HERE</span><span class="o">&gt;</span>  <span class="c1">; we're still here so e1 must be true. execute e2.</span>
    <span class="nf">jmp</span>  <span class="nv">_post_conditional</span>    <span class="c1">; jump over e3</span>
<span class="nl">_e3:</span>
    <span class="err">&lt;</span><span class="nf">CODE</span> <span class="nv">FOR</span> <span class="nv">e3</span> <span class="nv">GOES</span> <span class="nv">HERE</span><span class="o">&gt;</span>  <span class="c1">; we jumped here because e1 was false. execute e3.</span>
<span class="nl">_post_conditional:</span>            <span class="c1">; we need this label to jump over e3</span>
</code></pre></div></div>

<p>The assembly for <code class="language-plaintext highlighter-rouge">if</code> statements is quite similar, although it’s slightly complicated by the optional <code class="language-plaintext highlighter-rouge">else</code> clause. I’ll let you figure it out yourself.</p>

<p>As in the assembly for <code class="language-plaintext highlighter-rouge">&amp;&amp;</code> and <code class="language-plaintext highlighter-rouge">||</code> we saw earlier, labels have to be unique.</p>

<h4 id="-task-3">☑ Task:</h4>
<p>Update the code-generation pass to correctly handle ternary conditional expressions and <code class="language-plaintext highlighter-rouge">if</code> statements. It should success on all valid examples and fail on all invalid examples for stages 1-6.</p>

<h2 id="up-next">Up Next</h2>
<p>In the <a href="/2018/03/14/Write-a-Compiler-7.html">next post</a>, we’ll add compound statements, so brace yourself (pun intended) for an exciting discussion of lexical scope! I <strong>hope</strong> that will be two weeks from now and not two months. See you then!</p>

<p><em>If you have any questions, corrections, or other feedback, you can <a href="mailto:nora@norasandler.com">email me</a> or <a href="https://github.com/nlsandler/write_a_c_compiler/issues">open an issue</a>.</em></p>

<div class="footnote">
  <p><sup id="fn1">1</sup>
But the <code class="language-plaintext highlighter-rouge">if</code> construct in many functional languages <em>is</em> an expression, and works just like C’s ternary conditionals. This is valid OCaml, for instance:</p>
  <div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">a</span> <span class="o">=</span> <span class="k">if</span> <span class="n">b</span> <span class="k">then</span> <span class="mi">1</span> <span class="k">else</span> <span class="mi">2</span>
</code></pre></div>  </div>
  <p><a href="#anchor1">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn2">2</sup>
The terms “block” and “compound statement” aren’t 100% synonymous; compound statements are a subset of blocks. But the terms are similar enough that it’s fine to treat them as synonyms for now. <a href="#anchor2">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn3">3</sup>
Actually, any “modifiable lvalue” is allowed on the left side of an assignment statement, not just variables. <code class="language-plaintext highlighter-rouge">*x</code>, <code class="language-plaintext highlighter-rouge">&amp;x</code>, <code class="language-plaintext highlighter-rouge">++x</code>, and <code class="language-plaintext highlighter-rouge">x++</code> are all examples of modifiable lvalues. Conditional expressions aren’t, though. <a href="#anchor3">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn4">4</sup>
See <a href="https://stackoverflow.com/a/26448707">this Stack Overflow answer</a> and the <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3690.pdf">C++11 standard</a>. <a href="#anchor4">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn5">5</sup>
Thanks to Stephen Bastians for <a href="https://github.com/nlsandler/write_a_c_compiler/issues/4">pointing out a mistake in this grammar rule</a> in an earlier verson of this post.<a href="#anchor5">↩</a></p>
</div>]]></content><author><name>Nora Sandler</name></author><category term="compiler-tutorial" /><summary type="html"><![CDATA[This is the sixth post in a series. Read part 1 here.]]></summary></entry><entry><title type="html">Writing a C Compiler, Part 5</title><link href="https://norasandler.com/2018/01/08/Write-a-Compiler-5.html" rel="alternate" type="text/html" title="Writing a C Compiler, Part 5" /><published>2018-01-08T20:00:00+00:00</published><updated>2018-01-08T20:00:00+00:00</updated><id>https://norasandler.com/2018/01/08/Write-a-Compiler-5</id><content type="html" xml:base="https://norasandler.com/2018/01/08/Write-a-Compiler-5.html"><![CDATA[<p><em>This is the fifth post in a series. Read part 1 <a href="/2017/11/29/Write-a-Compiler.html">here</a>.</em></p>

<p>We’ve spent the last two weeks adding binary primitives, and I don’t know about you, but I’m starting to get kind of bored with it.
This week, we’ll do something completely different and add support for local variables.
We’ll finally be able to compile functions longer than one line! Hooray!</p>

<p>As always, accompanying tests are <a href="https://github.com/nlsandler/write_a_c_compiler">here</a>.</p>

<h1 id="week-5-local-variables">Week 5: Local Variables</h1>

<p>We’re adding variables this week! Programming without variables is hard, so this is very exciting. 
To keep things simple, we’re going to support variables in a very restricted way for now:</p>

<ul>
  <li>We only support local variables, which are declared in <code class="language-plaintext highlighter-rouge">main</code>. No global variables.</li>
  <li>We only support variables of type <code class="language-plaintext highlighter-rouge">int</code>.</li>
  <li>We don’t support type modifiers like <code class="language-plaintext highlighter-rouge">short</code>, <code class="language-plaintext highlighter-rouge">long</code> or <code class="language-plaintext highlighter-rouge">unsigned</code>, storage-class specifiers like <code class="language-plaintext highlighter-rouge">static</code>,
or type qualifiers like <code class="language-plaintext highlighter-rouge">const</code>. Just plain old <code class="language-plaintext highlighter-rouge">int</code>.</li>
  <li>You can only declare one variable per statement. We won’t support statements like <code class="language-plaintext highlighter-rouge">int a, b;</code></li>
</ul>

<p>There are three things you can do with a variable:</p>

<ul>
  <li>Declare it (<code class="language-plaintext highlighter-rouge">int a;</code>)
    <ul>
      <li>When you declare it, you can also optionally initialize it (<code class="language-plaintext highlighter-rouge">int a = 2;</code>)</li>
    </ul>
  </li>
  <li>Assign to it (<code class="language-plaintext highlighter-rouge">a = 3;</code>)</li>
  <li>Reference it in an expression (<code class="language-plaintext highlighter-rouge">a + 2</code>)</li>
</ul>

<p>We’ll need to add support for these three things. We’ll also add support for functions containing more than one statement.</p>

<h2 id="lexing">Lexing</h2>

<p>The only new token this week is the assignment operator, <code class="language-plaintext highlighter-rouge">=</code>. Here’s our list of tokens, with the newest addition in bold at the bottom:</p>

<ul>
  <li>Open brace <code class="language-plaintext highlighter-rouge">{</code></li>
  <li>Close brace <code class="language-plaintext highlighter-rouge">}</code></li>
  <li>Open parenthesis <code class="language-plaintext highlighter-rouge">(</code></li>
  <li>Close parenthesis <code class="language-plaintext highlighter-rouge">)</code></li>
  <li>Semicolon <code class="language-plaintext highlighter-rouge">;</code></li>
  <li>Int keyword <code class="language-plaintext highlighter-rouge">int</code></li>
  <li>Return keyword <code class="language-plaintext highlighter-rouge">return</code></li>
  <li>Identifier <code class="language-plaintext highlighter-rouge">[a-zA-Z]\w*</code></li>
  <li>Integer literal <code class="language-plaintext highlighter-rouge">[0-9]+</code></li>
  <li>Minus <code class="language-plaintext highlighter-rouge">-</code></li>
  <li>Bitwise complement <code class="language-plaintext highlighter-rouge">~</code></li>
  <li>Logical negation <code class="language-plaintext highlighter-rouge">!</code></li>
  <li>Addition <code class="language-plaintext highlighter-rouge">+</code></li>
  <li>Multiplication <code class="language-plaintext highlighter-rouge">*</code></li>
  <li>Division <code class="language-plaintext highlighter-rouge">/</code></li>
  <li>AND <code class="language-plaintext highlighter-rouge">&amp;&amp;</code></li>
  <li>OR <code class="language-plaintext highlighter-rouge">||</code></li>
  <li>Equal <code class="language-plaintext highlighter-rouge">==</code></li>
  <li>Not Equal <code class="language-plaintext highlighter-rouge">!=</code></li>
  <li>Less than <code class="language-plaintext highlighter-rouge">&lt;</code></li>
  <li>Less than or equal <code class="language-plaintext highlighter-rouge">&lt;=</code></li>
  <li>Greater than <code class="language-plaintext highlighter-rouge">&gt;</code></li>
  <li>Greater than or equal <code class="language-plaintext highlighter-rouge">&gt;=</code></li>
  <li><strong>Assignment <code class="language-plaintext highlighter-rouge">=</code></strong></li>
</ul>

<h4 id="-task">☑ Task:</h4>
<p>Update the <em>lex</em> function to handle the <code class="language-plaintext highlighter-rouge">=</code> token. It should work for all stage 1-5 examples in the test suite, including the invalid ones.</p>

<h2 id="parsing">Parsing</h2>

<p>We need to make a lot of changes to our AST this week. Let’s look at a sample program we’d like to handle:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">a</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="n">a</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">a</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In this program, <code class="language-plaintext highlighter-rouge">main</code> contains three statements:</p>

<ol>
  <li>A variable declaration (<code class="language-plaintext highlighter-rouge">int a = 1;</code>)</li>
  <li>A variable assignment (<code class="language-plaintext highlighter-rouge">a = a + 1;</code>)</li>
  <li>A return statement (<code class="language-plaintext highlighter-rouge">return a;</code>)</li>
</ol>

<p>We need to update the defintion of <code class="language-plaintext highlighter-rouge">function_declaration</code> in the AST so a function can contain a list of statements, not just a single statement:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function_declaration = Function(string, statement list) //string is function name
</code></pre></div></div>

<p>Right now, the only statements we’ve defined are <code class="language-plaintext highlighter-rouge">return</code> statements. That’s not right either. Let’s add some more:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>statement = Return(exp) 
          | Declare(string, exp option) //string is variable name
                                        //exp is optional initializer
          | Exp(exp)
</code></pre></div></div>

<p>We’ve added <code class="language-plaintext highlighter-rouge">Decl</code> for variable declarations. We can use an option type (<code class="language-plaintext highlighter-rouge">Maybe</code> in Haskell) to represent that we may or may not have an initializer.</p>

<p>The AST for <code class="language-plaintext highlighter-rouge">int a;</code> might look like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>decl = Declare("a", None) //None because we don't initialize it
</code></pre></div></div>

<p>And the AST for <code class="language-plaintext highlighter-rouge">int a = 3</code> might look like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>init_exp = Const(3)
decl = Declare("a", Some(init_exp))
</code></pre></div></div>

<p>Note that we don’t store the variable’s type anywhere in our AST; we don’t need to, because it can only have type <code class="language-plaintext highlighter-rouge">int</code>. We’ll need to start tracking type information once we have multiple types</p>

<p>We’ve also added a standalone <code class="language-plaintext highlighter-rouge">Exp</code> statement, which means we can now write programs like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="mi">2</span> <span class="o">+</span> <span class="mi">2</span><span class="p">;</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is valid C; if you compile it with gcc, it will issue a warning but it won’t fail.</p>

<p>However, <code class="language-plaintext highlighter-rouge">2+2;</code> isn’t a very useful statement. The real reason to add an <code class="language-plaintext highlighter-rouge">Exp</code> statement is so we can write statements like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span>
</code></pre></div></div>

<p>Variable assignment is just an expression! That’s why you this statement is valid:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">b</span> <span class="o">=</span> <span class="mi">2</span><span class="p">);</span>
</code></pre></div></div>

<p>In the code snippet above, the expression <code class="language-plaintext highlighter-rouge">b = 2</code> has the value <code class="language-plaintext highlighter-rouge">2</code>, and the side effect of updating <code class="language-plaintext highlighter-rouge">b</code> to have that value.
This would be evaluated as:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">b</span> <span class="o">=</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">a</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="mi">2</span> <span class="c1">//also b is 2 now</span>
<span class="n">a</span> <span class="o">=</span> <span class="mi">4</span>
</code></pre></div></div>

<p>Now we need to update <code class="language-plaintext highlighter-rouge">exp</code> in our AST definition to handle assignment operators. My first thought was to just add <code class="language-plaintext highlighter-rouge">=</code> as another binary operator – after all, <code class="language-plaintext highlighter-rouge">a = b</code> <em>looks</em> kind of like <code class="language-plaintext highlighter-rouge">a + b</code>. But that’s totally wrong: the two operands of a binary operator can be arbitrary expressions, but the left side of an assignment operator can’t. A statement like <code class="language-plaintext highlighter-rouge">2 = 2</code> doesn’t make any sense, because you can’t assign a new value to <code class="language-plaintext highlighter-rouge">2</code>.</p>

<p>Instead, we’ll just define assignment as a new type of expression:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>exp = Assign(string, exp) //string is variable, exp is value to assign
    | BinOp(binary_operator, exp, exp)
    | UnOp(unary_operator, exp)
    | Constant(int)
</code></pre></div></div>

<p>Now we can write the AST for the statement <code class="language-plaintext highlighter-rouge">a = 2;</code> like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>assign_exp = Assign("a", Const(2))
assign_statement = Exp(assign_exp)
</code></pre></div></div>

<p>Now we can define variables and update their values, but that’s not super helpful unless we can actually reference them.
Let’s add variable reference as another type of expression:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>exp = Assign(string, exp)
    | Var(string) //string is variable name
    | BinOp(binary_operator, exp, exp)
    | UnOp(unary_operator, exp)
    | Constant(int)
</code></pre></div></div>

<p>Now we can write the AST <code class="language-plaintext highlighter-rouge">return a;</code> like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>return_exp = Var("a")
return_statement = Return(return_exp)
</code></pre></div></div>

<p>If we put it all together, here’s our new AST, with changes bolded:</p>

<pre>
program = Program(function_declaration)
<b>function_declaration = Function(string, statement list) //string is the function name</b>

statement = Return(exp) 
<b>          | Declare(string, exp option) //string is variable name
                                        //exp is optional initializer
          | Exp(exp) </b>

exp = <b>Assign(string, exp) </b>
<b>    | Var(string) //string is variable name </b>
    | BinOp(binary_operator, exp, exp)
    | UnOp(unary_operator, exp)
    | Constant(int)
</pre>

<p>We also need to update our grammar. First, we need to update <code class="language-plaintext highlighter-rouge">&lt;function&gt;</code> to allow multiple statements.</p>

<p>Old definition:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;function&gt; ::= "int" &lt;id&gt; "(" ")" "{" &lt;statement&gt; "}"
</code></pre></div></div>

<p>New definition:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;function&gt; ::= "int" &lt;id&gt; "(" ")" "{" { &lt;statement&gt; } "}"
</code></pre></div></div>

<p>Thanks to the interspersed <code class="language-plaintext highlighter-rouge">{</code>/<code class="language-plaintext highlighter-rouge">}</code>, indicating repetitition, and <code class="language-plaintext highlighter-rouge">"{"</code>/<code class="language-plaintext highlighter-rouge">"}"</code>, indicating literal curly braces, this is almost completely unreadable. But it just means a function can have more than one statement now.</p>

<p>We need to handle multiple types of statement. We already have return statements:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"return" &lt;exp&gt; ";"
</code></pre></div></div>

<p>And standalone expressions are super easy:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;exp&gt; ";"
</code></pre></div></div>

<p>A variable declaration needs a type specifier (<code class="language-plaintext highlighter-rouge">int</code>) followed by a name, optionally followed by an initializer. We use <code class="language-plaintext highlighter-rouge">[]</code> here to indicate something is optional:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"int" &lt;id&gt; [ = &lt;exp&gt; ] ";"
</code></pre></div></div>

<p>Let’s put it all together to get a our new definition of <code class="language-plaintext highlighter-rouge">&lt;statement&gt;</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;statement&gt; ::= "return" &lt;exp&gt; ";"
              | &lt;exp&gt; ";"
              | "int" &lt;id&gt; [ = &lt;exp&gt; ] ";"
</code></pre></div></div>

<p>Finally, we need to update <code class="language-plaintext highlighter-rouge">&lt;exp&gt;</code>. Assignment is our lowest-precedence operator, so it becomes our top level <code class="language-plaintext highlighter-rouge">&lt;exp&gt;</code> expression. Also note that, unlike most of our other operators, it’s right-associative, which makes it a bit simpler to express.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;exp&gt; ::= &lt;id&gt; "=" &lt;exp&gt; | &lt;logical-or-exp&gt;
&lt;logical-or-exp&gt; ::= &lt;logical-and-exp&gt; { "||" &lt;logical-and-exp&gt; }
</code></pre></div></div>

<p>The grammar for all our binary operations (<code class="language-plaintext highlighter-rouge">&lt;logical-and-exp&gt;</code> on down to <code class="language-plaintext highlighter-rouge">&lt;term&gt;</code>) is unchanged. 
We just need to change <code class="language-plaintext highlighter-rouge">&lt;factor&gt;</code> so we can refer to variables as well as constants:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;factor&gt; ::= "(" &lt;exp&gt; ")" | &lt;unary_op&gt; &lt;factor&gt; | &lt;int&gt; | &lt;id&gt;
</code></pre></div></div>

<p>When you put it all together, here’s our new grammar, with changes bolded:</p>

<pre>
&lt;program&gt; ::= &lt;function&gt;
<b>&lt;function&gt; ::= "int" &lt;id&gt; "(" ")" "{" { &lt;statement&gt; } "}"</b>
&lt;statement&gt; ::= "return" &lt;exp&gt; ";"
<b>              | &lt;exp&gt; ";"
              | "int" &lt;id&gt; [ = &lt;exp&gt;] ";" </b>
<b>&lt;exp&gt; ::= &lt;id&gt; "=" &lt;exp&gt; | &lt;logical-or-exp&gt;
&lt;logical-or-exp&gt; ::= &lt;logical-and-exp&gt; { "||" &lt;logical-and-exp&gt; } </b>
&lt;logical-and-exp&gt; ::= &lt;equality-exp&gt; { "&amp;&amp;" &lt;equality-exp&gt; }
&lt;equality-exp&gt; ::= &lt;relational-exp&gt; { ("!=" | "==") &lt;relational-exp&gt; }
&lt;relational-exp&gt; ::= &lt;additive-exp&gt; { ("&lt;" | "&gt;" | "&lt;=" | "&gt;=") &lt;additive-exp&gt; }
&lt;additive-exp&gt; ::= &lt;term&gt; { ("+" | "-") &lt;term&gt; }
&lt;term&gt; ::= &lt;factor&gt; { ("*" | "/") &lt;factor&gt; }
<b>&lt;factor&gt; ::= "(" &lt;exp&gt; ")" | &lt;unary_op&gt; &lt;factor&gt; | &lt;int&gt; | &lt;id&gt;</b>
&lt;unary_op&gt; ::= "!" | "~" | "-"
</pre>

<h4 id="-task-1">☑ Task:</h4>
<p>Update your expression-parsing code to handle variable declaration, assignment, and references. It should successfully parse all valid stage 1-5 examples in the test suite. The invalid examples are a little different this week. Some of them should fail during parsing; others can be parsed successfully but should cause errors during code generation (e.g. because they reference variables that haven’t been declared.) I decided to deal with this in the laziest way possible; the names of the invalid examples that should fail during parsing all start with <code class="language-plaintext highlighter-rouge">syntax_err</code>.</p>

<h2 id="code-generation">Code Generation</h2>

<p>We need to save local variables somewhere, so we’ll save them on the stack.
We also need to remember exactly where on the stack each variable was saved, so we can refer to it later.
To track this information, we’ll create a map from variable names to locations.</p>

<p>But how are we supposed to know a variable’s location at compile time? Absolute memory addresses aren’t determined until runtime. We could store the variable’s offset from ESP, except that the value of ESP changes whenever we push something onto the stack.
The solution is to store the variable’s offset from a different register, EBP.
To understand why this will work, we need to know a little bit about stack frames.</p>

<h3 id="stack-frames">Stack Frames</h3>

<p>Whenever we call a function, we allocate a chunk of memory for it on top of the stack – this memory is called the <em>stack frame</em>. The stack frame holds function arguments, the address to jump to after the function returns, and of course local variables. We already know that ESP points to the top of stack, which is also the top of the current stack frame<sup id="anchor1"><a href="#fn1">1</a></sup>. The EBP (or base pointer) register points to the bottom of the current stack frame.
Without EBP, we wouldn’t know where once stack frame ends and the other begins, and we wouldn’t be able to find important values like a function’s return address.</p>

<p><img src="/assets/call_stack.svg" alt="Obligatory call stack diagram" /></p>

<div class="screen-reader-only">
  <p>Call stack diagram, from higher address on the bottom of the stack to lower address on top:
  I. Caller’s stack frame
    * Caller’s local variable y
    * Caller’s local variable x
    * return address
  II. Callee’s stack frame
    * Saved EBP (current EBP points here)
    * local variable a
    * local variable b (top of stack; current ESP points here)</p>
</div>

<p>When a function (let’s call it <code class="language-plaintext highlighter-rouge">f</code>) returns, its caller needs to be able to pick up where it left off. That means its stack frame, and the values in ESP and EBP, all need to be exactly the same as they were before <code class="language-plaintext highlighter-rouge">f</code> was called. The first thing <code class="language-plaintext highlighter-rouge">f</code> needs to do is set up a new stack frame for itself, using the following instructions:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">push</span> <span class="o">%</span><span class="nb">ebp</span>       <span class="c1">; save old value of EBP</span>
    <span class="nf">movl</span> <span class="o">%</span><span class="nb">esp</span><span class="p">,</span> <span class="o">%</span><span class="nb">ebp</span> <span class="c1">; current top of stack is bottom of new stack frame</span>
</code></pre></div></div>

<p>These instructions are called the function prologue. 
Immediately before <code class="language-plaintext highlighter-rouge">f</code> returns, it executes the function epilogue to remove this stack frame, 
leaving everything just as it was before the function prologue:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">movl</span> <span class="o">%</span><span class="nb">ebp</span><span class="p">,</span> <span class="o">%</span><span class="nb">esp</span> <span class="c1">; restore ESP; now it points to old EBP</span>
    <span class="nf">pop</span> <span class="o">%</span><span class="nb">ebp</span>        <span class="c1">; restore old EBP; now ESP is where it was before prologue</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>Up to this point, we could get away with not having a function prologue or epilogue, but now we need to add them.
Adding them helps us in two ways:</p>

<ul>
  <li><strong>We can store variable locations as offsets from EBP</strong>. We know there’s nothing above EBP (because we set up an empty stack frame in the function prologue),
and we know that EBP won’t change until the function epilogue.</li>
  <li>We can safely push local variables onto the stack without changing the caller’s stack frame<sup id="anchor2"><a href="#fn2">2</a></sup>.</li>
</ul>

<p>You should generate the function prologue at the start of the function definition, right after the function’s label.
You should generate the function epilogue as part of the return statement, right before <code class="language-plaintext highlighter-rouge">ret</code>.</p>

<p>Besides our variable map, we need to keep track of a <em>stack index</em>, which tells us the offset of the next available spot on the stack, relative to EBP. The next available spot is always the four-byte stack slot right after ESP, at <code class="language-plaintext highlighter-rouge">ESP - 4</code>. Right after the function prologue, EBP and ESP are the same. That means the stack index will also be -4. Whenever we push a variable onto the stack, we’ll decrement the stack index by 4<sup id="anchor3"><a href="#fn3">3</a></sup>.</p>

<p>Now let’s look at how we can handle declaring, assigning, and referring to variables.</p>

<h3 id="variable-declaration">Variable Declaration</h3>

<p>When you encounter a variable declaration, just save the variable onto the stack and add it to the variable map<sup id="anchor4"><a href="#fn4">4</a></sup>. Note that it’s illegal to declare a variable twice in the same local scope<sup id="anchor5"><a href="#fn5">5</a></sup>, as in the following code snippet:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">a</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">a</span><span class="p">;</span>
</code></pre></div></div>

<p>So your program should fail if the variable is already in the variable map.
Here’s how you might generate assembly for the statement <code class="language-plaintext highlighter-rouge">int a = expression</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  if var_map.contains("a"):
    fail() //shouldn't declare a var twice
  generate_exp(expression)      // generate assembly to calculate e1 and move it to eax
  emit "    pushl %eax" // save initial value of "a" onto the stack
  var_map = var_map.put("a", stack_index) // record location of a in the variable map
  stack_index = stack_index - 4 // stack location of next address will be 4 bytes lower
</code></pre></div></div>

<p>A few points here:</p>

<ul>
  <li>If a variable isn’t initialized, you can just initialize it to 0. Or whatever you want, really.</li>
  <li>The variable map exists during code generation, not at runtime.</li>
  <li>You should <strong>definitely use an immutable data structure</strong> for your variable map. In the next post we’ll add <code class="language-plaintext highlighter-rouge">if</code> statements, and then we’ll have nested scopes; a variable declared inside an <code class="language-plaintext highlighter-rouge">if</code> block isn’t accessible outside it. If you have to worry about code from an inner scope messing with the variable map in an outer scope, you will not be a happy camper.</li>
</ul>

<h3 id="variable-assignment">Variable Assignment</h3>

<p>We can look up a variable’s location in memory in our map; to assign it a new value, just move that value to the right memory location.
Here’s how to handle <code class="language-plaintext highlighter-rouge">a = expression</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  generate_exp(expression) // generate assembly to calculate expression and move it to eax 
  var_offset = var_map.find("a") //if "a" isn't in the map, fail b/c it hasn't been declared yet
  emit "    movl %eax, {}(%ebp)".format(var_offset) //using python-style string formatting here
</code></pre></div></div>

<p>Note that the value of <code class="language-plaintext highlighter-rouge">expression</code> is still in EAX, so this assignment expression has the correct value.</p>

<h3 id="variable-reference">Variable Reference</h3>

<p>To refer to a variable in an expression, just copy it from the stack to EAX:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  var_offset = var_map.find("a") //find location of variable "a" on the stack
                                 //should fail if it hasn't been declared yet
  emit "    movl {}(%ebp), %eax".format(var_offset) //retrieve value of variable
</code></pre></div></div>

<h3 id="missing-return-statements">Missing Return Statements</h3>

<p>Now that we support multiple types of statements, we can successfully parse programs with no return statement at all:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
  <span class="kt">int</span> <span class="n">a</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>What’s the expected behavior here? According to section 5.1.2.2.3 of the <a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf">C11 standard</a>:</p>

<blockquote>
  <p>If the return type of the <code class="language-plaintext highlighter-rouge">main</code> function is a type compatible with <code class="language-plaintext highlighter-rouge">int</code>, a return from the
initial call to the <code class="language-plaintext highlighter-rouge">main</code> function is equivalent to calling the <code class="language-plaintext highlighter-rouge">exit</code> function with the value
returned by the <code class="language-plaintext highlighter-rouge">main</code> function as its argument; reaching the <code class="language-plaintext highlighter-rouge">}</code> that terminates the
<code class="language-plaintext highlighter-rouge">main</code> function returns a value of 0.</p>
</blockquote>

<p>So, <code class="language-plaintext highlighter-rouge">main</code> needs to return 0 if it’s missing a return statement. Right now <code class="language-plaintext highlighter-rouge">main</code> is our only function, so that’s the only case we need to handle.</p>

<p>Eventually, we’ll need to deal with this problem in functions other than <code class="language-plaintext highlighter-rouge">main</code>. Here’s what section 6.9.1 of the standard says about missing return statements in general:</p>

<blockquote>
  <p>If the <code class="language-plaintext highlighter-rouge">}</code> that terminates a function is reached, and the value of the function call is used by the caller, the behavior is undefined.</p>
</blockquote>

<p>So this program has undefined behavior:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">foo</span><span class="p">()</span> <span class="p">{</span>
  <span class="mi">1</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
  <span class="k">return</span> <span class="n">foo</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>You could technically handle this however you want – fail, continue silently, issue a <a href="https://en.wikipedia.org/wiki/Halt_and_Catch_Fire">HALT AND CATCH FIRE</a> instruction.</p>

<p>This program, on the other hand, is perfectly valid, because the value returned from <code class="language-plaintext highlighter-rouge">foo()</code> is never used:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">foo</span><span class="p">()</span> <span class="p">{</span>
  <span class="mi">1</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
  <span class="n">foo</span><span class="p">();</span>
  <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Honestly, the specification here seems really dumb to me. If I write a non-<code class="language-plaintext highlighter-rouge">void</code> function without a return statement, that is WRONG and I want the compiler to save me from myself, even if I haven’t technically used it in an illegal way yet. I can’t think of any situation where we’d want this behavior; if you can, please let me know.</p>

<p>However, that’s the spec, so our functions have to return successfully even when they’re missing a return statement. That means you need to issue the function epilogue and <code class="language-plaintext highlighter-rouge">ret</code> instruction even if the return statement is missing. It’s probably easiest to handle <code class="language-plaintext highlighter-rouge">main</code> and all other functions uniformly, so you can just return 0 from any function without a return statement.</p>

<h4 id="-task-2">☑ Task:</h4>

<p>Update your code-generation pass to:</p>
<ul>
  <li>Generate function prologues and epilogues.</li>
  <li>Generate correct code for variable declarations, assignments, and references.</li>
  <li>Make <code class="language-plaintext highlighter-rouge">main</code> return 0 even if the return statement is missing.</li>
</ul>

<p>Your code should succeed on all valid examples and fail on all invalid examples for stages 1-5.</p>

<h2 id="bonus-features">Bonus features</h2>

<p>At this point, there are a handful of other features you can implement pretty easily:</p>

<h3 id="compound-assignment-operators">Compound Assignment Operators</h3>

<ul>
  <li><code class="language-plaintext highlighter-rouge">+=</code></li>
  <li><code class="language-plaintext highlighter-rouge">-=</code></li>
  <li><code class="language-plaintext highlighter-rouge">/=</code></li>
  <li><code class="language-plaintext highlighter-rouge">*=</code></li>
  <li><code class="language-plaintext highlighter-rouge">%=</code></li>
  <li><code class="language-plaintext highlighter-rouge">&lt;&lt;=</code></li>
  <li><code class="language-plaintext highlighter-rouge">&gt;&gt;=</code></li>
  <li><code class="language-plaintext highlighter-rouge">&amp;=</code></li>
  <li><code class="language-plaintext highlighter-rouge">|=</code></li>
  <li><code class="language-plaintext highlighter-rouge">^=</code></li>
</ul>

<h3 id="comma-operators">Comma Operators</h3>

<ul>
  <li><code class="language-plaintext highlighter-rouge">e1, e2</code>. The result is the value of e2; the value of e1 is ignored.</li>
</ul>

<h3 id="incrementdecrement-operators">Increment/Decrement Operators</h3>

<ul>
  <li>Prefix and postfix  <code class="language-plaintext highlighter-rouge">++</code></li>
  <li>Prefix and postfix <code class="language-plaintext highlighter-rouge">--</code></li>
</ul>

<p>This week’s tests don’t cover these, so it’s up to you whether to implement them or skip them.</p>

<h2 id="up-next">Up Next</h2>

<p>I’m going to switch to one blog post every two weeks. In the <a href="/2018/02/25/Write-a-Compiler-6.html">next post</a>, we’ll add <code class="language-plaintext highlighter-rouge">if</code> statements and conditional operators (<code class="language-plaintext highlighter-rouge">a ? b : c</code>). See you then!</p>

<h2 id="update-112">Update 1/12</h2>

<ul>
  <li>
    <p>Corrected the “Missing Return Statements” section, which previously said that the behavior of <code class="language-plaintext highlighter-rouge">main</code> is undefined when it’s missing a return statement. Also updated the test suite accordingly.</p>
  </li>
  <li>
    <p>Clarified that declaring a variable multiple times is sometimes legal at file scope.</p>
  </li>
</ul>

<p>Thanks to <a href="http://ouah.org/ogay/">Olivier Gay</a> for pointing out both those things.</p>

<p><em>If you have any questions, corrections, or other feedback, you can <a href="mailto:nora@norasandler.com">email me</a> or <a href="https://github.com/nlsandler/write_a_c_compiler/issues">open an issue</a>.</em></p>

<div class="footnote">
  <p><sup id="fn1">1</sup>
Keep in mind that the stack grows <em>down</em> towards lower addresses; we decrement ESP whenever we push things onto the stack, and ESP will always hold a lower value than EBP. So the top of the stack is really…on the bottom ¯_(ツ)_/¯ <a href="#anchor1">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn2">2</sup>
Even though <code class="language-plaintext highlighter-rouge">main</code> is the only function, it still has a caller: it’s called by the setup routine, <code class="language-plaintext highlighter-rouge">crt0</code>. <a href="#anchor2">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn3">3</sup>
We don’t really need to keep track of the stack index, since we can just derive it from the size of the variable map.
However, the stack index will come in handy once we add types other than <code class="language-plaintext highlighter-rouge">int</code>, since at that point our variables won’t all be the same size.
If you don’t want to keep track of it for now, that’s fine with me. <a href="#anchor3">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn4">4</sup>
This is not at all how real compilers work;
they usually allocate space for local variables all at once in the function prologue,
or just store them in registers.
Our way is less effort, though. <a href="#anchor4">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn5">5</sup>
It’s sometimes legal to declare a variable at file scope, per section 6.9.2 of the C11 specification. <a href="#anchor5">↩</a></p>
</div>]]></content><author><name>Nora Sandler</name></author><category term="compiler-tutorial" /><summary type="html"><![CDATA[This is the fifth post in a series. Read part 1 here.]]></summary></entry></feed>