Compiling a program is a fascinating, complex journey through which a high-level program becomes lower-level executable code. How is such a journey planned and realized? After developing the techniques, how to teach them to others?
There is lately much discussion of the pedagogy of compiler classes. In this blog post we would like to spotlight one design of compiler construction classes: the presentation order of the material. Usually, compiler classes present a compiler’s phases in the order they’re performed. We call this the “forward” direction. At Tel Aviv University, we’ve experimented with teaching compilers “backward”. This post presents this approach and describes what we see as its key benefits.
Let’s first briefly describe, in the “forward” fashion, how a compiler operates. First, the input program’s text is converted, through lexical and syntactic analysis, to an abstract syntax tree (AST), a representation that’s easy to manipulate. Second, the compiler performs a series of semantic checks, ensuring that the program “makes sense” – variables are defined before they’re used, types add up, etc. In the process the compiler constructs a symbol table, and attributes a type to every expression, ingredients which are also used in later stages. The compiler then proceeds to the third stage of generating code in a low-level intermediate representation (IR), using the symbol table and typing information from before. Finally, the IR code is translated to the target language (say, x86 assembly).
The backward order teaches these phases in reverse: first, code generation; then semantic analysis; and then syntactic analysis. (Ours is not the first class to attempt the backward order – see the acknowledgments for an elaboration.)
Why teach the course backwards?
The first and foremost conceptual challenge in a compiler class is to understand how high-level code can be mapped to low-level code. The backward order first clarifies the compiler’s goal, the “what” of compiled programs before the “how” of the various compilation phases; and it’s useful to know where you’re going to understand how to get there.
After setting the goal, the backward order works back from this less-familiar end to the more-familiar means. This is a common problem-solving strategy. Some studies have shown that a backward heuristic is preferred by novices when they solve problems which experts solve forward (Larkin et al. 1980, Simon & Simon 1978). Crucially, we feel that studying the compiler’s stages backward better motivates each of them. (This is similar to the approach of the famous Nand to Tetris class.)
How does the backward order work? In our class, the students develop a compiler that translates from MiniJava – a subset of Java – to the LLVM low-level intermediate representation (LLVM-IR), based on materials kindly provided by Jens Palsberg, Hal Perkins, and Yannis Smaragdakis.
We structured the project as four independent exercises (see the figure):
This order of implementation is enabled by:
Listed backward, of course.
There’s more to a program than its syntax
This is a mental barrier to overcome. Working solely with an AST representation of programs until very late helps with that.
Lowering in the spotlight
In the backward approach, the correspondence between the high-level code students write and a low-level code implementation is already the topic of the second exercise, which is the first exercise that is part of the compiler itself. We think that this is a good focus.
“Well-typed programs cannot go wrong”
The way we pitched the exercise on semantic checks earlier, the role of the semantic checks is to ensure that the input program satisfies all the assumptions employed in the code generation phase. Since students have already implemented code generation and used the assumptions, this ascribes meaning to the long and seemingly-pointless list of semantic checks. This is because we expect from the compiler to generate correct code for every program that passes the semantic checking phase – a very tangible manifestation of the type safety principle that well-typed programs cannot go wrong.
Parsing is well-motivated
Parsing constructs an AST. How ASTs facilitate the compiler’s analysis, and that they are preferable to manipulating the program text, should be well understood at this point, after implementing multiple analyses in the various exercises.
There and back again
One of our takeaways from this experience is that compilation is best understood holistically; the interfaces between compilation phases and their interdependencies are truly understood only when taken as a whole. It’s an interesting thought experiment, then, to consider these dependencies in both directions, rather than only in the usual, forward, way.
We enjoyed teaching the class backward, but also found that some aspects were harder to explain in this workflow.
Where does assembly generation fit in?
We were uncertain about when to teach the translation from IR to assembly, with the special challenge of activation records. Asking students to tackle these first, as a “pure” backward approach mandates, is to start with what may be considered less-central themes. (We taught assembly generation but didn’t ask students to implement it, due to time constraints.) An alternative is to pose assembly generation as a concluding project of yet another full compiler, this time of IR to assembly. Similar questions arise for when to teach compiler optimizations.
What are static types?
Code generation in an object-oriented language builds on the difference between static and dynamic types. We assumed familiarity with these concepts would come from a prior class on object-oriented programming, but for students who struggled with this we found the distinction hard to explain before discussing type checking.
The other role of semantic checks?
Semantic checks are not only important for the compiler, to guarantee the validity of code generation; they have another role: alerting the programmer to nonsensical code (even if the compiler can somehow “overcome” the error and generate code!). This other role may seem secondary when teaching backward.
Why runtime checks?
When implementing code generation, students must generate code to perform runtime checks, but other checks are “deferred” to the earlier stage of semantic checks. The reasoning behind the stage at which each check is performed doesn’t become clear until the discussion of semantic checks and static analysis.
Why this specific AST structure?
In our current structure, “programs” are synonymous with “ASTs” until very late, in the parsing exercise. This may give the illusion that our choice of AST is intrinsic to the language. We feel that students cannot be expected to make informed choices when designing the AST before seeing a full compiler, whichever direction used, but alluding to a fixed AST structure is sometimes too easy (we were sorely tempted when we explained the difference between semantic and syntactic errors). Also, not constantly referring to the language’s grammar made it harder for students to become accustomed to the syntactic subset of the language they were compiling.
An altogether different course structure is when the compiler is built incrementally to support more and more language features; where applicable, it solves the question of order by working on all phases together. In a more traditional structure, a “backward approach” to teaching compilers is just one possible answer to several questions, most prominent of which are how to make the compiler’s end-goal tangible, and how to motivate parsing and semantic checks as effectively as possible. The experience of implementing a code analysis outside the traditional compiler setting, in the exercise on variable renaming, is a cool bonus. The backward approach also raises questions of its own, particularly along the axis of how the compiler filters invalid programs, which may benefit from the forward, operational direction. We hope to see further ideas on achieving a synthesis of these approaches.
Bio: Yotam Feldman is a PhD student at Tel Aviv University, working on formal methods, hence talking about forward and backward (reachability) on most days. Mooly Sagiv is a professor of Computer Science at Tel Aviv University and the CEO of Certora. Mooly’s hobbies include Program Analysis, Program Verification, and Running.
Acknowledgments: Our class was inspired by Steve Zdancewic’s class at UPenn (which was inspired by Greg Morrisett’s class at Harvard and Andrew Myers’s class at Cornell), and drew on materials for compiling MiniJava graciously provided by Jens Palsberg, Hal Perkins, and Yannis Smaragdakis. A back-to-front approach to compiler classes was employed in other classes as well, and also discussed in a previous PL Perspective by Lindsey Kuper.
We thank Oren Ish-Shalom, Jens Palsberg, Hila Peleg, Hal Perkins, and Steve Zdancewic for graciously lending us their materials, Oren Ish-Shalom and Hila Peleg for consulting on project design, and Oren Ish-Shalom, Jens Palsberg, James R. Wilcox, and Steve Zdancewic for comments and suggestions on a draft of this post.
Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.