Home | History | Annotate | Download | only in tutorial
      1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
      2                       "http://www.w3.org/TR/html4/strict.dtd">
      3 
      4 <html>
      5 <head>
      6   <title>Kaleidoscope: Implementing a Parser and AST</title>
      7   <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      8   <meta name="author" content="Chris Lattner">
      9   <link rel="stylesheet" href="../llvm.css" type="text/css">
     10 </head>
     11 
     12 <body>
     13 
     14 <h1>Kaleidoscope: Implementing a Parser and AST</h1>
     15 
     16 <ul>
     17 <li><a href="index.html">Up to Tutorial Index</a></li>
     18 <li>Chapter 2
     19   <ol>
     20     <li><a href="#intro">Chapter 2 Introduction</a></li>
     21     <li><a href="#ast">The Abstract Syntax Tree (AST)</a></li>
     22     <li><a href="#parserbasics">Parser Basics</a></li>
     23     <li><a href="#parserprimexprs">Basic Expression Parsing</a></li>
     24     <li><a href="#parserbinops">Binary Expression Parsing</a></li>
     25     <li><a href="#parsertop">Parsing the Rest</a></li>
     26     <li><a href="#driver">The Driver</a></li>
     27     <li><a href="#conclusions">Conclusions</a></li>
     28     <li><a href="#code">Full Code Listing</a></li>
     29   </ol>
     30 </li>
     31 <li><a href="LangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li>
     32 </ul>
     33 
     34 <div class="doc_author">
     35   <p>Written by <a href="mailto:sabre (a] nondot.org">Chris Lattner</a></p>
     36 </div>
     37 
     38 <!-- *********************************************************************** -->
     39 <h2><a name="intro">Chapter 2 Introduction</a></h2>
     40 <!-- *********************************************************************** -->
     41 
     42 <div>
     43 
     44 <p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language
     45 with LLVM</a>" tutorial.  This chapter shows you how to use the lexer, built in 
     46 <a href="LangImpl1.html">Chapter 1</a>, to build a full <a
     47 href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
     48 our Kaleidoscope language.  Once we have a parser, we'll define and build an <a 
     49 href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax 
     50 Tree</a> (AST).</p>
     51 
     52 <p>The parser we will build uses a combination of <a 
     53 href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
     54 Parsing</a> and <a href=
     55 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence 
     56 Parsing</a> to parse the Kaleidoscope language (the latter for 
     57 binary expressions and the former for everything else).  Before we get to
     58 parsing though, lets talk about the output of the parser: the Abstract Syntax
     59 Tree.</p>
     60 
     61 </div>
     62 
     63 <!-- *********************************************************************** -->
     64 <h2><a name="ast">The Abstract Syntax Tree (AST)</a></h2>
     65 <!-- *********************************************************************** -->
     66 
     67 <div>
     68 
     69 <p>The AST for a program captures its behavior in such a way that it is easy for
     70 later stages of the compiler (e.g. code generation) to interpret.  We basically
     71 want one object for each construct in the language, and the AST should closely
     72 model the language.  In Kaleidoscope, we have expressions, a prototype, and a
     73 function object.  We'll start with expressions first:</p>
     74 
     75 <div class="doc_code">
     76 <pre>
     77 /// ExprAST - Base class for all expression nodes.
     78 class ExprAST {
     79 public:
     80   virtual ~ExprAST() {}
     81 };
     82 
     83 /// NumberExprAST - Expression class for numeric literals like "1.0".
     84 class NumberExprAST : public ExprAST {
     85   double Val;
     86 public:
     87   NumberExprAST(double val) : Val(val) {}
     88 };
     89 </pre>
     90 </div>
     91 
     92 <p>The code above shows the definition of the base ExprAST class and one
     93 subclass which we use for numeric literals.  The important thing to note about
     94 this code is that the NumberExprAST class captures the numeric value of the
     95 literal as an instance variable. This allows later phases of the compiler to
     96 know what the stored numeric value is.</p>
     97 
     98 <p>Right now we only create the AST,  so there are no useful accessor methods on
     99 them.  It would be very easy to add a virtual method to pretty print the code,
    100 for example.  Here are the other expression AST node definitions that we'll use
    101 in the basic form of the Kaleidoscope language:
    102 </p>
    103 
    104 <div class="doc_code">
    105 <pre>
    106 /// VariableExprAST - Expression class for referencing a variable, like "a".
    107 class VariableExprAST : public ExprAST {
    108   std::string Name;
    109 public:
    110   VariableExprAST(const std::string &amp;name) : Name(name) {}
    111 };
    112 
    113 /// BinaryExprAST - Expression class for a binary operator.
    114 class BinaryExprAST : public ExprAST {
    115   char Op;
    116   ExprAST *LHS, *RHS;
    117 public:
    118   BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs) 
    119     : Op(op), LHS(lhs), RHS(rhs) {}
    120 };
    121 
    122 /// CallExprAST - Expression class for function calls.
    123 class CallExprAST : public ExprAST {
    124   std::string Callee;
    125   std::vector&lt;ExprAST*&gt; Args;
    126 public:
    127   CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
    128     : Callee(callee), Args(args) {}
    129 };
    130 </pre>
    131 </div>
    132 
    133 <p>This is all (intentionally) rather straight-forward: variables capture the
    134 variable name, binary operators capture their opcode (e.g. '+'), and calls
    135 capture a function name as well as a list of any argument expressions.  One thing 
    136 that is nice about our AST is that it captures the language features without 
    137 talking about the syntax of the language.  Note that there is no discussion about 
    138 precedence of binary operators, lexical structure, etc.</p>
    139 
    140 <p>For our basic language, these are all of the expression nodes we'll define.
    141 Because it doesn't have conditional control flow, it isn't Turing-complete;
    142 we'll fix that in a later installment.  The two things we need next are a way
    143 to talk about the interface to a function, and a way to talk about functions
    144 themselves:</p>
    145 
    146 <div class="doc_code">
    147 <pre>
    148 /// PrototypeAST - This class represents the "prototype" for a function,
    149 /// which captures its name, and its argument names (thus implicitly the number
    150 /// of arguments the function takes).
    151 class PrototypeAST {
    152   std::string Name;
    153   std::vector&lt;std::string&gt; Args;
    154 public:
    155   PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
    156     : Name(name), Args(args) {}
    157 };
    158 
    159 /// FunctionAST - This class represents a function definition itself.
    160 class FunctionAST {
    161   PrototypeAST *Proto;
    162   ExprAST *Body;
    163 public:
    164   FunctionAST(PrototypeAST *proto, ExprAST *body)
    165     : Proto(proto), Body(body) {}
    166 };
    167 </pre>
    168 </div>
    169 
    170 <p>In Kaleidoscope, functions are typed with just a count of their arguments.
    171 Since all values are double precision floating point, the type of each argument
    172 doesn't need to be stored anywhere.  In a more aggressive and realistic
    173 language, the "ExprAST" class would probably have a type field.</p>
    174 
    175 <p>With this scaffolding, we can now talk about parsing expressions and function
    176 bodies in Kaleidoscope.</p>
    177 
    178 </div>
    179 
    180 <!-- *********************************************************************** -->
    181 <h2><a name="parserbasics">Parser Basics</a></h2>
    182 <!-- *********************************************************************** -->
    183 
    184 <div>
    185 
    186 <p>Now that we have an AST to build, we need to define the parser code to build
    187 it.  The idea here is that we want to parse something like "x+y" (which is
    188 returned as three tokens by the lexer) into an AST that could be generated with
    189 calls like this:</p>
    190 
    191 <div class="doc_code">
    192 <pre>
    193   ExprAST *X = new VariableExprAST("x");
    194   ExprAST *Y = new VariableExprAST("y");
    195   ExprAST *Result = new BinaryExprAST('+', X, Y);
    196 </pre>
    197 </div>
    198 
    199 <p>In order to do this, we'll start by defining some basic helper routines:</p>
    200 
    201 <div class="doc_code">
    202 <pre>
    203 /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
    204 /// token the parser is looking at.  getNextToken reads another token from the
    205 /// lexer and updates CurTok with its results.
    206 static int CurTok;
    207 static int getNextToken() {
    208   return CurTok = gettok();
    209 }
    210 </pre>
    211 </div>
    212 
    213 <p>
    214 This implements a simple token buffer around the lexer.  This allows 
    215 us to look one token ahead at what the lexer is returning.  Every function in
    216 our parser will assume that CurTok is the current token that needs to be
    217 parsed.</p>
    218 
    219 <div class="doc_code">
    220 <pre>
    221 
    222 /// Error* - These are little helper functions for error handling.
    223 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
    224 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
    225 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
    226 </pre>
    227 </div>
    228 
    229 <p>
    230 The <tt>Error</tt> routines are simple helper routines that our parser will use
    231 to handle errors.  The error recovery in our parser will not be the best and
    232 is not particular user-friendly, but it will be enough for our tutorial.  These
    233 routines make it easier to handle errors in routines that have various return
    234 types: they always return null.</p>
    235 
    236 <p>With these basic helper functions, we can implement the first
    237 piece of our grammar: numeric literals.</p>
    238 
    239 </div>
    240 
    241 <!-- *********************************************************************** -->
    242 <h2><a name="parserprimexprs">Basic Expression Parsing</a></h2>
    243 <!-- *********************************************************************** -->
    244 
    245 <div>
    246 
    247 <p>We start with numeric literals, because they are the simplest to process.
    248 For each production in our grammar, we'll define a function which parses that
    249 production.  For numeric literals, we have:
    250 </p>
    251 
    252 <div class="doc_code">
    253 <pre>
    254 /// numberexpr ::= number
    255 static ExprAST *ParseNumberExpr() {
    256   ExprAST *Result = new NumberExprAST(NumVal);
    257   getNextToken(); // consume the number
    258   return Result;
    259 }
    260 </pre>
    261 </div>
    262 
    263 <p>This routine is very simple: it expects to be called when the current token
    264 is a <tt>tok_number</tt> token.  It takes the current number value, creates 
    265 a <tt>NumberExprAST</tt> node, advances the lexer to the next token, and finally
    266 returns.</p>
    267 
    268 <p>There are some interesting aspects to this.  The most important one is that
    269 this routine eats all of the tokens that correspond to the production and
    270 returns the lexer buffer with the next token (which is not part of the grammar
    271 production) ready to go.  This is a fairly standard way to go for recursive
    272 descent parsers.  For a better example, the parenthesis operator is defined like
    273 this:</p>
    274 
    275 <div class="doc_code">
    276 <pre>
    277 /// parenexpr ::= '(' expression ')'
    278 static ExprAST *ParseParenExpr() {
    279   getNextToken();  // eat (.
    280   ExprAST *V = ParseExpression();
    281   if (!V) return 0;
    282   
    283   if (CurTok != ')')
    284     return Error("expected ')'");
    285   getNextToken();  // eat ).
    286   return V;
    287 }
    288 </pre>
    289 </div>
    290 
    291 <p>This function illustrates a number of interesting things about the 
    292 parser:</p>
    293 
    294 <p>
    295 1) It shows how we use the Error routines.  When called, this function expects
    296 that the current token is a '(' token, but after parsing the subexpression, it
    297 is possible that there is no ')' waiting.  For example, if the user types in
    298 "(4 x" instead of "(4)", the parser should emit an error.  Because errors can
    299 occur, the parser needs a way to indicate that they happened: in our parser, we
    300 return null on an error.</p>
    301 
    302 <p>2) Another interesting aspect of this function is that it uses recursion by
    303 calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
    304 <tt>ParseParenExpr</tt>).  This is powerful because it allows us to handle 
    305 recursive grammars, and keeps each production very simple.  Note that
    306 parentheses do not cause construction of AST nodes themselves.  While we could
    307 do it this way, the most important role of parentheses are to guide the parser
    308 and provide grouping.  Once the parser constructs the AST, parentheses are not
    309 needed.</p>
    310 
    311 <p>The next simple production is for handling variable references and function
    312 calls:</p>
    313 
    314 <div class="doc_code">
    315 <pre>
    316 /// identifierexpr
    317 ///   ::= identifier
    318 ///   ::= identifier '(' expression* ')'
    319 static ExprAST *ParseIdentifierExpr() {
    320   std::string IdName = IdentifierStr;
    321   
    322   getNextToken();  // eat identifier.
    323   
    324   if (CurTok != '(') // Simple variable ref.
    325     return new VariableExprAST(IdName);
    326   
    327   // Call.
    328   getNextToken();  // eat (
    329   std::vector&lt;ExprAST*&gt; Args;
    330   if (CurTok != ')') {
    331     while (1) {
    332       ExprAST *Arg = ParseExpression();
    333       if (!Arg) return 0;
    334       Args.push_back(Arg);
    335 
    336       if (CurTok == ')') break;
    337 
    338       if (CurTok != ',')
    339         return Error("Expected ')' or ',' in argument list");
    340       getNextToken();
    341     }
    342   }
    343 
    344   // Eat the ')'.
    345   getNextToken();
    346   
    347   return new CallExprAST(IdName, Args);
    348 }
    349 </pre>
    350 </div>
    351 
    352 <p>This routine follows the same style as the other routines.  (It expects to be
    353 called if the current token is a <tt>tok_identifier</tt> token).  It also has
    354 recursion and error handling.  One interesting aspect of this is that it uses
    355 <em>look-ahead</em> to determine if the current identifier is a stand alone
    356 variable reference or if it is a function call expression.  It handles this by
    357 checking to see if the token after the identifier is a '(' token, constructing
    358 either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
    359 </p>
    360 
    361 <p>Now that we have all of our simple expression-parsing logic in place, we can
    362 define a helper function to wrap it together into one entry point.  We call this
    363 class of expressions "primary" expressions, for reasons that will become more
    364 clear <a href="LangImpl6.html#unary">later in the tutorial</a>.  In order to
    365 parse an arbitrary primary expression, we need to determine what sort of
    366 expression it is:</p>
    367 
    368 <div class="doc_code">
    369 <pre>
    370 /// primary
    371 ///   ::= identifierexpr
    372 ///   ::= numberexpr
    373 ///   ::= parenexpr
    374 static ExprAST *ParsePrimary() {
    375   switch (CurTok) {
    376   default: return Error("unknown token when expecting an expression");
    377   case tok_identifier: return ParseIdentifierExpr();
    378   case tok_number:     return ParseNumberExpr();
    379   case '(':            return ParseParenExpr();
    380   }
    381 }
    382 </pre>
    383 </div>
    384 
    385 <p>Now that you see the definition of this function, it is more obvious why we
    386 can assume the state of CurTok in the various functions.  This uses look-ahead
    387 to determine which sort of expression is being inspected, and then parses it
    388 with a function call.</p>
    389 
    390 <p>Now that basic expressions are handled, we need to handle binary expressions.
    391 They are a bit more complex.</p>
    392 
    393 </div>
    394 
    395 <!-- *********************************************************************** -->
    396 <h2><a name="parserbinops">Binary Expression Parsing</a></h2>
    397 <!-- *********************************************************************** -->
    398 
    399 <div>
    400 
    401 <p>Binary expressions are significantly harder to parse because they are often
    402 ambiguous.  For example, when given the string "x+y*z", the parser can choose
    403 to parse it as either "(x+y)*z" or "x+(y*z)".  With common definitions from
    404 mathematics, we expect the later parse, because "*" (multiplication) has
    405 higher <em>precedence</em> than "+" (addition).</p>
    406 
    407 <p>There are many ways to handle this, but an elegant and efficient way is to
    408 use <a href=
    409 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence 
    410 Parsing</a>.  This parsing technique uses the precedence of binary operators to
    411 guide recursion.  To start with, we need a table of precedences:</p>
    412 
    413 <div class="doc_code">
    414 <pre>
    415 /// BinopPrecedence - This holds the precedence for each binary operator that is
    416 /// defined.
    417 static std::map&lt;char, int&gt; BinopPrecedence;
    418 
    419 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
    420 static int GetTokPrecedence() {
    421   if (!isascii(CurTok))
    422     return -1;
    423     
    424   // Make sure it's a declared binop.
    425   int TokPrec = BinopPrecedence[CurTok];
    426   if (TokPrec &lt;= 0) return -1;
    427   return TokPrec;
    428 }
    429 
    430 int main() {
    431   // Install standard binary operators.
    432   // 1 is lowest precedence.
    433   BinopPrecedence['&lt;'] = 10;
    434   BinopPrecedence['+'] = 20;
    435   BinopPrecedence['-'] = 20;
    436   BinopPrecedence['*'] = 40;  // highest.
    437   ...
    438 }
    439 </pre>
    440 </div>
    441 
    442 <p>For the basic form of Kaleidoscope, we will only support 4 binary operators
    443 (this can obviously be extended by you, our brave and intrepid reader).  The
    444 <tt>GetTokPrecedence</tt> function returns the precedence for the current token,
    445 or -1 if the token is not a binary operator.  Having a map makes it easy to add
    446 new operators and makes it clear that the algorithm doesn't depend on the
    447 specific operators involved, but it would be easy enough to eliminate the map
    448 and do the comparisons in the <tt>GetTokPrecedence</tt> function.  (Or just use
    449 a fixed-size array).</p>
    450 
    451 <p>With the helper above defined, we can now start parsing binary expressions.
    452 The basic idea of operator precedence parsing is to break down an expression
    453 with potentially ambiguous binary operators into pieces.  Consider ,for example,
    454 the expression "a+b+(c+d)*e*f+g".  Operator precedence parsing considers this
    455 as a stream of primary expressions separated by binary operators.  As such,
    456 it will first parse the leading primary expression "a", then it will see the
    457 pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g].  Note that because parentheses
    458 are primary expressions, the binary expression parser doesn't need to worry
    459 about nested subexpressions like (c+d) at all. 
    460 </p>
    461 
    462 <p>
    463 To start, an expression is a primary expression potentially followed by a
    464 sequence of [binop,primaryexpr] pairs:</p>
    465 
    466 <div class="doc_code">
    467 <pre>
    468 /// expression
    469 ///   ::= primary binoprhs
    470 ///
    471 static ExprAST *ParseExpression() {
    472   ExprAST *LHS = ParsePrimary();
    473   if (!LHS) return 0;
    474   
    475   return ParseBinOpRHS(0, LHS);
    476 }
    477 </pre>
    478 </div>
    479 
    480 <p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
    481 us.  It takes a precedence and a pointer to an expression for the part that has been
    482 parsed so far.   Note that "x" is a perfectly valid expression: As such, "binoprhs" is
    483 allowed to be empty, in which case it returns the expression that is passed into
    484 it. In our example above, the code passes the expression for "a" into
    485 <tt>ParseBinOpRHS</tt> and the current token is "+".</p>
    486 
    487 <p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
    488 minimal operator precedence</em> that the function is allowed to eat.  For
    489 example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
    490 passed in a precedence of 40, it will not consume any tokens (because the
    491 precedence of '+' is only 20).  With this in mind, <tt>ParseBinOpRHS</tt> starts
    492 with:</p>
    493 
    494 <div class="doc_code">
    495 <pre>
    496 /// binoprhs
    497 ///   ::= ('+' primary)*
    498 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
    499   // If this is a binop, find its precedence.
    500   while (1) {
    501     int TokPrec = GetTokPrecedence();
    502     
    503     // If this is a binop that binds at least as tightly as the current binop,
    504     // consume it, otherwise we are done.
    505     if (TokPrec &lt; ExprPrec)
    506       return LHS;
    507 </pre>
    508 </div>
    509 
    510 <p>This code gets the precedence of the current token and checks to see if if is
    511 too low.  Because we defined invalid tokens to have a precedence of -1, this 
    512 check implicitly knows that the pair-stream ends when the token stream runs out
    513 of binary operators.  If this check succeeds, we know that the token is a binary
    514 operator and that it will be included in this expression:</p>
    515 
    516 <div class="doc_code">
    517 <pre>
    518     // Okay, we know this is a binop.
    519     int BinOp = CurTok;
    520     getNextToken();  // eat binop
    521     
    522     // Parse the primary expression after the binary operator.
    523     ExprAST *RHS = ParsePrimary();
    524     if (!RHS) return 0;
    525 </pre>
    526 </div>
    527 
    528 <p>As such, this code eats (and remembers) the binary operator and then parses
    529 the primary expression that follows.  This builds up the whole pair, the first of
    530 which is [+, b] for the running example.</p>
    531 
    532 <p>Now that we parsed the left-hand side of an expression and one pair of the 
    533 RHS sequence, we have to decide which way the expression associates.  In
    534 particular, we could have "(a+b) binop unparsed"  or "a + (b binop unparsed)".
    535 To determine this, we look ahead at "binop" to determine its precedence and 
    536 compare it to BinOp's precedence (which is '+' in this case):</p>
    537 
    538 <div class="doc_code">
    539 <pre>
    540     // If BinOp binds less tightly with RHS than the operator after RHS, let
    541     // the pending operator take RHS as its LHS.
    542     int NextPrec = GetTokPrecedence();
    543     if (TokPrec &lt; NextPrec) {
    544 </pre>
    545 </div>
    546 
    547 <p>If the precedence of the binop to the right of "RHS" is lower or equal to the
    548 precedence of our current operator, then we know that the parentheses associate
    549 as "(a+b) binop ...".  In our example, the current operator is "+" and the next 
    550 operator is "+", we know that they have the same precedence.  In this case we'll
    551 create the AST node for "a+b", and then continue parsing:</p>
    552 
    553 <div class="doc_code">
    554 <pre>
    555       ... if body omitted ...
    556     }
    557     
    558     // Merge LHS/RHS.
    559     LHS = new BinaryExprAST(BinOp, LHS, RHS);
    560   }  // loop around to the top of the while loop.
    561 }
    562 </pre>
    563 </div>
    564 
    565 <p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
    566 iteration of the loop, with "+" as the current token.  The code above will eat, 
    567 remember, and parse "(c+d)" as the primary expression, which makes the
    568 current pair equal to [+, (c+d)].  It will then evaluate the 'if' conditional above with 
    569 "*" as the binop to the right of the primary.  In this case, the precedence of "*" is
    570 higher than the precedence of "+" so the if condition will be entered.</p>
    571 
    572 <p>The critical question left here is "how can the if condition parse the right
    573 hand side in full"?  In particular, to build the AST correctly for our example,
    574 it needs to get all of "(c+d)*e*f" as the RHS expression variable.  The code to
    575 do this is surprisingly simple (code from the above two blocks duplicated for
    576 context):</p>
    577 
    578 <div class="doc_code">
    579 <pre>
    580     // If BinOp binds less tightly with RHS than the operator after RHS, let
    581     // the pending operator take RHS as its LHS.
    582     int NextPrec = GetTokPrecedence();
    583     if (TokPrec &lt; NextPrec) {
    584       <b>RHS = ParseBinOpRHS(TokPrec+1, RHS);
    585       if (RHS == 0) return 0;</b>
    586     }
    587     // Merge LHS/RHS.
    588     LHS = new BinaryExprAST(BinOp, LHS, RHS);
    589   }  // loop around to the top of the while loop.
    590 }
    591 </pre>
    592 </div>
    593 
    594 <p>At this point, we know that the binary operator to the RHS of our primary
    595 has higher precedence than the binop we are currently parsing.  As such, we know
    596 that any sequence of pairs whose operators are all higher precedence than "+"
    597 should be parsed together and returned as "RHS".  To do this, we recursively
    598 invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
    599 precedence required for it to continue.  In our example above, this will cause
    600 it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
    601 of the '+' expression.</p>
    602 
    603 <p>Finally, on the next iteration of the while loop, the "+g" piece is parsed
    604 and added to the AST.  With this little bit of code (14 non-trivial lines), we
    605 correctly handle fully general binary expression parsing in a very elegant way.
    606 This was a whirlwind tour of this code, and it is somewhat subtle.  I recommend
    607 running through it with a few tough examples to see how it works.
    608 </p>
    609 
    610 <p>This wraps up handling of expressions.  At this point, we can point the
    611 parser at an arbitrary token stream and build an expression from it, stopping
    612 at the first token that is not part of the expression.  Next up we need to
    613 handle function definitions, etc.</p>
    614 
    615 </div>
    616 
    617 <!-- *********************************************************************** -->
    618 <h2><a name="parsertop">Parsing the Rest</a></h2>
    619 <!-- *********************************************************************** -->
    620 
    621 <div>
    622 
    623 <p>
    624 The next thing missing is handling of function prototypes.  In Kaleidoscope,
    625 these are used both for 'extern' function declarations as well as function body
    626 definitions.  The code to do this is straight-forward and not very interesting
    627 (once you've survived expressions):
    628 </p>
    629 
    630 <div class="doc_code">
    631 <pre>
    632 /// prototype
    633 ///   ::= id '(' id* ')'
    634 static PrototypeAST *ParsePrototype() {
    635   if (CurTok != tok_identifier)
    636     return ErrorP("Expected function name in prototype");
    637 
    638   std::string FnName = IdentifierStr;
    639   getNextToken();
    640   
    641   if (CurTok != '(')
    642     return ErrorP("Expected '(' in prototype");
    643   
    644   // Read the list of argument names.
    645   std::vector&lt;std::string&gt; ArgNames;
    646   while (getNextToken() == tok_identifier)
    647     ArgNames.push_back(IdentifierStr);
    648   if (CurTok != ')')
    649     return ErrorP("Expected ')' in prototype");
    650   
    651   // success.
    652   getNextToken();  // eat ')'.
    653   
    654   return new PrototypeAST(FnName, ArgNames);
    655 }
    656 </pre>
    657 </div>
    658 
    659 <p>Given this, a function definition is very simple, just a prototype plus
    660 an expression to implement the body:</p>
    661 
    662 <div class="doc_code">
    663 <pre>
    664 /// definition ::= 'def' prototype expression
    665 static FunctionAST *ParseDefinition() {
    666   getNextToken();  // eat def.
    667   PrototypeAST *Proto = ParsePrototype();
    668   if (Proto == 0) return 0;
    669 
    670   if (ExprAST *E = ParseExpression())
    671     return new FunctionAST(Proto, E);
    672   return 0;
    673 }
    674 </pre>
    675 </div>
    676 
    677 <p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
    678 well as to support forward declaration of user functions.  These 'extern's are just
    679 prototypes with no body:</p>
    680 
    681 <div class="doc_code">
    682 <pre>
    683 /// external ::= 'extern' prototype
    684 static PrototypeAST *ParseExtern() {
    685   getNextToken();  // eat extern.
    686   return ParsePrototype();
    687 }
    688 </pre>
    689 </div>
    690 
    691 <p>Finally, we'll also let the user type in arbitrary top-level expressions and
    692 evaluate them on the fly.  We will handle this by defining anonymous nullary
    693 (zero argument) functions for them:</p>
    694 
    695 <div class="doc_code">
    696 <pre>
    697 /// toplevelexpr ::= expression
    698 static FunctionAST *ParseTopLevelExpr() {
    699   if (ExprAST *E = ParseExpression()) {
    700     // Make an anonymous proto.
    701     PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
    702     return new FunctionAST(Proto, E);
    703   }
    704   return 0;
    705 }
    706 </pre>
    707 </div>
    708 
    709 <p>Now that we have all the pieces, let's build a little driver that will let us
    710 actually <em>execute</em> this code we've built!</p>
    711 
    712 </div>
    713 
    714 <!-- *********************************************************************** -->
    715 <h2><a name="driver">The Driver</a></h2>
    716 <!-- *********************************************************************** -->
    717 
    718 <div>
    719 
    720 <p>The driver for this simply invokes all of the parsing pieces with a top-level
    721 dispatch loop.  There isn't much interesting here, so I'll just include the
    722 top-level loop.  See <a href="#code">below</a> for full code in the "Top-Level
    723 Parsing" section.</p>
    724 
    725 <div class="doc_code">
    726 <pre>
    727 /// top ::= definition | external | expression | ';'
    728 static void MainLoop() {
    729   while (1) {
    730     fprintf(stderr, "ready&gt; ");
    731     switch (CurTok) {
    732     case tok_eof:    return;
    733     case ';':        getNextToken(); break;  // ignore top-level semicolons.
    734     case tok_def:    HandleDefinition(); break;
    735     case tok_extern: HandleExtern(); break;
    736     default:         HandleTopLevelExpression(); break;
    737     }
    738   }
    739 }
    740 </pre>
    741 </div>
    742 
    743 <p>The most interesting part of this is that we ignore top-level semicolons.
    744 Why is this, you ask?  The basic reason is that if you type "4 + 5" at the
    745 command line, the parser doesn't know whether that is the end of what you will type
    746 or not.  For example, on the next line you could type "def foo..." in which case
    747 4+5 is the end of a top-level expression.  Alternatively you could type "* 6",
    748 which would continue the expression.  Having top-level semicolons allows you to
    749 type "4+5;", and the parser will know you are done.</p> 
    750 
    751 </div>
    752 
    753 <!-- *********************************************************************** -->
    754 <h2><a name="conclusions">Conclusions</a></h2>
    755 <!-- *********************************************************************** -->
    756 
    757 <div>
    758 
    759 <p>With just under 400 lines of commented code (240 lines of non-comment, 
    760 non-blank code), we fully defined our minimal language, including a lexer,
    761 parser, and AST builder.  With this done, the executable will validate 
    762 Kaleidoscope code and tell us if it is grammatically invalid.  For
    763 example, here is a sample interaction:</p>
    764 
    765 <div class="doc_code">
    766 <pre>
    767 $ <b>./a.out</b>
    768 ready&gt; <b>def foo(x y) x+foo(y, 4.0);</b>
    769 Parsed a function definition.
    770 ready&gt; <b>def foo(x y) x+y y;</b>
    771 Parsed a function definition.
    772 Parsed a top-level expr
    773 ready&gt; <b>def foo(x y) x+y );</b>
    774 Parsed a function definition.
    775 Error: unknown token when expecting an expression
    776 ready&gt; <b>extern sin(a);</b>
    777 ready&gt; Parsed an extern
    778 ready&gt; <b>^D</b>
    779 $ 
    780 </pre>
    781 </div>
    782 
    783 <p>There is a lot of room for extension here.  You can define new AST nodes,
    784 extend the language in many ways, etc.  In the <a href="LangImpl3.html">next
    785 installment</a>, we will describe how to generate LLVM Intermediate
    786 Representation (IR) from the AST.</p>
    787 
    788 </div>
    789 
    790 <!-- *********************************************************************** -->
    791 <h2><a name="code">Full Code Listing</a></h2>
    792 <!-- *********************************************************************** -->
    793 
    794 <div>
    795 
    796 <p>
    797 Here is the complete code listing for this and the previous chapter.  
    798 Note that it is fully self-contained: you don't need LLVM or any external
    799 libraries at all for this.  (Besides the C and C++ standard libraries, of
    800 course.)  To build this, just compile with:</p>
    801 
    802 <div class="doc_code">
    803 <pre>
    804 # Compile
    805 clang++ -g -O3 toy.cpp
    806 # Run
    807 ./a.out 
    808 </pre>
    809 </div>
    810 
    811 <p>Here is the code:</p>
    812 
    813 <div class="doc_code">
    814 <pre>
    815 #include &lt;cstdio&gt;
    816 #include &lt;cstdlib&gt;
    817 #include &lt;string&gt;
    818 #include &lt;map&gt;
    819 #include &lt;vector&gt;
    820 
    821 //===----------------------------------------------------------------------===//
    822 // Lexer
    823 //===----------------------------------------------------------------------===//
    824 
    825 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
    826 // of these for known things.
    827 enum Token {
    828   tok_eof = -1,
    829 
    830   // commands
    831   tok_def = -2, tok_extern = -3,
    832 
    833   // primary
    834   tok_identifier = -4, tok_number = -5
    835 };
    836 
    837 static std::string IdentifierStr;  // Filled in if tok_identifier
    838 static double NumVal;              // Filled in if tok_number
    839 
    840 /// gettok - Return the next token from standard input.
    841 static int gettok() {
    842   static int LastChar = ' ';
    843 
    844   // Skip any whitespace.
    845   while (isspace(LastChar))
    846     LastChar = getchar();
    847 
    848   if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
    849     IdentifierStr = LastChar;
    850     while (isalnum((LastChar = getchar())))
    851       IdentifierStr += LastChar;
    852 
    853     if (IdentifierStr == "def") return tok_def;
    854     if (IdentifierStr == "extern") return tok_extern;
    855     return tok_identifier;
    856   }
    857 
    858   if (isdigit(LastChar) || LastChar == '.') {   // Number: [0-9.]+
    859     std::string NumStr;
    860     do {
    861       NumStr += LastChar;
    862       LastChar = getchar();
    863     } while (isdigit(LastChar) || LastChar == '.');
    864 
    865     NumVal = strtod(NumStr.c_str(), 0);
    866     return tok_number;
    867   }
    868 
    869   if (LastChar == '#') {
    870     // Comment until end of line.
    871     do LastChar = getchar();
    872     while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp;&amp; LastChar != '\r');
    873     
    874     if (LastChar != EOF)
    875       return gettok();
    876   }
    877   
    878   // Check for end of file.  Don't eat the EOF.
    879   if (LastChar == EOF)
    880     return tok_eof;
    881 
    882   // Otherwise, just return the character as its ascii value.
    883   int ThisChar = LastChar;
    884   LastChar = getchar();
    885   return ThisChar;
    886 }
    887 
    888 //===----------------------------------------------------------------------===//
    889 // Abstract Syntax Tree (aka Parse Tree)
    890 //===----------------------------------------------------------------------===//
    891 
    892 /// ExprAST - Base class for all expression nodes.
    893 class ExprAST {
    894 public:
    895   virtual ~ExprAST() {}
    896 };
    897 
    898 /// NumberExprAST - Expression class for numeric literals like "1.0".
    899 class NumberExprAST : public ExprAST {
    900   double Val;
    901 public:
    902   NumberExprAST(double val) : Val(val) {}
    903 };
    904 
    905 /// VariableExprAST - Expression class for referencing a variable, like "a".
    906 class VariableExprAST : public ExprAST {
    907   std::string Name;
    908 public:
    909   VariableExprAST(const std::string &amp;name) : Name(name) {}
    910 };
    911 
    912 /// BinaryExprAST - Expression class for a binary operator.
    913 class BinaryExprAST : public ExprAST {
    914   char Op;
    915   ExprAST *LHS, *RHS;
    916 public:
    917   BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs) 
    918     : Op(op), LHS(lhs), RHS(rhs) {}
    919 };
    920 
    921 /// CallExprAST - Expression class for function calls.
    922 class CallExprAST : public ExprAST {
    923   std::string Callee;
    924   std::vector&lt;ExprAST*&gt; Args;
    925 public:
    926   CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
    927     : Callee(callee), Args(args) {}
    928 };
    929 
    930 /// PrototypeAST - This class represents the "prototype" for a function,
    931 /// which captures its name, and its argument names (thus implicitly the number
    932 /// of arguments the function takes).
    933 class PrototypeAST {
    934   std::string Name;
    935   std::vector&lt;std::string&gt; Args;
    936 public:
    937   PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
    938     : Name(name), Args(args) {}
    939   
    940 };
    941 
    942 /// FunctionAST - This class represents a function definition itself.
    943 class FunctionAST {
    944   PrototypeAST *Proto;
    945   ExprAST *Body;
    946 public:
    947   FunctionAST(PrototypeAST *proto, ExprAST *body)
    948     : Proto(proto), Body(body) {}
    949   
    950 };
    951 
    952 //===----------------------------------------------------------------------===//
    953 // Parser
    954 //===----------------------------------------------------------------------===//
    955 
    956 /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
    957 /// token the parser is looking at.  getNextToken reads another token from the
    958 /// lexer and updates CurTok with its results.
    959 static int CurTok;
    960 static int getNextToken() {
    961   return CurTok = gettok();
    962 }
    963 
    964 /// BinopPrecedence - This holds the precedence for each binary operator that is
    965 /// defined.
    966 static std::map&lt;char, int&gt; BinopPrecedence;
    967 
    968 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
    969 static int GetTokPrecedence() {
    970   if (!isascii(CurTok))
    971     return -1;
    972   
    973   // Make sure it's a declared binop.
    974   int TokPrec = BinopPrecedence[CurTok];
    975   if (TokPrec &lt;= 0) return -1;
    976   return TokPrec;
    977 }
    978 
    979 /// Error* - These are little helper functions for error handling.
    980 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
    981 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
    982 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
    983 
    984 static ExprAST *ParseExpression();
    985 
    986 /// identifierexpr
    987 ///   ::= identifier
    988 ///   ::= identifier '(' expression* ')'
    989 static ExprAST *ParseIdentifierExpr() {
    990   std::string IdName = IdentifierStr;
    991   
    992   getNextToken();  // eat identifier.
    993   
    994   if (CurTok != '(') // Simple variable ref.
    995     return new VariableExprAST(IdName);
    996   
    997   // Call.
    998   getNextToken();  // eat (
    999   std::vector&lt;ExprAST*&gt; Args;
   1000   if (CurTok != ')') {
   1001     while (1) {
   1002       ExprAST *Arg = ParseExpression();
   1003       if (!Arg) return 0;
   1004       Args.push_back(Arg);
   1005 
   1006       if (CurTok == ')') break;
   1007 
   1008       if (CurTok != ',')
   1009         return Error("Expected ')' or ',' in argument list");
   1010       getNextToken();
   1011     }
   1012   }
   1013 
   1014   // Eat the ')'.
   1015   getNextToken();
   1016   
   1017   return new CallExprAST(IdName, Args);
   1018 }
   1019 
   1020 /// numberexpr ::= number
   1021 static ExprAST *ParseNumberExpr() {
   1022   ExprAST *Result = new NumberExprAST(NumVal);
   1023   getNextToken(); // consume the number
   1024   return Result;
   1025 }
   1026 
   1027 /// parenexpr ::= '(' expression ')'
   1028 static ExprAST *ParseParenExpr() {
   1029   getNextToken();  // eat (.
   1030   ExprAST *V = ParseExpression();
   1031   if (!V) return 0;
   1032   
   1033   if (CurTok != ')')
   1034     return Error("expected ')'");
   1035   getNextToken();  // eat ).
   1036   return V;
   1037 }
   1038 
   1039 /// primary
   1040 ///   ::= identifierexpr
   1041 ///   ::= numberexpr
   1042 ///   ::= parenexpr
   1043 static ExprAST *ParsePrimary() {
   1044   switch (CurTok) {
   1045   default: return Error("unknown token when expecting an expression");
   1046   case tok_identifier: return ParseIdentifierExpr();
   1047   case tok_number:     return ParseNumberExpr();
   1048   case '(':            return ParseParenExpr();
   1049   }
   1050 }
   1051 
   1052 /// binoprhs
   1053 ///   ::= ('+' primary)*
   1054 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
   1055   // If this is a binop, find its precedence.
   1056   while (1) {
   1057     int TokPrec = GetTokPrecedence();
   1058     
   1059     // If this is a binop that binds at least as tightly as the current binop,
   1060     // consume it, otherwise we are done.
   1061     if (TokPrec &lt; ExprPrec)
   1062       return LHS;
   1063     
   1064     // Okay, we know this is a binop.
   1065     int BinOp = CurTok;
   1066     getNextToken();  // eat binop
   1067     
   1068     // Parse the primary expression after the binary operator.
   1069     ExprAST *RHS = ParsePrimary();
   1070     if (!RHS) return 0;
   1071     
   1072     // If BinOp binds less tightly with RHS than the operator after RHS, let
   1073     // the pending operator take RHS as its LHS.
   1074     int NextPrec = GetTokPrecedence();
   1075     if (TokPrec &lt; NextPrec) {
   1076       RHS = ParseBinOpRHS(TokPrec+1, RHS);
   1077       if (RHS == 0) return 0;
   1078     }
   1079     
   1080     // Merge LHS/RHS.
   1081     LHS = new BinaryExprAST(BinOp, LHS, RHS);
   1082   }
   1083 }
   1084 
   1085 /// expression
   1086 ///   ::= primary binoprhs
   1087 ///
   1088 static ExprAST *ParseExpression() {
   1089   ExprAST *LHS = ParsePrimary();
   1090   if (!LHS) return 0;
   1091   
   1092   return ParseBinOpRHS(0, LHS);
   1093 }
   1094 
   1095 /// prototype
   1096 ///   ::= id '(' id* ')'
   1097 static PrototypeAST *ParsePrototype() {
   1098   if (CurTok != tok_identifier)
   1099     return ErrorP("Expected function name in prototype");
   1100 
   1101   std::string FnName = IdentifierStr;
   1102   getNextToken();
   1103   
   1104   if (CurTok != '(')
   1105     return ErrorP("Expected '(' in prototype");
   1106   
   1107   std::vector&lt;std::string&gt; ArgNames;
   1108   while (getNextToken() == tok_identifier)
   1109     ArgNames.push_back(IdentifierStr);
   1110   if (CurTok != ')')
   1111     return ErrorP("Expected ')' in prototype");
   1112   
   1113   // success.
   1114   getNextToken();  // eat ')'.
   1115   
   1116   return new PrototypeAST(FnName, ArgNames);
   1117 }
   1118 
   1119 /// definition ::= 'def' prototype expression
   1120 static FunctionAST *ParseDefinition() {
   1121   getNextToken();  // eat def.
   1122   PrototypeAST *Proto = ParsePrototype();
   1123   if (Proto == 0) return 0;
   1124 
   1125   if (ExprAST *E = ParseExpression())
   1126     return new FunctionAST(Proto, E);
   1127   return 0;
   1128 }
   1129 
   1130 /// toplevelexpr ::= expression
   1131 static FunctionAST *ParseTopLevelExpr() {
   1132   if (ExprAST *E = ParseExpression()) {
   1133     // Make an anonymous proto.
   1134     PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
   1135     return new FunctionAST(Proto, E);
   1136   }
   1137   return 0;
   1138 }
   1139 
   1140 /// external ::= 'extern' prototype
   1141 static PrototypeAST *ParseExtern() {
   1142   getNextToken();  // eat extern.
   1143   return ParsePrototype();
   1144 }
   1145 
   1146 //===----------------------------------------------------------------------===//
   1147 // Top-Level parsing
   1148 //===----------------------------------------------------------------------===//
   1149 
   1150 static void HandleDefinition() {
   1151   if (ParseDefinition()) {
   1152     fprintf(stderr, "Parsed a function definition.\n");
   1153   } else {
   1154     // Skip token for error recovery.
   1155     getNextToken();
   1156   }
   1157 }
   1158 
   1159 static void HandleExtern() {
   1160   if (ParseExtern()) {
   1161     fprintf(stderr, "Parsed an extern\n");
   1162   } else {
   1163     // Skip token for error recovery.
   1164     getNextToken();
   1165   }
   1166 }
   1167 
   1168 static void HandleTopLevelExpression() {
   1169   // Evaluate a top-level expression into an anonymous function.
   1170   if (ParseTopLevelExpr()) {
   1171     fprintf(stderr, "Parsed a top-level expr\n");
   1172   } else {
   1173     // Skip token for error recovery.
   1174     getNextToken();
   1175   }
   1176 }
   1177 
   1178 /// top ::= definition | external | expression | ';'
   1179 static void MainLoop() {
   1180   while (1) {
   1181     fprintf(stderr, "ready&gt; ");
   1182     switch (CurTok) {
   1183     case tok_eof:    return;
   1184     case ';':        getNextToken(); break;  // ignore top-level semicolons.
   1185     case tok_def:    HandleDefinition(); break;
   1186     case tok_extern: HandleExtern(); break;
   1187     default:         HandleTopLevelExpression(); break;
   1188     }
   1189   }
   1190 }
   1191 
   1192 //===----------------------------------------------------------------------===//
   1193 // Main driver code.
   1194 //===----------------------------------------------------------------------===//
   1195 
   1196 int main() {
   1197   // Install standard binary operators.
   1198   // 1 is lowest precedence.
   1199   BinopPrecedence['&lt;'] = 10;
   1200   BinopPrecedence['+'] = 20;
   1201   BinopPrecedence['-'] = 20;
   1202   BinopPrecedence['*'] = 40;  // highest.
   1203 
   1204   // Prime the first token.
   1205   fprintf(stderr, "ready&gt; ");
   1206   getNextToken();
   1207 
   1208   // Run the main "interpreter loop" now.
   1209   MainLoop();
   1210 
   1211   return 0;
   1212 }
   1213 </pre>
   1214 </div>
   1215 <a href="LangImpl3.html">Next: Implementing Code Generation to LLVM IR</a>
   1216 </div>
   1217 
   1218 <!-- *********************************************************************** -->
   1219 <hr>
   1220 <address>
   1221   <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
   1222   src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
   1223   <a href="http://validator.w3.org/check/referer"><img
   1224   src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
   1225 
   1226   <a href="mailto:sabre (a] nondot.org">Chris Lattner</a><br>
   1227   <a href="http://llvm.org/">The LLVM Compiler Infrastructure</a><br>
   1228   Last modified: $Date: 2011-10-16 04:07:38 -0400 (Sun, 16 Oct 2011) $
   1229 </address>
   1230 </body>
   1231 </html>
   1232