1 =========================================== 2 Kaleidoscope: Implementing a Parser and AST 3 =========================================== 4 5 .. contents:: 6 :local: 7 8 Chapter 2 Introduction 9 ====================== 10 11 Welcome to Chapter 2 of the "`Implementing a language with 12 LLVM <index.html>`_" tutorial. This chapter shows you how to use the 13 lexer, built in `Chapter 1 <LangImpl1.html>`_, to build a full 14 `parser <http://en.wikipedia.org/wiki/Parsing>`_ for our Kaleidoscope 15 language. Once we have a parser, we'll define and build an `Abstract 16 Syntax Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST). 17 18 The parser we will build uses a combination of `Recursive Descent 19 Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and 20 `Operator-Precedence 21 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to 22 parse the Kaleidoscope language (the latter for binary expressions and 23 the former for everything else). Before we get to parsing though, lets 24 talk about the output of the parser: the Abstract Syntax Tree. 25 26 The Abstract Syntax Tree (AST) 27 ============================== 28 29 The AST for a program captures its behavior in such a way that it is 30 easy for later stages of the compiler (e.g. code generation) to 31 interpret. We basically want one object for each construct in the 32 language, and the AST should closely model the language. In 33 Kaleidoscope, we have expressions, a prototype, and a function object. 34 We'll start with expressions first: 35 36 .. code-block:: c++ 37 38 /// ExprAST - Base class for all expression nodes. 39 class ExprAST { 40 public: 41 virtual ~ExprAST() {} 42 }; 43 44 /// NumberExprAST - Expression class for numeric literals like "1.0". 45 class NumberExprAST : public ExprAST { 46 double Val; 47 public: 48 NumberExprAST(double val) : Val(val) {} 49 }; 50 51 The code above shows the definition of the base ExprAST class and one 52 subclass which we use for numeric literals. The important thing to note 53 about this code is that the NumberExprAST class captures the numeric 54 value of the literal as an instance variable. This allows later phases 55 of the compiler to know what the stored numeric value is. 56 57 Right now we only create the AST, so there are no useful accessor 58 methods on them. It would be very easy to add a virtual method to pretty 59 print the code, for example. Here are the other expression AST node 60 definitions that we'll use in the basic form of the Kaleidoscope 61 language: 62 63 .. code-block:: c++ 64 65 /// VariableExprAST - Expression class for referencing a variable, like "a". 66 class VariableExprAST : public ExprAST { 67 std::string Name; 68 public: 69 VariableExprAST(const std::string &name) : Name(name) {} 70 }; 71 72 /// BinaryExprAST - Expression class for a binary operator. 73 class BinaryExprAST : public ExprAST { 74 char Op; 75 ExprAST *LHS, *RHS; 76 public: 77 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs) 78 : Op(op), LHS(lhs), RHS(rhs) {} 79 }; 80 81 /// CallExprAST - Expression class for function calls. 82 class CallExprAST : public ExprAST { 83 std::string Callee; 84 std::vector<ExprAST*> Args; 85 public: 86 CallExprAST(const std::string &callee, std::vector<ExprAST*> &args) 87 : Callee(callee), Args(args) {} 88 }; 89 90 This is all (intentionally) rather straight-forward: variables capture 91 the variable name, binary operators capture their opcode (e.g. '+'), and 92 calls capture a function name as well as a list of any argument 93 expressions. One thing that is nice about our AST is that it captures 94 the language features without talking about the syntax of the language. 95 Note that there is no discussion about precedence of binary operators, 96 lexical structure, etc. 97 98 For our basic language, these are all of the expression nodes we'll 99 define. Because it doesn't have conditional control flow, it isn't 100 Turing-complete; we'll fix that in a later installment. The two things 101 we need next are a way to talk about the interface to a function, and a 102 way to talk about functions themselves: 103 104 .. code-block:: c++ 105 106 /// PrototypeAST - This class represents the "prototype" for a function, 107 /// which captures its name, and its argument names (thus implicitly the number 108 /// of arguments the function takes). 109 class PrototypeAST { 110 std::string Name; 111 std::vector<std::string> Args; 112 public: 113 PrototypeAST(const std::string &name, const std::vector<std::string> &args) 114 : Name(name), Args(args) {} 115 }; 116 117 /// FunctionAST - This class represents a function definition itself. 118 class FunctionAST { 119 PrototypeAST *Proto; 120 ExprAST *Body; 121 public: 122 FunctionAST(PrototypeAST *proto, ExprAST *body) 123 : Proto(proto), Body(body) {} 124 }; 125 126 In Kaleidoscope, functions are typed with just a count of their 127 arguments. Since all values are double precision floating point, the 128 type of each argument doesn't need to be stored anywhere. In a more 129 aggressive and realistic language, the "ExprAST" class would probably 130 have a type field. 131 132 With this scaffolding, we can now talk about parsing expressions and 133 function bodies in Kaleidoscope. 134 135 Parser Basics 136 ============= 137 138 Now that we have an AST to build, we need to define the parser code to 139 build it. The idea here is that we want to parse something like "x+y" 140 (which is returned as three tokens by the lexer) into an AST that could 141 be generated with calls like this: 142 143 .. code-block:: c++ 144 145 ExprAST *X = new VariableExprAST("x"); 146 ExprAST *Y = new VariableExprAST("y"); 147 ExprAST *Result = new BinaryExprAST('+', X, Y); 148 149 In order to do this, we'll start by defining some basic helper routines: 150 151 .. code-block:: c++ 152 153 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current 154 /// token the parser is looking at. getNextToken reads another token from the 155 /// lexer and updates CurTok with its results. 156 static int CurTok; 157 static int getNextToken() { 158 return CurTok = gettok(); 159 } 160 161 This implements a simple token buffer around the lexer. This allows us 162 to look one token ahead at what the lexer is returning. Every function 163 in our parser will assume that CurTok is the current token that needs to 164 be parsed. 165 166 .. code-block:: c++ 167 168 169 /// Error* - These are little helper functions for error handling. 170 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;} 171 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; } 172 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; } 173 174 The ``Error`` routines are simple helper routines that our parser will 175 use to handle errors. The error recovery in our parser will not be the 176 best and is not particular user-friendly, but it will be enough for our 177 tutorial. These routines make it easier to handle errors in routines 178 that have various return types: they always return null. 179 180 With these basic helper functions, we can implement the first piece of 181 our grammar: numeric literals. 182 183 Basic Expression Parsing 184 ======================== 185 186 We start with numeric literals, because they are the simplest to 187 process. For each production in our grammar, we'll define a function 188 which parses that production. For numeric literals, we have: 189 190 .. code-block:: c++ 191 192 /// numberexpr ::= number 193 static ExprAST *ParseNumberExpr() { 194 ExprAST *Result = new NumberExprAST(NumVal); 195 getNextToken(); // consume the number 196 return Result; 197 } 198 199 This routine is very simple: it expects to be called when the current 200 token is a ``tok_number`` token. It takes the current number value, 201 creates a ``NumberExprAST`` node, advances the lexer to the next token, 202 and finally returns. 203 204 There are some interesting aspects to this. The most important one is 205 that this routine eats all of the tokens that correspond to the 206 production and returns the lexer buffer with the next token (which is 207 not part of the grammar production) ready to go. This is a fairly 208 standard way to go for recursive descent parsers. For a better example, 209 the parenthesis operator is defined like this: 210 211 .. code-block:: c++ 212 213 /// parenexpr ::= '(' expression ')' 214 static ExprAST *ParseParenExpr() { 215 getNextToken(); // eat (. 216 ExprAST *V = ParseExpression(); 217 if (!V) return 0; 218 219 if (CurTok != ')') 220 return Error("expected ')'"); 221 getNextToken(); // eat ). 222 return V; 223 } 224 225 This function illustrates a number of interesting things about the 226 parser: 227 228 1) It shows how we use the Error routines. When called, this function 229 expects that the current token is a '(' token, but after parsing the 230 subexpression, it is possible that there is no ')' waiting. For example, 231 if the user types in "(4 x" instead of "(4)", the parser should emit an 232 error. Because errors can occur, the parser needs a way to indicate that 233 they happened: in our parser, we return null on an error. 234 235 2) Another interesting aspect of this function is that it uses recursion 236 by calling ``ParseExpression`` (we will soon see that 237 ``ParseExpression`` can call ``ParseParenExpr``). This is powerful 238 because it allows us to handle recursive grammars, and keeps each 239 production very simple. Note that parentheses do not cause construction 240 of AST nodes themselves. While we could do it this way, the most 241 important role of parentheses are to guide the parser and provide 242 grouping. Once the parser constructs the AST, parentheses are not 243 needed. 244 245 The next simple production is for handling variable references and 246 function calls: 247 248 .. code-block:: c++ 249 250 /// identifierexpr 251 /// ::= identifier 252 /// ::= identifier '(' expression* ')' 253 static ExprAST *ParseIdentifierExpr() { 254 std::string IdName = IdentifierStr; 255 256 getNextToken(); // eat identifier. 257 258 if (CurTok != '(') // Simple variable ref. 259 return new VariableExprAST(IdName); 260 261 // Call. 262 getNextToken(); // eat ( 263 std::vector<ExprAST*> Args; 264 if (CurTok != ')') { 265 while (1) { 266 ExprAST *Arg = ParseExpression(); 267 if (!Arg) return 0; 268 Args.push_back(Arg); 269 270 if (CurTok == ')') break; 271 272 if (CurTok != ',') 273 return Error("Expected ')' or ',' in argument list"); 274 getNextToken(); 275 } 276 } 277 278 // Eat the ')'. 279 getNextToken(); 280 281 return new CallExprAST(IdName, Args); 282 } 283 284 This routine follows the same style as the other routines. (It expects 285 to be called if the current token is a ``tok_identifier`` token). It 286 also has recursion and error handling. One interesting aspect of this is 287 that it uses *look-ahead* to determine if the current identifier is a 288 stand alone variable reference or if it is a function call expression. 289 It handles this by checking to see if the token after the identifier is 290 a '(' token, constructing either a ``VariableExprAST`` or 291 ``CallExprAST`` node as appropriate. 292 293 Now that we have all of our simple expression-parsing logic in place, we 294 can define a helper function to wrap it together into one entry point. 295 We call this class of expressions "primary" expressions, for reasons 296 that will become more clear `later in the 297 tutorial <LangImpl6.html#unary>`_. In order to parse an arbitrary 298 primary expression, we need to determine what sort of expression it is: 299 300 .. code-block:: c++ 301 302 /// primary 303 /// ::= identifierexpr 304 /// ::= numberexpr 305 /// ::= parenexpr 306 static ExprAST *ParsePrimary() { 307 switch (CurTok) { 308 default: return Error("unknown token when expecting an expression"); 309 case tok_identifier: return ParseIdentifierExpr(); 310 case tok_number: return ParseNumberExpr(); 311 case '(': return ParseParenExpr(); 312 } 313 } 314 315 Now that you see the definition of this function, it is more obvious why 316 we can assume the state of CurTok in the various functions. This uses 317 look-ahead to determine which sort of expression is being inspected, and 318 then parses it with a function call. 319 320 Now that basic expressions are handled, we need to handle binary 321 expressions. They are a bit more complex. 322 323 Binary Expression Parsing 324 ========================= 325 326 Binary expressions are significantly harder to parse because they are 327 often ambiguous. For example, when given the string "x+y\*z", the parser 328 can choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common 329 definitions from mathematics, we expect the later parse, because "\*" 330 (multiplication) has higher *precedence* than "+" (addition). 331 332 There are many ways to handle this, but an elegant and efficient way is 333 to use `Operator-Precedence 334 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_. 335 This parsing technique uses the precedence of binary operators to guide 336 recursion. To start with, we need a table of precedences: 337 338 .. code-block:: c++ 339 340 /// BinopPrecedence - This holds the precedence for each binary operator that is 341 /// defined. 342 static std::map<char, int> BinopPrecedence; 343 344 /// GetTokPrecedence - Get the precedence of the pending binary operator token. 345 static int GetTokPrecedence() { 346 if (!isascii(CurTok)) 347 return -1; 348 349 // Make sure it's a declared binop. 350 int TokPrec = BinopPrecedence[CurTok]; 351 if (TokPrec <= 0) return -1; 352 return TokPrec; 353 } 354 355 int main() { 356 // Install standard binary operators. 357 // 1 is lowest precedence. 358 BinopPrecedence['<'] = 10; 359 BinopPrecedence['+'] = 20; 360 BinopPrecedence['-'] = 20; 361 BinopPrecedence['*'] = 40; // highest. 362 ... 363 } 364 365 For the basic form of Kaleidoscope, we will only support 4 binary 366 operators (this can obviously be extended by you, our brave and intrepid 367 reader). The ``GetTokPrecedence`` function returns the precedence for 368 the current token, or -1 if the token is not a binary operator. Having a 369 map makes it easy to add new operators and makes it clear that the 370 algorithm doesn't depend on the specific operators involved, but it 371 would be easy enough to eliminate the map and do the comparisons in the 372 ``GetTokPrecedence`` function. (Or just use a fixed-size array). 373 374 With the helper above defined, we can now start parsing binary 375 expressions. The basic idea of operator precedence parsing is to break 376 down an expression with potentially ambiguous binary operators into 377 pieces. Consider ,for example, the expression "a+b+(c+d)\*e\*f+g". 378 Operator precedence parsing considers this as a stream of primary 379 expressions separated by binary operators. As such, it will first parse 380 the leading primary expression "a", then it will see the pairs [+, b] 381 [+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are 382 primary expressions, the binary expression parser doesn't need to worry 383 about nested subexpressions like (c+d) at all. 384 385 To start, an expression is a primary expression potentially followed by 386 a sequence of [binop,primaryexpr] pairs: 387 388 .. code-block:: c++ 389 390 /// expression 391 /// ::= primary binoprhs 392 /// 393 static ExprAST *ParseExpression() { 394 ExprAST *LHS = ParsePrimary(); 395 if (!LHS) return 0; 396 397 return ParseBinOpRHS(0, LHS); 398 } 399 400 ``ParseBinOpRHS`` is the function that parses the sequence of pairs for 401 us. It takes a precedence and a pointer to an expression for the part 402 that has been parsed so far. Note that "x" is a perfectly valid 403 expression: As such, "binoprhs" is allowed to be empty, in which case it 404 returns the expression that is passed into it. In our example above, the 405 code passes the expression for "a" into ``ParseBinOpRHS`` and the 406 current token is "+". 407 408 The precedence value passed into ``ParseBinOpRHS`` indicates the 409 *minimal operator precedence* that the function is allowed to eat. For 410 example, if the current pair stream is [+, x] and ``ParseBinOpRHS`` is 411 passed in a precedence of 40, it will not consume any tokens (because 412 the precedence of '+' is only 20). With this in mind, ``ParseBinOpRHS`` 413 starts with: 414 415 .. code-block:: c++ 416 417 /// binoprhs 418 /// ::= ('+' primary)* 419 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) { 420 // If this is a binop, find its precedence. 421 while (1) { 422 int TokPrec = GetTokPrecedence(); 423 424 // If this is a binop that binds at least as tightly as the current binop, 425 // consume it, otherwise we are done. 426 if (TokPrec < ExprPrec) 427 return LHS; 428 429 This code gets the precedence of the current token and checks to see if 430 if is too low. Because we defined invalid tokens to have a precedence of 431 -1, this check implicitly knows that the pair-stream ends when the token 432 stream runs out of binary operators. If this check succeeds, we know 433 that the token is a binary operator and that it will be included in this 434 expression: 435 436 .. code-block:: c++ 437 438 // Okay, we know this is a binop. 439 int BinOp = CurTok; 440 getNextToken(); // eat binop 441 442 // Parse the primary expression after the binary operator. 443 ExprAST *RHS = ParsePrimary(); 444 if (!RHS) return 0; 445 446 As such, this code eats (and remembers) the binary operator and then 447 parses the primary expression that follows. This builds up the whole 448 pair, the first of which is [+, b] for the running example. 449 450 Now that we parsed the left-hand side of an expression and one pair of 451 the RHS sequence, we have to decide which way the expression associates. 452 In particular, we could have "(a+b) binop unparsed" or "a + (b binop 453 unparsed)". To determine this, we look ahead at "binop" to determine its 454 precedence and compare it to BinOp's precedence (which is '+' in this 455 case): 456 457 .. code-block:: c++ 458 459 // If BinOp binds less tightly with RHS than the operator after RHS, let 460 // the pending operator take RHS as its LHS. 461 int NextPrec = GetTokPrecedence(); 462 if (TokPrec < NextPrec) { 463 464 If the precedence of the binop to the right of "RHS" is lower or equal 465 to the precedence of our current operator, then we know that the 466 parentheses associate as "(a+b) binop ...". In our example, the current 467 operator is "+" and the next operator is "+", we know that they have the 468 same precedence. In this case we'll create the AST node for "a+b", and 469 then continue parsing: 470 471 .. code-block:: c++ 472 473 ... if body omitted ... 474 } 475 476 // Merge LHS/RHS. 477 LHS = new BinaryExprAST(BinOp, LHS, RHS); 478 } // loop around to the top of the while loop. 479 } 480 481 In our example above, this will turn "a+b+" into "(a+b)" and execute the 482 next iteration of the loop, with "+" as the current token. The code 483 above will eat, remember, and parse "(c+d)" as the primary expression, 484 which makes the current pair equal to [+, (c+d)]. It will then evaluate 485 the 'if' conditional above with "\*" as the binop to the right of the 486 primary. In this case, the precedence of "\*" is higher than the 487 precedence of "+" so the if condition will be entered. 488 489 The critical question left here is "how can the if condition parse the 490 right hand side in full"? In particular, to build the AST correctly for 491 our example, it needs to get all of "(c+d)\*e\*f" as the RHS expression 492 variable. The code to do this is surprisingly simple (code from the 493 above two blocks duplicated for context): 494 495 .. code-block:: c++ 496 497 // If BinOp binds less tightly with RHS than the operator after RHS, let 498 // the pending operator take RHS as its LHS. 499 int NextPrec = GetTokPrecedence(); 500 if (TokPrec < NextPrec) { 501 RHS = ParseBinOpRHS(TokPrec+1, RHS); 502 if (RHS == 0) return 0; 503 } 504 // Merge LHS/RHS. 505 LHS = new BinaryExprAST(BinOp, LHS, RHS); 506 } // loop around to the top of the while loop. 507 } 508 509 At this point, we know that the binary operator to the RHS of our 510 primary has higher precedence than the binop we are currently parsing. 511 As such, we know that any sequence of pairs whose operators are all 512 higher precedence than "+" should be parsed together and returned as 513 "RHS". To do this, we recursively invoke the ``ParseBinOpRHS`` function 514 specifying "TokPrec+1" as the minimum precedence required for it to 515 continue. In our example above, this will cause it to return the AST 516 node for "(c+d)\*e\*f" as RHS, which is then set as the RHS of the '+' 517 expression. 518 519 Finally, on the next iteration of the while loop, the "+g" piece is 520 parsed and added to the AST. With this little bit of code (14 521 non-trivial lines), we correctly handle fully general binary expression 522 parsing in a very elegant way. This was a whirlwind tour of this code, 523 and it is somewhat subtle. I recommend running through it with a few 524 tough examples to see how it works. 525 526 This wraps up handling of expressions. At this point, we can point the 527 parser at an arbitrary token stream and build an expression from it, 528 stopping at the first token that is not part of the expression. Next up 529 we need to handle function definitions, etc. 530 531 Parsing the Rest 532 ================ 533 534 The next thing missing is handling of function prototypes. In 535 Kaleidoscope, these are used both for 'extern' function declarations as 536 well as function body definitions. The code to do this is 537 straight-forward and not very interesting (once you've survived 538 expressions): 539 540 .. code-block:: c++ 541 542 /// prototype 543 /// ::= id '(' id* ')' 544 static PrototypeAST *ParsePrototype() { 545 if (CurTok != tok_identifier) 546 return ErrorP("Expected function name in prototype"); 547 548 std::string FnName = IdentifierStr; 549 getNextToken(); 550 551 if (CurTok != '(') 552 return ErrorP("Expected '(' in prototype"); 553 554 // Read the list of argument names. 555 std::vector<std::string> ArgNames; 556 while (getNextToken() == tok_identifier) 557 ArgNames.push_back(IdentifierStr); 558 if (CurTok != ')') 559 return ErrorP("Expected ')' in prototype"); 560 561 // success. 562 getNextToken(); // eat ')'. 563 564 return new PrototypeAST(FnName, ArgNames); 565 } 566 567 Given this, a function definition is very simple, just a prototype plus 568 an expression to implement the body: 569 570 .. code-block:: c++ 571 572 /// definition ::= 'def' prototype expression 573 static FunctionAST *ParseDefinition() { 574 getNextToken(); // eat def. 575 PrototypeAST *Proto = ParsePrototype(); 576 if (Proto == 0) return 0; 577 578 if (ExprAST *E = ParseExpression()) 579 return new FunctionAST(Proto, E); 580 return 0; 581 } 582 583 In addition, we support 'extern' to declare functions like 'sin' and 584 'cos' as well as to support forward declaration of user functions. These 585 'extern's are just prototypes with no body: 586 587 .. code-block:: c++ 588 589 /// external ::= 'extern' prototype 590 static PrototypeAST *ParseExtern() { 591 getNextToken(); // eat extern. 592 return ParsePrototype(); 593 } 594 595 Finally, we'll also let the user type in arbitrary top-level expressions 596 and evaluate them on the fly. We will handle this by defining anonymous 597 nullary (zero argument) functions for them: 598 599 .. code-block:: c++ 600 601 /// toplevelexpr ::= expression 602 static FunctionAST *ParseTopLevelExpr() { 603 if (ExprAST *E = ParseExpression()) { 604 // Make an anonymous proto. 605 PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>()); 606 return new FunctionAST(Proto, E); 607 } 608 return 0; 609 } 610 611 Now that we have all the pieces, let's build a little driver that will 612 let us actually *execute* this code we've built! 613 614 The Driver 615 ========== 616 617 The driver for this simply invokes all of the parsing pieces with a 618 top-level dispatch loop. There isn't much interesting here, so I'll just 619 include the top-level loop. See `below <#code>`_ for full code in the 620 "Top-Level Parsing" section. 621 622 .. code-block:: c++ 623 624 /// top ::= definition | external | expression | ';' 625 static void MainLoop() { 626 while (1) { 627 fprintf(stderr, "ready> "); 628 switch (CurTok) { 629 case tok_eof: return; 630 case ';': getNextToken(); break; // ignore top-level semicolons. 631 case tok_def: HandleDefinition(); break; 632 case tok_extern: HandleExtern(); break; 633 default: HandleTopLevelExpression(); break; 634 } 635 } 636 } 637 638 The most interesting part of this is that we ignore top-level 639 semicolons. Why is this, you ask? The basic reason is that if you type 640 "4 + 5" at the command line, the parser doesn't know whether that is the 641 end of what you will type or not. For example, on the next line you 642 could type "def foo..." in which case 4+5 is the end of a top-level 643 expression. Alternatively you could type "\* 6", which would continue 644 the expression. Having top-level semicolons allows you to type "4+5;", 645 and the parser will know you are done. 646 647 Conclusions 648 =========== 649 650 With just under 400 lines of commented code (240 lines of non-comment, 651 non-blank code), we fully defined our minimal language, including a 652 lexer, parser, and AST builder. With this done, the executable will 653 validate Kaleidoscope code and tell us if it is grammatically invalid. 654 For example, here is a sample interaction: 655 656 .. code-block:: bash 657 658 $ ./a.out 659 ready> def foo(x y) x+foo(y, 4.0); 660 Parsed a function definition. 661 ready> def foo(x y) x+y y; 662 Parsed a function definition. 663 Parsed a top-level expr 664 ready> def foo(x y) x+y ); 665 Parsed a function definition. 666 Error: unknown token when expecting an expression 667 ready> extern sin(a); 668 ready> Parsed an extern 669 ready> ^D 670 $ 671 672 There is a lot of room for extension here. You can define new AST nodes, 673 extend the language in many ways, etc. In the `next 674 installment <LangImpl3.html>`_, we will describe how to generate LLVM 675 Intermediate Representation (IR) from the AST. 676 677 Full Code Listing 678 ================= 679 680 Here is the complete code listing for this and the previous chapter. 681 Note that it is fully self-contained: you don't need LLVM or any 682 external libraries at all for this. (Besides the C and C++ standard 683 libraries, of course.) To build this, just compile with: 684 685 .. code-block:: bash 686 687 # Compile 688 clang++ -g -O3 toy.cpp 689 # Run 690 ./a.out 691 692 Here is the code: 693 694 .. code-block:: c++ 695 696 #include <cstdio> 697 #include <cstdlib> 698 #include <string> 699 #include <map> 700 #include <vector> 701 702 //===----------------------------------------------------------------------===// 703 // Lexer 704 //===----------------------------------------------------------------------===// 705 706 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one 707 // of these for known things. 708 enum Token { 709 tok_eof = -1, 710 711 // commands 712 tok_def = -2, tok_extern = -3, 713 714 // primary 715 tok_identifier = -4, tok_number = -5 716 }; 717 718 static std::string IdentifierStr; // Filled in if tok_identifier 719 static double NumVal; // Filled in if tok_number 720 721 /// gettok - Return the next token from standard input. 722 static int gettok() { 723 static int LastChar = ' '; 724 725 // Skip any whitespace. 726 while (isspace(LastChar)) 727 LastChar = getchar(); 728 729 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]* 730 IdentifierStr = LastChar; 731 while (isalnum((LastChar = getchar()))) 732 IdentifierStr += LastChar; 733 734 if (IdentifierStr == "def") return tok_def; 735 if (IdentifierStr == "extern") return tok_extern; 736 return tok_identifier; 737 } 738 739 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+ 740 std::string NumStr; 741 do { 742 NumStr += LastChar; 743 LastChar = getchar(); 744 } while (isdigit(LastChar) || LastChar == '.'); 745 746 NumVal = strtod(NumStr.c_str(), 0); 747 return tok_number; 748 } 749 750 if (LastChar == '#') { 751 // Comment until end of line. 752 do LastChar = getchar(); 753 while (LastChar != EOF && LastChar != '\n' && LastChar != '\r'); 754 755 if (LastChar != EOF) 756 return gettok(); 757 } 758 759 // Check for end of file. Don't eat the EOF. 760 if (LastChar == EOF) 761 return tok_eof; 762 763 // Otherwise, just return the character as its ascii value. 764 int ThisChar = LastChar; 765 LastChar = getchar(); 766 return ThisChar; 767 } 768 769 //===----------------------------------------------------------------------===// 770 // Abstract Syntax Tree (aka Parse Tree) 771 //===----------------------------------------------------------------------===// 772 773 /// ExprAST - Base class for all expression nodes. 774 class ExprAST { 775 public: 776 virtual ~ExprAST() {} 777 }; 778 779 /// NumberExprAST - Expression class for numeric literals like "1.0". 780 class NumberExprAST : public ExprAST { 781 double Val; 782 public: 783 NumberExprAST(double val) : Val(val) {} 784 }; 785 786 /// VariableExprAST - Expression class for referencing a variable, like "a". 787 class VariableExprAST : public ExprAST { 788 std::string Name; 789 public: 790 VariableExprAST(const std::string &name) : Name(name) {} 791 }; 792 793 /// BinaryExprAST - Expression class for a binary operator. 794 class BinaryExprAST : public ExprAST { 795 char Op; 796 ExprAST *LHS, *RHS; 797 public: 798 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs) 799 : Op(op), LHS(lhs), RHS(rhs) {} 800 }; 801 802 /// CallExprAST - Expression class for function calls. 803 class CallExprAST : public ExprAST { 804 std::string Callee; 805 std::vector<ExprAST*> Args; 806 public: 807 CallExprAST(const std::string &callee, std::vector<ExprAST*> &args) 808 : Callee(callee), Args(args) {} 809 }; 810 811 /// PrototypeAST - This class represents the "prototype" for a function, 812 /// which captures its name, and its argument names (thus implicitly the number 813 /// of arguments the function takes). 814 class PrototypeAST { 815 std::string Name; 816 std::vector<std::string> Args; 817 public: 818 PrototypeAST(const std::string &name, const std::vector<std::string> &args) 819 : Name(name), Args(args) {} 820 821 }; 822 823 /// FunctionAST - This class represents a function definition itself. 824 class FunctionAST { 825 PrototypeAST *Proto; 826 ExprAST *Body; 827 public: 828 FunctionAST(PrototypeAST *proto, ExprAST *body) 829 : Proto(proto), Body(body) {} 830 831 }; 832 833 //===----------------------------------------------------------------------===// 834 // Parser 835 //===----------------------------------------------------------------------===// 836 837 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current 838 /// token the parser is looking at. getNextToken reads another token from the 839 /// lexer and updates CurTok with its results. 840 static int CurTok; 841 static int getNextToken() { 842 return CurTok = gettok(); 843 } 844 845 /// BinopPrecedence - This holds the precedence for each binary operator that is 846 /// defined. 847 static std::map<char, int> BinopPrecedence; 848 849 /// GetTokPrecedence - Get the precedence of the pending binary operator token. 850 static int GetTokPrecedence() { 851 if (!isascii(CurTok)) 852 return -1; 853 854 // Make sure it's a declared binop. 855 int TokPrec = BinopPrecedence[CurTok]; 856 if (TokPrec <= 0) return -1; 857 return TokPrec; 858 } 859 860 /// Error* - These are little helper functions for error handling. 861 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;} 862 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; } 863 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; } 864 865 static ExprAST *ParseExpression(); 866 867 /// identifierexpr 868 /// ::= identifier 869 /// ::= identifier '(' expression* ')' 870 static ExprAST *ParseIdentifierExpr() { 871 std::string IdName = IdentifierStr; 872 873 getNextToken(); // eat identifier. 874 875 if (CurTok != '(') // Simple variable ref. 876 return new VariableExprAST(IdName); 877 878 // Call. 879 getNextToken(); // eat ( 880 std::vector<ExprAST*> Args; 881 if (CurTok != ')') { 882 while (1) { 883 ExprAST *Arg = ParseExpression(); 884 if (!Arg) return 0; 885 Args.push_back(Arg); 886 887 if (CurTok == ')') break; 888 889 if (CurTok != ',') 890 return Error("Expected ')' or ',' in argument list"); 891 getNextToken(); 892 } 893 } 894 895 // Eat the ')'. 896 getNextToken(); 897 898 return new CallExprAST(IdName, Args); 899 } 900 901 /// numberexpr ::= number 902 static ExprAST *ParseNumberExpr() { 903 ExprAST *Result = new NumberExprAST(NumVal); 904 getNextToken(); // consume the number 905 return Result; 906 } 907 908 /// parenexpr ::= '(' expression ')' 909 static ExprAST *ParseParenExpr() { 910 getNextToken(); // eat (. 911 ExprAST *V = ParseExpression(); 912 if (!V) return 0; 913 914 if (CurTok != ')') 915 return Error("expected ')'"); 916 getNextToken(); // eat ). 917 return V; 918 } 919 920 /// primary 921 /// ::= identifierexpr 922 /// ::= numberexpr 923 /// ::= parenexpr 924 static ExprAST *ParsePrimary() { 925 switch (CurTok) { 926 default: return Error("unknown token when expecting an expression"); 927 case tok_identifier: return ParseIdentifierExpr(); 928 case tok_number: return ParseNumberExpr(); 929 case '(': return ParseParenExpr(); 930 } 931 } 932 933 /// binoprhs 934 /// ::= ('+' primary)* 935 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) { 936 // If this is a binop, find its precedence. 937 while (1) { 938 int TokPrec = GetTokPrecedence(); 939 940 // If this is a binop that binds at least as tightly as the current binop, 941 // consume it, otherwise we are done. 942 if (TokPrec < ExprPrec) 943 return LHS; 944 945 // Okay, we know this is a binop. 946 int BinOp = CurTok; 947 getNextToken(); // eat binop 948 949 // Parse the primary expression after the binary operator. 950 ExprAST *RHS = ParsePrimary(); 951 if (!RHS) return 0; 952 953 // If BinOp binds less tightly with RHS than the operator after RHS, let 954 // the pending operator take RHS as its LHS. 955 int NextPrec = GetTokPrecedence(); 956 if (TokPrec < NextPrec) { 957 RHS = ParseBinOpRHS(TokPrec+1, RHS); 958 if (RHS == 0) return 0; 959 } 960 961 // Merge LHS/RHS. 962 LHS = new BinaryExprAST(BinOp, LHS, RHS); 963 } 964 } 965 966 /// expression 967 /// ::= primary binoprhs 968 /// 969 static ExprAST *ParseExpression() { 970 ExprAST *LHS = ParsePrimary(); 971 if (!LHS) return 0; 972 973 return ParseBinOpRHS(0, LHS); 974 } 975 976 /// prototype 977 /// ::= id '(' id* ')' 978 static PrototypeAST *ParsePrototype() { 979 if (CurTok != tok_identifier) 980 return ErrorP("Expected function name in prototype"); 981 982 std::string FnName = IdentifierStr; 983 getNextToken(); 984 985 if (CurTok != '(') 986 return ErrorP("Expected '(' in prototype"); 987 988 std::vector<std::string> ArgNames; 989 while (getNextToken() == tok_identifier) 990 ArgNames.push_back(IdentifierStr); 991 if (CurTok != ')') 992 return ErrorP("Expected ')' in prototype"); 993 994 // success. 995 getNextToken(); // eat ')'. 996 997 return new PrototypeAST(FnName, ArgNames); 998 } 999 1000 /// definition ::= 'def' prototype expression 1001 static FunctionAST *ParseDefinition() { 1002 getNextToken(); // eat def. 1003 PrototypeAST *Proto = ParsePrototype(); 1004 if (Proto == 0) return 0; 1005 1006 if (ExprAST *E = ParseExpression()) 1007 return new FunctionAST(Proto, E); 1008 return 0; 1009 } 1010 1011 /// toplevelexpr ::= expression 1012 static FunctionAST *ParseTopLevelExpr() { 1013 if (ExprAST *E = ParseExpression()) { 1014 // Make an anonymous proto. 1015 PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>()); 1016 return new FunctionAST(Proto, E); 1017 } 1018 return 0; 1019 } 1020 1021 /// external ::= 'extern' prototype 1022 static PrototypeAST *ParseExtern() { 1023 getNextToken(); // eat extern. 1024 return ParsePrototype(); 1025 } 1026 1027 //===----------------------------------------------------------------------===// 1028 // Top-Level parsing 1029 //===----------------------------------------------------------------------===// 1030 1031 static void HandleDefinition() { 1032 if (ParseDefinition()) { 1033 fprintf(stderr, "Parsed a function definition.\n"); 1034 } else { 1035 // Skip token for error recovery. 1036 getNextToken(); 1037 } 1038 } 1039 1040 static void HandleExtern() { 1041 if (ParseExtern()) { 1042 fprintf(stderr, "Parsed an extern\n"); 1043 } else { 1044 // Skip token for error recovery. 1045 getNextToken(); 1046 } 1047 } 1048 1049 static void HandleTopLevelExpression() { 1050 // Evaluate a top-level expression into an anonymous function. 1051 if (ParseTopLevelExpr()) { 1052 fprintf(stderr, "Parsed a top-level expr\n"); 1053 } else { 1054 // Skip token for error recovery. 1055 getNextToken(); 1056 } 1057 } 1058 1059 /// top ::= definition | external | expression | ';' 1060 static void MainLoop() { 1061 while (1) { 1062 fprintf(stderr, "ready> "); 1063 switch (CurTok) { 1064 case tok_eof: return; 1065 case ';': getNextToken(); break; // ignore top-level semicolons. 1066 case tok_def: HandleDefinition(); break; 1067 case tok_extern: HandleExtern(); break; 1068 default: HandleTopLevelExpression(); break; 1069 } 1070 } 1071 } 1072 1073 //===----------------------------------------------------------------------===// 1074 // Main driver code. 1075 //===----------------------------------------------------------------------===// 1076 1077 int main() { 1078 // Install standard binary operators. 1079 // 1 is lowest precedence. 1080 BinopPrecedence['<'] = 10; 1081 BinopPrecedence['+'] = 20; 1082 BinopPrecedence['-'] = 20; 1083 BinopPrecedence['*'] = 40; // highest. 1084 1085 // Prime the first token. 1086 fprintf(stderr, "ready> "); 1087 getNextToken(); 1088 1089 // Run the main "interpreter loop" now. 1090 MainLoop(); 1091 1092 return 0; 1093 } 1094 1095 `Next: Implementing Code Generation to LLVM IR <LangImpl3.html>`_ 1096 1097