1 =========================================== 2 Kaleidoscope: Implementing a Parser and AST 3 =========================================== 4 5 .. contents:: 6 :local: 7 8 Chapter 2 Introduction 9 ====================== 10 11 Welcome to Chapter 2 of the "`Implementing a language with LLVM in 12 Objective Caml <index.html>`_" tutorial. This chapter shows you how to 13 use the lexer, built in `Chapter 1 <OCamlLangImpl1.html>`_, to build a 14 full `parser <http://en.wikipedia.org/wiki/Parsing>`_ for our 15 Kaleidoscope language. Once we have a parser, we'll define and build an 16 `Abstract Syntax 17 Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST). 18 19 The parser we will build uses a combination of `Recursive Descent 20 Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and 21 `Operator-Precedence 22 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to 23 parse the Kaleidoscope language (the latter for binary expressions and 24 the former for everything else). Before we get to parsing though, lets 25 talk about the output of the parser: the Abstract Syntax Tree. 26 27 The Abstract Syntax Tree (AST) 28 ============================== 29 30 The AST for a program captures its behavior in such a way that it is 31 easy for later stages of the compiler (e.g. code generation) to 32 interpret. We basically want one object for each construct in the 33 language, and the AST should closely model the language. In 34 Kaleidoscope, we have expressions, a prototype, and a function object. 35 We'll start with expressions first: 36 37 .. code-block:: ocaml 38 39 (* expr - Base type for all expression nodes. *) 40 type expr = 41 (* variant for numeric literals like "1.0". *) 42 | Number of float 43 44 The code above shows the definition of the base ExprAST class and one 45 subclass which we use for numeric literals. The important thing to note 46 about this code is that the Number variant captures the numeric value of 47 the literal as an instance variable. This allows later phases of the 48 compiler to know what the stored numeric value is. 49 50 Right now we only create the AST, so there are no useful functions on 51 them. It would be very easy to add a function to pretty print the code, 52 for example. Here are the other expression AST node definitions that 53 we'll use in the basic form of the Kaleidoscope language: 54 55 .. code-block:: ocaml 56 57 (* variant for referencing a variable, like "a". *) 58 | Variable of string 59 60 (* variant for a binary operator. *) 61 | Binary of char * expr * expr 62 63 (* variant for function calls. *) 64 | Call of string * expr array 65 66 This is all (intentionally) rather straight-forward: variables capture 67 the variable name, binary operators capture their opcode (e.g. '+'), and 68 calls capture a function name as well as a list of any argument 69 expressions. One thing that is nice about our AST is that it captures 70 the language features without talking about the syntax of the language. 71 Note that there is no discussion about precedence of binary operators, 72 lexical structure, etc. 73 74 For our basic language, these are all of the expression nodes we'll 75 define. Because it doesn't have conditional control flow, it isn't 76 Turing-complete; we'll fix that in a later installment. The two things 77 we need next are a way to talk about the interface to a function, and a 78 way to talk about functions themselves: 79 80 .. code-block:: ocaml 81 82 (* proto - This type represents the "prototype" for a function, which captures 83 * its name, and its argument names (thus implicitly the number of arguments the 84 * function takes). *) 85 type proto = Prototype of string * string array 86 87 (* func - This type represents a function definition itself. *) 88 type func = Function of proto * expr 89 90 In Kaleidoscope, functions are typed with just a count of their 91 arguments. Since all values are double precision floating point, the 92 type of each argument doesn't need to be stored anywhere. In a more 93 aggressive and realistic language, the "expr" variants would probably 94 have a type field. 95 96 With this scaffolding, we can now talk about parsing expressions and 97 function bodies in Kaleidoscope. 98 99 Parser Basics 100 ============= 101 102 Now that we have an AST to build, we need to define the parser code to 103 build it. The idea here is that we want to parse something like "x+y" 104 (which is returned as three tokens by the lexer) into an AST that could 105 be generated with calls like this: 106 107 .. code-block:: ocaml 108 109 let x = Variable "x" in 110 let y = Variable "y" in 111 let result = Binary ('+', x, y) in 112 ... 113 114 The error handling routines make use of the builtin ``Stream.Failure`` 115 and ``Stream.Error``s. ``Stream.Failure`` is raised when the parser is 116 unable to find any matching token in the first position of a pattern. 117 ``Stream.Error`` is raised when the first token matches, but the rest do 118 not. The error recovery in our parser will not be the best and is not 119 particular user-friendly, but it will be enough for our tutorial. These 120 exceptions make it easier to handle errors in routines that have various 121 return types. 122 123 With these basic types and exceptions, we can implement the first piece 124 of our grammar: numeric literals. 125 126 Basic Expression Parsing 127 ======================== 128 129 We start with numeric literals, because they are the simplest to 130 process. For each production in our grammar, we'll define a function 131 which parses that production. We call this class of expressions 132 "primary" expressions, for reasons that will become more clear `later in 133 the tutorial <OCamlLangImpl6.html#unary>`_. In order to parse an 134 arbitrary primary expression, we need to determine what sort of 135 expression it is. For numeric literals, we have: 136 137 .. code-block:: ocaml 138 139 (* primary 140 * ::= identifier 141 * ::= numberexpr 142 * ::= parenexpr *) 143 parse_primary = parser 144 (* numberexpr ::= number *) 145 | [< 'Token.Number n >] -> Ast.Number n 146 147 This routine is very simple: it expects to be called when the current 148 token is a ``Token.Number`` token. It takes the current number value, 149 creates a ``Ast.Number`` node, advances the lexer to the next token, and 150 finally returns. 151 152 There are some interesting aspects to this. The most important one is 153 that this routine eats all of the tokens that correspond to the 154 production and returns the lexer buffer with the next token (which is 155 not part of the grammar production) ready to go. This is a fairly 156 standard way to go for recursive descent parsers. For a better example, 157 the parenthesis operator is defined like this: 158 159 .. code-block:: ocaml 160 161 (* parenexpr ::= '(' expression ')' *) 162 | [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e 163 164 This function illustrates a number of interesting things about the 165 parser: 166 167 1) It shows how we use the ``Stream.Error`` exception. When called, this 168 function expects that the current token is a '(' token, but after 169 parsing the subexpression, it is possible that there is no ')' waiting. 170 For example, if the user types in "(4 x" instead of "(4)", the parser 171 should emit an error. Because errors can occur, the parser needs a way 172 to indicate that they happened. In our parser, we use the camlp4 173 shortcut syntax ``token ?? "parse error"``, where if the token before 174 the ``??`` does not match, then ``Stream.Error "parse error"`` will be 175 raised. 176 177 2) Another interesting aspect of this function is that it uses recursion 178 by calling ``Parser.parse_primary`` (we will soon see that 179 ``Parser.parse_primary`` can call ``Parser.parse_primary``). This is 180 powerful because it allows us to handle recursive grammars, and keeps 181 each production very simple. Note that parentheses do not cause 182 construction of AST nodes themselves. While we could do it this way, the 183 most important role of parentheses are to guide the parser and provide 184 grouping. Once the parser constructs the AST, parentheses are not 185 needed. 186 187 The next simple production is for handling variable references and 188 function calls: 189 190 .. code-block:: ocaml 191 192 (* identifierexpr 193 * ::= identifier 194 * ::= identifier '(' argumentexpr ')' *) 195 | [< 'Token.Ident id; stream >] -> 196 let rec parse_args accumulator = parser 197 | [< e=parse_expr; stream >] -> 198 begin parser 199 | [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e 200 | [< >] -> e :: accumulator 201 end stream 202 | [< >] -> accumulator 203 in 204 let rec parse_ident id = parser 205 (* Call. *) 206 | [< 'Token.Kwd '('; 207 args=parse_args []; 208 'Token.Kwd ')' ?? "expected ')'">] -> 209 Ast.Call (id, Array.of_list (List.rev args)) 210 211 (* Simple variable ref. *) 212 | [< >] -> Ast.Variable id 213 in 214 parse_ident id stream 215 216 This routine follows the same style as the other routines. (It expects 217 to be called if the current token is a ``Token.Ident`` token). It also 218 has recursion and error handling. One interesting aspect of this is that 219 it uses *look-ahead* to determine if the current identifier is a stand 220 alone variable reference or if it is a function call expression. It 221 handles this by checking to see if the token after the identifier is a 222 '(' token, constructing either a ``Ast.Variable`` or ``Ast.Call`` node 223 as appropriate. 224 225 We finish up by raising an exception if we received a token we didn't 226 expect: 227 228 .. code-block:: ocaml 229 230 | [< >] -> raise (Stream.Error "unknown token when expecting an expression.") 231 232 Now that basic expressions are handled, we need to handle binary 233 expressions. They are a bit more complex. 234 235 Binary Expression Parsing 236 ========================= 237 238 Binary expressions are significantly harder to parse because they are 239 often ambiguous. For example, when given the string "x+y\*z", the parser 240 can choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common 241 definitions from mathematics, we expect the later parse, because "\*" 242 (multiplication) has higher *precedence* than "+" (addition). 243 244 There are many ways to handle this, but an elegant and efficient way is 245 to use `Operator-Precedence 246 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_. 247 This parsing technique uses the precedence of binary operators to guide 248 recursion. To start with, we need a table of precedences: 249 250 .. code-block:: ocaml 251 252 (* binop_precedence - This holds the precedence for each binary operator that is 253 * defined *) 254 let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10 255 256 (* precedence - Get the precedence of the pending binary operator token. *) 257 let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1 258 259 ... 260 261 let main () = 262 (* Install standard binary operators. 263 * 1 is the lowest precedence. *) 264 Hashtbl.add Parser.binop_precedence '<' 10; 265 Hashtbl.add Parser.binop_precedence '+' 20; 266 Hashtbl.add Parser.binop_precedence '-' 20; 267 Hashtbl.add Parser.binop_precedence '*' 40; (* highest. *) 268 ... 269 270 For the basic form of Kaleidoscope, we will only support 4 binary 271 operators (this can obviously be extended by you, our brave and intrepid 272 reader). The ``Parser.precedence`` function returns the precedence for 273 the current token, or -1 if the token is not a binary operator. Having a 274 ``Hashtbl.t`` makes it easy to add new operators and makes it clear that 275 the algorithm doesn't depend on the specific operators involved, but it 276 would be easy enough to eliminate the ``Hashtbl.t`` and do the 277 comparisons in the ``Parser.precedence`` function. (Or just use a 278 fixed-size array). 279 280 With the helper above defined, we can now start parsing binary 281 expressions. The basic idea of operator precedence parsing is to break 282 down an expression with potentially ambiguous binary operators into 283 pieces. Consider ,for example, the expression "a+b+(c+d)\*e\*f+g". 284 Operator precedence parsing considers this as a stream of primary 285 expressions separated by binary operators. As such, it will first parse 286 the leading primary expression "a", then it will see the pairs [+, b] 287 [+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are 288 primary expressions, the binary expression parser doesn't need to worry 289 about nested subexpressions like (c+d) at all. 290 291 To start, an expression is a primary expression potentially followed by 292 a sequence of [binop,primaryexpr] pairs: 293 294 .. code-block:: ocaml 295 296 (* expression 297 * ::= primary binoprhs *) 298 and parse_expr = parser 299 | [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream 300 301 ``Parser.parse_bin_rhs`` is the function that parses the sequence of 302 pairs for us. It takes a precedence and a pointer to an expression for 303 the part that has been parsed so far. Note that "x" is a perfectly valid 304 expression: As such, "binoprhs" is allowed to be empty, in which case it 305 returns the expression that is passed into it. In our example above, the 306 code passes the expression for "a" into ``Parser.parse_bin_rhs`` and the 307 current token is "+". 308 309 The precedence value passed into ``Parser.parse_bin_rhs`` indicates the 310 *minimal operator precedence* that the function is allowed to eat. For 311 example, if the current pair stream is [+, x] and 312 ``Parser.parse_bin_rhs`` is passed in a precedence of 40, it will not 313 consume any tokens (because the precedence of '+' is only 20). With this 314 in mind, ``Parser.parse_bin_rhs`` starts with: 315 316 .. code-block:: ocaml 317 318 (* binoprhs 319 * ::= ('+' primary)* *) 320 and parse_bin_rhs expr_prec lhs stream = 321 match Stream.peek stream with 322 (* If this is a binop, find its precedence. *) 323 | Some (Token.Kwd c) when Hashtbl.mem binop_precedence c -> 324 let token_prec = precedence c in 325 326 (* If this is a binop that binds at least as tightly as the current binop, 327 * consume it, otherwise we are done. *) 328 if token_prec < expr_prec then lhs else begin 329 330 This code gets the precedence of the current token and checks to see if 331 if is too low. Because we defined invalid tokens to have a precedence of 332 -1, this check implicitly knows that the pair-stream ends when the token 333 stream runs out of binary operators. If this check succeeds, we know 334 that the token is a binary operator and that it will be included in this 335 expression: 336 337 .. code-block:: ocaml 338 339 (* Eat the binop. *) 340 Stream.junk stream; 341 342 (* Okay, we know this is a binop. *) 343 let rhs = 344 match Stream.peek stream with 345 | Some (Token.Kwd c2) -> 346 347 As such, this code eats (and remembers) the binary operator and then 348 parses the primary expression that follows. This builds up the whole 349 pair, the first of which is [+, b] for the running example. 350 351 Now that we parsed the left-hand side of an expression and one pair of 352 the RHS sequence, we have to decide which way the expression associates. 353 In particular, we could have "(a+b) binop unparsed" or "a + (b binop 354 unparsed)". To determine this, we look ahead at "binop" to determine its 355 precedence and compare it to BinOp's precedence (which is '+' in this 356 case): 357 358 .. code-block:: ocaml 359 360 (* If BinOp binds less tightly with rhs than the operator after 361 * rhs, let the pending operator take rhs as its lhs. *) 362 let next_prec = precedence c2 in 363 if token_prec < next_prec 364 365 If the precedence of the binop to the right of "RHS" is lower or equal 366 to the precedence of our current operator, then we know that the 367 parentheses associate as "(a+b) binop ...". In our example, the current 368 operator is "+" and the next operator is "+", we know that they have the 369 same precedence. In this case we'll create the AST node for "a+b", and 370 then continue parsing: 371 372 .. code-block:: ocaml 373 374 ... if body omitted ... 375 in 376 377 (* Merge lhs/rhs. *) 378 let lhs = Ast.Binary (c, lhs, rhs) in 379 parse_bin_rhs expr_prec lhs stream 380 end 381 382 In our example above, this will turn "a+b+" into "(a+b)" and execute the 383 next iteration of the loop, with "+" as the current token. The code 384 above will eat, remember, and parse "(c+d)" as the primary expression, 385 which makes the current pair equal to [+, (c+d)]. It will then evaluate 386 the 'if' conditional above with "\*" as the binop to the right of the 387 primary. In this case, the precedence of "\*" is higher than the 388 precedence of "+" so the if condition will be entered. 389 390 The critical question left here is "how can the if condition parse the 391 right hand side in full"? In particular, to build the AST correctly for 392 our example, it needs to get all of "(c+d)\*e\*f" as the RHS expression 393 variable. The code to do this is surprisingly simple (code from the 394 above two blocks duplicated for context): 395 396 .. code-block:: ocaml 397 398 match Stream.peek stream with 399 | Some (Token.Kwd c2) -> 400 (* If BinOp binds less tightly with rhs than the operator after 401 * rhs, let the pending operator take rhs as its lhs. *) 402 if token_prec < precedence c2 403 then parse_bin_rhs (token_prec + 1) rhs stream 404 else rhs 405 | _ -> rhs 406 in 407 408 (* Merge lhs/rhs. *) 409 let lhs = Ast.Binary (c, lhs, rhs) in 410 parse_bin_rhs expr_prec lhs stream 411 end 412 413 At this point, we know that the binary operator to the RHS of our 414 primary has higher precedence than the binop we are currently parsing. 415 As such, we know that any sequence of pairs whose operators are all 416 higher precedence than "+" should be parsed together and returned as 417 "RHS". To do this, we recursively invoke the ``Parser.parse_bin_rhs`` 418 function specifying "token\_prec+1" as the minimum precedence required 419 for it to continue. In our example above, this will cause it to return 420 the AST node for "(c+d)\*e\*f" as RHS, which is then set as the RHS of 421 the '+' expression. 422 423 Finally, on the next iteration of the while loop, the "+g" piece is 424 parsed and added to the AST. With this little bit of code (14 425 non-trivial lines), we correctly handle fully general binary expression 426 parsing in a very elegant way. This was a whirlwind tour of this code, 427 and it is somewhat subtle. I recommend running through it with a few 428 tough examples to see how it works. 429 430 This wraps up handling of expressions. At this point, we can point the 431 parser at an arbitrary token stream and build an expression from it, 432 stopping at the first token that is not part of the expression. Next up 433 we need to handle function definitions, etc. 434 435 Parsing the Rest 436 ================ 437 438 The next thing missing is handling of function prototypes. In 439 Kaleidoscope, these are used both for 'extern' function declarations as 440 well as function body definitions. The code to do this is 441 straight-forward and not very interesting (once you've survived 442 expressions): 443 444 .. code-block:: ocaml 445 446 (* prototype 447 * ::= id '(' id* ')' *) 448 let parse_prototype = 449 let rec parse_args accumulator = parser 450 | [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e 451 | [< >] -> accumulator 452 in 453 454 parser 455 | [< 'Token.Ident id; 456 'Token.Kwd '(' ?? "expected '(' in prototype"; 457 args=parse_args []; 458 'Token.Kwd ')' ?? "expected ')' in prototype" >] -> 459 (* success. *) 460 Ast.Prototype (id, Array.of_list (List.rev args)) 461 462 | [< >] -> 463 raise (Stream.Error "expected function name in prototype") 464 465 Given this, a function definition is very simple, just a prototype plus 466 an expression to implement the body: 467 468 .. code-block:: ocaml 469 470 (* definition ::= 'def' prototype expression *) 471 let parse_definition = parser 472 | [< 'Token.Def; p=parse_prototype; e=parse_expr >] -> 473 Ast.Function (p, e) 474 475 In addition, we support 'extern' to declare functions like 'sin' and 476 'cos' as well as to support forward declaration of user functions. These 477 'extern's are just prototypes with no body: 478 479 .. code-block:: ocaml 480 481 (* external ::= 'extern' prototype *) 482 let parse_extern = parser 483 | [< 'Token.Extern; e=parse_prototype >] -> e 484 485 Finally, we'll also let the user type in arbitrary top-level expressions 486 and evaluate them on the fly. We will handle this by defining anonymous 487 nullary (zero argument) functions for them: 488 489 .. code-block:: ocaml 490 491 (* toplevelexpr ::= expression *) 492 let parse_toplevel = parser 493 | [< e=parse_expr >] -> 494 (* Make an anonymous proto. *) 495 Ast.Function (Ast.Prototype ("", [||]), e) 496 497 Now that we have all the pieces, let's build a little driver that will 498 let us actually *execute* this code we've built! 499 500 The Driver 501 ========== 502 503 The driver for this simply invokes all of the parsing pieces with a 504 top-level dispatch loop. There isn't much interesting here, so I'll just 505 include the top-level loop. See `below <#code>`_ for full code in the 506 "Top-Level Parsing" section. 507 508 .. code-block:: ocaml 509 510 (* top ::= definition | external | expression | ';' *) 511 let rec main_loop stream = 512 match Stream.peek stream with 513 | None -> () 514 515 (* ignore top-level semicolons. *) 516 | Some (Token.Kwd ';') -> 517 Stream.junk stream; 518 main_loop stream 519 520 | Some token -> 521 begin 522 try match token with 523 | Token.Def -> 524 ignore(Parser.parse_definition stream); 525 print_endline "parsed a function definition."; 526 | Token.Extern -> 527 ignore(Parser.parse_extern stream); 528 print_endline "parsed an extern."; 529 | _ -> 530 (* Evaluate a top-level expression into an anonymous function. *) 531 ignore(Parser.parse_toplevel stream); 532 print_endline "parsed a top-level expr"; 533 with Stream.Error s -> 534 (* Skip token for error recovery. *) 535 Stream.junk stream; 536 print_endline s; 537 end; 538 print_string "ready> "; flush stdout; 539 main_loop stream 540 541 The most interesting part of this is that we ignore top-level 542 semicolons. Why is this, you ask? The basic reason is that if you type 543 "4 + 5" at the command line, the parser doesn't know whether that is the 544 end of what you will type or not. For example, on the next line you 545 could type "def foo..." in which case 4+5 is the end of a top-level 546 expression. Alternatively you could type "\* 6", which would continue 547 the expression. Having top-level semicolons allows you to type "4+5;", 548 and the parser will know you are done. 549 550 Conclusions 551 =========== 552 553 With just under 300 lines of commented code (240 lines of non-comment, 554 non-blank code), we fully defined our minimal language, including a 555 lexer, parser, and AST builder. With this done, the executable will 556 validate Kaleidoscope code and tell us if it is grammatically invalid. 557 For example, here is a sample interaction: 558 559 .. code-block:: bash 560 561 $ ./toy.byte 562 ready> def foo(x y) x+foo(y, 4.0); 563 Parsed a function definition. 564 ready> def foo(x y) x+y y; 565 Parsed a function definition. 566 Parsed a top-level expr 567 ready> def foo(x y) x+y ); 568 Parsed a function definition. 569 Error: unknown token when expecting an expression 570 ready> extern sin(a); 571 ready> Parsed an extern 572 ready> ^D 573 $ 574 575 There is a lot of room for extension here. You can define new AST nodes, 576 extend the language in many ways, etc. In the `next 577 installment <OCamlLangImpl3.html>`_, we will describe how to generate 578 LLVM Intermediate Representation (IR) from the AST. 579 580 Full Code Listing 581 ================= 582 583 Here is the complete code listing for this and the previous chapter. 584 Note that it is fully self-contained: you don't need LLVM or any 585 external libraries at all for this. (Besides the ocaml standard 586 libraries, of course.) To build this, just compile with: 587 588 .. code-block:: bash 589 590 # Compile 591 ocamlbuild toy.byte 592 # Run 593 ./toy.byte 594 595 Here is the code: 596 597 \_tags: 598 :: 599 600 <{lexer,parser}.ml>: use_camlp4, pp(camlp4of) 601 602 token.ml: 603 .. code-block:: ocaml 604 605 (*===----------------------------------------------------------------------=== 606 * Lexer Tokens 607 *===----------------------------------------------------------------------===*) 608 609 (* The lexer returns these 'Kwd' if it is an unknown character, otherwise one of 610 * these others for known things. *) 611 type token = 612 (* commands *) 613 | Def | Extern 614 615 (* primary *) 616 | Ident of string | Number of float 617 618 (* unknown *) 619 | Kwd of char 620 621 lexer.ml: 622 .. code-block:: ocaml 623 624 (*===----------------------------------------------------------------------=== 625 * Lexer 626 *===----------------------------------------------------------------------===*) 627 628 let rec lex = parser 629 (* Skip any whitespace. *) 630 | [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream 631 632 (* identifier: [a-zA-Z][a-zA-Z0-9] *) 633 | [< ' ('A' .. 'Z' | 'a' .. 'z' as c); stream >] -> 634 let buffer = Buffer.create 1 in 635 Buffer.add_char buffer c; 636 lex_ident buffer stream 637 638 (* number: [0-9.]+ *) 639 | [< ' ('0' .. '9' as c); stream >] -> 640 let buffer = Buffer.create 1 in 641 Buffer.add_char buffer c; 642 lex_number buffer stream 643 644 (* Comment until end of line. *) 645 | [< ' ('#'); stream >] -> 646 lex_comment stream 647 648 (* Otherwise, just return the character as its ascii value. *) 649 | [< 'c; stream >] -> 650 [< 'Token.Kwd c; lex stream >] 651 652 (* end of stream. *) 653 | [< >] -> [< >] 654 655 and lex_number buffer = parser 656 | [< ' ('0' .. '9' | '.' as c); stream >] -> 657 Buffer.add_char buffer c; 658 lex_number buffer stream 659 | [< stream=lex >] -> 660 [< 'Token.Number (float_of_string (Buffer.contents buffer)); stream >] 661 662 and lex_ident buffer = parser 663 | [< ' ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' as c); stream >] -> 664 Buffer.add_char buffer c; 665 lex_ident buffer stream 666 | [< stream=lex >] -> 667 match Buffer.contents buffer with 668 | "def" -> [< 'Token.Def; stream >] 669 | "extern" -> [< 'Token.Extern; stream >] 670 | id -> [< 'Token.Ident id; stream >] 671 672 and lex_comment = parser 673 | [< ' ('\n'); stream=lex >] -> stream 674 | [< 'c; e=lex_comment >] -> e 675 | [< >] -> [< >] 676 677 ast.ml: 678 .. code-block:: ocaml 679 680 (*===----------------------------------------------------------------------=== 681 * Abstract Syntax Tree (aka Parse Tree) 682 *===----------------------------------------------------------------------===*) 683 684 (* expr - Base type for all expression nodes. *) 685 type expr = 686 (* variant for numeric literals like "1.0". *) 687 | Number of float 688 689 (* variant for referencing a variable, like "a". *) 690 | Variable of string 691 692 (* variant for a binary operator. *) 693 | Binary of char * expr * expr 694 695 (* variant for function calls. *) 696 | Call of string * expr array 697 698 (* proto - This type represents the "prototype" for a function, which captures 699 * its name, and its argument names (thus implicitly the number of arguments the 700 * function takes). *) 701 type proto = Prototype of string * string array 702 703 (* func - This type represents a function definition itself. *) 704 type func = Function of proto * expr 705 706 parser.ml: 707 .. code-block:: ocaml 708 709 (*===---------------------------------------------------------------------=== 710 * Parser 711 *===---------------------------------------------------------------------===*) 712 713 (* binop_precedence - This holds the precedence for each binary operator that is 714 * defined *) 715 let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10 716 717 (* precedence - Get the precedence of the pending binary operator token. *) 718 let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1 719 720 (* primary 721 * ::= identifier 722 * ::= numberexpr 723 * ::= parenexpr *) 724 let rec parse_primary = parser 725 (* numberexpr ::= number *) 726 | [< 'Token.Number n >] -> Ast.Number n 727 728 (* parenexpr ::= '(' expression ')' *) 729 | [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e 730 731 (* identifierexpr 732 * ::= identifier 733 * ::= identifier '(' argumentexpr ')' *) 734 | [< 'Token.Ident id; stream >] -> 735 let rec parse_args accumulator = parser 736 | [< e=parse_expr; stream >] -> 737 begin parser 738 | [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e 739 | [< >] -> e :: accumulator 740 end stream 741 | [< >] -> accumulator 742 in 743 let rec parse_ident id = parser 744 (* Call. *) 745 | [< 'Token.Kwd '('; 746 args=parse_args []; 747 'Token.Kwd ')' ?? "expected ')'">] -> 748 Ast.Call (id, Array.of_list (List.rev args)) 749 750 (* Simple variable ref. *) 751 | [< >] -> Ast.Variable id 752 in 753 parse_ident id stream 754 755 | [< >] -> raise (Stream.Error "unknown token when expecting an expression.") 756 757 (* binoprhs 758 * ::= ('+' primary)* *) 759 and parse_bin_rhs expr_prec lhs stream = 760 match Stream.peek stream with 761 (* If this is a binop, find its precedence. *) 762 | Some (Token.Kwd c) when Hashtbl.mem binop_precedence c -> 763 let token_prec = precedence c in 764 765 (* If this is a binop that binds at least as tightly as the current binop, 766 * consume it, otherwise we are done. *) 767 if token_prec < expr_prec then lhs else begin 768 (* Eat the binop. *) 769 Stream.junk stream; 770 771 (* Parse the primary expression after the binary operator. *) 772 let rhs = parse_primary stream in 773 774 (* Okay, we know this is a binop. *) 775 let rhs = 776 match Stream.peek stream with 777 | Some (Token.Kwd c2) -> 778 (* If BinOp binds less tightly with rhs than the operator after 779 * rhs, let the pending operator take rhs as its lhs. *) 780 let next_prec = precedence c2 in 781 if token_prec < next_prec 782 then parse_bin_rhs (token_prec + 1) rhs stream 783 else rhs 784 | _ -> rhs 785 in 786 787 (* Merge lhs/rhs. *) 788 let lhs = Ast.Binary (c, lhs, rhs) in 789 parse_bin_rhs expr_prec lhs stream 790 end 791 | _ -> lhs 792 793 (* expression 794 * ::= primary binoprhs *) 795 and parse_expr = parser 796 | [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream 797 798 (* prototype 799 * ::= id '(' id* ')' *) 800 let parse_prototype = 801 let rec parse_args accumulator = parser 802 | [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e 803 | [< >] -> accumulator 804 in 805 806 parser 807 | [< 'Token.Ident id; 808 'Token.Kwd '(' ?? "expected '(' in prototype"; 809 args=parse_args []; 810 'Token.Kwd ')' ?? "expected ')' in prototype" >] -> 811 (* success. *) 812 Ast.Prototype (id, Array.of_list (List.rev args)) 813 814 | [< >] -> 815 raise (Stream.Error "expected function name in prototype") 816 817 (* definition ::= 'def' prototype expression *) 818 let parse_definition = parser 819 | [< 'Token.Def; p=parse_prototype; e=parse_expr >] -> 820 Ast.Function (p, e) 821 822 (* toplevelexpr ::= expression *) 823 let parse_toplevel = parser 824 | [< e=parse_expr >] -> 825 (* Make an anonymous proto. *) 826 Ast.Function (Ast.Prototype ("", [||]), e) 827 828 (* external ::= 'extern' prototype *) 829 let parse_extern = parser 830 | [< 'Token.Extern; e=parse_prototype >] -> e 831 832 toplevel.ml: 833 .. code-block:: ocaml 834 835 (*===----------------------------------------------------------------------=== 836 * Top-Level parsing and JIT Driver 837 *===----------------------------------------------------------------------===*) 838 839 (* top ::= definition | external | expression | ';' *) 840 let rec main_loop stream = 841 match Stream.peek stream with 842 | None -> () 843 844 (* ignore top-level semicolons. *) 845 | Some (Token.Kwd ';') -> 846 Stream.junk stream; 847 main_loop stream 848 849 | Some token -> 850 begin 851 try match token with 852 | Token.Def -> 853 ignore(Parser.parse_definition stream); 854 print_endline "parsed a function definition."; 855 | Token.Extern -> 856 ignore(Parser.parse_extern stream); 857 print_endline "parsed an extern."; 858 | _ -> 859 (* Evaluate a top-level expression into an anonymous function. *) 860 ignore(Parser.parse_toplevel stream); 861 print_endline "parsed a top-level expr"; 862 with Stream.Error s -> 863 (* Skip token for error recovery. *) 864 Stream.junk stream; 865 print_endline s; 866 end; 867 print_string "ready> "; flush stdout; 868 main_loop stream 869 870 toy.ml: 871 .. code-block:: ocaml 872 873 (*===----------------------------------------------------------------------=== 874 * Main driver code. 875 *===----------------------------------------------------------------------===*) 876 877 let main () = 878 (* Install standard binary operators. 879 * 1 is the lowest precedence. *) 880 Hashtbl.add Parser.binop_precedence '<' 10; 881 Hashtbl.add Parser.binop_precedence '+' 20; 882 Hashtbl.add Parser.binop_precedence '-' 20; 883 Hashtbl.add Parser.binop_precedence '*' 40; (* highest. *) 884 885 (* Prime the first token. *) 886 print_string "ready> "; flush stdout; 887 let stream = Lexer.lex (Stream.of_channel stdin) in 888 889 (* Run the main "interpreter loop" now. *) 890 Toplevel.main_loop stream; 891 ;; 892 893 main () 894 895 `Next: Implementing Code Generation to LLVM IR <OCamlLangImpl3.html>`_ 896 897