Home | History | Annotate | Download | only in tutorial
      1 ===========================================
      2 Kaleidoscope: Implementing a Parser and AST
      3 ===========================================
      4 
      5 .. contents::
      6    :local:
      7 
      8 Chapter 2 Introduction
      9 ======================
     10 
     11 Welcome to Chapter 2 of the "`Implementing a language with LLVM in
     12 Objective Caml <index.html>`_" tutorial. This chapter shows you how to
     13 use the lexer, built in `Chapter 1 <OCamlLangImpl1.html>`_, to build a
     14 full `parser <http://en.wikipedia.org/wiki/Parsing>`_ for our
     15 Kaleidoscope language. Once we have a parser, we'll define and build an
     16 `Abstract Syntax
     17 Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
     18 
     19 The parser we will build uses a combination of `Recursive Descent
     20 Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
     21 `Operator-Precedence
     22 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
     23 parse the Kaleidoscope language (the latter for binary expressions and
     24 the former for everything else). Before we get to parsing though, lets
     25 talk about the output of the parser: the Abstract Syntax Tree.
     26 
     27 The Abstract Syntax Tree (AST)
     28 ==============================
     29 
     30 The AST for a program captures its behavior in such a way that it is
     31 easy for later stages of the compiler (e.g. code generation) to
     32 interpret. We basically want one object for each construct in the
     33 language, and the AST should closely model the language. In
     34 Kaleidoscope, we have expressions, a prototype, and a function object.
     35 We'll start with expressions first:
     36 
     37 .. code-block:: ocaml
     38 
     39     (* expr - Base type for all expression nodes. *)
     40     type expr =
     41       (* variant for numeric literals like "1.0". *)
     42       | Number of float
     43 
     44 The code above shows the definition of the base ExprAST class and one
     45 subclass which we use for numeric literals. The important thing to note
     46 about this code is that the Number variant captures the numeric value of
     47 the literal as an instance variable. This allows later phases of the
     48 compiler to know what the stored numeric value is.
     49 
     50 Right now we only create the AST, so there are no useful functions on
     51 them. It would be very easy to add a function to pretty print the code,
     52 for example. Here are the other expression AST node definitions that
     53 we'll use in the basic form of the Kaleidoscope language:
     54 
     55 .. code-block:: ocaml
     56 
     57       (* variant for referencing a variable, like "a". *)
     58       | Variable of string
     59 
     60       (* variant for a binary operator. *)
     61       | Binary of char * expr * expr
     62 
     63       (* variant for function calls. *)
     64       | Call of string * expr array
     65 
     66 This is all (intentionally) rather straight-forward: variables capture
     67 the variable name, binary operators capture their opcode (e.g. '+'), and
     68 calls capture a function name as well as a list of any argument
     69 expressions. One thing that is nice about our AST is that it captures
     70 the language features without talking about the syntax of the language.
     71 Note that there is no discussion about precedence of binary operators,
     72 lexical structure, etc.
     73 
     74 For our basic language, these are all of the expression nodes we'll
     75 define. Because it doesn't have conditional control flow, it isn't
     76 Turing-complete; we'll fix that in a later installment. The two things
     77 we need next are a way to talk about the interface to a function, and a
     78 way to talk about functions themselves:
     79 
     80 .. code-block:: ocaml
     81 
     82     (* proto - This type represents the "prototype" for a function, which captures
     83      * its name, and its argument names (thus implicitly the number of arguments the
     84      * function takes). *)
     85     type proto = Prototype of string * string array
     86 
     87     (* func - This type represents a function definition itself. *)
     88     type func = Function of proto * expr
     89 
     90 In Kaleidoscope, functions are typed with just a count of their
     91 arguments. Since all values are double precision floating point, the
     92 type of each argument doesn't need to be stored anywhere. In a more
     93 aggressive and realistic language, the "expr" variants would probably
     94 have a type field.
     95 
     96 With this scaffolding, we can now talk about parsing expressions and
     97 function bodies in Kaleidoscope.
     98 
     99 Parser Basics
    100 =============
    101 
    102 Now that we have an AST to build, we need to define the parser code to
    103 build it. The idea here is that we want to parse something like "x+y"
    104 (which is returned as three tokens by the lexer) into an AST that could
    105 be generated with calls like this:
    106 
    107 .. code-block:: ocaml
    108 
    109       let x = Variable "x" in
    110       let y = Variable "y" in
    111       let result = Binary ('+', x, y) in
    112       ...
    113 
    114 The error handling routines make use of the builtin ``Stream.Failure``
    115 and ``Stream.Error``s. ``Stream.Failure`` is raised when the parser is
    116 unable to find any matching token in the first position of a pattern.
    117 ``Stream.Error`` is raised when the first token matches, but the rest do
    118 not. The error recovery in our parser will not be the best and is not
    119 particular user-friendly, but it will be enough for our tutorial. These
    120 exceptions make it easier to handle errors in routines that have various
    121 return types.
    122 
    123 With these basic types and exceptions, we can implement the first piece
    124 of our grammar: numeric literals.
    125 
    126 Basic Expression Parsing
    127 ========================
    128 
    129 We start with numeric literals, because they are the simplest to
    130 process. For each production in our grammar, we'll define a function
    131 which parses that production. We call this class of expressions
    132 "primary" expressions, for reasons that will become more clear `later in
    133 the tutorial <OCamlLangImpl6.html#user-defined-unary-operators>`_. In order to parse an
    134 arbitrary primary expression, we need to determine what sort of
    135 expression it is. For numeric literals, we have:
    136 
    137 .. code-block:: ocaml
    138 
    139     (* primary
    140      *   ::= identifier
    141      *   ::= numberexpr
    142      *   ::= parenexpr *)
    143     parse_primary = parser
    144       (* numberexpr ::= number *)
    145       | [< 'Token.Number n >] -> Ast.Number n
    146 
    147 This routine is very simple: it expects to be called when the current
    148 token is a ``Token.Number`` token. It takes the current number value,
    149 creates a ``Ast.Number`` node, advances the lexer to the next token, and
    150 finally returns.
    151 
    152 There are some interesting aspects to this. The most important one is
    153 that this routine eats all of the tokens that correspond to the
    154 production and returns the lexer buffer with the next token (which is
    155 not part of the grammar production) ready to go. This is a fairly
    156 standard way to go for recursive descent parsers. For a better example,
    157 the parenthesis operator is defined like this:
    158 
    159 .. code-block:: ocaml
    160 
    161       (* parenexpr ::= '(' expression ')' *)
    162       | [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e
    163 
    164 This function illustrates a number of interesting things about the
    165 parser:
    166 
    167 1) It shows how we use the ``Stream.Error`` exception. When called, this
    168 function expects that the current token is a '(' token, but after
    169 parsing the subexpression, it is possible that there is no ')' waiting.
    170 For example, if the user types in "(4 x" instead of "(4)", the parser
    171 should emit an error. Because errors can occur, the parser needs a way
    172 to indicate that they happened. In our parser, we use the camlp4
    173 shortcut syntax ``token ?? "parse error"``, where if the token before
    174 the ``??`` does not match, then ``Stream.Error "parse error"`` will be
    175 raised.
    176 
    177 2) Another interesting aspect of this function is that it uses recursion
    178 by calling ``Parser.parse_primary`` (we will soon see that
    179 ``Parser.parse_primary`` can call ``Parser.parse_primary``). This is
    180 powerful because it allows us to handle recursive grammars, and keeps
    181 each production very simple. Note that parentheses do not cause
    182 construction of AST nodes themselves. While we could do it this way, the
    183 most important role of parentheses are to guide the parser and provide
    184 grouping. Once the parser constructs the AST, parentheses are not
    185 needed.
    186 
    187 The next simple production is for handling variable references and
    188 function calls:
    189 
    190 .. code-block:: ocaml
    191 
    192       (* identifierexpr
    193        *   ::= identifier
    194        *   ::= identifier '(' argumentexpr ')' *)
    195       | [< 'Token.Ident id; stream >] ->
    196           let rec parse_args accumulator = parser
    197             | [< e=parse_expr; stream >] ->
    198                 begin parser
    199                   | [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e
    200                   | [< >] -> e :: accumulator
    201                 end stream
    202             | [< >] -> accumulator
    203           in
    204           let rec parse_ident id = parser
    205             (* Call. *)
    206             | [< 'Token.Kwd '(';
    207                  args=parse_args [];
    208                  'Token.Kwd ')' ?? "expected ')'">] ->
    209                 Ast.Call (id, Array.of_list (List.rev args))
    210 
    211             (* Simple variable ref. *)
    212             | [< >] -> Ast.Variable id
    213           in
    214           parse_ident id stream
    215 
    216 This routine follows the same style as the other routines. (It expects
    217 to be called if the current token is a ``Token.Ident`` token). It also
    218 has recursion and error handling. One interesting aspect of this is that
    219 it uses *look-ahead* to determine if the current identifier is a stand
    220 alone variable reference or if it is a function call expression. It
    221 handles this by checking to see if the token after the identifier is a
    222 '(' token, constructing either a ``Ast.Variable`` or ``Ast.Call`` node
    223 as appropriate.
    224 
    225 We finish up by raising an exception if we received a token we didn't
    226 expect:
    227 
    228 .. code-block:: ocaml
    229 
    230       | [< >] -> raise (Stream.Error "unknown token when expecting an expression.")
    231 
    232 Now that basic expressions are handled, we need to handle binary
    233 expressions. They are a bit more complex.
    234 
    235 Binary Expression Parsing
    236 =========================
    237 
    238 Binary expressions are significantly harder to parse because they are
    239 often ambiguous. For example, when given the string "x+y\*z", the parser
    240 can choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common
    241 definitions from mathematics, we expect the later parse, because "\*"
    242 (multiplication) has higher *precedence* than "+" (addition).
    243 
    244 There are many ways to handle this, but an elegant and efficient way is
    245 to use `Operator-Precedence
    246 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
    247 This parsing technique uses the precedence of binary operators to guide
    248 recursion. To start with, we need a table of precedences:
    249 
    250 .. code-block:: ocaml
    251 
    252     (* binop_precedence - This holds the precedence for each binary operator that is
    253      * defined *)
    254     let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
    255 
    256     (* precedence - Get the precedence of the pending binary operator token. *)
    257     let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1
    258 
    259     ...
    260 
    261     let main () =
    262       (* Install standard binary operators.
    263        * 1 is the lowest precedence. *)
    264       Hashtbl.add Parser.binop_precedence '<' 10;
    265       Hashtbl.add Parser.binop_precedence '+' 20;
    266       Hashtbl.add Parser.binop_precedence '-' 20;
    267       Hashtbl.add Parser.binop_precedence '*' 40;    (* highest. *)
    268       ...
    269 
    270 For the basic form of Kaleidoscope, we will only support 4 binary
    271 operators (this can obviously be extended by you, our brave and intrepid
    272 reader). The ``Parser.precedence`` function returns the precedence for
    273 the current token, or -1 if the token is not a binary operator. Having a
    274 ``Hashtbl.t`` makes it easy to add new operators and makes it clear that
    275 the algorithm doesn't depend on the specific operators involved, but it
    276 would be easy enough to eliminate the ``Hashtbl.t`` and do the
    277 comparisons in the ``Parser.precedence`` function. (Or just use a
    278 fixed-size array).
    279 
    280 With the helper above defined, we can now start parsing binary
    281 expressions. The basic idea of operator precedence parsing is to break
    282 down an expression with potentially ambiguous binary operators into
    283 pieces. Consider, for example, the expression "a+b+(c+d)\*e\*f+g".
    284 Operator precedence parsing considers this as a stream of primary
    285 expressions separated by binary operators. As such, it will first parse
    286 the leading primary expression "a", then it will see the pairs [+, b]
    287 [+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are
    288 primary expressions, the binary expression parser doesn't need to worry
    289 about nested subexpressions like (c+d) at all.
    290 
    291 To start, an expression is a primary expression potentially followed by
    292 a sequence of [binop,primaryexpr] pairs:
    293 
    294 .. code-block:: ocaml
    295 
    296     (* expression
    297      *   ::= primary binoprhs *)
    298     and parse_expr = parser
    299       | [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream
    300 
    301 ``Parser.parse_bin_rhs`` is the function that parses the sequence of
    302 pairs for us. It takes a precedence and a pointer to an expression for
    303 the part that has been parsed so far. Note that "x" is a perfectly valid
    304 expression: As such, "binoprhs" is allowed to be empty, in which case it
    305 returns the expression that is passed into it. In our example above, the
    306 code passes the expression for "a" into ``Parser.parse_bin_rhs`` and the
    307 current token is "+".
    308 
    309 The precedence value passed into ``Parser.parse_bin_rhs`` indicates the
    310 *minimal operator precedence* that the function is allowed to eat. For
    311 example, if the current pair stream is [+, x] and
    312 ``Parser.parse_bin_rhs`` is passed in a precedence of 40, it will not
    313 consume any tokens (because the precedence of '+' is only 20). With this
    314 in mind, ``Parser.parse_bin_rhs`` starts with:
    315 
    316 .. code-block:: ocaml
    317 
    318     (* binoprhs
    319      *   ::= ('+' primary)* *)
    320     and parse_bin_rhs expr_prec lhs stream =
    321       match Stream.peek stream with
    322       (* If this is a binop, find its precedence. *)
    323       | Some (Token.Kwd c) when Hashtbl.mem binop_precedence c ->
    324           let token_prec = precedence c in
    325 
    326           (* If this is a binop that binds at least as tightly as the current binop,
    327            * consume it, otherwise we are done. *)
    328           if token_prec < expr_prec then lhs else begin
    329 
    330 This code gets the precedence of the current token and checks to see if
    331 if is too low. Because we defined invalid tokens to have a precedence of
    332 -1, this check implicitly knows that the pair-stream ends when the token
    333 stream runs out of binary operators. If this check succeeds, we know
    334 that the token is a binary operator and that it will be included in this
    335 expression:
    336 
    337 .. code-block:: ocaml
    338 
    339             (* Eat the binop. *)
    340             Stream.junk stream;
    341 
    342             (* Parse the primary expression after the binary operator *)
    343             let rhs = parse_primary stream in
    344 
    345             (* Okay, we know this is a binop. *)
    346             let rhs =
    347               match Stream.peek stream with
    348               | Some (Token.Kwd c2) ->
    349 
    350 As such, this code eats (and remembers) the binary operator and then
    351 parses the primary expression that follows. This builds up the whole
    352 pair, the first of which is [+, b] for the running example.
    353 
    354 Now that we parsed the left-hand side of an expression and one pair of
    355 the RHS sequence, we have to decide which way the expression associates.
    356 In particular, we could have "(a+b) binop unparsed" or "a + (b binop
    357 unparsed)". To determine this, we look ahead at "binop" to determine its
    358 precedence and compare it to BinOp's precedence (which is '+' in this
    359 case):
    360 
    361 .. code-block:: ocaml
    362 
    363                   (* If BinOp binds less tightly with rhs than the operator after
    364                    * rhs, let the pending operator take rhs as its lhs. *)
    365                   let next_prec = precedence c2 in
    366                   if token_prec < next_prec
    367 
    368 If the precedence of the binop to the right of "RHS" is lower or equal
    369 to the precedence of our current operator, then we know that the
    370 parentheses associate as "(a+b) binop ...". In our example, the current
    371 operator is "+" and the next operator is "+", we know that they have the
    372 same precedence. In this case we'll create the AST node for "a+b", and
    373 then continue parsing:
    374 
    375 .. code-block:: ocaml
    376 
    377               ... if body omitted ...
    378             in
    379 
    380             (* Merge lhs/rhs. *)
    381             let lhs = Ast.Binary (c, lhs, rhs) in
    382             parse_bin_rhs expr_prec lhs stream
    383           end
    384 
    385 In our example above, this will turn "a+b+" into "(a+b)" and execute the
    386 next iteration of the loop, with "+" as the current token. The code
    387 above will eat, remember, and parse "(c+d)" as the primary expression,
    388 which makes the current pair equal to [+, (c+d)]. It will then evaluate
    389 the 'if' conditional above with "\*" as the binop to the right of the
    390 primary. In this case, the precedence of "\*" is higher than the
    391 precedence of "+" so the if condition will be entered.
    392 
    393 The critical question left here is "how can the if condition parse the
    394 right hand side in full"? In particular, to build the AST correctly for
    395 our example, it needs to get all of "(c+d)\*e\*f" as the RHS expression
    396 variable. The code to do this is surprisingly simple (code from the
    397 above two blocks duplicated for context):
    398 
    399 .. code-block:: ocaml
    400 
    401               match Stream.peek stream with
    402               | Some (Token.Kwd c2) ->
    403                   (* If BinOp binds less tightly with rhs than the operator after
    404                    * rhs, let the pending operator take rhs as its lhs. *)
    405                   if token_prec < precedence c2
    406                   then parse_bin_rhs (token_prec + 1) rhs stream
    407                   else rhs
    408               | _ -> rhs
    409             in
    410 
    411             (* Merge lhs/rhs. *)
    412             let lhs = Ast.Binary (c, lhs, rhs) in
    413             parse_bin_rhs expr_prec lhs stream
    414           end
    415 
    416 At this point, we know that the binary operator to the RHS of our
    417 primary has higher precedence than the binop we are currently parsing.
    418 As such, we know that any sequence of pairs whose operators are all
    419 higher precedence than "+" should be parsed together and returned as
    420 "RHS". To do this, we recursively invoke the ``Parser.parse_bin_rhs``
    421 function specifying "token\_prec+1" as the minimum precedence required
    422 for it to continue. In our example above, this will cause it to return
    423 the AST node for "(c+d)\*e\*f" as RHS, which is then set as the RHS of
    424 the '+' expression.
    425 
    426 Finally, on the next iteration of the while loop, the "+g" piece is
    427 parsed and added to the AST. With this little bit of code (14
    428 non-trivial lines), we correctly handle fully general binary expression
    429 parsing in a very elegant way. This was a whirlwind tour of this code,
    430 and it is somewhat subtle. I recommend running through it with a few
    431 tough examples to see how it works.
    432 
    433 This wraps up handling of expressions. At this point, we can point the
    434 parser at an arbitrary token stream and build an expression from it,
    435 stopping at the first token that is not part of the expression. Next up
    436 we need to handle function definitions, etc.
    437 
    438 Parsing the Rest
    439 ================
    440 
    441 The next thing missing is handling of function prototypes. In
    442 Kaleidoscope, these are used both for 'extern' function declarations as
    443 well as function body definitions. The code to do this is
    444 straight-forward and not very interesting (once you've survived
    445 expressions):
    446 
    447 .. code-block:: ocaml
    448 
    449     (* prototype
    450      *   ::= id '(' id* ')' *)
    451     let parse_prototype =
    452       let rec parse_args accumulator = parser
    453         | [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e
    454         | [< >] -> accumulator
    455       in
    456 
    457       parser
    458       | [< 'Token.Ident id;
    459            'Token.Kwd '(' ?? "expected '(' in prototype";
    460            args=parse_args [];
    461            'Token.Kwd ')' ?? "expected ')' in prototype" >] ->
    462           (* success. *)
    463           Ast.Prototype (id, Array.of_list (List.rev args))
    464 
    465       | [< >] ->
    466           raise (Stream.Error "expected function name in prototype")
    467 
    468 Given this, a function definition is very simple, just a prototype plus
    469 an expression to implement the body:
    470 
    471 .. code-block:: ocaml
    472 
    473     (* definition ::= 'def' prototype expression *)
    474     let parse_definition = parser
    475       | [< 'Token.Def; p=parse_prototype; e=parse_expr >] ->
    476           Ast.Function (p, e)
    477 
    478 In addition, we support 'extern' to declare functions like 'sin' and
    479 'cos' as well as to support forward declaration of user functions. These
    480 'extern's are just prototypes with no body:
    481 
    482 .. code-block:: ocaml
    483 
    484     (*  external ::= 'extern' prototype *)
    485     let parse_extern = parser
    486       | [< 'Token.Extern; e=parse_prototype >] -> e
    487 
    488 Finally, we'll also let the user type in arbitrary top-level expressions
    489 and evaluate them on the fly. We will handle this by defining anonymous
    490 nullary (zero argument) functions for them:
    491 
    492 .. code-block:: ocaml
    493 
    494     (* toplevelexpr ::= expression *)
    495     let parse_toplevel = parser
    496       | [< e=parse_expr >] ->
    497           (* Make an anonymous proto. *)
    498           Ast.Function (Ast.Prototype ("", [||]), e)
    499 
    500 Now that we have all the pieces, let's build a little driver that will
    501 let us actually *execute* this code we've built!
    502 
    503 The Driver
    504 ==========
    505 
    506 The driver for this simply invokes all of the parsing pieces with a
    507 top-level dispatch loop. There isn't much interesting here, so I'll just
    508 include the top-level loop. See `below <#full-code-listing>`_ for full code in the
    509 "Top-Level Parsing" section.
    510 
    511 .. code-block:: ocaml
    512 
    513     (* top ::= definition | external | expression | ';' *)
    514     let rec main_loop stream =
    515       match Stream.peek stream with
    516       | None -> ()
    517 
    518       (* ignore top-level semicolons. *)
    519       | Some (Token.Kwd ';') ->
    520           Stream.junk stream;
    521           main_loop stream
    522 
    523       | Some token ->
    524           begin
    525             try match token with
    526             | Token.Def ->
    527                 ignore(Parser.parse_definition stream);
    528                 print_endline "parsed a function definition.";
    529             | Token.Extern ->
    530                 ignore(Parser.parse_extern stream);
    531                 print_endline "parsed an extern.";
    532             | _ ->
    533                 (* Evaluate a top-level expression into an anonymous function. *)
    534                 ignore(Parser.parse_toplevel stream);
    535                 print_endline "parsed a top-level expr";
    536             with Stream.Error s ->
    537               (* Skip token for error recovery. *)
    538               Stream.junk stream;
    539               print_endline s;
    540           end;
    541           print_string "ready> "; flush stdout;
    542           main_loop stream
    543 
    544 The most interesting part of this is that we ignore top-level
    545 semicolons. Why is this, you ask? The basic reason is that if you type
    546 "4 + 5" at the command line, the parser doesn't know whether that is the
    547 end of what you will type or not. For example, on the next line you
    548 could type "def foo..." in which case 4+5 is the end of a top-level
    549 expression. Alternatively you could type "\* 6", which would continue
    550 the expression. Having top-level semicolons allows you to type "4+5;",
    551 and the parser will know you are done.
    552 
    553 Conclusions
    554 ===========
    555 
    556 With just under 300 lines of commented code (240 lines of non-comment,
    557 non-blank code), we fully defined our minimal language, including a
    558 lexer, parser, and AST builder. With this done, the executable will
    559 validate Kaleidoscope code and tell us if it is grammatically invalid.
    560 For example, here is a sample interaction:
    561 
    562 .. code-block:: bash
    563 
    564     $ ./toy.byte
    565     ready> def foo(x y) x+foo(y, 4.0);
    566     Parsed a function definition.
    567     ready> def foo(x y) x+y y;
    568     Parsed a function definition.
    569     Parsed a top-level expr
    570     ready> def foo(x y) x+y );
    571     Parsed a function definition.
    572     Error: unknown token when expecting an expression
    573     ready> extern sin(a);
    574     ready> Parsed an extern
    575     ready> ^D
    576     $
    577 
    578 There is a lot of room for extension here. You can define new AST nodes,
    579 extend the language in many ways, etc. In the `next
    580 installment <OCamlLangImpl3.html>`_, we will describe how to generate
    581 LLVM Intermediate Representation (IR) from the AST.
    582 
    583 Full Code Listing
    584 =================
    585 
    586 Here is the complete code listing for this and the previous chapter.
    587 Note that it is fully self-contained: you don't need LLVM or any
    588 external libraries at all for this. (Besides the ocaml standard
    589 libraries, of course.) To build this, just compile with:
    590 
    591 .. code-block:: bash
    592 
    593     # Compile
    594     ocamlbuild toy.byte
    595     # Run
    596     ./toy.byte
    597 
    598 Here is the code:
    599 
    600 \_tags:
    601     ::
    602 
    603         <{lexer,parser}.ml>: use_camlp4, pp(camlp4of)
    604 
    605 token.ml:
    606     .. code-block:: ocaml
    607 
    608         (*===----------------------------------------------------------------------===
    609          * Lexer Tokens
    610          *===----------------------------------------------------------------------===*)
    611 
    612         (* The lexer returns these 'Kwd' if it is an unknown character, otherwise one of
    613          * these others for known things. *)
    614         type token =
    615           (* commands *)
    616           | Def | Extern
    617 
    618           (* primary *)
    619           | Ident of string | Number of float
    620 
    621           (* unknown *)
    622           | Kwd of char
    623 
    624 lexer.ml:
    625     .. code-block:: ocaml
    626 
    627         (*===----------------------------------------------------------------------===
    628          * Lexer
    629          *===----------------------------------------------------------------------===*)
    630 
    631         let rec lex = parser
    632           (* Skip any whitespace. *)
    633           | [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream
    634 
    635           (* identifier: [a-zA-Z][a-zA-Z0-9] *)
    636           | [< ' ('A' .. 'Z' | 'a' .. 'z' as c); stream >] ->
    637               let buffer = Buffer.create 1 in
    638               Buffer.add_char buffer c;
    639               lex_ident buffer stream
    640 
    641           (* number: [0-9.]+ *)
    642           | [< ' ('0' .. '9' as c); stream >] ->
    643               let buffer = Buffer.create 1 in
    644               Buffer.add_char buffer c;
    645               lex_number buffer stream
    646 
    647           (* Comment until end of line. *)
    648           | [< ' ('#'); stream >] ->
    649               lex_comment stream
    650 
    651           (* Otherwise, just return the character as its ascii value. *)
    652           | [< 'c; stream >] ->
    653               [< 'Token.Kwd c; lex stream >]
    654 
    655           (* end of stream. *)
    656           | [< >] -> [< >]
    657 
    658         and lex_number buffer = parser
    659           | [< ' ('0' .. '9' | '.' as c); stream >] ->
    660               Buffer.add_char buffer c;
    661               lex_number buffer stream
    662           | [< stream=lex >] ->
    663               [< 'Token.Number (float_of_string (Buffer.contents buffer)); stream >]
    664 
    665         and lex_ident buffer = parser
    666           | [< ' ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' as c); stream >] ->
    667               Buffer.add_char buffer c;
    668               lex_ident buffer stream
    669           | [< stream=lex >] ->
    670               match Buffer.contents buffer with
    671               | "def" -> [< 'Token.Def; stream >]
    672               | "extern" -> [< 'Token.Extern; stream >]
    673               | id -> [< 'Token.Ident id; stream >]
    674 
    675         and lex_comment = parser
    676           | [< ' ('\n'); stream=lex >] -> stream
    677           | [< 'c; e=lex_comment >] -> e
    678           | [< >] -> [< >]
    679 
    680 ast.ml:
    681     .. code-block:: ocaml
    682 
    683         (*===----------------------------------------------------------------------===
    684          * Abstract Syntax Tree (aka Parse Tree)
    685          *===----------------------------------------------------------------------===*)
    686 
    687         (* expr - Base type for all expression nodes. *)
    688         type expr =
    689           (* variant for numeric literals like "1.0". *)
    690           | Number of float
    691 
    692           (* variant for referencing a variable, like "a". *)
    693           | Variable of string
    694 
    695           (* variant for a binary operator. *)
    696           | Binary of char * expr * expr
    697 
    698           (* variant for function calls. *)
    699           | Call of string * expr array
    700 
    701         (* proto - This type represents the "prototype" for a function, which captures
    702          * its name, and its argument names (thus implicitly the number of arguments the
    703          * function takes). *)
    704         type proto = Prototype of string * string array
    705 
    706         (* func - This type represents a function definition itself. *)
    707         type func = Function of proto * expr
    708 
    709 parser.ml:
    710     .. code-block:: ocaml
    711 
    712         (*===---------------------------------------------------------------------===
    713          * Parser
    714          *===---------------------------------------------------------------------===*)
    715 
    716         (* binop_precedence - This holds the precedence for each binary operator that is
    717          * defined *)
    718         let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
    719 
    720         (* precedence - Get the precedence of the pending binary operator token. *)
    721         let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1
    722 
    723         (* primary
    724          *   ::= identifier
    725          *   ::= numberexpr
    726          *   ::= parenexpr *)
    727         let rec parse_primary = parser
    728           (* numberexpr ::= number *)
    729           | [< 'Token.Number n >] -> Ast.Number n
    730 
    731           (* parenexpr ::= '(' expression ')' *)
    732           | [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e
    733 
    734           (* identifierexpr
    735            *   ::= identifier
    736            *   ::= identifier '(' argumentexpr ')' *)
    737           | [< 'Token.Ident id; stream >] ->
    738               let rec parse_args accumulator = parser
    739                 | [< e=parse_expr; stream >] ->
    740                     begin parser
    741                       | [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e
    742                       | [< >] -> e :: accumulator
    743                     end stream
    744                 | [< >] -> accumulator
    745               in
    746               let rec parse_ident id = parser
    747                 (* Call. *)
    748                 | [< 'Token.Kwd '(';
    749                      args=parse_args [];
    750                      'Token.Kwd ')' ?? "expected ')'">] ->
    751                     Ast.Call (id, Array.of_list (List.rev args))
    752 
    753                 (* Simple variable ref. *)
    754                 | [< >] -> Ast.Variable id
    755               in
    756               parse_ident id stream
    757 
    758           | [< >] -> raise (Stream.Error "unknown token when expecting an expression.")
    759 
    760         (* binoprhs
    761          *   ::= ('+' primary)* *)
    762         and parse_bin_rhs expr_prec lhs stream =
    763           match Stream.peek stream with
    764           (* If this is a binop, find its precedence. *)
    765           | Some (Token.Kwd c) when Hashtbl.mem binop_precedence c ->
    766               let token_prec = precedence c in
    767 
    768               (* If this is a binop that binds at least as tightly as the current binop,
    769                * consume it, otherwise we are done. *)
    770               if token_prec < expr_prec then lhs else begin
    771                 (* Eat the binop. *)
    772                 Stream.junk stream;
    773 
    774                 (* Parse the primary expression after the binary operator. *)
    775                 let rhs = parse_primary stream in
    776 
    777                 (* Okay, we know this is a binop. *)
    778                 let rhs =
    779                   match Stream.peek stream with
    780                   | Some (Token.Kwd c2) ->
    781                       (* If BinOp binds less tightly with rhs than the operator after
    782                        * rhs, let the pending operator take rhs as its lhs. *)
    783                       let next_prec = precedence c2 in
    784                       if token_prec < next_prec
    785                       then parse_bin_rhs (token_prec + 1) rhs stream
    786                       else rhs
    787                   | _ -> rhs
    788                 in
    789 
    790                 (* Merge lhs/rhs. *)
    791                 let lhs = Ast.Binary (c, lhs, rhs) in
    792                 parse_bin_rhs expr_prec lhs stream
    793               end
    794           | _ -> lhs
    795 
    796         (* expression
    797          *   ::= primary binoprhs *)
    798         and parse_expr = parser
    799           | [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream
    800 
    801         (* prototype
    802          *   ::= id '(' id* ')' *)
    803         let parse_prototype =
    804           let rec parse_args accumulator = parser
    805             | [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e
    806             | [< >] -> accumulator
    807           in
    808 
    809           parser
    810           | [< 'Token.Ident id;
    811                'Token.Kwd '(' ?? "expected '(' in prototype";
    812                args=parse_args [];
    813                'Token.Kwd ')' ?? "expected ')' in prototype" >] ->
    814               (* success. *)
    815               Ast.Prototype (id, Array.of_list (List.rev args))
    816 
    817           | [< >] ->
    818               raise (Stream.Error "expected function name in prototype")
    819 
    820         (* definition ::= 'def' prototype expression *)
    821         let parse_definition = parser
    822           | [< 'Token.Def; p=parse_prototype; e=parse_expr >] ->
    823               Ast.Function (p, e)
    824 
    825         (* toplevelexpr ::= expression *)
    826         let parse_toplevel = parser
    827           | [< e=parse_expr >] ->
    828               (* Make an anonymous proto. *)
    829               Ast.Function (Ast.Prototype ("", [||]), e)
    830 
    831         (*  external ::= 'extern' prototype *)
    832         let parse_extern = parser
    833           | [< 'Token.Extern; e=parse_prototype >] -> e
    834 
    835 toplevel.ml:
    836     .. code-block:: ocaml
    837 
    838         (*===----------------------------------------------------------------------===
    839          * Top-Level parsing and JIT Driver
    840          *===----------------------------------------------------------------------===*)
    841 
    842         (* top ::= definition | external | expression | ';' *)
    843         let rec main_loop stream =
    844           match Stream.peek stream with
    845           | None -> ()
    846 
    847           (* ignore top-level semicolons. *)
    848           | Some (Token.Kwd ';') ->
    849               Stream.junk stream;
    850               main_loop stream
    851 
    852           | Some token ->
    853               begin
    854                 try match token with
    855                 | Token.Def ->
    856                     ignore(Parser.parse_definition stream);
    857                     print_endline "parsed a function definition.";
    858                 | Token.Extern ->
    859                     ignore(Parser.parse_extern stream);
    860                     print_endline "parsed an extern.";
    861                 | _ ->
    862                     (* Evaluate a top-level expression into an anonymous function. *)
    863                     ignore(Parser.parse_toplevel stream);
    864                     print_endline "parsed a top-level expr";
    865                 with Stream.Error s ->
    866                   (* Skip token for error recovery. *)
    867                   Stream.junk stream;
    868                   print_endline s;
    869               end;
    870               print_string "ready> "; flush stdout;
    871               main_loop stream
    872 
    873 toy.ml:
    874     .. code-block:: ocaml
    875 
    876         (*===----------------------------------------------------------------------===
    877          * Main driver code.
    878          *===----------------------------------------------------------------------===*)
    879 
    880         let main () =
    881           (* Install standard binary operators.
    882            * 1 is the lowest precedence. *)
    883           Hashtbl.add Parser.binop_precedence '<' 10;
    884           Hashtbl.add Parser.binop_precedence '+' 20;
    885           Hashtbl.add Parser.binop_precedence '-' 20;
    886           Hashtbl.add Parser.binop_precedence '*' 40;    (* highest. *)
    887 
    888           (* Prime the first token. *)
    889           print_string "ready> "; flush stdout;
    890           let stream = Lexer.lex (Stream.of_channel stdin) in
    891 
    892           (* Run the main "interpreter loop" now. *)
    893           Toplevel.main_loop stream;
    894         ;;
    895 
    896         main ()
    897 
    898 `Next: Implementing Code Generation to LLVM IR <OCamlLangImpl3.html>`_
    899 
    900