Home | History | Annotate | Download | only in doxygen
      1 /// \page interop Interacting with the Generated Code
      2 ///
      3 /// \section intro Introduction
      4 ///
      5 /// The main way to interact with the generated code is via action code placed within <code>{</code> and
      6 /// <code>}</code> characters in your rules. In general, you are advised to keep the code you embed within
      7 /// these actions, and the grammar itself to an absolute minimum. Rather than embed code directly in your
      8 /// grammar, you should construct an API, that is called from the actions within your grammar. This way
      9 /// you will keep the grammar clean and maintainable and separate the code generators or other code
     10 /// from the definition of the grammar itself.
     11 ///
     12 /// However, when you wish to call your API functions, or insert small pieces of code that do not 
     13 /// warrant external functions, you will need to access elements of tokens, return elements from 
     14 /// parser rules and perhaps the internals of the recognizer itself. The C runtime provides a number
     15 /// of MACROs that you can use within your action code. It also provides a number of performant
     16 /// structures that you may find useful for building symbol tables, lists, tries, stacks, arrays and so on (all
     17 /// of which are managed so that your memory allocation problems are minimized.)
     18 ///
     19 /// \section rules Parameters and Returns from Parser Rules
     20 ///
     21 /// The C target does not differ from the Java target in any major ways here, and you should consult
     22 /// the standard documentation for the use of parameters on rules and the returns clause. You should
     23 /// be aware though, that the rules generate C function calls and therefore the input and returns
     24 /// clauses are subject to the constraints of C scoping.
     25 ///
     26 /// You should note that if your parser rule returns more than a single entity, then the return
     27 /// type of the generated rule function is a struct, which is returned by value. This is also the case
     28 /// if your rule is part of a tree building grammar (uses the <code>output=AST;</code> option.
     29 ///
     30 /// Other than the notes above, you can use any pre-declared type as an input or output parameter
     31 /// for your rule. 
     32 ///
     33 /// \section memory Memory Management
     34 ///
     35 /// You are responsible for allocating and freeing any memory used by your own
     36 /// constructs, ANTLR will track and release any memory allocated internally for tokens, trees, stacks, scopes
     37 /// and so on. This memory is returned to the malloc pool when you call the free method of any
     38 /// ANTLR3 produced structure.
     39 ///
     40 /// For performance reasons, and to avoid thrashing the malloc allocation system, memory for amy elements
     41 /// of your generated parser is allocated in chunks and parcelled out by factories. For instance memory
     42 /// for tokens is created as an array of tokens, and a token factory hands out the next available slot
     43 /// to the lexer. When you free the lexer, the allocated memory is returned to the pool. The same applies
     44 /// to 'strings' that contain the token text and various other text elements accessed within the lexer.
     45 ///
     46 /// The only side effect of this is that after your parse and analysis is complete, if you wish to retain
     47 /// anything generated automatically, you must copy it before freeing the recognizer structures. In practice
     48 /// it is usually practical to retain the recognizer context objects until your processing is complete or
     49 /// to use your own allocation scheme for generating output etc.
     50 ///
     51 /// The advantage of using object factories is of course that memory leaks and accessing de-allocated
     52 /// memory are bugs that rarely occur within the ANTLR3 C runtime. Further, allocating memory for 
     53 /// tokens, trees and so on is very fast.
     54 ///
     55 /// \section ctx The CTX Macro
     56 ///
     57 /// The CTX macro is a fundamental parameter that is passed as the first parameter to any generated function
     58 /// concerned with your lexer, parser, or tree parser. The is is the context pointer for your generated
     59 /// recognizer and is how you invoke the generated functions, and access the data embedded within your generated
     60 /// recognizer. While you can use it to directly access stacks, scopes and so on, this is not really recommended
     61 /// as you should use the $xxx references that are available generically within ANTLR grammars.
     62 ///
     63 /// The context pointer is used because this removes the need for any global/static variables at all, either
     64 /// within the generated code, or the C runtime. This is of course fundamental to creating free threading
     65 /// recognizers. Wherever a function call or rule call required the ctx parameter, you either reference it
     66 /// via the CTX macro, or the ctx parameter is in fact the return type from calling the 'constructor'
     67 /// function for your parser/lexer/tree parser (see code example in "How to build Generated Code" .)
     68 ///
     69 /// \section macros Pre-Defined convenience MACROs 
     70 ///
     71 /// While the author is not fond of using C MACROs to hide code or structure access, in the case of generated
     72 /// code, they serve two useful purposes. The first is to simplify the references to internal constructs,
     73 /// the second is to facilitate the change of any internal interface without requiring you to port grammars
     74 /// from earlier versions (just regenerate and recompile). As of release 3.1, these macros are stable and
     75 /// will only change their usage interface in the event of bugs being discovered. You are encouraged to 
     76 /// use these macros in your code, rather than access the raw interface.
     77 ///
     78 /// \bNB: Macros that act like statements must be terminated with a ';'. The macro body does not
     79 /// supply this, nor should it. Macros that call functions are declared with () even if they
     80 /// have no parameters, macros that reference fields do not have a () declaration.
     81 ///
     82 /// \section lexermacros Lexer Macros
     83 ///
     84 /// There are a number of macros that are useful exclusively within lexer rules. There are additional
     85 /// macros, common to all recognizer, and these are documented in the section Common Macros.
     86 ///
     87 /// \subsection lexer LEXER
     88 ///
     89 /// The <code>LEXER</code> macro returns a pointer to the base lexer object, which is of type #pANTLR3_LEXER. This is
     90 /// not the pointer to your generated lexer, which is supplied by the CTX macro,
     91 /// but to the common implementation of a lexer interface,
     92 /// which is supplied to all generated lexers.
     93 ///
     94 /// \subsection lexstate LEXSTATE
     95 ///
     96 /// Provides a pointer to the lexer shared state structure, which is where the tokens for a
     97 /// rule are constructed and the status elements of the lexer are kept. This pointer is of type
     98 /// #pANTLR3_RECOGNIZER_SHARED_STATE.In general you should only access elements of this structure
     99 /// if there is not already another MACRO or standard $xxxx antlr reference that refers to it.
    100 ///
    101 /// \subsection la LA(n)
    102 ///
    103 /// The <code>LA</code> macro returns the character at index n from the current input stream index. The return 
    104 /// type is #ANTLR3_UINT32. Hence <code>LA(1)</code> returns the character at the current input position (the
    105 /// character that will be consumed next), <code>LA(-1)</code> returns the character that has just been consumed
    106 /// and so on. The <code>LA(n)</code> macro is useful for constructing semantic predicates in lexer rules. The
    107 /// reference <code>LA(0)</code> is undefined and will cause an error in your lexer.
    108 ///
    109 /// \subsection getcharindex GETCHARINDEX()
    110 ///
    111 /// The <code>GETCHARINDEX</code> macro returns the index of the current character position as a 0 based
    112 /// offset from the start of the input stream. It returns a value type of #ANTLR3_UINT32.
    113 ///
    114 /// \subsection getline GETLINE()
    115 ///
    116 /// The <code>GETLINE</code> macro returns the line number of current character (<code>LA(1)</code> in the input
    117 /// stream. It returns a value type of #ANTLR3_UINT32. Note that the line number is incremented
    118 /// automatically by an input stream when it sees the input character '\n'. The character that causes
    119 /// the line number to increment can be changed by calling the SetNewLineChar() method on the input
    120 /// stream before invoking the lexer and after creating the input stream.
    121 ///
    122 /// \subsection gettext GETTEXT()
    123 ///
    124 /// The <code>GETTEXT</code> macro returns the text currently matched by the lexer rule. In general you should use the
    125 /// generic $text reference in ANTLR to retrieve this. The return type is a reference type of #pANTLR3_STRING
    126 /// which allows you to manipulate the text you have retrieved (\b NB this does not change the input stream
    127 /// only the text you copy from the input stream when you use this MACRO or $text). 
    128 ///
    129 /// The reference $text->chars or GETTEXT()->chars will reference a pointer to the '\\0' terminated character
    130 /// string that the ANTLR3 #pANTLR3_STRING represents. String space is allocated automatically as well as
    131 /// the structure that holds the string. The #pANTLR3_STRING_FACTORY associated with the lexer handles this
    132 /// and when you close the lexer, it will automatically free any space allocated for strings and their structures.
    133 ///
    134 /// \subsection getcharpositioninline GETCHARPOSITIONINLINE()
    135 ///
    136 /// The <code>GETCHARPOSITIONINLINE</code> returns the zero based offset of character <code>LA(1)</code> 
    137 /// from the start of the current input line. See the macro <code>GETLINE</code> for details on what the 
    138 /// line number means.
    139 ///
    140 /// \subsection emit EMIT()
    141 ///
    142 /// The macro <code>EMIT</code> causes the text range currently matched to the lexer rule to be emitted
    143 /// immediately as the token for the rule. Subsequent text is matched but ignored. The type used for the
    144 /// the token is the name of the lexer rule or, if you have change this by using $type = XXX;, the type
    145 /// XXX is used.
    146 ///
    147 /// \subsection emitnew EMITNEW(t)
    148 ///
    149 /// The macro <code>EMITNEW</code> causes the supplied token reference <code>t</code> to be used as the
    150 /// token emitted by the rule. The parameter <code>t </code> must be of type #pANTLR3_COMMON_TOKEN.
    151 ///
    152 /// \subsection index INDEX()
    153 /// 
    154 /// The <code>INDEX</code> macro returns the current input position according to the input stream. It is not
    155 /// guaranteed to be the character offset in the input stream but is instead used as a value
    156 /// for marking and rewinding to specific points in the input stream. Use the macro <code>GETCHARINDEX()</code>
    157 /// to find out the position of the <code>LA(1)</code> in the input stream.
    158 ///
    159 /// \subsection pushstream PUSHSTREAM(str)
    160 ///
    161 /// The <code>PUSHSTREAM</code> macro, in conjunction with the <code>POPSTREAM</code> macro (called internally in the runtime usually)
    162 /// can be used to stack many input streams to the lexer, and implement constructs such as the C pre-processor
    163 /// \#include directive. 
    164 /// 
    165 /// An input stream that is pushed on to the stack becomes the current input stream for the lexer and 
    166 /// the state of the previous stream is automatically saved. The input stream will be automatically
    167 /// popped from the stack when it is exhausted by the lexer. You may use the macro <code>POPSTREAM</code>
    168 /// to return to the previous input stream prior to exhausting the currently stacked input stream.
    169 ///
    170 /// Here is an example of using the macro in a lexer to implement the C \#include pre-processor directive:
    171 ///
    172 /// \code
    173 /// fragment
    174 /// STRING_GUTS :	(~('\\'|'"') )* ;
    175 ///
    176 /// LINE_COMMAND 
    177 /// : '#' (' ' | '\t')*
    178 /// 	(
    179 /// 	    'include' (' ' | '\t')+ '"' file = STRING_GUTS '"' (' ' | '\t')* '\r'? '\n'
    180 /// 		{
    181 /// 		    pANTLR3_STRING	    fName;
    182 /// 		    pANTLR3_INPUT_STREAM    in;
    183 /// 
    184 /// 		    // Create an initial string, then take a substring
    185 /// 		    // We can do this by messing with the start and end
    186 /// 		    // pointers of tokens and so on. This shows a reasonable way to
    187 /// 		    // manipulate strings.
    188 /// 		    //
    189 /// 		    fName = $file.text;
    190 /// 		    printf("Including file '\%s'\n", fName->chars);
    191 /// 
    192 /// 		    // Create a new input stream and take advantage of built in stream stacking
    193 /// 		    // in C target runtime.
    194 /// 		    //
    195 /// 		    in = antlr38BitFileStreamNew(fName->chars);
    196 /// 		    PUSHSTREAM(in);
    197 /// 
    198 /// 		    // Note that the input stream is not closed when it EOFs, I don't bother
    199 /// 		    // to do it here, but it is up to you to track streams created like this
    200 /// 		    // and destroy them when the whole parse session is complete. Remember that you
    201 /// 		    // don't want to do this until all tokens have been manipulated all the way through 
    202 /// 		    // your tree parsers etc as the token does not store the text it just refers
    203 /// 		    // back to the input stream and trying to get the text for it will abort if you
    204 /// 		    // close the input stream too early.
    205 /// 		    //
    206 /// 
    207 /// 		}
    208 ///             | (('0'..'9')=>('0'..'9'))+ ~('\n'|'\r')* '\r'? '\n'
    209 /// 	    )
    210 /// 	 {$channel=HIDDEN;}
    211 ///     ;
    212 /// \endcode
    213 ///
    214 /// \subsection popstream POPSTREAM()
    215 ///
    216 /// Assuming that you have stacked an input stream using the PUSHSTREAM macro, you can 
    217 /// remove it from the stream stack and revert to the previous input stream. You should be careful
    218 /// to pop the stream at an appropriate point in your lexer action, so you do not match characters
    219 /// from one stream with those from another in the same rule (unless this is what you want to do)
    220 ///
    221 /// \subsection settext SETTEXT(str)
    222 ///
    223 /// A token manufactured by the lexer does not actually physically store the text from the
    224 /// input stream to which it matches. The token string is instead created only if you ask for
    225 /// the text. However if you wish to change the text that the token represents you can use
    226 /// this macro to set it explicitly. Note that this does not change the input stream text
    227 /// but associates the supplied #pANTLR3_STRING with the token. This string is then returned
    228 /// when parser and tree parser reference the tokens via the $xxx.text reference.
    229 ///
    230 /// \subsection user1 USER1 USER2 USER3 and CUSTOM
    231 ///
    232 /// While you can create your own custom token class and have the lexer deal with this, this
    233 /// is a lot of work compared to the trivial inheritance that can be achieved in the Java target.
    234 /// In many cases though, all that is needed is the addition of a few data items such as an
    235 /// integer or a pointer. Rather than require C programmers to create complicated structures
    236 /// just to add a few data items, the C target provides a few custom fields in the standard
    237 /// token, which will fulfil the needs of most lexers and parsers.
    238 ///
    239 /// The token fields user1, user2, and user3 are all value types of #ANTLR_UINT32. In the
    240 /// parser you can reference these fields directly from the token: <code>x=TOKNAME { $x->user1 ...</code>
    241 /// but when you are building the token in the lexer, you must assign to the fields using the
    242 /// macros <code>USER1</code>, <code>USER2</code>, or <code>USER3</code>. As in:
    243 ///
    244 /// \code
    245 /// LEXTOK: 'AAAAA' { USER1 = 99; } ;
    246 /// \endcode
    247 ///
    248 ///
    249 /// \section parsermacros Parser and Tree Parser Macros
    250 ///
    251 /// \subsection parser PARSER
    252 ///
    253 /// The <code>PARSER</code> macro returns a pointer to the base parser or tree parser object, which is of type #pANTLR3_PARSER
    254 /// or #pANTLR3_TREE_PARSER . This is not the pointer to your generated parser, which is supplied by the <code>CTX</code> macro,
    255 /// but to the common implementation of a parser or tree parser interface, which is supplied to all generated parsers.
    256 ///
    257 /// \subsection index INDEX()
    258 ///
    259 /// When used in the parser, the <code>INDEX</code> macro returns the position of the current
    260 /// token ( LT(1) ) in the input token stream. It can be used for <code>MARK</code> and <code>REWIND</code> 
    261 /// operations.
    262 ///
    263 /// \subsection lt LT(n) and LA(n)
    264 ///
    265 /// In the parser, the macro <code>LT(n)</code> returns the #pANTLR3_COMMON_TOKEN at offset <code>n</code> from
    266 /// the current token stream input position. The macro <code>LA(n)</code> returns the token type of the token
    267 /// at position <code>n</code>. The value <code>n</code> cannot be zero, and such a reference will return 
    268 /// <code>NULL</code> and possibly cause an error. <code>LA(1)</code> is the token that is about to be
    269 /// recognized and <code>LA(-1)</code> is the token that has just been recognized. Values of n that exceed the
    270 /// limits of the token stream boundaries will return <code>NULL</code>.
    271 ///
    272 /// \subsection psrstate PSRSTATE
    273 ///
    274 /// Returns the shared state pointer of type #pANTLR3_RECOGNIZER_SHARED_STATE. This is not generally
    275 /// useful to the grammar programmer as the useful elements have generic $xxx references built in to
    276 /// ANTLR.
    277 ///
    278 /// \subsection adaptor ADAPTOR
    279 ///
    280 /// When building an AST via a parser, the work of constructing and manipulating trees is done
    281 /// by a supplied adaptor class. The default class is usually fine for most tree operations but
    282 /// if you wish to build your own specialized linked/tree structure, then you may need to reference
    283 /// the adaptor you supply directly. The <code>ADAPTOR</code> macro returns the reference to the tree adaptor
    284 /// which is always of type #pANTLR3_BASE_TREE_ADAPTOR, even if it is your custom adapter.
    285 ///
    286 /// \section commonmacros Macros Common to All Recognizers
    287 ///
    288 /// \subsection recognizer RECOGNIZER
    289 ///
    290 /// Returns a reference type of #pANTRL3_BASE_RECOGNIZER, which is the base functionality supplied
    291 /// to all recognizers, whether lexers, parsers or tree parsers. You can override methods in this
    292 /// interface by installing your own function pointers (once you know what you are doing).
    293 ///
    294 /// \subsection input INPUT
    295 ///
    296 /// Returns a reference to the input stream of the appropriate type for the recognizer. In a lexer
    297 /// this macro returns a reference type of #pANTLR3_INPUT_STREAM, in a parser this is type
    298 /// #pANTLR3_TOKEN_STREAM and in a tree parser this is type #pANTLR3_COMMON_TREE_NODE_STREAM.
    299 /// You can of course provide your own implementations of any of these interfaces.
    300 /// 
    301 /// \subsection mark MARK()
    302 ///
    303 /// This macro will cause the input stream for the current recognizer to be marked with a
    304 /// checkpoint. It will return a value type of #ANTLR3_MARKER which you can use as the 
    305 /// parameter to a <code>REWIND</code> macro to return to the marked point in the input.
    306 /// 
    307 /// If you know you will only ever rewind to the last <code>MARK</code>, then you can ignore the return
    308 /// value of this macro and just use the <code>REWINDLAST</code> macro to return to the last <code>MARK</code> that
    309 /// was set in the input stream.
    310 ///
    311 /// \subsection rewind REWIND(m)
    312 ///
    313 /// Rewinds the appropriate input stream back to the marked checkpoint returned from a prior
    314 /// MARK macro call and supplied as the parameter <code>m</code> to the <code>REWIND(m)</code> 
    315 /// macro.
    316 ///
    317 /// \subsection rewindlast REWINDLAST()
    318 ///
    319 /// Rewinds the current input stream (character, tokens, tree nodes) back to the last checkpoint
    320 /// marker created by a <code>MARK</code> macro call. Fails silently if there was no prior
    321 /// <code>MARK</code> call.
    322 ///
    323 /// \subsection seek SEEK(n)
    324 ///
    325 /// Causes the input stream to position itself directly at offset <code>n</code> in the stream. Works for all
    326 /// input stream types, both lexer, parser and tree parser.
    327 ///
    328