1 <html> 2 <head> 3 <title>"Clang" CFE Internals Manual</title> 4 <link type="text/css" rel="stylesheet" href="../menu.css" /> 5 <link type="text/css" rel="stylesheet" href="../content.css" /> 6 <style type="text/css"> 7 td { 8 vertical-align: top; 9 } 10 </style> 11 </head> 12 <body> 13 14 <!--#include virtual="../menu.html.incl"--> 15 16 <div id="content"> 17 18 <h1>"Clang" CFE Internals Manual</h1> 19 20 <ul> 21 <li><a href="#intro">Introduction</a></li> 22 <li><a href="#libsystem">LLVM System and Support Libraries</a></li> 23 <li><a href="#libbasic">The Clang 'Basic' Library</a> 24 <ul> 25 <li><a href="#Diagnostics">The Diagnostics Subsystem</a></li> 26 <li><a href="#SourceLocation">The SourceLocation and SourceManager 27 classes</a></li> 28 <li><a href="#SourceRange">SourceRange and CharSourceRange</a></li> 29 </ul> 30 </li> 31 <li><a href="#libdriver">The Driver Library</a> 32 <ul> 33 </ul> 34 </li> 35 <li><a href="#pch">Precompiled Headers</a> 36 <li><a href="#libfrontend">The Frontend Library</a> 37 <ul> 38 </ul> 39 </li> 40 <li><a href="#liblex">The Lexer and Preprocessor Library</a> 41 <ul> 42 <li><a href="#Token">The Token class</a></li> 43 <li><a href="#Lexer">The Lexer class</a></li> 44 <li><a href="#AnnotationToken">Annotation Tokens</a></li> 45 <li><a href="#TokenLexer">The TokenLexer class</a></li> 46 <li><a href="#MultipleIncludeOpt">The MultipleIncludeOpt class</a></li> 47 </ul> 48 </li> 49 <li><a href="#libparse">The Parser Library</a> 50 <ul> 51 </ul> 52 </li> 53 <li><a href="#libast">The AST Library</a> 54 <ul> 55 <li><a href="#Type">The Type class and its subclasses</a></li> 56 <li><a href="#QualType">The QualType class</a></li> 57 <li><a href="#DeclarationName">Declaration names</a></li> 58 <li><a href="#DeclContext">Declaration contexts</a> 59 <ul> 60 <li><a href="#Redeclarations">Redeclarations and Overloads</a></li> 61 <li><a href="#LexicalAndSemanticContexts">Lexical and Semantic 62 Contexts</a></li> 63 <li><a href="#TransparentContexts">Transparent Declaration Contexts</a></li> 64 <li><a href="#MultiDeclContext">Multiply-Defined Declaration Contexts</a></li> 65 </ul> 66 </li> 67 <li><a href="#CFG">The CFG class</a></li> 68 <li><a href="#Constants">Constant Folding in the Clang AST</a></li> 69 </ul> 70 </li> 71 <li><a href="#Howtos">Howto guides</a> 72 <ul> 73 <li><a href="#AddingAttributes">How to add an attribute</a></li> 74 </ul> 75 </li> 76 </ul> 77 78 79 <!-- ======================================================================= --> 80 <h2 id="intro">Introduction</h2> 81 <!-- ======================================================================= --> 82 83 <p>This document describes some of the more important APIs and internal design 84 decisions made in the Clang C front-end. The purpose of this document is to 85 both capture some of this high level information and also describe some of the 86 design decisions behind it. This is meant for people interested in hacking on 87 Clang, not for end-users. The description below is categorized by 88 libraries, and does not describe any of the clients of the libraries.</p> 89 90 <!-- ======================================================================= --> 91 <h2 id="libsystem">LLVM System and Support Libraries</h2> 92 <!-- ======================================================================= --> 93 94 <p>The LLVM libsystem library provides the basic Clang system abstraction layer, 95 which is used for file system access. The LLVM libsupport library provides many 96 underlying libraries and <a 97 href="http://llvm.org/docs/ProgrammersManual.html">data-structures</a>, 98 including command line option 99 processing and various containers.</p> 100 101 <!-- ======================================================================= --> 102 <h2 id="libbasic">The Clang 'Basic' Library</h2> 103 <!-- ======================================================================= --> 104 105 <p>This library certainly needs a better name. The 'basic' library contains a 106 number of low-level utilities for tracking and manipulating source buffers, 107 locations within the source buffers, diagnostics, tokens, target abstraction, 108 and information about the subset of the language being compiled for.</p> 109 110 <p>Part of this infrastructure is specific to C (such as the TargetInfo class), 111 other parts could be reused for other non-C-based languages (SourceLocation, 112 SourceManager, Diagnostics, FileManager). When and if there is future demand 113 we can figure out if it makes sense to introduce a new library, move the general 114 classes somewhere else, or introduce some other solution.</p> 115 116 <p>We describe the roles of these classes in order of their dependencies.</p> 117 118 119 <!-- ======================================================================= --> 120 <h3 id="Diagnostics">The Diagnostics Subsystem</h3> 121 <!-- ======================================================================= --> 122 123 <p>The Clang Diagnostics subsystem is an important part of how the compiler 124 communicates with the human. Diagnostics are the warnings and errors produced 125 when the code is incorrect or dubious. In Clang, each diagnostic produced has 126 (at the minimum) a unique ID, an English translation associated with it, a <a 127 href="#SourceLocation">SourceLocation</a> to "put the caret", and a severity (e.g. 128 <tt>WARNING</tt> or <tt>ERROR</tt>). They can also optionally include a number 129 of arguments to the dianostic (which fill in "%0"'s in the string) as well as a 130 number of source ranges that related to the diagnostic.</p> 131 132 <p>In this section, we'll be giving examples produced by the Clang command line 133 driver, but diagnostics can be <a href="#DiagnosticClient">rendered in many 134 different ways</a> depending on how the DiagnosticClient interface is 135 implemented. A representative example of a diagnostic is:</p> 136 137 <pre> 138 t.c:38:15: error: invalid operands to binary expression ('int *' and '_Complex float') 139 <font color="darkgreen">P = (P-42) + Gamma*4;</font> 140 <font color="blue">~~~~~~ ^ ~~~~~~~</font> 141 </pre> 142 143 <p>In this example, you can see the English translation, the severity (error), 144 you can see the source location (the caret ("^") and file/line/column info), 145 the source ranges "~~~~", arguments to the diagnostic ("int*" and "_Complex 146 float"). You'll have to believe me that there is a unique ID backing the 147 diagnostic :).</p> 148 149 <p>Getting all of this to happen has several steps and involves many moving 150 pieces, this section describes them and talks about best practices when adding 151 a new diagnostic.</p> 152 153 <!-- ============================= --> 154 <h4>The Diagnostic*Kinds.td files</h4> 155 <!-- ============================= --> 156 157 <p>Diagnostics are created by adding an entry to one of the <tt> 158 clang/Basic/Diagnostic*Kinds.td</tt> files, depending on what library will 159 be using it. From this file, tblgen generates the unique ID of the diagnostic, 160 the severity of the diagnostic and the English translation + format string.</p> 161 162 <p>There is little sanity with the naming of the unique ID's right now. Some 163 start with err_, warn_, ext_ to encode the severity into the name. Since the 164 enum is referenced in the C++ code that produces the diagnostic, it is somewhat 165 useful for it to be reasonably short.</p> 166 167 <p>The severity of the diagnostic comes from the set {<tt>NOTE</tt>, 168 <tt>WARNING</tt>, <tt>EXTENSION</tt>, <tt>EXTWARN</tt>, <tt>ERROR</tt>}. The 169 <tt>ERROR</tt> severity is used for diagnostics indicating the program is never 170 acceptable under any circumstances. When an error is emitted, the AST for the 171 input code may not be fully built. The <tt>EXTENSION</tt> and <tt>EXTWARN</tt> 172 severities are used for extensions to the language that Clang accepts. This 173 means that Clang fully understands and can represent them in the AST, but we 174 produce diagnostics to tell the user their code is non-portable. The difference 175 is that the former are ignored by default, and the later warn by default. The 176 <tt>WARNING</tt> severity is used for constructs that are valid in the currently 177 selected source language but that are dubious in some way. The <tt>NOTE</tt> 178 level is used to staple more information onto previous diagnostics.</p> 179 180 <p>These <em>severities</em> are mapped into a smaller set (the 181 Diagnostic::Level enum, {<tt>Ignored</tt>, <tt>Note</tt>, <tt>Warning</tt>, 182 <tt>Error</tt>, <tt>Fatal</tt> }) of output <em>levels</em> by the diagnostics 183 subsystem based on various configuration options. Clang internally supports a 184 fully fine grained mapping mechanism that allows you to map almost any 185 diagnostic to the output level that you want. The only diagnostics that cannot 186 be mapped are <tt>NOTE</tt>s, which always follow the severity of the previously 187 emitted diagnostic and <tt>ERROR</tt>s, which can only be mapped to 188 <tt>Fatal</tt> (it is not possible to turn an error into a warning, 189 for example).</p> 190 191 <p>Diagnostic mappings are used in many ways. For example, if the user 192 specifies <tt>-pedantic</tt>, <tt>EXTENSION</tt> maps to <tt>Warning</tt>, if 193 they specify <tt>-pedantic-errors</tt>, it turns into <tt>Error</tt>. This is 194 used to implement options like <tt>-Wunused_macros</tt>, <tt>-Wundef</tt> etc. 195 </p> 196 197 <p> 198 Mapping to <tt>Fatal</tt> should only be used for diagnostics that are 199 considered so severe that error recovery won't be able to recover sensibly from 200 them (thus spewing a ton of bogus errors). One example of this class of error 201 are failure to #include a file. 202 </p> 203 204 <!-- ================= --> 205 <h4>The Format String</h4> 206 <!-- ================= --> 207 208 <p>The format string for the diagnostic is very simple, but it has some power. 209 It takes the form of a string in English with markers that indicate where and 210 how arguments to the diagnostic are inserted and formatted. For example, here 211 are some simple format strings:</p> 212 213 <pre> 214 "binary integer literals are an extension" 215 "format string contains '\\0' within the string body" 216 "more '<b>%%</b>' conversions than data arguments" 217 "invalid operands to binary expression (<b>%0</b> and <b>%1</b>)" 218 "overloaded '<b>%0</b>' must be a <b>%select{unary|binary|unary or binary}2</b> operator" 219 " (has <b>%1</b> parameter<b>%s1</b>)" 220 </pre> 221 222 <p>These examples show some important points of format strings. You can use any 223 plain ASCII character in the diagnostic string except "%" without a problem, 224 but these are C strings, so you have to use and be aware of all the C escape 225 sequences (as in the second example). If you want to produce a "%" in the 226 output, use the "%%" escape sequence, like the third diagnostic. Finally, 227 Clang uses the "%...[digit]" sequences to specify where and how arguments to 228 the diagnostic are formatted.</p> 229 230 <p>Arguments to the diagnostic are numbered according to how they are specified 231 by the C++ code that <a href="#producingdiag">produces them</a>, and are 232 referenced by <tt>%0</tt> .. <tt>%9</tt>. If you have more than 10 arguments 233 to your diagnostic, you are doing something wrong :). Unlike printf, there 234 is no requirement that arguments to the diagnostic end up in the output in 235 the same order as they are specified, you could have a format string with 236 <tt>"%1 %0"</tt> that swaps them, for example. The text in between the 237 percent and digit are formatting instructions. If there are no instructions, 238 the argument is just turned into a string and substituted in.</p> 239 240 <p>Here are some "best practices" for writing the English format string:</p> 241 242 <ul> 243 <li>Keep the string short. It should ideally fit in the 80 column limit of the 244 <tt>DiagnosticKinds.td</tt> file. This avoids the diagnostic wrapping when 245 printed, and forces you to think about the important point you are conveying 246 with the diagnostic.</li> 247 <li>Take advantage of location information. The user will be able to see the 248 line and location of the caret, so you don't need to tell them that the 249 problem is with the 4th argument to the function: just point to it.</li> 250 <li>Do not capitalize the diagnostic string, and do not end it with a 251 period.</li> 252 <li>If you need to quote something in the diagnostic string, use single 253 quotes.</li> 254 </ul> 255 256 <p>Diagnostics should never take random English strings as arguments: you 257 shouldn't use <tt>"you have a problem with %0"</tt> and pass in things like 258 <tt>"your argument"</tt> or <tt>"your return value"</tt> as arguments. Doing 259 this prevents <a href="#translation">translating</a> the Clang diagnostics to 260 other languages (because they'll get random English words in their otherwise 261 localized diagnostic). The exceptions to this are C/C++ language keywords 262 (e.g. auto, const, mutable, etc) and C/C++ operators (<tt>/=</tt>). Note 263 that things like "pointer" and "reference" are not keywords. On the other 264 hand, you <em>can</em> include anything that comes from the user's source code, 265 including variable names, types, labels, etc. The 'select' format can be 266 used to achieve this sort of thing in a localizable way, see below.</p> 267 268 <!-- ==================================== --> 269 <h4>Formatting a Diagnostic Argument</a></h4> 270 <!-- ==================================== --> 271 272 <p>Arguments to diagnostics are fully typed internally, and come from a couple 273 different classes: integers, types, names, and random strings. Depending on 274 the class of the argument, it can be optionally formatted in different ways. 275 This gives the DiagnosticClient information about what the argument means 276 without requiring it to use a specific presentation (consider this MVC for 277 Clang :).</p> 278 279 <p>Here are the different diagnostic argument formats currently supported by 280 Clang:</p> 281 282 <table> 283 <tr><td colspan="2"><b>"s" format</b></td></tr> 284 <tr><td>Example:</td><td><tt>"requires %1 parameter%s1"</tt></td></tr> 285 <tr><td>Class:</td><td>Integers</td></tr> 286 <tr><td>Description:</td><td>This is a simple formatter for integers that is 287 useful when producing English diagnostics. When the integer is 1, it prints 288 as nothing. When the integer is not 1, it prints as "s". This allows some 289 simple grammatical forms to be to be handled correctly, and eliminates the 290 need to use gross things like <tt>"requires %1 parameter(s)"</tt>.</td></tr> 291 292 <tr><td colspan="2"><b>"select" format</b></td></tr> 293 <tr><td>Example:</td><td><tt>"must be a %select{unary|binary|unary or binary}2 294 operator"</tt></td></tr> 295 <tr><td>Class:</td><td>Integers</td></tr> 296 <tr><td>Description:</td><td><p>This format specifier is used to merge multiple 297 related diagnostics together into one common one, without requiring the 298 difference to be specified as an English string argument. Instead of 299 specifying the string, the diagnostic gets an integer argument and the 300 format string selects the numbered option. In this case, the "%2" value 301 must be an integer in the range [0..2]. If it is 0, it prints 'unary', if 302 it is 1 it prints 'binary' if it is 2, it prints 'unary or binary'. This 303 allows other language translations to substitute reasonable words (or entire 304 phrases) based on the semantics of the diagnostic instead of having to do 305 things textually.</p> 306 <p>The selected string does undergo formatting.</p></td></tr> 307 308 <tr><td colspan="2"><b>"plural" format</b></td></tr> 309 <tr><td>Example:</td><td><tt>"you have %1 %plural{1:mouse|:mice}1 connected to 310 your computer"</tt></td></tr> 311 <tr><td>Class:</td><td>Integers</td></tr> 312 <tr><td>Description:</td><td><p>This is a formatter for complex plural forms. 313 It is designed to handle even the requirements of languages with very 314 complex plural forms, as many Baltic languages have. The argument consists 315 of a series of expression/form pairs, separated by ':', where the first form 316 whose expression evaluates to true is the result of the modifier.</p> 317 <p>An expression can be empty, in which case it is always true. See the 318 example at the top. Otherwise, it is a series of one or more numeric 319 conditions, separated by ','. If any condition matches, the expression 320 matches. Each numeric condition can take one of three forms.</p> 321 <ul> 322 <li>number: A simple decimal number matches if the argument is the same 323 as the number. Example: <tt>"%plural{1:mouse|:mice}4"</tt></li> 324 <li>range: A range in square brackets matches if the argument is within 325 the range. Then range is inclusive on both ends. Example: 326 <tt>"%plural{0:none|1:one|[2,5]:some|:many}2"</tt></li> 327 <li>modulo: A modulo operator is followed by a number, and 328 equals sign and either a number or a range. The tests are the 329 same as for plain 330 numbers and ranges, but the argument is taken modulo the number first. 331 Example: <tt>"%plural{%100=0:even hundred|%100=[1,50]:lower half|:everything 332 else}1"</tt></li> 333 </ul> 334 <p>The parser is very unforgiving. A syntax error, even whitespace, will 335 abort, as will a failure to match the argument against any 336 expression.</p></td></tr> 337 338 <tr><td colspan="2"><b>"ordinal" format</b></td></tr> 339 <tr><td>Example:</td><td><tt>"ambiguity in %ordinal0 argument"</tt></td></tr> 340 <tr><td>Class:</td><td>Integers</td></tr> 341 <tr><td>Description:</td><td><p>This is a formatter which represents the 342 argument number as an ordinal: the value <tt>1</tt> becomes <tt>1st</tt>, 343 <tt>3</tt> becomes <tt>3rd</tt>, and so on. Values less than <tt>1</tt> 344 are not supported.</p> 345 <p>This formatter is currently hard-coded to use English ordinals.</p></td></tr> 346 347 <tr><td colspan="2"><b>"objcclass" format</b></td></tr> 348 <tr><td>Example:</td><td><tt>"method %objcclass0 not found"</tt></td></tr> 349 <tr><td>Class:</td><td>DeclarationName</td></tr> 350 <tr><td>Description:</td><td><p>This is a simple formatter that indicates the 351 DeclarationName corresponds to an Objective-C class method selector. As 352 such, it prints the selector with a leading '+'.</p></td></tr> 353 354 <tr><td colspan="2"><b>"objcinstance" format</b></td></tr> 355 <tr><td>Example:</td><td><tt>"method %objcinstance0 not found"</tt></td></tr> 356 <tr><td>Class:</td><td>DeclarationName</td></tr> 357 <tr><td>Description:</td><td><p>This is a simple formatter that indicates the 358 DeclarationName corresponds to an Objective-C instance method selector. As 359 such, it prints the selector with a leading '-'.</p></td></tr> 360 361 <tr><td colspan="2"><b>"q" format</b></td></tr> 362 <tr><td>Example:</td><td><tt>"candidate found by name lookup is %q0"</tt></td></tr> 363 <tr><td>Class:</td><td>NamedDecl*</td></tr> 364 <tr><td>Description</td><td><p>This formatter indicates that the fully-qualified name of the declaration should be printed, e.g., "std::vector" rather than "vector".</p></td></tr> 365 366 </table> 367 368 <p>It is really easy to add format specifiers to the Clang diagnostics system, 369 but they should be discussed before they are added. If you are creating a lot 370 of repetitive diagnostics and/or have an idea for a useful formatter, please 371 bring it up on the cfe-dev mailing list.</p> 372 373 <!-- ===================================================== --> 374 <h4 id="producingdiag">Producing the Diagnostic</h4> 375 <!-- ===================================================== --> 376 377 <p>Now that you've created the diagnostic in the DiagnosticKinds.td file, you 378 need to write the code that detects the condition in question and emits the 379 new diagnostic. Various components of Clang (e.g. the preprocessor, Sema, 380 etc) provide a helper function named "Diag". It creates a diagnostic and 381 accepts the arguments, ranges, and other information that goes along with 382 it.</p> 383 384 <p>For example, the binary expression error comes from code like this:</p> 385 386 <pre> 387 if (various things that are bad) 388 Diag(Loc, diag::err_typecheck_invalid_operands) 389 << lex->getType() << rex->getType() 390 << lex->getSourceRange() << rex->getSourceRange(); 391 </pre> 392 393 <p>This shows that use of the Diag method: they take a location (a <a 394 href="#SourceLocation">SourceLocation</a> object) and a diagnostic enum value 395 (which matches the name from DiagnosticKinds.td). If the diagnostic takes 396 arguments, they are specified with the << operator: the first argument 397 becomes %0, the second becomes %1, etc. The diagnostic interface allows you to 398 specify arguments of many different types, including <tt>int</tt> and 399 <tt>unsigned</tt> for integer arguments, <tt>const char*</tt> and 400 <tt>std::string</tt> for string arguments, <tt>DeclarationName</tt> and 401 <tt>const IdentifierInfo*</tt> for names, <tt>QualType</tt> for types, etc. 402 SourceRanges are also specified with the << operator, but do not have a 403 specific ordering requirement.</p> 404 405 <p>As you can see, adding and producing a diagnostic is pretty straightforward. 406 The hard part is deciding exactly what you need to say to help the user, picking 407 a suitable wording, and providing the information needed to format it correctly. 408 The good news is that the call site that issues a diagnostic should be 409 completely independent of how the diagnostic is formatted and in what language 410 it is rendered. 411 </p> 412 413 <!-- ==================================================== --> 414 <h4 id="fix-it-hints">Fix-It Hints</h4> 415 <!-- ==================================================== --> 416 417 <p>In some cases, the front end emits diagnostics when it is clear 418 that some small change to the source code would fix the problem. For 419 example, a missing semicolon at the end of a statement or a use of 420 deprecated syntax that is easily rewritten into a more modern form. 421 Clang tries very hard to emit the diagnostic and recover gracefully 422 in these and other cases.</p> 423 424 <p>However, for these cases where the fix is obvious, the diagnostic 425 can be annotated with a hint (referred to as a "fix-it hint") that 426 describes how to change the code referenced by the diagnostic to fix 427 the problem. For example, it might add the missing semicolon at the 428 end of the statement or rewrite the use of a deprecated construct 429 into something more palatable. Here is one such example from the C++ 430 front end, where we warn about the right-shift operator changing 431 meaning from C++98 to C++0x:</p> 432 433 <pre> 434 test.cpp:3:7: warning: use of right-shift operator ('>>') in template argument will require parentheses in C++0x 435 A<100 >> 2> *a; 436 ^ 437 ( ) 438 </pre> 439 440 <p>Here, the fix-it hint is suggesting that parentheses be added, 441 and showing exactly where those parentheses would be inserted into the 442 source code. The fix-it hints themselves describe what changes to make 443 to the source code in an abstract manner, which the text diagnostic 444 printer renders as a line of "insertions" below the caret line. <a 445 href="#DiagnosticClient">Other diagnostic clients</a> might choose 446 to render the code differently (e.g., as markup inline) or even give 447 the user the ability to automatically fix the problem.</p> 448 449 <p>All fix-it hints are described by the <code>FixItHint</code> class, 450 instances of which should be attached to the diagnostic using the 451 << operator in the same way that highlighted source ranges and 452 arguments are passed to the diagnostic. Fix-it hints can be created 453 with one of three constructors:</p> 454 455 <dl> 456 <dt><code>FixItHint::CreateInsertion(Loc, Code)</code></dt> 457 <dd>Specifies that the given <code>Code</code> (a string) should be inserted 458 before the source location <code>Loc</code>.</dd> 459 460 <dt><code>FixItHint::CreateRemoval(Range)</code></dt> 461 <dd>Specifies that the code in the given source <code>Range</code> 462 should be removed.</dd> 463 464 <dt><code>FixItHint::CreateReplacement(Range, Code)</code></dt> 465 <dd>Specifies that the code in the given source <code>Range</code> 466 should be removed, and replaced with the given <code>Code</code> string.</dd> 467 </dl> 468 469 <!-- ============================================================= --> 470 <h4><a name="DiagnosticClient">The DiagnosticClient Interface</a></h4> 471 <!-- ============================================================= --> 472 473 <p>Once code generates a diagnostic with all of the arguments and the rest of 474 the relevant information, Clang needs to know what to do with it. As previously 475 mentioned, the diagnostic machinery goes through some filtering to map a 476 severity onto a diagnostic level, then (assuming the diagnostic is not mapped to 477 "<tt>Ignore</tt>") it invokes an object that implements the DiagnosticClient 478 interface with the information.</p> 479 480 <p>It is possible to implement this interface in many different ways. For 481 example, the normal Clang DiagnosticClient (named 'TextDiagnosticPrinter') turns 482 the arguments into strings (according to the various formatting rules), prints 483 out the file/line/column information and the string, then prints out the line of 484 code, the source ranges, and the caret. However, this behavior isn't required. 485 </p> 486 487 <p>Another implementation of the DiagnosticClient interface is the 488 'TextDiagnosticBuffer' class, which is used when Clang is in -verify mode. 489 Instead of formatting and printing out the diagnostics, this implementation just 490 captures and remembers the diagnostics as they fly by. Then -verify compares 491 the list of produced diagnostics to the list of expected ones. If they disagree, 492 it prints out its own output. 493 </p> 494 495 <p>There are many other possible implementations of this interface, and this is 496 why we prefer diagnostics to pass down rich structured information in arguments. 497 For example, an HTML output might want declaration names be linkified to where 498 they come from in the source. Another example is that a GUI might let you click 499 on typedefs to expand them. This application would want to pass significantly 500 more information about types through to the GUI than a simple flat string. The 501 interface allows this to happen.</p> 502 503 <!-- ====================================================== --> 504 <h4><a name="translation">Adding Translations to Clang</a></h4> 505 <!-- ====================================================== --> 506 507 <p>Not possible yet! Diagnostic strings should be written in UTF-8, the client 508 can translate to the relevant code page if needed. Each translation completely 509 replaces the format string for the diagnostic.</p> 510 511 512 <!-- ======================================================================= --> 513 <h3 id="SourceLocation">The SourceLocation and SourceManager classes</h3> 514 <!-- ======================================================================= --> 515 516 <p>Strangely enough, the SourceLocation class represents a location within the 517 source code of the program. Important design points include:</p> 518 519 <ol> 520 <li>sizeof(SourceLocation) must be extremely small, as these are embedded into 521 many AST nodes and are passed around often. Currently it is 32 bits.</li> 522 <li>SourceLocation must be a simple value object that can be efficiently 523 copied.</li> 524 <li>We should be able to represent a source location for any byte of any input 525 file. This includes in the middle of tokens, in whitespace, in trigraphs, 526 etc.</li> 527 <li>A SourceLocation must encode the current #include stack that was active when 528 the location was processed. For example, if the location corresponds to a 529 token, it should contain the set of #includes active when the token was 530 lexed. This allows us to print the #include stack for a diagnostic.</li> 531 <li>SourceLocation must be able to describe macro expansions, capturing both 532 the ultimate instantiation point and the source of the original character 533 data.</li> 534 </ol> 535 536 <p>In practice, the SourceLocation works together with the SourceManager class 537 to encode two pieces of information about a location: its spelling location 538 and its instantiation location. For most tokens, these will be the same. 539 However, for a macro expansion (or tokens that came from a _Pragma directive) 540 these will describe the location of the characters corresponding to the token 541 and the location where the token was used (i.e. the macro instantiation point 542 or the location of the _Pragma itself).</p> 543 544 <p>The Clang front-end inherently depends on the location of a token being 545 tracked correctly. If it is ever incorrect, the front-end may get confused and 546 die. The reason for this is that the notion of the 'spelling' of a Token in 547 Clang depends on being able to find the original input characters for the token. 548 This concept maps directly to the "spelling location" for the token.</p> 549 550 551 <!-- ======================================================================= --> 552 <h3 id="SourceRange">SourceRange and CharSourceRange</h3> 553 <!-- ======================================================================= --> 554 <!-- mostly taken from 555 http://lists.cs.uiuc.edu/pipermail/cfe-dev/2010-August/010595.html --> 556 557 <p>Clang represents most source ranges by [first, last], where first and last 558 each point to the beginning of their respective tokens. For example 559 consider the SourceRange of the following statement:</p> 560 <pre> 561 x = foo + bar; 562 ^first ^last 563 </pre> 564 565 <p>To map from this representation to a character-based 566 representation, the 'last' location needs to be adjusted to point to 567 (or past) the end of that token with either 568 <code>Lexer::MeasureTokenLength()</code> or 569 <code>Lexer::getLocForEndOfToken()</code>. For the rare cases 570 where character-level source ranges information is needed we use 571 the <code>CharSourceRange</code> class.</p> 572 573 574 <!-- ======================================================================= --> 575 <h2 id="libdriver">The Driver Library</h2> 576 <!-- ======================================================================= --> 577 578 <p>The clang Driver and library are documented <a 579 href="DriverInternals.html">here<a>.<p> 580 581 <!-- ======================================================================= --> 582 <h2 id="pch">Precompiled Headers</h2> 583 <!-- ======================================================================= --> 584 585 <p>Clang supports two implementations of precompiled headers. The 586 default implementation, precompiled headers (<a 587 href="PCHInternals.html">PCH</a>) uses a serialized representation 588 of Clang's internal data structures, encoded with the <a 589 href="http://llvm.org/docs/BitCodeFormat.html">LLVM bitstream 590 format</a>. Pretokenized headers (<a 591 href="PTHInternals.html">PTH</a>), on the other hand, contain a 592 serialized representation of the tokens encountered when 593 preprocessing a header (and anything that header includes).</p> 594 595 596 <!-- ======================================================================= --> 597 <h2 id="libfrontend">The Frontend Library</h2> 598 <!-- ======================================================================= --> 599 600 <p>The Frontend library contains functionality useful for building 601 tools on top of the clang libraries, for example several methods for 602 outputting diagnostics.</p> 603 604 <!-- ======================================================================= --> 605 <h2 id="liblex">The Lexer and Preprocessor Library</h2> 606 <!-- ======================================================================= --> 607 608 <p>The Lexer library contains several tightly-connected classes that are involved 609 with the nasty process of lexing and preprocessing C source code. The main 610 interface to this library for outside clients is the large <a 611 href="#Preprocessor">Preprocessor</a> class. 612 It contains the various pieces of state that are required to coherently read 613 tokens out of a translation unit.</p> 614 615 <p>The core interface to the Preprocessor object (once it is set up) is the 616 Preprocessor::Lex method, which returns the next <a href="#Token">Token</a> from 617 the preprocessor stream. There are two types of token providers that the 618 preprocessor is capable of reading from: a buffer lexer (provided by the <a 619 href="#Lexer">Lexer</a> class) and a buffered token stream (provided by the <a 620 href="#TokenLexer">TokenLexer</a> class). 621 622 623 <!-- ======================================================================= --> 624 <h3 id="Token">The Token class</h3> 625 <!-- ======================================================================= --> 626 627 <p>The Token class is used to represent a single lexed token. Tokens are 628 intended to be used by the lexer/preprocess and parser libraries, but are not 629 intended to live beyond them (for example, they should not live in the ASTs).<p> 630 631 <p>Tokens most often live on the stack (or some other location that is efficient 632 to access) as the parser is running, but occasionally do get buffered up. For 633 example, macro definitions are stored as a series of tokens, and the C++ 634 front-end periodically needs to buffer tokens up for tentative parsing and 635 various pieces of look-ahead. As such, the size of a Token matter. On a 32-bit 636 system, sizeof(Token) is currently 16 bytes.</p> 637 638 <p>Tokens occur in two forms: "<a href="#AnnotationToken">Annotation 639 Tokens</a>" and normal tokens. Normal tokens are those returned by the lexer, 640 annotation tokens represent semantic information and are produced by the parser, 641 replacing normal tokens in the token stream. Normal tokens contain the 642 following information:</p> 643 644 <ul> 645 <li><b>A SourceLocation</b> - This indicates the location of the start of the 646 token.</li> 647 648 <li><b>A length</b> - This stores the length of the token as stored in the 649 SourceBuffer. For tokens that include them, this length includes trigraphs and 650 escaped newlines which are ignored by later phases of the compiler. By pointing 651 into the original source buffer, it is always possible to get the original 652 spelling of a token completely accurately.</li> 653 654 <li><b>IdentifierInfo</b> - If a token takes the form of an identifier, and if 655 identifier lookup was enabled when the token was lexed (e.g. the lexer was not 656 reading in 'raw' mode) this contains a pointer to the unique hash value for the 657 identifier. Because the lookup happens before keyword identification, this 658 field is set even for language keywords like 'for'.</li> 659 660 <li><b>TokenKind</b> - This indicates the kind of token as classified by the 661 lexer. This includes things like <tt>tok::starequal</tt> (for the "*=" 662 operator), <tt>tok::ampamp</tt> for the "&&" token, and keyword values 663 (e.g. <tt>tok::kw_for</tt>) for identifiers that correspond to keywords. Note 664 that some tokens can be spelled multiple ways. For example, C++ supports 665 "operator keywords", where things like "and" are treated exactly like the 666 "&&" operator. In these cases, the kind value is set to 667 <tt>tok::ampamp</tt>, which is good for the parser, which doesn't have to 668 consider both forms. For something that cares about which form is used (e.g. 669 the preprocessor 'stringize' operator) the spelling indicates the original 670 form.</li> 671 672 <li><b>Flags</b> - There are currently four flags tracked by the 673 lexer/preprocessor system on a per-token basis: 674 675 <ol> 676 <li><b>StartOfLine</b> - This was the first token that occurred on its input 677 source line.</li> 678 <li><b>LeadingSpace</b> - There was a space character either immediately 679 before the token or transitively before the token as it was expanded 680 through a macro. The definition of this flag is very closely defined by 681 the stringizing requirements of the preprocessor.</li> 682 <li><b>DisableExpand</b> - This flag is used internally to the preprocessor to 683 represent identifier tokens which have macro expansion disabled. This 684 prevents them from being considered as candidates for macro expansion ever 685 in the future.</li> 686 <li><b>NeedsCleaning</b> - This flag is set if the original spelling for the 687 token includes a trigraph or escaped newline. Since this is uncommon, 688 many pieces of code can fast-path on tokens that did not need cleaning. 689 </p> 690 </ol> 691 </li> 692 </ul> 693 694 <p>One interesting (and somewhat unusual) aspect of normal tokens is that they 695 don't contain any semantic information about the lexed value. For example, if 696 the token was a pp-number token, we do not represent the value of the number 697 that was lexed (this is left for later pieces of code to decide). Additionally, 698 the lexer library has no notion of typedef names vs variable names: both are 699 returned as identifiers, and the parser is left to decide whether a specific 700 identifier is a typedef or a variable (tracking this requires scope information 701 among other things). The parser can do this translation by replacing tokens 702 returned by the preprocessor with "Annotation Tokens".</p> 703 704 <!-- ======================================================================= --> 705 <h3 id="AnnotationToken">Annotation Tokens</h3> 706 <!-- ======================================================================= --> 707 708 <p>Annotation Tokens are tokens that are synthesized by the parser and injected 709 into the preprocessor's token stream (replacing existing tokens) to record 710 semantic information found by the parser. For example, if "foo" is found to be 711 a typedef, the "foo" <tt>tok::identifier</tt> token is replaced with an 712 <tt>tok::annot_typename</tt>. This is useful for a couple of reasons: 1) this 713 makes it easy to handle qualified type names (e.g. "foo::bar::baz<42>::t") 714 in C++ as a single "token" in the parser. 2) if the parser backtracks, the 715 reparse does not need to redo semantic analysis to determine whether a token 716 sequence is a variable, type, template, etc.</p> 717 718 <p>Annotation Tokens are created by the parser and reinjected into the parser's 719 token stream (when backtracking is enabled). Because they can only exist in 720 tokens that the preprocessor-proper is done with, it doesn't need to keep around 721 flags like "start of line" that the preprocessor uses to do its job. 722 Additionally, an annotation token may "cover" a sequence of preprocessor tokens 723 (e.g. <tt>a::b::c</tt> is five preprocessor tokens). As such, the valid fields 724 of an annotation token are different than the fields for a normal token (but 725 they are multiplexed into the normal Token fields):</p> 726 727 <ul> 728 <li><b>SourceLocation "Location"</b> - The SourceLocation for the annotation 729 token indicates the first token replaced by the annotation token. In the example 730 above, it would be the location of the "a" identifier.</li> 731 732 <li><b>SourceLocation "AnnotationEndLoc"</b> - This holds the location of the 733 last token replaced with the annotation token. In the example above, it would 734 be the location of the "c" identifier.</li> 735 736 <li><b>void* "AnnotationValue"</b> - This contains an opaque object 737 that the parser gets from Sema. The parser merely preserves the 738 information for Sema to later interpret based on the annotation token 739 kind.</li> 740 741 <li><b>TokenKind "Kind"</b> - This indicates the kind of Annotation token this 742 is. See below for the different valid kinds.</li> 743 </ul> 744 745 <p>Annotation tokens currently come in three kinds:</p> 746 747 <ol> 748 <li><b>tok::annot_typename</b>: This annotation token represents a 749 resolved typename token that is potentially qualified. The 750 AnnotationValue field contains the <tt>QualType</tt> returned by 751 Sema::getTypeName(), possibly with source location information 752 attached.</li> 753 754 <li><b>tok::annot_cxxscope</b>: This annotation token represents a C++ 755 scope specifier, such as "A::B::". This corresponds to the grammar 756 productions "::" and ":: [opt] nested-name-specifier". The 757 AnnotationValue pointer is a <tt>NestedNameSpecifier*</tt> returned by 758 the Sema::ActOnCXXGlobalScopeSpecifier and 759 Sema::ActOnCXXNestedNameSpecifier callbacks.</li> 760 761 <li><b>tok::annot_template_id</b>: This annotation token represents a 762 C++ template-id such as "foo<int, 4>", where "foo" is the name 763 of a template. The AnnotationValue pointer is a pointer to a malloc'd 764 TemplateIdAnnotation object. Depending on the context, a parsed 765 template-id that names a type might become a typename annotation token 766 (if all we care about is the named type, e.g., because it occurs in a 767 type specifier) or might remain a template-id token (if we want to 768 retain more source location information or produce a new type, e.g., 769 in a declaration of a class template specialization). template-id 770 annotation tokens that refer to a type can be "upgraded" to typename 771 annotation tokens by the parser.</li> 772 773 </ol> 774 775 <p>As mentioned above, annotation tokens are not returned by the preprocessor, 776 they are formed on demand by the parser. This means that the parser has to be 777 aware of cases where an annotation could occur and form it where appropriate. 778 This is somewhat similar to how the parser handles Translation Phase 6 of C99: 779 String Concatenation (see C99 5.1.1.2). In the case of string concatenation, 780 the preprocessor just returns distinct tok::string_literal and 781 tok::wide_string_literal tokens and the parser eats a sequence of them wherever 782 the grammar indicates that a string literal can occur.</p> 783 784 <p>In order to do this, whenever the parser expects a tok::identifier or 785 tok::coloncolon, it should call the TryAnnotateTypeOrScopeToken or 786 TryAnnotateCXXScopeToken methods to form the annotation token. These methods 787 will maximally form the specified annotation tokens and replace the current 788 token with them, if applicable. If the current tokens is not valid for an 789 annotation token, it will remain an identifier or :: token.</p> 790 791 792 793 <!-- ======================================================================= --> 794 <h3 id="Lexer">The Lexer class</h3> 795 <!-- ======================================================================= --> 796 797 <p>The Lexer class provides the mechanics of lexing tokens out of a source 798 buffer and deciding what they mean. The Lexer is complicated by the fact that 799 it operates on raw buffers that have not had spelling eliminated (this is a 800 necessity to get decent performance), but this is countered with careful coding 801 as well as standard performance techniques (for example, the comment handling 802 code is vectorized on X86 and PowerPC hosts).</p> 803 804 <p>The lexer has a couple of interesting modal features:</p> 805 806 <ul> 807 <li>The lexer can operate in 'raw' mode. This mode has several features that 808 make it possible to quickly lex the file (e.g. it stops identifier lookup, 809 doesn't specially handle preprocessor tokens, handles EOF differently, etc). 810 This mode is used for lexing within an "<tt>#if 0</tt>" block, for 811 example.</li> 812 <li>The lexer can capture and return comments as tokens. This is required to 813 support the -C preprocessor mode, which passes comments through, and is 814 used by the diagnostic checker to identifier expect-error annotations.</li> 815 <li>The lexer can be in ParsingFilename mode, which happens when preprocessing 816 after reading a #include directive. This mode changes the parsing of '<' 817 to return an "angled string" instead of a bunch of tokens for each thing 818 within the filename.</li> 819 <li>When parsing a preprocessor directive (after "<tt>#</tt>") the 820 ParsingPreprocessorDirective mode is entered. This changes the parser to 821 return EOD at a newline.</li> 822 <li>The Lexer uses a LangOptions object to know whether trigraphs are enabled, 823 whether C++ or ObjC keywords are recognized, etc.</li> 824 </ul> 825 826 <p>In addition to these modes, the lexer keeps track of a couple of other 827 features that are local to a lexed buffer, which change as the buffer is 828 lexed:</p> 829 830 <ul> 831 <li>The Lexer uses BufferPtr to keep track of the current character being 832 lexed.</li> 833 <li>The Lexer uses IsAtStartOfLine to keep track of whether the next lexed token 834 will start with its "start of line" bit set.</li> 835 <li>The Lexer keeps track of the current #if directives that are active (which 836 can be nested).</li> 837 <li>The Lexer keeps track of an <a href="#MultipleIncludeOpt"> 838 MultipleIncludeOpt</a> object, which is used to 839 detect whether the buffer uses the standard "<tt>#ifndef XX</tt> / 840 <tt>#define XX</tt>" idiom to prevent multiple inclusion. If a buffer does, 841 subsequent includes can be ignored if the XX macro is defined.</li> 842 </ul> 843 844 <!-- ======================================================================= --> 845 <h3 id="TokenLexer">The TokenLexer class</h3> 846 <!-- ======================================================================= --> 847 848 <p>The TokenLexer class is a token provider that returns tokens from a list 849 of tokens that came from somewhere else. It typically used for two things: 1) 850 returning tokens from a macro definition as it is being expanded 2) returning 851 tokens from an arbitrary buffer of tokens. The later use is used by _Pragma and 852 will most likely be used to handle unbounded look-ahead for the C++ parser.</p> 853 854 <!-- ======================================================================= --> 855 <h3 id="MultipleIncludeOpt">The MultipleIncludeOpt class</h3> 856 <!-- ======================================================================= --> 857 858 <p>The MultipleIncludeOpt class implements a really simple little state machine 859 that is used to detect the standard "<tt>#ifndef XX</tt> / <tt>#define XX</tt>" 860 idiom that people typically use to prevent multiple inclusion of headers. If a 861 buffer uses this idiom and is subsequently #include'd, the preprocessor can 862 simply check to see whether the guarding condition is defined or not. If so, 863 the preprocessor can completely ignore the include of the header.</p> 864 865 866 867 <!-- ======================================================================= --> 868 <h2 id="libparse">The Parser Library</h2> 869 <!-- ======================================================================= --> 870 871 <!-- ======================================================================= --> 872 <h2 id="libast">The AST Library</h2> 873 <!-- ======================================================================= --> 874 875 <!-- ======================================================================= --> 876 <h3 id="Type">The Type class and its subclasses</h3> 877 <!-- ======================================================================= --> 878 879 <p>The Type class (and its subclasses) are an important part of the AST. Types 880 are accessed through the ASTContext class, which implicitly creates and uniques 881 them as they are needed. Types have a couple of non-obvious features: 1) they 882 do not capture type qualifiers like const or volatile (See 883 <a href="#QualType">QualType</a>), and 2) they implicitly capture typedef 884 information. Once created, types are immutable (unlike decls).</p> 885 886 <p>Typedefs in C make semantic analysis a bit more complex than it would 887 be without them. The issue is that we want to capture typedef information 888 and represent it in the AST perfectly, but the semantics of operations need to 889 "see through" typedefs. For example, consider this code:</p> 890 891 <code> 892 void func() {<br> 893 typedef int foo;<br> 894 foo X, *Y;<br> 895 typedef foo* bar;<br> 896 bar Z;<br> 897 *X; <i>// error</i><br> 898 **Y; <i>// error</i><br> 899 **Z; <i>// error</i><br> 900 }<br> 901 </code> 902 903 <p>The code above is illegal, and thus we expect there to be diagnostics emitted 904 on the annotated lines. In this example, we expect to get:</p> 905 906 <pre> 907 <b>test.c:6:1: error: indirection requires pointer operand ('foo' invalid)</b> 908 *X; // error 909 <font color="blue">^~</font> 910 <b>test.c:7:1: error: indirection requires pointer operand ('foo' invalid)</b> 911 **Y; // error 912 <font color="blue">^~~</font> 913 <b>test.c:8:1: error: indirection requires pointer operand ('foo' invalid)</b> 914 **Z; // error 915 <font color="blue">^~~</font> 916 </pre> 917 918 <p>While this example is somewhat silly, it illustrates the point: we want to 919 retain typedef information where possible, so that we can emit errors about 920 "<tt>std::string</tt>" instead of "<tt>std::basic_string<char, std:...</tt>". 921 Doing this requires properly keeping typedef information (for example, the type 922 of "X" is "foo", not "int"), and requires properly propagating it through the 923 various operators (for example, the type of *Y is "foo", not "int"). In order 924 to retain this information, the type of these expressions is an instance of the 925 TypedefType class, which indicates that the type of these expressions is a 926 typedef for foo. 927 </p> 928 929 <p>Representing types like this is great for diagnostics, because the 930 user-specified type is always immediately available. There are two problems 931 with this: first, various semantic checks need to make judgements about the 932 <em>actual structure</em> of a type, ignoring typdefs. Second, we need an 933 efficient way to query whether two types are structurally identical to each 934 other, ignoring typedefs. The solution to both of these problems is the idea of 935 canonical types.</p> 936 937 <!-- =============== --> 938 <h4>Canonical Types</h4> 939 <!-- =============== --> 940 941 <p>Every instance of the Type class contains a canonical type pointer. For 942 simple types with no typedefs involved (e.g. "<tt>int</tt>", "<tt>int*</tt>", 943 "<tt>int**</tt>"), the type just points to itself. For types that have a 944 typedef somewhere in their structure (e.g. "<tt>foo</tt>", "<tt>foo*</tt>", 945 "<tt>foo**</tt>", "<tt>bar</tt>"), the canonical type pointer points to their 946 structurally equivalent type without any typedefs (e.g. "<tt>int</tt>", 947 "<tt>int*</tt>", "<tt>int**</tt>", and "<tt>int*</tt>" respectively).</p> 948 949 <p>This design provides a constant time operation (dereferencing the canonical 950 type pointer) that gives us access to the structure of types. For example, 951 we can trivially tell that "bar" and "foo*" are the same type by dereferencing 952 their canonical type pointers and doing a pointer comparison (they both point 953 to the single "<tt>int*</tt>" type).</p> 954 955 <p>Canonical types and typedef types bring up some complexities that must be 956 carefully managed. Specifically, the "isa/cast/dyncast" operators generally 957 shouldn't be used in code that is inspecting the AST. For example, when type 958 checking the indirection operator (unary '*' on a pointer), the type checker 959 must verify that the operand has a pointer type. It would not be correct to 960 check that with "<tt>isa<PointerType>(SubExpr->getType())</tt>", 961 because this predicate would fail if the subexpression had a typedef type.</p> 962 963 <p>The solution to this problem are a set of helper methods on Type, used to 964 check their properties. In this case, it would be correct to use 965 "<tt>SubExpr->getType()->isPointerType()</tt>" to do the check. This 966 predicate will return true if the <em>canonical type is a pointer</em>, which is 967 true any time the type is structurally a pointer type. The only hard part here 968 is remembering not to use the <tt>isa/cast/dyncast</tt> operations.</p> 969 970 <p>The second problem we face is how to get access to the pointer type once we 971 know it exists. To continue the example, the result type of the indirection 972 operator is the pointee type of the subexpression. In order to determine the 973 type, we need to get the instance of PointerType that best captures the typedef 974 information in the program. If the type of the expression is literally a 975 PointerType, we can return that, otherwise we have to dig through the 976 typedefs to find the pointer type. For example, if the subexpression had type 977 "<tt>foo*</tt>", we could return that type as the result. If the subexpression 978 had type "<tt>bar</tt>", we want to return "<tt>foo*</tt>" (note that we do 979 <em>not</em> want "<tt>int*</tt>"). In order to provide all of this, Type has 980 a getAsPointerType() method that checks whether the type is structurally a 981 PointerType and, if so, returns the best one. If not, it returns a null 982 pointer.</p> 983 984 <p>This structure is somewhat mystical, but after meditating on it, it will 985 make sense to you :).</p> 986 987 <!-- ======================================================================= --> 988 <h3 id="QualType">The QualType class</h3> 989 <!-- ======================================================================= --> 990 991 <p>The QualType class is designed as a trivial value class that is 992 small, passed by-value and is efficient to query. The idea of 993 QualType is that it stores the type qualifiers (const, volatile, 994 restrict, plus some extended qualifiers required by language 995 extensions) separately from the types themselves. QualType is 996 conceptually a pair of "Type*" and the bits for these type qualifiers.</p> 997 998 <p>By storing the type qualifiers as bits in the conceptual pair, it is 999 extremely efficient to get the set of qualifiers on a QualType (just return the 1000 field of the pair), add a type qualifier (which is a trivial constant-time 1001 operation that sets a bit), and remove one or more type qualifiers (just return 1002 a QualType with the bitfield set to empty).</p> 1003 1004 <p>Further, because the bits are stored outside of the type itself, we do not 1005 need to create duplicates of types with different sets of qualifiers (i.e. there 1006 is only a single heap allocated "int" type: "const int" and "volatile const int" 1007 both point to the same heap allocated "int" type). This reduces the heap size 1008 used to represent bits and also means we do not have to consider qualifiers when 1009 uniquing types (<a href="#Type">Type</a> does not even contain qualifiers).</p> 1010 1011 <p>In practice, the two most common type qualifiers (const and 1012 restrict) are stored in the low bits of the pointer to the Type 1013 object, together with a flag indicating whether extended qualifiers 1014 are present (which must be heap-allocated). This means that QualType 1015 is exactly the same size as a pointer.</p> 1016 1017 <!-- ======================================================================= --> 1018 <h3 id="DeclarationName">Declaration names</h3> 1019 <!-- ======================================================================= --> 1020 1021 <p>The <tt>DeclarationName</tt> class represents the name of a 1022 declaration in Clang. Declarations in the C family of languages can 1023 take several different forms. Most declarations are named by 1024 simple identifiers, e.g., "<code>f</code>" and "<code>x</code>" in 1025 the function declaration <code>f(int x)</code>. In C++, declaration 1026 names can also name class constructors ("<code>Class</code>" 1027 in <code>struct Class { Class(); }</code>), class destructors 1028 ("<code>~Class</code>"), overloaded operator names ("operator+"), 1029 and conversion functions ("<code>operator void const *</code>"). In 1030 Objective-C, declaration names can refer to the names of Objective-C 1031 methods, which involve the method name and the parameters, 1032 collectively called a <i>selector</i>, e.g., 1033 "<code>setWidth:height:</code>". Since all of these kinds of 1034 entities - variables, functions, Objective-C methods, C++ 1035 constructors, destructors, and operators - are represented as 1036 subclasses of Clang's common <code>NamedDecl</code> 1037 class, <code>DeclarationName</code> is designed to efficiently 1038 represent any kind of name.</p> 1039 1040 <p>Given 1041 a <code>DeclarationName</code> <code>N</code>, <code>N.getNameKind()</code> 1042 will produce a value that describes what kind of name <code>N</code> 1043 stores. There are 8 options (all of the names are inside 1044 the <code>DeclarationName</code> class)</p> 1045 <dl> 1046 <dt>Identifier</dt> 1047 <dd>The name is a simple 1048 identifier. Use <code>N.getAsIdentifierInfo()</code> to retrieve the 1049 corresponding <code>IdentifierInfo*</code> pointing to the actual 1050 identifier. Note that C++ overloaded operators (e.g., 1051 "<code>operator+</code>") are represented as special kinds of 1052 identifiers. Use <code>IdentifierInfo</code>'s <code>getOverloadedOperatorID</code> 1053 function to determine whether an identifier is an overloaded 1054 operator name.</dd> 1055 1056 <dt>ObjCZeroArgSelector, ObjCOneArgSelector, 1057 ObjCMultiArgSelector</dt> 1058 <dd>The name is an Objective-C selector, which can be retrieved as a 1059 <code>Selector</code> instance 1060 via <code>N.getObjCSelector()</code>. The three possible name 1061 kinds for Objective-C reflect an optimization within 1062 the <code>DeclarationName</code> class: both zero- and 1063 one-argument selectors are stored as a 1064 masked <code>IdentifierInfo</code> pointer, and therefore require 1065 very little space, since zero- and one-argument selectors are far 1066 more common than multi-argument selectors (which use a different 1067 structure).</dd> 1068 1069 <dt>CXXConstructorName</dt> 1070 <dd>The name is a C++ constructor 1071 name. Use <code>N.getCXXNameType()</code> to retrieve 1072 the <a href="#QualType">type</a> that this constructor is meant to 1073 construct. The type is always the canonical type, since all 1074 constructors for a given type have the same name.</dd> 1075 1076 <dt>CXXDestructorName</dt> 1077 <dd>The name is a C++ destructor 1078 name. Use <code>N.getCXXNameType()</code> to retrieve 1079 the <a href="#QualType">type</a> whose destructor is being 1080 named. This type is always a canonical type.</dd> 1081 1082 <dt>CXXConversionFunctionName</dt> 1083 <dd>The name is a C++ conversion function. Conversion functions are 1084 named according to the type they convert to, e.g., "<code>operator void 1085 const *</code>". Use <code>N.getCXXNameType()</code> to retrieve 1086 the type that this conversion function converts to. This type is 1087 always a canonical type.</dd> 1088 1089 <dt>CXXOperatorName</dt> 1090 <dd>The name is a C++ overloaded operator name. Overloaded operators 1091 are named according to their spelling, e.g., 1092 "<code>operator+</code>" or "<code>operator new 1093 []</code>". Use <code>N.getCXXOverloadedOperator()</code> to 1094 retrieve the overloaded operator (a value of 1095 type <code>OverloadedOperatorKind</code>).</dd> 1096 </dl> 1097 1098 <p><code>DeclarationName</code>s are cheap to create, copy, and 1099 compare. They require only a single pointer's worth of storage in 1100 the common cases (identifiers, zero- 1101 and one-argument Objective-C selectors) and use dense, uniqued 1102 storage for the other kinds of 1103 names. Two <code>DeclarationName</code>s can be compared for 1104 equality (<code>==</code>, <code>!=</code>) using a simple bitwise 1105 comparison, can be ordered 1106 with <code><</code>, <code>></code>, <code><=</code>, 1107 and <code>>=</code> (which provide a lexicographical ordering for 1108 normal identifiers but an unspecified ordering for other kinds of 1109 names), and can be placed into LLVM <code>DenseMap</code>s 1110 and <code>DenseSet</code>s.</p> 1111 1112 <p><code>DeclarationName</code> instances can be created in different 1113 ways depending on what kind of name the instance will store. Normal 1114 identifiers (<code>IdentifierInfo</code> pointers) and Objective-C selectors 1115 (<code>Selector</code>) can be implicitly converted 1116 to <code>DeclarationName</code>s. Names for C++ constructors, 1117 destructors, conversion functions, and overloaded operators can be retrieved from 1118 the <code>DeclarationNameTable</code>, an instance of which is 1119 available as <code>ASTContext::DeclarationNames</code>. The member 1120 functions <code>getCXXConstructorName</code>, <code>getCXXDestructorName</code>, 1121 <code>getCXXConversionFunctionName</code>, and <code>getCXXOperatorName</code>, respectively, 1122 return <code>DeclarationName</code> instances for the four kinds of 1123 C++ special function names.</p> 1124 1125 <!-- ======================================================================= --> 1126 <h3 id="DeclContext">Declaration contexts</h3> 1127 <!-- ======================================================================= --> 1128 <p>Every declaration in a program exists within some <i>declaration 1129 context</i>, such as a translation unit, namespace, class, or 1130 function. Declaration contexts in Clang are represented by 1131 the <code>DeclContext</code> class, from which the various 1132 declaration-context AST nodes 1133 (<code>TranslationUnitDecl</code>, <code>NamespaceDecl</code>, <code>RecordDecl</code>, <code>FunctionDecl</code>, 1134 etc.) will derive. The <code>DeclContext</code> class provides 1135 several facilities common to each declaration context:</p> 1136 <dl> 1137 <dt>Source-centric vs. Semantics-centric View of Declarations</dt> 1138 <dd><code>DeclContext</code> provides two views of the declarations 1139 stored within a declaration context. The source-centric view 1140 accurately represents the program source code as written, including 1141 multiple declarations of entities where present (see the 1142 section <a href="#Redeclarations">Redeclarations and 1143 Overloads</a>), while the semantics-centric view represents the 1144 program semantics. The two views are kept synchronized by semantic 1145 analysis while the ASTs are being constructed.</dd> 1146 1147 <dt>Storage of declarations within that context</dt> 1148 <dd>Every declaration context can contain some number of 1149 declarations. For example, a C++ class (represented 1150 by <code>RecordDecl</code>) contains various member functions, 1151 fields, nested types, and so on. All of these declarations will be 1152 stored within the <code>DeclContext</code>, and one can iterate 1153 over the declarations via 1154 [<code>DeclContext::decls_begin()</code>, 1155 <code>DeclContext::decls_end()</code>). This mechanism provides 1156 the source-centric view of declarations in the context.</dd> 1157 1158 <dt>Lookup of declarations within that context</dt> 1159 <dd>The <code>DeclContext</code> structure provides efficient name 1160 lookup for names within that declaration context. For example, 1161 if <code>N</code> is a namespace we can look for the 1162 name <code>N::f</code> 1163 using <code>DeclContext::lookup</code>. The lookup itself is 1164 based on a lazily-constructed array (for declaration contexts 1165 with a small number of declarations) or hash table (for 1166 declaration contexts with more declarations). The lookup 1167 operation provides the semantics-centric view of the declarations 1168 in the context.</dd> 1169 1170 <dt>Ownership of declarations</dt> 1171 <dd>The <code>DeclContext</code> owns all of the declarations that 1172 were declared within its declaration context, and is responsible 1173 for the management of their memory as well as their 1174 (de-)serialization.</dd> 1175 </dl> 1176 1177 <p>All declarations are stored within a declaration context, and one 1178 can query 1179 information about the context in which each declaration lives. One 1180 can retrieve the <code>DeclContext</code> that contains a 1181 particular <code>Decl</code> 1182 using <code>Decl::getDeclContext</code>. However, see the 1183 section <a href="#LexicalAndSemanticContexts">Lexical and Semantic 1184 Contexts</a> for more information about how to interpret this 1185 context information.</p> 1186 1187 <h4 id="Redeclarations">Redeclarations and Overloads</h4> 1188 <p>Within a translation unit, it is common for an entity to be 1189 declared several times. For example, we might declare a function "f" 1190 and then later re-declare it as part of an inlined definition:</p> 1191 1192 <pre> 1193 void f(int x, int y, int z = 1); 1194 1195 inline void f(int x, int y, int z) { /* ... */ } 1196 </pre> 1197 1198 <p>The representation of "f" differs in the source-centric and 1199 semantics-centric views of a declaration context. In the 1200 source-centric view, all redeclarations will be present, in the 1201 order they occurred in the source code, making 1202 this view suitable for clients that wish to see the structure of 1203 the source code. In the semantics-centric view, only the most recent "f" 1204 will be found by the lookup, since it effectively replaces the first 1205 declaration of "f".</p> 1206 1207 <p>In the semantics-centric view, overloading of functions is 1208 represented explicitly. For example, given two declarations of a 1209 function "g" that are overloaded, e.g.,</p> 1210 <pre> 1211 void g(); 1212 void g(int); 1213 </pre> 1214 <p>the <code>DeclContext::lookup</code> operation will return 1215 a <code>DeclContext::lookup_result</code> that contains a range of iterators 1216 over declarations of "g". Clients that perform semantic analysis on a 1217 program that is not concerned with the actual source code will 1218 primarily use this semantics-centric view.</p> 1219 1220 <h4 id="LexicalAndSemanticContexts">Lexical and Semantic Contexts</h4> 1221 <p>Each declaration has two potentially different 1222 declaration contexts: a <i>lexical</i> context, which corresponds to 1223 the source-centric view of the declaration context, and 1224 a <i>semantic</i> context, which corresponds to the 1225 semantics-centric view. The lexical context is accessible 1226 via <code>Decl::getLexicalDeclContext</code> while the 1227 semantic context is accessible 1228 via <code>Decl::getDeclContext</code>, both of which return 1229 <code>DeclContext</code> pointers. For most declarations, the two 1230 contexts are identical. For example:</p> 1231 1232 <pre> 1233 class X { 1234 public: 1235 void f(int x); 1236 }; 1237 </pre> 1238 1239 <p>Here, the semantic and lexical contexts of <code>X::f</code> are 1240 the <code>DeclContext</code> associated with the 1241 class <code>X</code> (itself stored as a <code>RecordDecl</code> AST 1242 node). However, we can now define <code>X::f</code> out-of-line:</p> 1243 1244 <pre> 1245 void X::f(int x = 17) { /* ... */ } 1246 </pre> 1247 1248 <p>This definition of has different lexical and semantic 1249 contexts. The lexical context corresponds to the declaration 1250 context in which the actual declaration occurred in the source 1251 code, e.g., the translation unit containing <code>X</code>. Thus, 1252 this declaration of <code>X::f</code> can be found by traversing 1253 the declarations provided by 1254 [<code>decls_begin()</code>, <code>decls_end()</code>) in the 1255 translation unit.</p> 1256 1257 <p>The semantic context of <code>X::f</code> corresponds to the 1258 class <code>X</code>, since this member function is (semantically) a 1259 member of <code>X</code>. Lookup of the name <code>f</code> into 1260 the <code>DeclContext</code> associated with <code>X</code> will 1261 then return the definition of <code>X::f</code> (including 1262 information about the default argument).</p> 1263 1264 <h4 id="TransparentContexts">Transparent Declaration Contexts</h4> 1265 <p>In C and C++, there are several contexts in which names that are 1266 logically declared inside another declaration will actually "leak" 1267 out into the enclosing scope from the perspective of name 1268 lookup. The most obvious instance of this behavior is in 1269 enumeration types, e.g.,</p> 1270 <pre> 1271 enum Color { 1272 Red, 1273 Green, 1274 Blue 1275 }; 1276 </pre> 1277 1278 <p>Here, <code>Color</code> is an enumeration, which is a declaration 1279 context that contains the 1280 enumerators <code>Red</code>, <code>Green</code>, 1281 and <code>Blue</code>. Thus, traversing the list of declarations 1282 contained in the enumeration <code>Color</code> will 1283 yield <code>Red</code>, <code>Green</code>, 1284 and <code>Blue</code>. However, outside of the scope 1285 of <code>Color</code> one can name the enumerator <code>Red</code> 1286 without qualifying the name, e.g.,</p> 1287 1288 <pre> 1289 Color c = Red; 1290 </pre> 1291 1292 <p>There are other entities in C++ that provide similar behavior. For 1293 example, linkage specifications that use curly braces:</p> 1294 1295 <pre> 1296 extern "C" { 1297 void f(int); 1298 void g(int); 1299 } 1300 // f and g are visible here 1301 </pre> 1302 1303 <p>For source-level accuracy, we treat the linkage specification and 1304 enumeration type as a 1305 declaration context in which its enclosed declarations ("Red", 1306 "Green", and "Blue"; "f" and "g") 1307 are declared. However, these declarations are visible outside of the 1308 scope of the declaration context.</p> 1309 1310 <p>These language features (and several others, described below) have 1311 roughly the same set of 1312 requirements: declarations are declared within a particular lexical 1313 context, but the declarations are also found via name lookup in 1314 scopes enclosing the declaration itself. This feature is implemented 1315 via <i>transparent</i> declaration contexts 1316 (see <code>DeclContext::isTransparentContext()</code>), whose 1317 declarations are visible in the nearest enclosing non-transparent 1318 declaration context. This means that the lexical context of the 1319 declaration (e.g., an enumerator) will be the 1320 transparent <code>DeclContext</code> itself, as will the semantic 1321 context, but the declaration will be visible in every outer context 1322 up to and including the first non-transparent declaration context (since 1323 transparent declaration contexts can be nested).</p> 1324 1325 <p>The transparent <code>DeclContexts</code> are:</p> 1326 <ul> 1327 <li>Enumerations (but not C++0x "scoped enumerations"): 1328 <pre> 1329 enum Color { 1330 Red, 1331 Green, 1332 Blue 1333 }; 1334 // Red, Green, and Blue are in scope 1335 </pre></li> 1336 <li>C++ linkage specifications: 1337 <pre> 1338 extern "C" { 1339 void f(int); 1340 void g(int); 1341 } 1342 // f and g are in scope 1343 </pre></li> 1344 <li>Anonymous unions and structs: 1345 <pre> 1346 struct LookupTable { 1347 bool IsVector; 1348 union { 1349 std::vector<Item> *Vector; 1350 std::set<Item> *Set; 1351 }; 1352 }; 1353 1354 LookupTable LT; 1355 LT.Vector = 0; // Okay: finds Vector inside the unnamed union 1356 </pre> 1357 </li> 1358 <li>C++0x inline namespaces: 1359 <pre> 1360 namespace mylib { 1361 inline namespace debug { 1362 class X; 1363 } 1364 } 1365 mylib::X *xp; // okay: mylib::X refers to mylib::debug::X 1366 </pre> 1367 </li> 1368 </ul> 1369 1370 1371 <h4 id="MultiDeclContext">Multiply-Defined Declaration Contexts</h4> 1372 <p>C++ namespaces have the interesting--and, so far, unique--property that 1373 the namespace can be defined multiple times, and the declarations 1374 provided by each namespace definition are effectively merged (from 1375 the semantic point of view). For example, the following two code 1376 snippets are semantically indistinguishable:</p> 1377 <pre> 1378 // Snippet #1: 1379 namespace N { 1380 void f(); 1381 } 1382 namespace N { 1383 void f(int); 1384 } 1385 1386 // Snippet #2: 1387 namespace N { 1388 void f(); 1389 void f(int); 1390 } 1391 </pre> 1392 1393 <p>In Clang's representation, the source-centric view of declaration 1394 contexts will actually have two separate <code>NamespaceDecl</code> 1395 nodes in Snippet #1, each of which is a declaration context that 1396 contains a single declaration of "f". However, the semantics-centric 1397 view provided by name lookup into the namespace <code>N</code> for 1398 "f" will return a <code>DeclContext::lookup_result</code> that contains 1399 a range of iterators over declarations of "f".</p> 1400 1401 <p><code>DeclContext</code> manages multiply-defined declaration 1402 contexts internally. The 1403 function <code>DeclContext::getPrimaryContext</code> retrieves the 1404 "primary" context for a given <code>DeclContext</code> instance, 1405 which is the <code>DeclContext</code> responsible for maintaining 1406 the lookup table used for the semantics-centric view. Given the 1407 primary context, one can follow the chain 1408 of <code>DeclContext</code> nodes that define additional 1409 declarations via <code>DeclContext::getNextContext</code>. Note that 1410 these functions are used internally within the lookup and insertion 1411 methods of the <code>DeclContext</code>, so the vast majority of 1412 clients can ignore them.</p> 1413 1414 <!-- ======================================================================= --> 1415 <h3 id="CFG">The <tt>CFG</tt> class</h3> 1416 <!-- ======================================================================= --> 1417 1418 <p>The <tt>CFG</tt> class is designed to represent a source-level 1419 control-flow graph for a single statement (<tt>Stmt*</tt>). Typically 1420 instances of <tt>CFG</tt> are constructed for function bodies (usually 1421 an instance of <tt>CompoundStmt</tt>), but can also be instantiated to 1422 represent the control-flow of any class that subclasses <tt>Stmt</tt>, 1423 which includes simple expressions. Control-flow graphs are especially 1424 useful for performing 1425 <a href="http://en.wikipedia.org/wiki/Data_flow_analysis#Sensitivities">flow- 1426 or path-sensitive</a> program analyses on a given function.</p> 1427 1428 <!-- ============ --> 1429 <h4>Basic Blocks</h4> 1430 <!-- ============ --> 1431 1432 <p>Concretely, an instance of <tt>CFG</tt> is a collection of basic 1433 blocks. Each basic block is an instance of <tt>CFGBlock</tt>, which 1434 simply contains an ordered sequence of <tt>Stmt*</tt> (each referring 1435 to statements in the AST). The ordering of statements within a block 1436 indicates unconditional flow of control from one statement to the 1437 next. <a href="#ConditionalControlFlow">Conditional control-flow</a> 1438 is represented using edges between basic blocks. The statements 1439 within a given <tt>CFGBlock</tt> can be traversed using 1440 the <tt>CFGBlock::*iterator</tt> interface.</p> 1441 1442 <p> 1443 A <tt>CFG</tt> object owns the instances of <tt>CFGBlock</tt> within 1444 the control-flow graph it represents. Each <tt>CFGBlock</tt> within a 1445 CFG is also uniquely numbered (accessible 1446 via <tt>CFGBlock::getBlockID()</tt>). Currently the number is 1447 based on the ordering the blocks were created, but no assumptions 1448 should be made on how <tt>CFGBlock</tt>s are numbered other than their 1449 numbers are unique and that they are numbered from 0..N-1 (where N is 1450 the number of basic blocks in the CFG).</p> 1451 1452 <!-- ===================== --> 1453 <h4>Entry and Exit Blocks</h4> 1454 <!-- ===================== --> 1455 1456 Each instance of <tt>CFG</tt> contains two special blocks: 1457 an <i>entry</i> block (accessible via <tt>CFG::getEntry()</tt>), which 1458 has no incoming edges, and an <i>exit</i> block (accessible 1459 via <tt>CFG::getExit()</tt>), which has no outgoing edges. Neither 1460 block contains any statements, and they serve the role of providing a 1461 clear entrance and exit for a body of code such as a function body. 1462 The presence of these empty blocks greatly simplifies the 1463 implementation of many analyses built on top of CFGs. 1464 1465 <!-- ===================================================== --> 1466 <h4 id ="ConditionalControlFlow">Conditional Control-Flow</h4> 1467 <!-- ===================================================== --> 1468 1469 <p>Conditional control-flow (such as those induced by if-statements 1470 and loops) is represented as edges between <tt>CFGBlock</tt>s. 1471 Because different C language constructs can induce control-flow, 1472 each <tt>CFGBlock</tt> also records an extra <tt>Stmt*</tt> that 1473 represents the <i>terminator</i> of the block. A terminator is simply 1474 the statement that caused the control-flow, and is used to identify 1475 the nature of the conditional control-flow between blocks. For 1476 example, in the case of an if-statement, the terminator refers to 1477 the <tt>IfStmt</tt> object in the AST that represented the given 1478 branch.</p> 1479 1480 <p>To illustrate, consider the following code example:</p> 1481 1482 <code> 1483 int foo(int x) {<br> 1484 x = x + 1;<br> 1485 <br> 1486 if (x > 2) x++;<br> 1487 else {<br> 1488 x += 2;<br> 1489 x *= 2;<br> 1490 }<br> 1491 <br> 1492 return x;<br> 1493 } 1494 </code> 1495 1496 <p>After invoking the parser+semantic analyzer on this code fragment, 1497 the AST of the body of <tt>foo</tt> is referenced by a 1498 single <tt>Stmt*</tt>. We can then construct an instance 1499 of <tt>CFG</tt> representing the control-flow graph of this function 1500 body by single call to a static class method:</p> 1501 1502 <code> 1503 Stmt* FooBody = ...<br> 1504 CFG* FooCFG = <b>CFG::buildCFG</b>(FooBody); 1505 </code> 1506 1507 <p>It is the responsibility of the caller of <tt>CFG::buildCFG</tt> 1508 to <tt>delete</tt> the returned <tt>CFG*</tt> when the CFG is no 1509 longer needed.</p> 1510 1511 <p>Along with providing an interface to iterate over 1512 its <tt>CFGBlock</tt>s, the <tt>CFG</tt> class also provides methods 1513 that are useful for debugging and visualizing CFGs. For example, the 1514 method 1515 <tt>CFG::dump()</tt> dumps a pretty-printed version of the CFG to 1516 standard error. This is especially useful when one is using a 1517 debugger such as gdb. For example, here is the output 1518 of <tt>FooCFG->dump()</tt>:</p> 1519 1520 <code> 1521 [ B5 (ENTRY) ]<br> 1522 Predecessors (0):<br> 1523 Successors (1): B4<br> 1524 <br> 1525 [ B4 ]<br> 1526 1: x = x + 1<br> 1527 2: (x > 2)<br> 1528 <b>T: if [B4.2]</b><br> 1529 Predecessors (1): B5<br> 1530 Successors (2): B3 B2<br> 1531 <br> 1532 [ B3 ]<br> 1533 1: x++<br> 1534 Predecessors (1): B4<br> 1535 Successors (1): B1<br> 1536 <br> 1537 [ B2 ]<br> 1538 1: x += 2<br> 1539 2: x *= 2<br> 1540 Predecessors (1): B4<br> 1541 Successors (1): B1<br> 1542 <br> 1543 [ B1 ]<br> 1544 1: return x;<br> 1545 Predecessors (2): B2 B3<br> 1546 Successors (1): B0<br> 1547 <br> 1548 [ B0 (EXIT) ]<br> 1549 Predecessors (1): B1<br> 1550 Successors (0): 1551 </code> 1552 1553 <p>For each block, the pretty-printed output displays for each block 1554 the number of <i>predecessor</i> blocks (blocks that have outgoing 1555 control-flow to the given block) and <i>successor</i> blocks (blocks 1556 that have control-flow that have incoming control-flow from the given 1557 block). We can also clearly see the special entry and exit blocks at 1558 the beginning and end of the pretty-printed output. For the entry 1559 block (block B5), the number of predecessor blocks is 0, while for the 1560 exit block (block B0) the number of successor blocks is 0.</p> 1561 1562 <p>The most interesting block here is B4, whose outgoing control-flow 1563 represents the branching caused by the sole if-statement 1564 in <tt>foo</tt>. Of particular interest is the second statement in 1565 the block, <b><tt>(x > 2)</tt></b>, and the terminator, printed 1566 as <b><tt>if [B4.2]</tt></b>. The second statement represents the 1567 evaluation of the condition of the if-statement, which occurs before 1568 the actual branching of control-flow. Within the <tt>CFGBlock</tt> 1569 for B4, the <tt>Stmt*</tt> for the second statement refers to the 1570 actual expression in the AST for <b><tt>(x > 2)</tt></b>. Thus 1571 pointers to subclasses of <tt>Expr</tt> can appear in the list of 1572 statements in a block, and not just subclasses of <tt>Stmt</tt> that 1573 refer to proper C statements.</p> 1574 1575 <p>The terminator of block B4 is a pointer to the <tt>IfStmt</tt> 1576 object in the AST. The pretty-printer outputs <b><tt>if 1577 [B4.2]</tt></b> because the condition expression of the if-statement 1578 has an actual place in the basic block, and thus the terminator is 1579 essentially 1580 <i>referring</i> to the expression that is the second statement of 1581 block B4 (i.e., B4.2). In this manner, conditions for control-flow 1582 (which also includes conditions for loops and switch statements) are 1583 hoisted into the actual basic block.</p> 1584 1585 <!-- ===================== --> 1586 <!-- <h4>Implicit Control-Flow</h4> --> 1587 <!-- ===================== --> 1588 1589 <!-- 1590 <p>A key design principle of the <tt>CFG</tt> class was to not require 1591 any transformations to the AST in order to represent control-flow. 1592 Thus the <tt>CFG</tt> does not perform any "lowering" of the 1593 statements in an AST: loops are not transformed into guarded gotos, 1594 short-circuit operations are not converted to a set of if-statements, 1595 and so on.</p> 1596 --> 1597 1598 1599 <!-- ======================================================================= --> 1600 <h3 id="Constants">Constant Folding in the Clang AST</h3> 1601 <!-- ======================================================================= --> 1602 1603 <p>There are several places where constants and constant folding matter a lot to 1604 the Clang front-end. First, in general, we prefer the AST to retain the source 1605 code as close to how the user wrote it as possible. This means that if they 1606 wrote "5+4", we want to keep the addition and two constants in the AST, we don't 1607 want to fold to "9". This means that constant folding in various ways turns 1608 into a tree walk that needs to handle the various cases.</p> 1609 1610 <p>However, there are places in both C and C++ that require constants to be 1611 folded. For example, the C standard defines what an "integer constant 1612 expression" (i-c-e) is with very precise and specific requirements. The 1613 language then requires i-c-e's in a lot of places (for example, the size of a 1614 bitfield, the value for a case statement, etc). For these, we have to be able 1615 to constant fold the constants, to do semantic checks (e.g. verify bitfield size 1616 is non-negative and that case statements aren't duplicated). We aim for Clang 1617 to be very pedantic about this, diagnosing cases when the code does not use an 1618 i-c-e where one is required, but accepting the code unless running with 1619 <tt>-pedantic-errors</tt>.</p> 1620 1621 <p>Things get a little bit more tricky when it comes to compatibility with 1622 real-world source code. Specifically, GCC has historically accepted a huge 1623 superset of expressions as i-c-e's, and a lot of real world code depends on this 1624 unfortuate accident of history (including, e.g., the glibc system headers). GCC 1625 accepts anything its "fold" optimizer is capable of reducing to an integer 1626 constant, which means that the definition of what it accepts changes as its 1627 optimizer does. One example is that GCC accepts things like "case X-X:" even 1628 when X is a variable, because it can fold this to 0.</p> 1629 1630 <p>Another issue are how constants interact with the extensions we support, such 1631 as __builtin_constant_p, __builtin_inf, __extension__ and many others. C99 1632 obviously does not specify the semantics of any of these extensions, and the 1633 definition of i-c-e does not include them. However, these extensions are often 1634 used in real code, and we have to have a way to reason about them.</p> 1635 1636 <p>Finally, this is not just a problem for semantic analysis. The code 1637 generator and other clients have to be able to fold constants (e.g. to 1638 initialize global variables) and has to handle a superset of what C99 allows. 1639 Further, these clients can benefit from extended information. For example, we 1640 know that "foo()||1" always evaluates to true, but we can't replace the 1641 expression with true because it has side effects.</p> 1642 1643 <!-- ======================= --> 1644 <h4>Implementation Approach</h4> 1645 <!-- ======================= --> 1646 1647 <p>After trying several different approaches, we've finally converged on a 1648 design (Note, at the time of this writing, not all of this has been implemented, 1649 consider this a design goal!). Our basic approach is to define a single 1650 recursive method evaluation method (<tt>Expr::Evaluate</tt>), which is 1651 implemented in <tt>AST/ExprConstant.cpp</tt>. Given an expression with 'scalar' 1652 type (integer, fp, complex, or pointer) this method returns the following 1653 information:</p> 1654 1655 <ul> 1656 <li>Whether the expression is an integer constant expression, a general 1657 constant that was folded but has no side effects, a general constant that 1658 was folded but that does have side effects, or an uncomputable/unfoldable 1659 value. 1660 </li> 1661 <li>If the expression was computable in any way, this method returns the APValue 1662 for the result of the expression.</li> 1663 <li>If the expression is not evaluatable at all, this method returns 1664 information on one of the problems with the expression. This includes a 1665 SourceLocation for where the problem is, and a diagnostic ID that explains 1666 the problem. The diagnostic should be have ERROR type.</li> 1667 <li>If the expression is not an integer constant expression, this method returns 1668 information on one of the problems with the expression. This includes a 1669 SourceLocation for where the problem is, and a diagnostic ID that explains 1670 the problem. The diagnostic should be have EXTENSION type.</li> 1671 </ul> 1672 1673 <p>This information gives various clients the flexibility that they want, and we 1674 will eventually have some helper methods for various extensions. For example, 1675 Sema should have a <tt>Sema::VerifyIntegerConstantExpression</tt> method, which 1676 calls Evaluate on the expression. If the expression is not foldable, the error 1677 is emitted, and it would return true. If the expression is not an i-c-e, the 1678 EXTENSION diagnostic is emitted. Finally it would return false to indicate that 1679 the AST is ok.</p> 1680 1681 <p>Other clients can use the information in other ways, for example, codegen can 1682 just use expressions that are foldable in any way.</p> 1683 1684 <!-- ========== --> 1685 <h4>Extensions</h4> 1686 <!-- ========== --> 1687 1688 <p>This section describes how some of the various extensions Clang supports 1689 interacts with constant evaluation:</p> 1690 1691 <ul> 1692 <li><b><tt>__extension__</tt></b>: The expression form of this extension causes 1693 any evaluatable subexpression to be accepted as an integer constant 1694 expression.</li> 1695 <li><b><tt>__builtin_constant_p</tt></b>: This returns true (as a integer 1696 constant expression) if the operand is any evaluatable constant. As a 1697 special case, if <tt>__builtin_constant_p</tt> is the (potentially 1698 parenthesized) condition of a conditional operator expression ("?:"), only 1699 the true side of the conditional operator is considered, and it is evaluated 1700 with full constant folding.</li> 1701 <li><b><tt>__builtin_choose_expr</tt></b>: The condition is required to be an 1702 integer constant expression, but we accept any constant as an "extension of 1703 an extension". This only evaluates one operand depending on which way the 1704 condition evaluates.</li> 1705 <li><b><tt>__builtin_classify_type</tt></b>: This always returns an integer 1706 constant expression.</li> 1707 <li><b><tt>__builtin_inf,nan,..</tt></b>: These are treated just like a 1708 floating-point literal.</li> 1709 <li><b><tt>__builtin_abs,copysign,..</tt></b>: These are constant folded as 1710 general constant expressions.</li> 1711 <li><b><tt>__builtin_strlen</tt></b> and <b><tt>strlen</tt></b>: These are constant folded as integer constant expressions if the argument is a string literal.</li> 1712 </ul> 1713 1714 1715 <!-- ======================================================================= --> 1716 <h2 id="Howtos">How to change Clang</h2> 1717 <!-- ======================================================================= --> 1718 1719 <!-- ======================================================================= --> 1720 <h3 id="AddingAttributes">How to add an attribute</h3> 1721 <!-- ======================================================================= --> 1722 1723 <p>To add an attribute, you'll have to add it to the list of attributes, add it 1724 to the parsing phase, and look for it in the AST scan. 1725 <a href="http://llvm.org/viewvc/llvm-project?view=rev&revision=124217">r124217</a> 1726 has a good example of adding a warning attribute.</p> 1727 1728 <p>(Beware that this hasn't been reviewed/fixed by the people who designed the 1729 attributes system yet.)</p> 1730 1731 <h4><a 1732 href="http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Basic/Attr.td?view=markup">include/clang/Basic/Attr.td</a></h4> 1733 1734 <p>Each attribute gets a <tt>def</tt> inheriting from <tt>Attr</tt> or one of 1735 its subclasses. <tt>InheritableAttr</tt> means that the attribute also applies 1736 to subsequent declarations of the same name.</p> 1737 1738 <p><tt>Spellings</tt> lists the strings that can appear in 1739 <tt>__attribute__((here))</tt> or <tt>[[here]]</tt>. All such strings 1740 will be synonymous. If you want to allow the <tt>[[]]</tt> C++0x 1741 syntax, you have to define a list of <tt>Namespaces</tt>, which will 1742 let users write <tt>[[namespace:spelling]]</tt>. Using the empty 1743 string for a namespace will allow users to write just the spelling 1744 with no "<tt>:</tt>".</p> 1745 1746 <p><tt>Subjects</tt> restricts what kinds of AST node to which this attribute 1747 can appertain (roughly, attach).</p> 1748 1749 <p><tt>Args</tt> names the arguments the attribute takes, in order. If 1750 <tt>Args</tt> is <tt>[StringArgument<"Arg1">, IntArgument<"Arg2">]</tt> 1751 then <tt>__attribute__((myattribute("Hello", 3)))</tt> will be a valid use.</p> 1752 1753 <h4>Boilerplate</h4> 1754 1755 <p>Add an element to the <tt>AttributeList::Kind</tt> enum in <a 1756 href="http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Sema/AttributeList.h?view=markup">include/clang/Sema/AttributeList.h</a> 1757 named <tt>AT_lower_with_underscores</tt>. That is, a CamelCased 1758 <tt>AttributeName</tt> in <tt>Attr.td</tt> name should become 1759 <tt>AT_attribute_name</tt>.</p> 1760 1761 <p>Add a case to the <tt>StringSwitch</tt> in <tt>AttributeList::getKind()</tt> 1762 in <a 1763 href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Sema/AttributeList.cpp?view=markup">lib/Sema/AttributeList.cpp</a> 1764 for each spelling of your attribute. Less common attributes should come toward 1765 the end of that list.</p> 1766 1767 <p>Write a new <tt>HandleYourAttr()</tt> function in <a 1768 href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Sema/SemaDeclAttr.cpp?view=markup">lib/Sema/SemaDeclAttr.cpp</a>, 1769 and add a case to the switch in <tt>ProcessNonInheritableDeclAttr()</tt> or 1770 <tt>ProcessInheritableDeclAttr()</tt> forwarding to it.</p> 1771 1772 <p>If your attribute causes extra warnings to fire, define a <tt>DiagGroup</tt> 1773 in <a 1774 href="http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Basic/DiagnosticGroups.td?view=markup">include/clang/Basic/DiagnosticGroups.td</a> 1775 named after the attribute's <tt>Spelling</tt> with "_"s replaced by "-"s. If 1776 you're only defining one diagnostic, you can skip <tt>DiagnosticGroups.td</tt> 1777 and use <tt>InGroup<DiagGroup<"your-attribute">></tt> directly in <a 1778 href="http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Basic/DiagnosticSemaKinds.td?view=markup">DiagnosticSemaKinds.td</a></p> 1779 1780 <h4>The meat of your attribute</h4> 1781 1782 <p>Find an appropriate place in Clang to do whatever your attribute needs to do. 1783 Check for the attribute's presence using <tt>Decl::getAttr<YourAttr>()</tt>.</p> 1784 1785 <p>Update the <a href="LanguageExtensions.html">Clang Language Extensions</a> 1786 document to describe your new attribute.</p> 1787 1788 </div> 1789 </body> 1790 </html> 1791