1 =============================================================== 2 Tutorial for building tools using LibTooling and LibASTMatchers 3 =============================================================== 4 5 This document is intended to show how to build a useful source-to-source 6 translation tool based on Clang's `LibTooling <LibTooling.html>`_. It is 7 explicitly aimed at people who are new to Clang, so all you should need 8 is a working knowledge of C++ and the command line. 9 10 In order to work on the compiler, you need some basic knowledge of the 11 abstract syntax tree (AST). To this end, the reader is incouraged to 12 skim the :doc:`Introduction to the Clang 13 AST <IntroductionToTheClangAST>` 14 15 Step 0: Obtaining Clang 16 ======================= 17 18 As Clang is part of the LLVM project, you'll need to download LLVM's 19 source code first. Both Clang and LLVM are maintained as Subversion 20 repositories, but we'll be accessing them through the git mirror. For 21 further information, see the `getting started 22 guide <http://llvm.org/docs/GettingStarted.html>`_. 23 24 .. code-block:: console 25 26 mkdir ~/clang-llvm && cd ~/clang-llvm 27 git clone http://llvm.org/git/llvm.git 28 cd llvm/tools 29 git clone http://llvm.org/git/clang.git 30 cd clang/tools 31 git clone http://llvm.org/git/clang-tools-extra.git extra 32 33 Next you need to obtain the CMake build system and Ninja build tool. You 34 may already have CMake installed, but current binary versions of CMake 35 aren't built with Ninja support. 36 37 .. code-block:: console 38 39 cd ~/clang-llvm 40 git clone https://github.com/martine/ninja.git 41 cd ninja 42 git checkout release 43 ./bootstrap.py 44 sudo cp ninja /usr/bin/ 45 46 cd ~/clang-llvm 47 git clone git://cmake.org/stage/cmake.git 48 cd cmake 49 git checkout next 50 ./bootstrap 51 make 52 sudo make install 53 54 Okay. Now we'll build Clang! 55 56 .. code-block:: console 57 58 cd ~/clang-llvm 59 mkdir build && cd build 60 cmake -G Ninja ../llvm -DLLVM_BUILD_TESTS=ON # Enable tests; default is off. 61 ninja 62 ninja check # Test LLVM only. 63 ninja clang-test # Test Clang only. 64 ninja install 65 66 And we're live. 67 68 All of the tests should pass, though there is a (very) small chance that 69 you can catch LLVM and Clang out of sync. Running ``'git svn rebase'`` 70 in both the llvm and clang directories should fix any problems. 71 72 Finally, we want to set Clang as its own compiler. 73 74 .. code-block:: console 75 76 cd ~/clang-llvm/build 77 ccmake ../llvm 78 79 The second command will bring up a GUI for configuring Clang. You need 80 to set the entry for ``CMAKE_CXX_COMPILER``. Press ``'t'`` to turn on 81 advanced mode. Scroll down to ``CMAKE_CXX_COMPILER``, and set it to 82 ``/usr/bin/clang++``, or wherever you installed it. Press ``'c'`` to 83 configure, then ``'g'`` to generate CMake's files. 84 85 Finally, run ninja one last time, and you're done. 86 87 Step 1: Create a ClangTool 88 ========================== 89 90 Now that we have enough background knowledge, it's time to create the 91 simplest productive ClangTool in existence: a syntax checker. While this 92 already exists as ``clang-check``, it's important to understand what's 93 going on. 94 95 First, we'll need to create a new directory for our tool and tell CMake 96 that it exists. As this is not going to be a core clang tool, it will 97 live in the ``tools/extra`` repository. 98 99 .. code-block:: console 100 101 cd ~/clang-llvm/llvm/tools/clang 102 mkdir tools/extra/loop-convert 103 echo 'add_subdirectory(loop-convert)' >> tools/extra/CMakeLists.txt 104 vim tools/extra/loop-convert/CMakeLists.txt 105 106 CMakeLists.txt should have the following contents: 107 108 :: 109 110 set(LLVM_LINK_COMPONENTS support) 111 set(LLVM_USED_LIBS clangTooling clangBasic clangAST) 112 113 add_clang_executable(loop-convert 114 LoopConvert.cpp 115 ) 116 target_link_libraries(loop-convert 117 clangTooling 118 clangBasic 119 clangASTMatchers 120 ) 121 122 With that done, Ninja will be able to compile our tool. Let's give it 123 something to compile! Put the following into 124 ``tools/extra/loop-convert/LoopConvert.cpp``. A detailed explanation of 125 why the different parts are needed can be found in the `LibTooling 126 documentation <LibTooling.html>`_. 127 128 .. code-block:: c++ 129 130 // Declares clang::SyntaxOnlyAction. 131 #include "clang/Frontend/FrontendActions.h" 132 #include "clang/Tooling/CommonOptionsParser.h" 133 #include "clang/Tooling/Tooling.h" 134 // Declares llvm::cl::extrahelp. 135 #include "llvm/Support/CommandLine.h" 136 137 using namespace clang::tooling; 138 using namespace llvm; 139 140 // Apply a custom category to all command-line options so that they are the 141 // only ones displayed. 142 static llvm::cl::OptionCategory MyToolCategory("my-tool options"); 143 144 // CommonOptionsParser declares HelpMessage with a description of the common 145 // command-line options related to the compilation database and input files. 146 // It's nice to have this help message in all tools. 147 static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage); 148 149 // A help message for this specific tool can be added afterwards. 150 static cl::extrahelp MoreHelp("\nMore help text..."); 151 152 int main(int argc, const char **argv) { 153 CommonOptionsParser OptionsParser(argc, argv, MyToolCategory); 154 ClangTool Tool(OptionsParser.getCompilations(), 155 OptionsParser.getSourcePathList()); 156 return Tool.run(newFrontendActionFactory<clang::SyntaxOnlyAction>().get()); 157 } 158 159 And that's it! You can compile our new tool by running ninja from the 160 ``build`` directory. 161 162 .. code-block:: console 163 164 cd ~/clang-llvm/build 165 ninja 166 167 You should now be able to run the syntax checker, which is located in 168 ``~/clang-llvm/build/bin``, on any source file. Try it! 169 170 .. code-block:: console 171 172 cat "int main() { return 0; }" > test.cpp 173 bin/loop-convert test.cpp -- 174 175 Note the two dashes after we specify the source file. The additional 176 options for the compiler are passed after the dashes rather than loading 177 them from a compilation database - there just aren't any options needed 178 right now. 179 180 Intermezzo: Learn AST matcher basics 181 ==================================== 182 183 Clang recently introduced the :doc:`ASTMatcher 184 library <LibASTMatchers>` to provide a simple, powerful, and 185 concise way to describe specific patterns in the AST. Implemented as a 186 DSL powered by macros and templates (see 187 `ASTMatchers.h <../doxygen/ASTMatchers_8h_source.html>`_ if you're 188 curious), matchers offer the feel of algebraic data types common to 189 functional programming languages. 190 191 For example, suppose you wanted to examine only binary operators. There 192 is a matcher to do exactly that, conveniently named ``binaryOperator``. 193 I'll give you one guess what this matcher does: 194 195 .. code-block:: c++ 196 197 binaryOperator(hasOperatorName("+"), hasLHS(integerLiteral(equals(0)))) 198 199 Shockingly, it will match against addition expressions whose left hand 200 side is exactly the literal 0. It will not match against other forms of 201 0, such as ``'\0'`` or ``NULL``, but it will match against macros that 202 expand to 0. The matcher will also not match against calls to the 203 overloaded operator ``'+'``, as there is a separate ``operatorCallExpr`` 204 matcher to handle overloaded operators. 205 206 There are AST matchers to match all the different nodes of the AST, 207 narrowing matchers to only match AST nodes fulfilling specific criteria, 208 and traversal matchers to get from one kind of AST node to another. For 209 a complete list of AST matchers, take a look at the `AST Matcher 210 References <LibASTMatchersReference.html>`_ 211 212 All matcher that are nouns describe entities in the AST and can be 213 bound, so that they can be referred to whenever a match is found. To do 214 so, simply call the method ``bind`` on these matchers, e.g.: 215 216 .. code-block:: c++ 217 218 variable(hasType(isInteger())).bind("intvar") 219 220 Step 2: Using AST matchers 221 ========================== 222 223 Okay, on to using matchers for real. Let's start by defining a matcher 224 which will capture all ``for`` statements that define a new variable 225 initialized to zero. Let's start with matching all ``for`` loops: 226 227 .. code-block:: c++ 228 229 forStmt() 230 231 Next, we want to specify that a single variable is declared in the first 232 portion of the loop, so we can extend the matcher to 233 234 .. code-block:: c++ 235 236 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl())))) 237 238 Finally, we can add the condition that the variable is initialized to 239 zero. 240 241 .. code-block:: c++ 242 243 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( 244 hasInitializer(integerLiteral(equals(0)))))))) 245 246 It is fairly easy to read and understand the matcher definition ("match 247 loops whose init portion declares a single variable which is initialized 248 to the integer literal 0"), but deciding that every piece is necessary 249 is more difficult. Note that this matcher will not match loops whose 250 variables are initialized to ``'\0'``, ``0.0``, ``NULL``, or any form of 251 zero besides the integer 0. 252 253 The last step is giving the matcher a name and binding the ``ForStmt`` 254 as we will want to do something with it: 255 256 .. code-block:: c++ 257 258 StatementMatcher LoopMatcher = 259 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( 260 hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop"); 261 262 Once you have defined your matchers, you will need to add a little more 263 scaffolding in order to run them. Matchers are paired with a 264 ``MatchCallback`` and registered with a ``MatchFinder`` object, then run 265 from a ``ClangTool``. More code! 266 267 Add the following to ``LoopConvert.cpp``: 268 269 .. code-block:: c++ 270 271 #include "clang/ASTMatchers/ASTMatchers.h" 272 #include "clang/ASTMatchers/ASTMatchFinder.h" 273 274 using namespace clang; 275 using namespace clang::ast_matchers; 276 277 StatementMatcher LoopMatcher = 278 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( 279 hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop"); 280 281 class LoopPrinter : public MatchFinder::MatchCallback { 282 public : 283 virtual void run(const MatchFinder::MatchResult &Result) { 284 if (const ForStmt *FS = Result.Nodes.getNodeAs<clang::ForStmt>("forLoop")) 285 FS->dump(); 286 } 287 }; 288 289 And change ``main()`` to: 290 291 .. code-block:: c++ 292 293 int main(int argc, const char **argv) { 294 CommonOptionsParser OptionsParser(argc, argv, MyToolCategory); 295 ClangTool Tool(OptionsParser.getCompilations(), 296 OptionsParser.getSourcePathList()); 297 298 LoopPrinter Printer; 299 MatchFinder Finder; 300 Finder.addMatcher(LoopMatcher, &Printer); 301 302 return Tool.run(newFrontendActionFactory(&Finder).get()); 303 } 304 305 Now, you should be able to recompile and run the code to discover for 306 loops. Create a new file with a few examples, and test out our new 307 handiwork: 308 309 .. code-block:: console 310 311 cd ~/clang-llvm/llvm/llvm_build/ 312 ninja loop-convert 313 vim ~/test-files/simple-loops.cc 314 bin/loop-convert ~/test-files/simple-loops.cc 315 316 Step 3.5: More Complicated Matchers 317 =================================== 318 319 Our simple matcher is capable of discovering for loops, but we would 320 still need to filter out many more ourselves. We can do a good portion 321 of the remaining work with some cleverly chosen matchers, but first we 322 need to decide exactly which properties we want to allow. 323 324 How can we characterize for loops over arrays which would be eligible 325 for translation to range-based syntax? Range based loops over arrays of 326 size ``N`` that: 327 328 - start at index ``0`` 329 - iterate consecutively 330 - end at index ``N-1`` 331 332 We already check for (1), so all we need to add is a check to the loop's 333 condition to ensure that the loop's index variable is compared against 334 ``N`` and another check to ensure that the increment step just 335 increments this same variable. The matcher for (2) is straightforward: 336 require a pre- or post-increment of the same variable declared in the 337 init portion. 338 339 Unfortunately, such a matcher is impossible to write. Matchers contain 340 no logic for comparing two arbitrary AST nodes and determining whether 341 or not they are equal, so the best we can do is matching more than we 342 would like to allow, and punting extra comparisons to the callback. 343 344 In any case, we can start building this sub-matcher. We can require that 345 the increment step be a unary increment like this: 346 347 .. code-block:: c++ 348 349 hasIncrement(unaryOperator(hasOperatorName("++"))) 350 351 Specifying what is incremented introduces another quirk of Clang's AST: 352 Usages of variables are represented as ``DeclRefExpr``'s ("declaration 353 reference expressions") because they are expressions which refer to 354 variable declarations. To find a ``unaryOperator`` that refers to a 355 specific declaration, we can simply add a second condition to it: 356 357 .. code-block:: c++ 358 359 hasIncrement(unaryOperator( 360 hasOperatorName("++"), 361 hasUnaryOperand(declRefExpr()))) 362 363 Furthermore, we can restrict our matcher to only match if the 364 incremented variable is an integer: 365 366 .. code-block:: c++ 367 368 hasIncrement(unaryOperator( 369 hasOperatorName("++"), 370 hasUnaryOperand(declRefExpr(to(varDecl(hasType(isInteger()))))))) 371 372 And the last step will be to attach an identifier to this variable, so 373 that we can retrieve it in the callback: 374 375 .. code-block:: c++ 376 377 hasIncrement(unaryOperator( 378 hasOperatorName("++"), 379 hasUnaryOperand(declRefExpr(to( 380 varDecl(hasType(isInteger())).bind("incrementVariable")))))) 381 382 We can add this code to the definition of ``LoopMatcher`` and make sure 383 that our program, outfitted with the new matcher, only prints out loops 384 that declare a single variable initialized to zero and have an increment 385 step consisting of a unary increment of some variable. 386 387 Now, we just need to add a matcher to check if the condition part of the 388 ``for`` loop compares a variable against the size of the array. There is 389 only one problem - we don't know which array we're iterating over 390 without looking at the body of the loop! We are again restricted to 391 approximating the result we want with matchers, filling in the details 392 in the callback. So we start with: 393 394 .. code-block:: c++ 395 396 hasCondition(binaryOperator(hasOperatorName("<")) 397 398 It makes sense to ensure that the left-hand side is a reference to a 399 variable, and that the right-hand side has integer type. 400 401 .. code-block:: c++ 402 403 hasCondition(binaryOperator( 404 hasOperatorName("<"), 405 hasLHS(declRefExpr(to(varDecl(hasType(isInteger()))))), 406 hasRHS(expr(hasType(isInteger()))))) 407 408 Why? Because it doesn't work. Of the three loops provided in 409 ``test-files/simple.cpp``, zero of them have a matching condition. A 410 quick look at the AST dump of the first for loop, produced by the 411 previous iteration of loop-convert, shows us the answer: 412 413 :: 414 415 (ForStmt 0x173b240 416 (DeclStmt 0x173afc8 417 0x173af50 "int i = 418 (IntegerLiteral 0x173afa8 'int' 0)") 419 <<>> 420 (BinaryOperator 0x173b060 '_Bool' '<' 421 (ImplicitCastExpr 0x173b030 'int' 422 (DeclRefExpr 0x173afe0 'int' lvalue Var 0x173af50 'i' 'int')) 423 (ImplicitCastExpr 0x173b048 'int' 424 (DeclRefExpr 0x173b008 'const int' lvalue Var 0x170fa80 'N' 'const int'))) 425 (UnaryOperator 0x173b0b0 'int' lvalue prefix '++' 426 (DeclRefExpr 0x173b088 'int' lvalue Var 0x173af50 'i' 'int')) 427 (CompoundStatement ... 428 429 We already know that the declaration and increments both match, or this 430 loop wouldn't have been dumped. The culprit lies in the implicit cast 431 applied to the first operand (i.e. the LHS) of the less-than operator, 432 an L-value to R-value conversion applied to the expression referencing 433 ``i``. Thankfully, the matcher library offers a solution to this problem 434 in the form of ``ignoringParenImpCasts``, which instructs the matcher to 435 ignore implicit casts and parentheses before continuing to match. 436 Adjusting the condition operator will restore the desired match. 437 438 .. code-block:: c++ 439 440 hasCondition(binaryOperator( 441 hasOperatorName("<"), 442 hasLHS(ignoringParenImpCasts(declRefExpr( 443 to(varDecl(hasType(isInteger())))))), 444 hasRHS(expr(hasType(isInteger()))))) 445 446 After adding binds to the expressions we wished to capture and 447 extracting the identifier strings into variables, we have array-step-2 448 completed. 449 450 Step 4: Retrieving Matched Nodes 451 ================================ 452 453 So far, the matcher callback isn't very interesting: it just dumps the 454 loop's AST. At some point, we will need to make changes to the input 455 source code. Next, we'll work on using the nodes we bound in the 456 previous step. 457 458 The ``MatchFinder::run()`` callback takes a 459 ``MatchFinder::MatchResult&`` as its parameter. We're most interested in 460 its ``Context`` and ``Nodes`` members. Clang uses the ``ASTContext`` 461 class to represent contextual information about the AST, as the name 462 implies, though the most functionally important detail is that several 463 operations require an ``ASTContext*`` parameter. More immediately useful 464 is the set of matched nodes, and how we retrieve them. 465 466 Since we bind three variables (identified by ConditionVarName, 467 InitVarName, and IncrementVarName), we can obtain the matched nodes by 468 using the ``getNodeAs()`` member function. 469 470 In ``LoopConvert.cpp`` add 471 472 .. code-block:: c++ 473 474 #include "clang/AST/ASTContext.h" 475 476 Change ``LoopMatcher`` to 477 478 .. code-block:: c++ 479 480 StatementMatcher LoopMatcher = 481 forStmt(hasLoopInit(declStmt( 482 hasSingleDecl(varDecl(hasInitializer(integerLiteral(equals(0)))) 483 .bind("initVarName")))), 484 hasIncrement(unaryOperator( 485 hasOperatorName("++"), 486 hasUnaryOperand(declRefExpr( 487 to(varDecl(hasType(isInteger())).bind("incVarName")))))), 488 hasCondition(binaryOperator( 489 hasOperatorName("<"), 490 hasLHS(ignoringParenImpCasts(declRefExpr( 491 to(varDecl(hasType(isInteger())).bind("condVarName"))))), 492 hasRHS(expr(hasType(isInteger())))))).bind("forLoop"); 493 494 And change ``LoopPrinter::run`` to 495 496 .. code-block:: c++ 497 498 void LoopPrinter::run(const MatchFinder::MatchResult &Result) { 499 ASTContext *Context = Result.Context; 500 const ForStmt *FS = Result.Nodes.getStmtAs<ForStmt>("forLoop"); 501 // We do not want to convert header files! 502 if (!FS || !Context->getSourceManager().isFromMainFile(FS->getForLoc())) 503 return; 504 const VarDecl *IncVar = Result.Nodes.getNodeAs<VarDecl>("incVarName"); 505 const VarDecl *CondVar = Result.Nodes.getNodeAs<VarDecl>("condVarName"); 506 const VarDecl *InitVar = Result.Nodes.getNodeAs<VarDecl>("initVarName"); 507 508 if (!areSameVariable(IncVar, CondVar) || !areSameVariable(IncVar, InitVar)) 509 return; 510 llvm::outs() << "Potential array-based loop discovered.\n"; 511 } 512 513 Clang associates a ``VarDecl`` with each variable to represent the variable's 514 declaration. Since the "canonical" form of each declaration is unique by 515 address, all we need to do is make sure neither ``ValueDecl`` (base class of 516 ``VarDecl``) is ``NULL`` and compare the canonical Decls. 517 518 .. code-block:: c++ 519 520 static bool areSameVariable(const ValueDecl *First, const ValueDecl *Second) { 521 return First && Second && 522 First->getCanonicalDecl() == Second->getCanonicalDecl(); 523 } 524 525 If execution reaches the end of ``LoopPrinter::run()``, we know that the 526 loop shell that looks like 527 528 .. code-block:: c++ 529 530 for (int i= 0; i < expr(); ++i) { ... } 531 532 For now, we will just print a message explaining that we found a loop. 533 The next section will deal with recursively traversing the AST to 534 discover all changes needed. 535 536 As a side note, it's not as trivial to test if two expressions are the same, 537 though Clang has already done the hard work for us by providing a way to 538 canonicalize expressions: 539 540 .. code-block:: c++ 541 542 static bool areSameExpr(ASTContext *Context, const Expr *First, 543 const Expr *Second) { 544 if (!First || !Second) 545 return false; 546 llvm::FoldingSetNodeID FirstID, SecondID; 547 First->Profile(FirstID, *Context, true); 548 Second->Profile(SecondID, *Context, true); 549 return FirstID == SecondID; 550 } 551 552 This code relies on the comparison between two 553 ``llvm::FoldingSetNodeIDs``. As the documentation for 554 ``Stmt::Profile()`` indicates, the ``Profile()`` member function builds 555 a description of a node in the AST, based on its properties, along with 556 those of its children. ``FoldingSetNodeID`` then serves as a hash we can 557 use to compare expressions. We will need ``areSameExpr`` later. Before 558 you run the new code on the additional loops added to 559 test-files/simple.cpp, try to figure out which ones will be considered 560 potentially convertible. 561