1 =============================================================== 2 Tutorial for building tools using LibTooling and LibASTMatchers 3 =============================================================== 4 5 This document is intended to show how to build a useful source-to-source 6 translation tool based on Clang's `LibTooling <LibTooling.html>`_. It is 7 explicitly aimed at people who are new to Clang, so all you should need 8 is a working knowledge of C++ and the command line. 9 10 In order to work on the compiler, you need some basic knowledge of the 11 abstract syntax tree (AST). To this end, the reader is incouraged to 12 skim the :doc:`Introduction to the Clang 13 AST <IntroductionToTheClangAST>` 14 15 Step 0: Obtaining Clang 16 ======================= 17 18 As Clang is part of the LLVM project, you'll need to download LLVM's 19 source code first. Both Clang and LLVM are maintained as Subversion 20 repositories, but we'll be accessing them through the git mirror. For 21 further information, see the `getting started 22 guide <http://llvm.org/docs/GettingStarted.html>`_. 23 24 .. code-block:: console 25 26 mkdir ~/clang-llvm && cd ~/clang-llvm 27 git clone http://llvm.org/git/llvm.git 28 cd llvm/tools 29 git clone http://llvm.org/git/clang.git 30 cd clang/tools 31 git clone http://llvm.org/git/clang-tools-extra.git extra 32 33 Next you need to obtain the CMake build system and Ninja build tool. You 34 may already have CMake installed, but current binary versions of CMake 35 aren't built with Ninja support. 36 37 .. code-block:: console 38 39 cd ~/clang-llvm 40 git clone https://github.com/martine/ninja.git 41 cd ninja 42 git checkout release 43 ./bootstrap.py 44 sudo cp ninja /usr/bin/ 45 46 cd ~/clang-llvm 47 git clone git://cmake.org/stage/cmake.git 48 cd cmake 49 git checkout next 50 ./bootstrap 51 make 52 sudo make install 53 54 Okay. Now we'll build Clang! 55 56 .. code-block:: console 57 58 cd ~/clang-llvm 59 mkdir build && cd build 60 cmake -G Ninja ../llvm -DLLVM_BUILD_TESTS=ON # Enable tests; default is off. 61 ninja 62 ninja check # Test LLVM only. 63 ninja clang-test # Test Clang only. 64 ninja install 65 66 And we're live. 67 68 All of the tests should pass, though there is a (very) small chance that 69 you can catch LLVM and Clang out of sync. Running ``'git svn rebase'`` 70 in both the llvm and clang directories should fix any problems. 71 72 Finally, we want to set Clang as its own compiler. 73 74 .. code-block:: console 75 76 cd ~/clang-llvm/build 77 ccmake ../llvm 78 79 The second command will bring up a GUI for configuring Clang. You need 80 to set the entry for ``CMAKE_CXX_COMPILER``. Press ``'t'`` to turn on 81 advanced mode. Scroll down to ``CMAKE_CXX_COMPILER``, and set it to 82 ``/usr/bin/clang++``, or wherever you installed it. Press ``'c'`` to 83 configure, then ``'g'`` to generate CMake's files. 84 85 Finally, run ninja one last time, and you're done. 86 87 Step 1: Create a ClangTool 88 ========================== 89 90 Now that we have enough background knowledge, it's time to create the 91 simplest productive ClangTool in existence: a syntax checker. While this 92 already exists as ``clang-check``, it's important to understand what's 93 going on. 94 95 First, we'll need to create a new directory for our tool and tell CMake 96 that it exists. As this is not going to be a core clang tool, it will 97 live in the ``tools/extra`` repository. 98 99 .. code-block:: console 100 101 cd ~/clang-llvm/llvm/tools/clang 102 mkdir tools/extra/loop-convert 103 echo 'add_subdirectory(loop-convert)' >> tools/extra/CMakeLists.txt 104 vim tools/extra/loop-convert/CMakeLists.txt 105 106 CMakeLists.txt should have the following contents: 107 108 :: 109 110 set(LLVM_LINK_COMPONENTS support) 111 set(LLVM_USED_LIBS clangTooling clangBasic clangAST) 112 113 add_clang_executable(loop-convert 114 LoopConvert.cpp 115 ) 116 target_link_libraries(loop-convert 117 clangTooling 118 clangBasic 119 clangASTMatchers 120 ) 121 122 With that done, Ninja will be able to compile our tool. Let's give it 123 something to compile! Put the following into 124 ``tools/extra/loop-convert/LoopConvert.cpp``. A detailed explanation of 125 why the different parts are needed can be found in the `LibTooling 126 documentation <LibTooling.html>`_. 127 128 .. code-block:: c++ 129 130 // Declares clang::SyntaxOnlyAction. 131 #include "clang/Frontend/FrontendActions.h" 132 #include "clang/Tooling/CommonOptionsParser.h" 133 #include "clang/Tooling/Tooling.h" 134 // Declares llvm::cl::extrahelp. 135 #include "llvm/Support/CommandLine.h" 136 137 using namespace clang::tooling; 138 using namespace llvm; 139 140 // CommonOptionsParser declares HelpMessage with a description of the common 141 // command-line options related to the compilation database and input files. 142 // It's nice to have this help message in all tools. 143 static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage); 144 145 // A help message for this specific tool can be added afterwards. 146 static cl::extrahelp MoreHelp("\nMore help text..."); 147 148 int main(int argc, const char **argv) { 149 CommonOptionsParser OptionsParser(argc, argv); 150 ClangTool Tool(OptionsParser.getCompilations(), 151 OptionsParser.getSourcePathList()); 152 return Tool.run(newFrontendActionFactory<clang::SyntaxOnlyAction>()); 153 } 154 155 And that's it! You can compile our new tool by running ninja from the 156 ``build`` directory. 157 158 .. code-block:: console 159 160 cd ~/clang-llvm/build 161 ninja 162 163 You should now be able to run the syntax checker, which is located in 164 ``~/clang-llvm/build/bin``, on any source file. Try it! 165 166 .. code-block:: console 167 168 cat "int main() { return 0; }" > test.cpp 169 bin/loop-convert test.cpp -- 170 171 Note the two dashes after we specify the source file. The additional 172 options for the compiler are passed after the dashes rather than loading 173 them from a compilation database - there just aren't any options needed 174 right now. 175 176 Intermezzo: Learn AST matcher basics 177 ==================================== 178 179 Clang recently introduced the :doc:`ASTMatcher 180 library <LibASTMatchers>` to provide a simple, powerful, and 181 concise way to describe specific patterns in the AST. Implemented as a 182 DSL powered by macros and templates (see 183 `ASTMatchers.h <../doxygen/ASTMatchers_8h_source.html>`_ if you're 184 curious), matchers offer the feel of algebraic data types common to 185 functional programming languages. 186 187 For example, suppose you wanted to examine only binary operators. There 188 is a matcher to do exactly that, conveniently named ``binaryOperator``. 189 I'll give you one guess what this matcher does: 190 191 .. code-block:: c++ 192 193 binaryOperator(hasOperatorName("+"), hasLHS(integerLiteral(equals(0)))) 194 195 Shockingly, it will match against addition expressions whose left hand 196 side is exactly the literal 0. It will not match against other forms of 197 0, such as ``'\0'`` or ``NULL``, but it will match against macros that 198 expand to 0. The matcher will also not match against calls to the 199 overloaded operator ``'+'``, as there is a separate ``operatorCallExpr`` 200 matcher to handle overloaded operators. 201 202 There are AST matchers to match all the different nodes of the AST, 203 narrowing matchers to only match AST nodes fulfilling specific criteria, 204 and traversal matchers to get from one kind of AST node to another. For 205 a complete list of AST matchers, take a look at the `AST Matcher 206 References <LibASTMatchersReference.html>`_ 207 208 All matcher that are nouns describe entities in the AST and can be 209 bound, so that they can be referred to whenever a match is found. To do 210 so, simply call the method ``bind`` on these matchers, e.g.: 211 212 .. code-block:: c++ 213 214 variable(hasType(isInteger())).bind("intvar") 215 216 Step 2: Using AST matchers 217 ========================== 218 219 Okay, on to using matchers for real. Let's start by defining a matcher 220 which will capture all ``for`` statements that define a new variable 221 initialized to zero. Let's start with matching all ``for`` loops: 222 223 .. code-block:: c++ 224 225 forStmt() 226 227 Next, we want to specify that a single variable is declared in the first 228 portion of the loop, so we can extend the matcher to 229 230 .. code-block:: c++ 231 232 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl())))) 233 234 Finally, we can add the condition that the variable is initialized to 235 zero. 236 237 .. code-block:: c++ 238 239 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( 240 hasInitializer(integerLiteral(equals(0)))))))) 241 242 It is fairly easy to read and understand the matcher definition ("match 243 loops whose init portion declares a single variable which is initialized 244 to the integer literal 0"), but deciding that every piece is necessary 245 is more difficult. Note that this matcher will not match loops whose 246 variables are initialized to ``'\0'``, ``0.0``, ``NULL``, or any form of 247 zero besides the integer 0. 248 249 The last step is giving the matcher a name and binding the ``ForStmt`` 250 as we will want to do something with it: 251 252 .. code-block:: c++ 253 254 StatementMatcher LoopMatcher = 255 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( 256 hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop"); 257 258 Once you have defined your matchers, you will need to add a little more 259 scaffolding in order to run them. Matchers are paired with a 260 ``MatchCallback`` and registered with a ``MatchFinder`` object, then run 261 from a ``ClangTool``. More code! 262 263 Add the following to ``LoopConvert.cpp``: 264 265 .. code-block:: c++ 266 267 #include "clang/ASTMatchers/ASTMatchers.h" 268 #include "clang/ASTMatchers/ASTMatchFinder.h" 269 270 using namespace clang; 271 using namespace clang::ast_matchers; 272 273 StatementMatcher LoopMatcher = 274 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( 275 hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop"); 276 277 class LoopPrinter : public MatchFinder::MatchCallback { 278 public : 279 virtual void run(const MatchFinder::MatchResult &Result) { 280 if (const ForStmt *FS = Result.Nodes.getNodeAs<clang::ForStmt>("forLoop")) 281 FS->dump(); 282 } 283 }; 284 285 And change ``main()`` to: 286 287 .. code-block:: c++ 288 289 int main(int argc, const char **argv) { 290 CommonOptionsParser OptionsParser(argc, argv); 291 ClangTool Tool(OptionsParser.getCompilations(), 292 OptionsParser.getSourcePathList()); 293 294 LoopPrinter Printer; 295 MatchFinder Finder; 296 Finder.addMatcher(LoopMatcher, &Printer); 297 298 return Tool.run(newFrontendActionFactory(&Finder)); 299 } 300 301 Now, you should be able to recompile and run the code to discover for 302 loops. Create a new file with a few examples, and test out our new 303 handiwork: 304 305 .. code-block:: console 306 307 cd ~/clang-llvm/llvm/llvm_build/ 308 ninja loop-convert 309 vim ~/test-files/simple-loops.cc 310 bin/loop-convert ~/test-files/simple-loops.cc 311 312 Step 3.5: More Complicated Matchers 313 =================================== 314 315 Our simple matcher is capable of discovering for loops, but we would 316 still need to filter out many more ourselves. We can do a good portion 317 of the remaining work with some cleverly chosen matchers, but first we 318 need to decide exactly which properties we want to allow. 319 320 How can we characterize for loops over arrays which would be eligible 321 for translation to range-based syntax? Range based loops over arrays of 322 size ``N`` that: 323 324 - start at index ``0`` 325 - iterate consecutively 326 - end at index ``N-1`` 327 328 We already check for (1), so all we need to add is a check to the loop's 329 condition to ensure that the loop's index variable is compared against 330 ``N`` and another check to ensure that the increment step just 331 increments this same variable. The matcher for (2) is straightforward: 332 require a pre- or post-increment of the same variable declared in the 333 init portion. 334 335 Unfortunately, such a matcher is impossible to write. Matchers contain 336 no logic for comparing two arbitrary AST nodes and determining whether 337 or not they are equal, so the best we can do is matching more than we 338 would like to allow, and punting extra comparisons to the callback. 339 340 In any case, we can start building this sub-matcher. We can require that 341 the increment step be a unary increment like this: 342 343 .. code-block:: c++ 344 345 hasIncrement(unaryOperator(hasOperatorName("++"))) 346 347 Specifying what is incremented introduces another quirk of Clang's AST: 348 Usages of variables are represented as ``DeclRefExpr``'s ("declaration 349 reference expressions") because they are expressions which refer to 350 variable declarations. To find a ``unaryOperator`` that refers to a 351 specific declaration, we can simply add a second condition to it: 352 353 .. code-block:: c++ 354 355 hasIncrement(unaryOperator( 356 hasOperatorName("++"), 357 hasUnaryOperand(declRefExpr()))) 358 359 Furthermore, we can restrict our matcher to only match if the 360 incremented variable is an integer: 361 362 .. code-block:: c++ 363 364 hasIncrement(unaryOperator( 365 hasOperatorName("++"), 366 hasUnaryOperand(declRefExpr(to(varDecl(hasType(isInteger()))))))) 367 368 And the last step will be to attach an identifier to this variable, so 369 that we can retrieve it in the callback: 370 371 .. code-block:: c++ 372 373 hasIncrement(unaryOperator( 374 hasOperatorName("++"), 375 hasUnaryOperand(declRefExpr(to( 376 varDecl(hasType(isInteger())).bind("incrementVariable")))))) 377 378 We can add this code to the definition of ``LoopMatcher`` and make sure 379 that our program, outfitted with the new matcher, only prints out loops 380 that declare a single variable initialized to zero and have an increment 381 step consisting of a unary increment of some variable. 382 383 Now, we just need to add a matcher to check if the condition part of the 384 ``for`` loop compares a variable against the size of the array. There is 385 only one problem - we don't know which array we're iterating over 386 without looking at the body of the loop! We are again restricted to 387 approximating the result we want with matchers, filling in the details 388 in the callback. So we start with: 389 390 .. code-block:: c++ 391 392 hasCondition(binaryOperator(hasOperatorName("<")) 393 394 It makes sense to ensure that the left-hand side is a reference to a 395 variable, and that the right-hand side has integer type. 396 397 .. code-block:: c++ 398 399 hasCondition(binaryOperator( 400 hasOperatorName("<"), 401 hasLHS(declRefExpr(to(varDecl(hasType(isInteger()))))), 402 hasRHS(expr(hasType(isInteger()))))) 403 404 Why? Because it doesn't work. Of the three loops provided in 405 ``test-files/simple.cpp``, zero of them have a matching condition. A 406 quick look at the AST dump of the first for loop, produced by the 407 previous iteration of loop-convert, shows us the answer: 408 409 :: 410 411 (ForStmt 0x173b240 412 (DeclStmt 0x173afc8 413 0x173af50 "int i = 414 (IntegerLiteral 0x173afa8 'int' 0)") 415 <<>> 416 (BinaryOperator 0x173b060 '_Bool' '<' 417 (ImplicitCastExpr 0x173b030 'int' 418 (DeclRefExpr 0x173afe0 'int' lvalue Var 0x173af50 'i' 'int')) 419 (ImplicitCastExpr 0x173b048 'int' 420 (DeclRefExpr 0x173b008 'const int' lvalue Var 0x170fa80 'N' 'const int'))) 421 (UnaryOperator 0x173b0b0 'int' lvalue prefix '++' 422 (DeclRefExpr 0x173b088 'int' lvalue Var 0x173af50 'i' 'int')) 423 (CompoundStatement ... 424 425 We already know that the declaration and increments both match, or this 426 loop wouldn't have been dumped. The culprit lies in the implicit cast 427 applied to the first operand (i.e. the LHS) of the less-than operator, 428 an L-value to R-value conversion applied to the expression referencing 429 ``i``. Thankfully, the matcher library offers a solution to this problem 430 in the form of ``ignoringParenImpCasts``, which instructs the matcher to 431 ignore implicit casts and parentheses before continuing to match. 432 Adjusting the condition operator will restore the desired match. 433 434 .. code-block:: c++ 435 436 hasCondition(binaryOperator( 437 hasOperatorName("<"), 438 hasLHS(ignoringParenImpCasts(declRefExpr( 439 to(varDecl(hasType(isInteger())))))), 440 hasRHS(expr(hasType(isInteger()))))) 441 442 After adding binds to the expressions we wished to capture and 443 extracting the identifier strings into variables, we have array-step-2 444 completed. 445 446 Step 4: Retrieving Matched Nodes 447 ================================ 448 449 So far, the matcher callback isn't very interesting: it just dumps the 450 loop's AST. At some point, we will need to make changes to the input 451 source code. Next, we'll work on using the nodes we bound in the 452 previous step. 453 454 The ``MatchFinder::run()`` callback takes a 455 ``MatchFinder::MatchResult&`` as its parameter. We're most interested in 456 its ``Context`` and ``Nodes`` members. Clang uses the ``ASTContext`` 457 class to represent contextual information about the AST, as the name 458 implies, though the most functionally important detail is that several 459 operations require an ``ASTContext*`` parameter. More immediately useful 460 is the set of matched nodes, and how we retrieve them. 461 462 Since we bind three variables (identified by ConditionVarName, 463 InitVarName, and IncrementVarName), we can obtain the matched nodes by 464 using the ``getNodeAs()`` member function. 465 466 In ``LoopConvert.cpp`` add 467 468 .. code-block:: c++ 469 470 #include "clang/AST/ASTContext.h" 471 472 Change ``LoopMatcher`` to 473 474 .. code-block:: c++ 475 476 StatementMatcher LoopMatcher = 477 forStmt(hasLoopInit(declStmt( 478 hasSingleDecl(varDecl(hasInitializer(integerLiteral(equals(0)))) 479 .bind("initVarName")))), 480 hasIncrement(unaryOperator( 481 hasOperatorName("++"), 482 hasUnaryOperand(declRefExpr( 483 to(varDecl(hasType(isInteger())).bind("incVarName")))))), 484 hasCondition(binaryOperator( 485 hasOperatorName("<"), 486 hasLHS(ignoringParenImpCasts(declRefExpr( 487 to(varDecl(hasType(isInteger())).bind("condVarName"))))), 488 hasRHS(expr(hasType(isInteger())))))).bind("forLoop"); 489 490 And change ``LoopPrinter::run`` to 491 492 .. code-block:: c++ 493 494 void LoopPrinter::run(const MatchFinder::MatchResult &Result) { 495 ASTContext *Context = Result.Context; 496 const ForStmt *FS = Result.Nodes.getStmtAs<ForStmt>("forLoop"); 497 // We do not want to convert header files! 498 if (!FS || !Context->getSourceManager().isFromMainFile(FS->getForLoc())) 499 return; 500 const VarDecl *IncVar = Result.Nodes.getNodeAs<VarDecl>("incVarName"); 501 const VarDecl *CondVar = Result.Nodes.getNodeAs<VarDecl>("condVarName"); 502 const VarDecl *InitVar = Result.Nodes.getNodeAs<VarDecl>("initVarName"); 503 504 if (!areSameVariable(IncVar, CondVar) || !areSameVariable(IncVar, InitVar)) 505 return; 506 llvm::outs() << "Potential array-based loop discovered.\n"; 507 } 508 509 Clang associates a ``VarDecl`` with each variable to represent the variable's 510 declaration. Since the "canonical" form of each declaration is unique by 511 address, all we need to do is make sure neither ``ValueDecl`` (base class of 512 ``VarDecl``) is ``NULL`` and compare the canonical Decls. 513 514 .. code-block:: c++ 515 516 static bool areSameVariable(const ValueDecl *First, const ValueDecl *Second) { 517 return First && Second && 518 First->getCanonicalDecl() == Second->getCanonicalDecl(); 519 } 520 521 If execution reaches the end of ``LoopPrinter::run()``, we know that the 522 loop shell that looks like 523 524 .. code-block:: c++ 525 526 for (int i= 0; i < expr(); ++i) { ... } 527 528 For now, we will just print a message explaining that we found a loop. 529 The next section will deal with recursively traversing the AST to 530 discover all changes needed. 531 532 As a side note, it's not as trivial to test if two expressions are the same, 533 though Clang has already done the hard work for us by providing a way to 534 canonicalize expressions: 535 536 .. code-block:: c++ 537 538 static bool areSameExpr(ASTContext *Context, const Expr *First, 539 const Expr *Second) { 540 if (!First || !Second) 541 return false; 542 llvm::FoldingSetNodeID FirstID, SecondID; 543 First->Profile(FirstID, *Context, true); 544 Second->Profile(SecondID, *Context, true); 545 return FirstID == SecondID; 546 } 547 548 This code relies on the comparison between two 549 ``llvm::FoldingSetNodeIDs``. As the documentation for 550 ``Stmt::Profile()`` indicates, the ``Profile()`` member function builds 551 a description of a node in the AST, based on its properties, along with 552 those of its children. ``FoldingSetNodeID`` then serves as a hash we can 553 use to compare expressions. We will need ``areSameExpr`` later. Before 554 you run the new code on the additional loops added to 555 test-files/simple.cpp, try to figure out which ones will be considered 556 potentially convertible. 557