1 =============================================================== 2 Tutorial for building tools using LibTooling and LibASTMatchers 3 =============================================================== 4 5 This document is intended to show how to build a useful source-to-source 6 translation tool based on Clang's `LibTooling <LibTooling.html>`_. It is 7 explicitly aimed at people who are new to Clang, so all you should need 8 is a working knowledge of C++ and the command line. 9 10 In order to work on the compiler, you need some basic knowledge of the 11 abstract syntax tree (AST). To this end, the reader is incouraged to 12 skim the :doc:`Introduction to the Clang 13 AST <IntroductionToTheClangAST>` 14 15 Step 0: Obtaining Clang 16 ======================= 17 18 As Clang is part of the LLVM project, you'll need to download LLVM's 19 source code first. Both Clang and LLVM are maintained as Subversion 20 repositories, but we'll be accessing them through the git mirror. For 21 further information, see the `getting started 22 guide <http://llvm.org/docs/GettingStarted.html>`_. 23 24 .. code-block:: console 25 26 mkdir ~/clang-llvm && cd ~/clang-llvm 27 git clone http://llvm.org/git/llvm.git 28 cd llvm/tools 29 git clone http://llvm.org/git/clang.git 30 31 Next you need to obtain the CMake build system and Ninja build tool. You 32 may already have CMake installed, but current binary versions of CMake 33 aren't built with Ninja support. 34 35 .. code-block:: console 36 37 cd ~/clang-llvm 38 git clone https://github.com/martine/ninja.git 39 cd ninja 40 git checkout release 41 ./bootstrap.py 42 sudo cp ninja /usr/bin/ 43 44 cd ~/clang-llvm 45 git clone git://cmake.org/stage/cmake.git 46 cd cmake 47 git checkout next 48 ./bootstrap 49 make 50 sudo make install 51 52 Okay. Now we'll build Clang! 53 54 .. code-block:: console 55 56 cd ~/clang-llvm 57 mkdir build && cd build 58 cmake -G Ninja ../llvm -DLLVM_BUILD_TESTS=ON # Enable tests; default is off. 59 ninja 60 ninja check # Test LLVM only. 61 ninja clang-test # Test Clang only. 62 ninja install 63 64 And we're live. 65 66 All of the tests should pass, though there is a (very) small chance that 67 you can catch LLVM and Clang out of sync. Running ``'git svn rebase'`` 68 in both the llvm and clang directories should fix any problems. 69 70 Finally, we want to set Clang as its own compiler. 71 72 .. code-block:: console 73 74 cd ~/clang-llvm/build 75 ccmake ../llvm 76 77 The second command will bring up a GUI for configuring Clang. You need 78 to set the entry for ``CMAKE_CXX_COMPILER``. Press ``'t'`` to turn on 79 advanced mode. Scroll down to ``CMAKE_CXX_COMPILER``, and set it to 80 ``/usr/bin/clang++``, or wherever you installed it. Press ``'c'`` to 81 configure, then ``'g'`` to generate CMake's files. 82 83 Finally, run ninja one last time, and you're done. 84 85 Step 1: Create a ClangTool 86 ========================== 87 88 Now that we have enough background knowledge, it's time to create the 89 simplest productive ClangTool in existence: a syntax checker. While this 90 already exists as ``clang-check``, it's important to understand what's 91 going on. 92 93 First, we'll need to create a new directory for our tool and tell CMake 94 that it exists. As this is not going to be a core clang tool, it will 95 live in the ``tools/extra`` repository. 96 97 .. code-block:: console 98 99 cd ~/clang-llvm/llvm/tools/clang 100 mkdir tools/extra/loop-convert 101 echo 'add_subdirectory(loop-convert)' >> tools/extra/CMakeLists.txt 102 vim tools/extra/loop-convert/CMakeLists.txt 103 104 CMakeLists.txt should have the following contents: 105 106 :: 107 108 set(LLVM_LINK_COMPONENTS support) 109 set(LLVM_USED_LIBS clangTooling clangBasic clangAST) 110 111 add_clang_executable(loop-convert 112 LoopConvert.cpp 113 ) 114 target_link_libraries(loop-convert 115 clangTooling 116 clangBasic 117 clangASTMatchers 118 ) 119 120 With that done, Ninja will be able to compile our tool. Let's give it 121 something to compile! Put the following into 122 ``tools/extra/loop-convert/LoopConvert.cpp``. A detailed explanation of 123 why the different parts are needed can be found in the `LibTooling 124 documentation <LibTooling.html>`_. 125 126 .. code-block:: c++ 127 128 // Declares clang::SyntaxOnlyAction. 129 #include "clang/Frontend/FrontendActions.h" 130 #include "clang/Tooling/CommonOptionsParser.h" 131 #include "clang/Tooling/Tooling.h" 132 // Declares llvm::cl::extrahelp. 133 #include "llvm/Support/CommandLine.h" 134 135 using namespace clang::tooling; 136 using namespace llvm; 137 138 // CommonOptionsParser declares HelpMessage with a description of the common 139 // command-line options related to the compilation database and input files. 140 // It's nice to have this help message in all tools. 141 static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage); 142 143 // A help message for this specific tool can be added afterwards. 144 static cl::extrahelp MoreHelp("\nMore help text..."); 145 146 int main(int argc, const char **argv) { 147 CommonOptionsParser OptionsParser(argc, argv); 148 ClangTool Tool(OptionsParser.getCompilations(), 149 OptionsParser.getSourcePathList()); 150 return Tool.run(newFrontendActionFactory<clang::SyntaxOnlyAction>()); 151 } 152 153 And that's it! You can compile our new tool by running ninja from the 154 ``build`` directory. 155 156 .. code-block:: console 157 158 cd ~/clang-llvm/build 159 ninja 160 161 You should now be able to run the syntax checker, which is located in 162 ``~/clang-llvm/build/bin``, on any source file. Try it! 163 164 .. code-block:: console 165 166 cat "void main() {}" > test.cpp 167 bin/loop-convert test.cpp -- 168 169 Note the two dashes after we specify the source file. The additional 170 options for the compiler are passed after the dashes rather than loading 171 them from a compilation database - there just aren't any options needed 172 right now. 173 174 Intermezzo: Learn AST matcher basics 175 ==================================== 176 177 Clang recently introduced the :doc:`ASTMatcher 178 library <LibASTMatchers>` to provide a simple, powerful, and 179 concise way to describe specific patterns in the AST. Implemented as a 180 DSL powered by macros and templates (see 181 `ASTMatchers.h <../doxygen/ASTMatchers_8h_source.html>`_ if you're 182 curious), matchers offer the feel of algebraic data types common to 183 functional programming languages. 184 185 For example, suppose you wanted to examine only binary operators. There 186 is a matcher to do exactly that, conveniently named ``binaryOperator``. 187 I'll give you one guess what this matcher does: 188 189 .. code-block:: c++ 190 191 binaryOperator(hasOperatorName("+"), hasLHS(integerLiteral(equals(0)))) 192 193 Shockingly, it will match against addition expressions whose left hand 194 side is exactly the literal 0. It will not match against other forms of 195 0, such as ``'\0'`` or ``NULL``, but it will match against macros that 196 expand to 0. The matcher will also not match against calls to the 197 overloaded operator ``'+'``, as there is a separate ``operatorCallExpr`` 198 matcher to handle overloaded operators. 199 200 There are AST matchers to match all the different nodes of the AST, 201 narrowing matchers to only match AST nodes fulfilling specific criteria, 202 and traversal matchers to get from one kind of AST node to another. For 203 a complete list of AST matchers, take a look at the `AST Matcher 204 References <LibASTMatchersReference.html>`_ 205 206 All matcher that are nouns describe entities in the AST and can be 207 bound, so that they can be referred to whenever a match is found. To do 208 so, simply call the method ``bind`` on these matchers, e.g.: 209 210 .. code-block:: c++ 211 212 variable(hasType(isInteger())).bind("intvar") 213 214 Step 2: Using AST matchers 215 ========================== 216 217 Okay, on to using matchers for real. Let's start by defining a matcher 218 which will capture all ``for`` statements that define a new variable 219 initialized to zero. Let's start with matching all ``for`` loops: 220 221 .. code-block:: c++ 222 223 forStmt() 224 225 Next, we want to specify that a single variable is declared in the first 226 portion of the loop, so we can extend the matcher to 227 228 .. code-block:: c++ 229 230 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl())))) 231 232 Finally, we can add the condition that the variable is initialized to 233 zero. 234 235 .. code-block:: c++ 236 237 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( 238 hasInitializer(integerLiteral(equals(0)))))))) 239 240 It is fairly easy to read and understand the matcher definition ("match 241 loops whose init portion declares a single variable which is initialized 242 to the integer literal 0"), but deciding that every piece is necessary 243 is more difficult. Note that this matcher will not match loops whose 244 variables are initialized to ``'\0'``, ``0.0``, ``NULL``, or any form of 245 zero besides the integer 0. 246 247 The last step is giving the matcher a name and binding the ``ForStmt`` 248 as we will want to do something with it: 249 250 .. code-block:: c++ 251 252 StatementMatcher LoopMatcher = 253 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( 254 hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop"); 255 256 Once you have defined your matchers, you will need to add a little more 257 scaffolding in order to run them. Matchers are paired with a 258 ``MatchCallback`` and registered with a ``MatchFinder`` object, then run 259 from a ``ClangTool``. More code! 260 261 Add the following to ``LoopConvert.cpp``: 262 263 .. code-block:: c++ 264 265 #include "clang/ASTMatchers/ASTMatchers.h" 266 #include "clang/ASTMatchers/ASTMatchFinder.h" 267 268 using namespace clang; 269 using namespace clang::ast_matchers; 270 271 StatementMatcher LoopMatcher = 272 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( 273 hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop"); 274 275 class LoopPrinter : public MatchFinder::MatchCallback { 276 public : 277 virtual void run(const MatchFinder::MatchResult &Result) { 278 if (const ForStmt *FS = Result.Nodes.getNodeAs<clang::ForStmt>("forLoop")) 279 FS->dump(); 280 }; 281 282 And change ``main()`` to: 283 284 .. code-block:: c++ 285 286 int main(int argc, const char **argv) { 287 CommonOptionsParser OptionsParser(argc, argv); 288 ClangTool Tool(OptionsParser.getCompilations(), 289 OptionsParser.getSourcePathList()); 290 291 LoopPrinter Printer; 292 MatchFinder Finder; 293 Finder.addMatcher(LoopMatcher, &Printer); 294 295 return Tool.run(newFrontendActionFactory(&Finder)); 296 } 297 298 Now, you should be able to recompile and run the code to discover for 299 loops. Create a new file with a few examples, and test out our new 300 handiwork: 301 302 .. code-block:: console 303 304 cd ~/clang-llvm/llvm/llvm_build/ 305 ninja loop-convert 306 vim ~/test-files/simple-loops.cc 307 bin/loop-convert ~/test-files/simple-loops.cc 308 309 Step 3.5: More Complicated Matchers 310 =================================== 311 312 Our simple matcher is capable of discovering for loops, but we would 313 still need to filter out many more ourselves. We can do a good portion 314 of the remaining work with some cleverly chosen matchers, but first we 315 need to decide exactly which properties we want to allow. 316 317 How can we characterize for loops over arrays which would be eligible 318 for translation to range-based syntax? Range based loops over arrays of 319 size ``N`` that: 320 321 - start at index ``0`` 322 - iterate consecutively 323 - end at index ``N-1`` 324 325 We already check for (1), so all we need to add is a check to the loop's 326 condition to ensure that the loop's index variable is compared against 327 ``N`` and another check to ensure that the increment step just 328 increments this same variable. The matcher for (2) is straightforward: 329 require a pre- or post-increment of the same variable declared in the 330 init portion. 331 332 Unfortunately, such a matcher is impossible to write. Matchers contain 333 no logic for comparing two arbitrary AST nodes and determining whether 334 or not they are equal, so the best we can do is matching more than we 335 would like to allow, and punting extra comparisons to the callback. 336 337 In any case, we can start building this sub-matcher. We can require that 338 the increment step be a unary increment like this: 339 340 .. code-block:: c++ 341 342 hasIncrement(unaryOperator(hasOperatorName("++"))) 343 344 Specifying what is incremented introduces another quirk of Clang's AST: 345 Usages of variables are represented as ``DeclRefExpr``'s ("declaration 346 reference expressions") because they are expressions which refer to 347 variable declarations. To find a ``unaryOperator`` that refers to a 348 specific declaration, we can simply add a second condition to it: 349 350 .. code-block:: c++ 351 352 hasIncrement(unaryOperator( 353 hasOperatorName("++"), 354 hasUnaryOperand(declRefExpr()))) 355 356 Furthermore, we can restrict our matcher to only match if the 357 incremented variable is an integer: 358 359 .. code-block:: c++ 360 361 hasIncrement(unaryOperator( 362 hasOperatorName("++"), 363 hasUnaryOperand(declRefExpr(to(varDecl(hasType(isInteger()))))))) 364 365 And the last step will be to attach an identifier to this variable, so 366 that we can retrieve it in the callback: 367 368 .. code-block:: c++ 369 370 hasIncrement(unaryOperator( 371 hasOperatorName("++"), 372 hasUnaryOperand(declRefExpr(to( 373 varDecl(hasType(isInteger())).bind("incrementVariable")))))) 374 375 We can add this code to the definition of ``LoopMatcher`` and make sure 376 that our program, outfitted with the new matcher, only prints out loops 377 that declare a single variable initialized to zero and have an increment 378 step consisting of a unary increment of some variable. 379 380 Now, we just need to add a matcher to check if the condition part of the 381 ``for`` loop compares a variable against the size of the array. There is 382 only one problem - we don't know which array we're iterating over 383 without looking at the body of the loop! We are again restricted to 384 approximating the result we want with matchers, filling in the details 385 in the callback. So we start with: 386 387 .. code-block:: c++ 388 389 hasCondition(binaryOperator(hasOperatorName("<")) 390 391 It makes sense to ensure that the left-hand side is a reference to a 392 variable, and that the right-hand side has integer type. 393 394 .. code-block:: c++ 395 396 hasCondition(binaryOperator( 397 hasOperatorName("<"), 398 hasLHS(declRefExpr(to(varDecl(hasType(isInteger()))))), 399 hasRHS(expr(hasType(isInteger()))))) 400 401 Why? Because it doesn't work. Of the three loops provided in 402 ``test-files/simple.cpp``, zero of them have a matching condition. A 403 quick look at the AST dump of the first for loop, produced by the 404 previous iteration of loop-convert, shows us the answer: 405 406 :: 407 408 (ForStmt 0x173b240 409 (DeclStmt 0x173afc8 410 0x173af50 "int i = 411 (IntegerLiteral 0x173afa8 'int' 0)") 412 <<>> 413 (BinaryOperator 0x173b060 '_Bool' '<' 414 (ImplicitCastExpr 0x173b030 'int' 415 (DeclRefExpr 0x173afe0 'int' lvalue Var 0x173af50 'i' 'int')) 416 (ImplicitCastExpr 0x173b048 'int' 417 (DeclRefExpr 0x173b008 'const int' lvalue Var 0x170fa80 'N' 'const int'))) 418 (UnaryOperator 0x173b0b0 'int' lvalue prefix '++' 419 (DeclRefExpr 0x173b088 'int' lvalue Var 0x173af50 'i' 'int')) 420 (CompoundStatement 421 422 We already know that the declaration and increments both match, or this 423 loop wouldn't have been dumped. The culprit lies in the implicit cast 424 applied to the first operand (i.e. the LHS) of the less-than operator, 425 an L-value to R-value conversion applied to the expression referencing 426 ``i``. Thankfully, the matcher library offers a solution to this problem 427 in the form of ``ignoringParenImpCasts``, which instructs the matcher to 428 ignore implicit casts and parentheses before continuing to match. 429 Adjusting the condition operator will restore the desired match. 430 431 .. code-block:: c++ 432 433 hasCondition(binaryOperator( 434 hasOperatorName("<"), 435 hasLHS(ignoringParenImpCasts(declRefExpr( 436 to(varDecl(hasType(isInteger())))))), 437 hasRHS(expr(hasType(isInteger()))))) 438 439 After adding binds to the expressions we wished to capture and 440 extracting the identifier strings into variables, we have array-step-2 441 completed. 442 443 Step 4: Retrieving Matched Nodes 444 ================================ 445 446 So far, the matcher callback isn't very interesting: it just dumps the 447 loop's AST. At some point, we will need to make changes to the input 448 source code. Next, we'll work on using the nodes we bound in the 449 previous step. 450 451 The ``MatchFinder::run()`` callback takes a 452 ``MatchFinder::MatchResult&`` as its parameter. We're most interested in 453 its ``Context`` and ``Nodes`` members. Clang uses the ``ASTContext`` 454 class to represent contextual information about the AST, as the name 455 implies, though the most functionally important detail is that several 456 operations require an ``ASTContext*`` parameter. More immediately useful 457 is the set of matched nodes, and how we retrieve them. 458 459 Since we bind three variables (identified by ConditionVarName, 460 InitVarName, and IncrementVarName), we can obtain the matched nodes by 461 using the ``getNodeAs()`` member function. 462 463 In ``LoopActions.cpp``: 464 465 .. code-block:: c++ 466 467 #include "clang/AST/ASTContext.h" 468 469 void LoopPrinter::run(const MatchFinder::MatchResult &Result) { 470 ASTContext *Context = Result.Context; 471 const ForStmt *FS = Result.Nodes.getStmtAs<ForStmt>(LoopName); 472 // We do not want to convert header files! 473 if (!FS || !Context->getSourceManager().isFromMainFile(FS->getForLoc())) 474 return; 475 const VarDecl *IncVar = Result.Nodes.getNodeAs<VarDecl>(IncrementVarName); 476 const VarDecl *CondVar = Result.Nodes.getNodeAs<VarDecl>(ConditionVarName); 477 const VarDecl *InitVar = Result.Nodes.getNodeAs<VarDecl>(InitVarName); 478 479 Now that we have the three variables, represented by their respective 480 declarations, let's make sure that they're all the same, using a helper 481 function I call ``areSameVariable()``. 482 483 .. code-block:: c++ 484 485 if (!areSameVariable(IncVar, CondVar) || !areSameVariable(IncVar, InitVar)) 486 return; 487 llvm::outs() << "Potential array-based loop discovered.\n"; 488 } 489 490 If execution reaches the end of ``LoopPrinter::run()``, we know that the 491 loop shell that looks like 492 493 .. code-block:: c++ 494 495 for (int i= 0; i < expr(); ++i) { ... } 496 497 For now, we will just print a message explaining that we found a loop. 498 The next section will deal with recursively traversing the AST to 499 discover all changes needed. 500 501 As a side note, here is the implementation of ``areSameVariable``. Clang 502 associates a ``VarDecl`` with each variable to represent the variable's 503 declaration. Since the "canonical" form of each declaration is unique by 504 address, all we need to do is make sure neither ``ValueDecl`` (base 505 class of ``VarDecl``) is ``NULL`` and compare the canonical Decls. 506 507 .. code-block:: c++ 508 509 static bool areSameVariable(const ValueDecl *First, const ValueDecl *Second) { 510 return First && Second && 511 First->getCanonicalDecl() == Second->getCanonicalDecl(); 512 } 513 514 It's not as trivial to test if two expressions are the same, though 515 Clang has already done the hard work for us by providing a way to 516 canonicalize expressions: 517 518 .. code-block:: c++ 519 520 static bool areSameExpr(ASTContext *Context, const Expr *First, 521 const Expr *Second) { 522 if (!First || !Second) 523 return false; 524 llvm::FoldingSetNodeID FirstID, SecondID; 525 First->Profile(FirstID, *Context, true); 526 Second->Profile(SecondID, *Context, true); 527 return FirstID == SecondID; 528 } 529 530 This code relies on the comparison between two 531 ``llvm::FoldingSetNodeIDs``. As the documentation for 532 ``Stmt::Profile()`` indicates, the ``Profile()`` member function builds 533 a description of a node in the AST, based on its properties, along with 534 those of its children. ``FoldingSetNodeID`` then serves as a hash we can 535 use to compare expressions. We will need ``areSameExpr`` later. Before 536 you run the new code on the additional loops added to 537 test-files/simple.cpp, try to figure out which ones will be considered 538 potentially convertible. 539