Home | History | Annotate | Download | only in docs
      1 ===============================================================
      2 Tutorial for building tools using LibTooling and LibASTMatchers
      3 ===============================================================
      4 
      5 This document is intended to show how to build a useful source-to-source
      6 translation tool based on Clang's `LibTooling <LibTooling.html>`_. It is
      7 explicitly aimed at people who are new to Clang, so all you should need
      8 is a working knowledge of C++ and the command line.
      9 
     10 In order to work on the compiler, you need some basic knowledge of the
     11 abstract syntax tree (AST). To this end, the reader is incouraged to
     12 skim the :doc:`Introduction to the Clang
     13 AST <IntroductionToTheClangAST>`
     14 
     15 Step 0: Obtaining Clang
     16 =======================
     17 
     18 As Clang is part of the LLVM project, you'll need to download LLVM's
     19 source code first. Both Clang and LLVM are maintained as Subversion
     20 repositories, but we'll be accessing them through the git mirror. For
     21 further information, see the `getting started
     22 guide <http://llvm.org/docs/GettingStarted.html>`_.
     23 
     24 .. code-block:: console
     25 
     26       mkdir ~/clang-llvm && cd ~/clang-llvm
     27       git clone http://llvm.org/git/llvm.git
     28       cd llvm/tools
     29       git clone http://llvm.org/git/clang.git
     30       cd clang/tools
     31       git clone http://llvm.org/git/clang-tools-extra.git extra
     32 
     33 Next you need to obtain the CMake build system and Ninja build tool. You
     34 may already have CMake installed, but current binary versions of CMake
     35 aren't built with Ninja support.
     36 
     37 .. code-block:: console
     38 
     39       cd ~/clang-llvm
     40       git clone https://github.com/martine/ninja.git
     41       cd ninja
     42       git checkout release
     43       ./bootstrap.py
     44       sudo cp ninja /usr/bin/
     45 
     46       cd ~/clang-llvm
     47       git clone git://cmake.org/stage/cmake.git
     48       cd cmake
     49       git checkout next
     50       ./bootstrap
     51       make
     52       sudo make install
     53 
     54 Okay. Now we'll build Clang!
     55 
     56 .. code-block:: console
     57 
     58       cd ~/clang-llvm
     59       mkdir build && cd build
     60       cmake -G Ninja ../llvm -DLLVM_BUILD_TESTS=ON  # Enable tests; default is off.
     61       ninja
     62       ninja check       # Test LLVM only.
     63       ninja clang-test  # Test Clang only.
     64       ninja install
     65 
     66 And we're live.
     67 
     68 All of the tests should pass, though there is a (very) small chance that
     69 you can catch LLVM and Clang out of sync. Running ``'git svn rebase'``
     70 in both the llvm and clang directories should fix any problems.
     71 
     72 Finally, we want to set Clang as its own compiler.
     73 
     74 .. code-block:: console
     75 
     76       cd ~/clang-llvm/build
     77       ccmake ../llvm
     78 
     79 The second command will bring up a GUI for configuring Clang. You need
     80 to set the entry for ``CMAKE_CXX_COMPILER``. Press ``'t'`` to turn on
     81 advanced mode. Scroll down to ``CMAKE_CXX_COMPILER``, and set it to
     82 ``/usr/bin/clang++``, or wherever you installed it. Press ``'c'`` to
     83 configure, then ``'g'`` to generate CMake's files.
     84 
     85 Finally, run ninja one last time, and you're done.
     86 
     87 Step 1: Create a ClangTool
     88 ==========================
     89 
     90 Now that we have enough background knowledge, it's time to create the
     91 simplest productive ClangTool in existence: a syntax checker. While this
     92 already exists as ``clang-check``, it's important to understand what's
     93 going on.
     94 
     95 First, we'll need to create a new directory for our tool and tell CMake
     96 that it exists. As this is not going to be a core clang tool, it will
     97 live in the ``tools/extra`` repository.
     98 
     99 .. code-block:: console
    100 
    101       cd ~/clang-llvm/llvm/tools/clang
    102       mkdir tools/extra/loop-convert
    103       echo 'add_subdirectory(loop-convert)' >> tools/extra/CMakeLists.txt
    104       vim tools/extra/loop-convert/CMakeLists.txt
    105 
    106 CMakeLists.txt should have the following contents:
    107 
    108 ::
    109 
    110       set(LLVM_LINK_COMPONENTS support)
    111       set(LLVM_USED_LIBS clangTooling clangBasic clangAST)
    112 
    113       add_clang_executable(loop-convert
    114         LoopConvert.cpp
    115         )
    116       target_link_libraries(loop-convert
    117         clangTooling
    118         clangBasic
    119         clangASTMatchers
    120         )
    121 
    122 With that done, Ninja will be able to compile our tool. Let's give it
    123 something to compile! Put the following into
    124 ``tools/extra/loop-convert/LoopConvert.cpp``. A detailed explanation of
    125 why the different parts are needed can be found in the `LibTooling
    126 documentation <LibTooling.html>`_.
    127 
    128 .. code-block:: c++
    129 
    130       // Declares clang::SyntaxOnlyAction.
    131       #include "clang/Frontend/FrontendActions.h"
    132       #include "clang/Tooling/CommonOptionsParser.h"
    133       #include "clang/Tooling/Tooling.h"
    134       // Declares llvm::cl::extrahelp.
    135       #include "llvm/Support/CommandLine.h"
    136 
    137       using namespace clang::tooling;
    138       using namespace llvm;
    139 
    140       // Apply a custom category to all command-line options so that they are the
    141       // only ones displayed.
    142       static llvm::cl::OptionCategory MyToolCategory("my-tool options");
    143 
    144       // CommonOptionsParser declares HelpMessage with a description of the common
    145       // command-line options related to the compilation database and input files.
    146       // It's nice to have this help message in all tools.
    147       static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage);
    148 
    149       // A help message for this specific tool can be added afterwards.
    150       static cl::extrahelp MoreHelp("\nMore help text...");
    151 
    152       int main(int argc, const char **argv) {
    153         CommonOptionsParser OptionsParser(argc, argv, MyToolCategory);
    154         ClangTool Tool(OptionsParser.getCompilations(),
    155                        OptionsParser.getSourcePathList());
    156         return Tool.run(newFrontendActionFactory<clang::SyntaxOnlyAction>().get());
    157       }
    158 
    159 And that's it! You can compile our new tool by running ninja from the
    160 ``build`` directory.
    161 
    162 .. code-block:: console
    163 
    164       cd ~/clang-llvm/build
    165       ninja
    166 
    167 You should now be able to run the syntax checker, which is located in
    168 ``~/clang-llvm/build/bin``, on any source file. Try it!
    169 
    170 .. code-block:: console
    171 
    172       cat "int main() { return 0; }" > test.cpp
    173       bin/loop-convert test.cpp --
    174 
    175 Note the two dashes after we specify the source file. The additional
    176 options for the compiler are passed after the dashes rather than loading
    177 them from a compilation database - there just aren't any options needed
    178 right now.
    179 
    180 Intermezzo: Learn AST matcher basics
    181 ====================================
    182 
    183 Clang recently introduced the :doc:`ASTMatcher
    184 library <LibASTMatchers>` to provide a simple, powerful, and
    185 concise way to describe specific patterns in the AST. Implemented as a
    186 DSL powered by macros and templates (see
    187 `ASTMatchers.h <../doxygen/ASTMatchers_8h_source.html>`_ if you're
    188 curious), matchers offer the feel of algebraic data types common to
    189 functional programming languages.
    190 
    191 For example, suppose you wanted to examine only binary operators. There
    192 is a matcher to do exactly that, conveniently named ``binaryOperator``.
    193 I'll give you one guess what this matcher does:
    194 
    195 .. code-block:: c++
    196 
    197       binaryOperator(hasOperatorName("+"), hasLHS(integerLiteral(equals(0))))
    198 
    199 Shockingly, it will match against addition expressions whose left hand
    200 side is exactly the literal 0. It will not match against other forms of
    201 0, such as ``'\0'`` or ``NULL``, but it will match against macros that
    202 expand to 0. The matcher will also not match against calls to the
    203 overloaded operator ``'+'``, as there is a separate ``operatorCallExpr``
    204 matcher to handle overloaded operators.
    205 
    206 There are AST matchers to match all the different nodes of the AST,
    207 narrowing matchers to only match AST nodes fulfilling specific criteria,
    208 and traversal matchers to get from one kind of AST node to another. For
    209 a complete list of AST matchers, take a look at the `AST Matcher
    210 References <LibASTMatchersReference.html>`_
    211 
    212 All matcher that are nouns describe entities in the AST and can be
    213 bound, so that they can be referred to whenever a match is found. To do
    214 so, simply call the method ``bind`` on these matchers, e.g.:
    215 
    216 .. code-block:: c++
    217 
    218       variable(hasType(isInteger())).bind("intvar")
    219 
    220 Step 2: Using AST matchers
    221 ==========================
    222 
    223 Okay, on to using matchers for real. Let's start by defining a matcher
    224 which will capture all ``for`` statements that define a new variable
    225 initialized to zero. Let's start with matching all ``for`` loops:
    226 
    227 .. code-block:: c++
    228 
    229       forStmt()
    230 
    231 Next, we want to specify that a single variable is declared in the first
    232 portion of the loop, so we can extend the matcher to
    233 
    234 .. code-block:: c++
    235 
    236       forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl()))))
    237 
    238 Finally, we can add the condition that the variable is initialized to
    239 zero.
    240 
    241 .. code-block:: c++
    242 
    243       forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
    244         hasInitializer(integerLiteral(equals(0))))))))
    245 
    246 It is fairly easy to read and understand the matcher definition ("match
    247 loops whose init portion declares a single variable which is initialized
    248 to the integer literal 0"), but deciding that every piece is necessary
    249 is more difficult. Note that this matcher will not match loops whose
    250 variables are initialized to ``'\0'``, ``0.0``, ``NULL``, or any form of
    251 zero besides the integer 0.
    252 
    253 The last step is giving the matcher a name and binding the ``ForStmt``
    254 as we will want to do something with it:
    255 
    256 .. code-block:: c++
    257 
    258       StatementMatcher LoopMatcher =
    259         forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
    260           hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop");
    261 
    262 Once you have defined your matchers, you will need to add a little more
    263 scaffolding in order to run them. Matchers are paired with a
    264 ``MatchCallback`` and registered with a ``MatchFinder`` object, then run
    265 from a ``ClangTool``. More code!
    266 
    267 Add the following to ``LoopConvert.cpp``:
    268 
    269 .. code-block:: c++
    270 
    271       #include "clang/ASTMatchers/ASTMatchers.h"
    272       #include "clang/ASTMatchers/ASTMatchFinder.h"
    273 
    274       using namespace clang;
    275       using namespace clang::ast_matchers;
    276 
    277       StatementMatcher LoopMatcher =
    278         forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
    279           hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop");
    280 
    281       class LoopPrinter : public MatchFinder::MatchCallback {
    282       public :
    283         virtual void run(const MatchFinder::MatchResult &Result) {
    284           if (const ForStmt *FS = Result.Nodes.getNodeAs<clang::ForStmt>("forLoop"))
    285             FS->dump();
    286         }
    287       };
    288 
    289 And change ``main()`` to:
    290 
    291 .. code-block:: c++
    292 
    293       int main(int argc, const char **argv) {
    294         CommonOptionsParser OptionsParser(argc, argv, MyToolCategory);
    295         ClangTool Tool(OptionsParser.getCompilations(),
    296                        OptionsParser.getSourcePathList());
    297 
    298         LoopPrinter Printer;
    299         MatchFinder Finder;
    300         Finder.addMatcher(LoopMatcher, &Printer);
    301 
    302         return Tool.run(newFrontendActionFactory(&Finder).get());
    303       }
    304 
    305 Now, you should be able to recompile and run the code to discover for
    306 loops. Create a new file with a few examples, and test out our new
    307 handiwork:
    308 
    309 .. code-block:: console
    310 
    311       cd ~/clang-llvm/llvm/llvm_build/
    312       ninja loop-convert
    313       vim ~/test-files/simple-loops.cc
    314       bin/loop-convert ~/test-files/simple-loops.cc
    315 
    316 Step 3.5: More Complicated Matchers
    317 ===================================
    318 
    319 Our simple matcher is capable of discovering for loops, but we would
    320 still need to filter out many more ourselves. We can do a good portion
    321 of the remaining work with some cleverly chosen matchers, but first we
    322 need to decide exactly which properties we want to allow.
    323 
    324 How can we characterize for loops over arrays which would be eligible
    325 for translation to range-based syntax? Range based loops over arrays of
    326 size ``N`` that:
    327 
    328 -  start at index ``0``
    329 -  iterate consecutively
    330 -  end at index ``N-1``
    331 
    332 We already check for (1), so all we need to add is a check to the loop's
    333 condition to ensure that the loop's index variable is compared against
    334 ``N`` and another check to ensure that the increment step just
    335 increments this same variable. The matcher for (2) is straightforward:
    336 require a pre- or post-increment of the same variable declared in the
    337 init portion.
    338 
    339 Unfortunately, such a matcher is impossible to write. Matchers contain
    340 no logic for comparing two arbitrary AST nodes and determining whether
    341 or not they are equal, so the best we can do is matching more than we
    342 would like to allow, and punting extra comparisons to the callback.
    343 
    344 In any case, we can start building this sub-matcher. We can require that
    345 the increment step be a unary increment like this:
    346 
    347 .. code-block:: c++
    348 
    349       hasIncrement(unaryOperator(hasOperatorName("++")))
    350 
    351 Specifying what is incremented introduces another quirk of Clang's AST:
    352 Usages of variables are represented as ``DeclRefExpr``'s ("declaration
    353 reference expressions") because they are expressions which refer to
    354 variable declarations. To find a ``unaryOperator`` that refers to a
    355 specific declaration, we can simply add a second condition to it:
    356 
    357 .. code-block:: c++
    358 
    359       hasIncrement(unaryOperator(
    360         hasOperatorName("++"),
    361         hasUnaryOperand(declRefExpr())))
    362 
    363 Furthermore, we can restrict our matcher to only match if the
    364 incremented variable is an integer:
    365 
    366 .. code-block:: c++
    367 
    368       hasIncrement(unaryOperator(
    369         hasOperatorName("++"),
    370         hasUnaryOperand(declRefExpr(to(varDecl(hasType(isInteger())))))))
    371 
    372 And the last step will be to attach an identifier to this variable, so
    373 that we can retrieve it in the callback:
    374 
    375 .. code-block:: c++
    376 
    377       hasIncrement(unaryOperator(
    378         hasOperatorName("++"),
    379         hasUnaryOperand(declRefExpr(to(
    380           varDecl(hasType(isInteger())).bind("incrementVariable"))))))
    381 
    382 We can add this code to the definition of ``LoopMatcher`` and make sure
    383 that our program, outfitted with the new matcher, only prints out loops
    384 that declare a single variable initialized to zero and have an increment
    385 step consisting of a unary increment of some variable.
    386 
    387 Now, we just need to add a matcher to check if the condition part of the
    388 ``for`` loop compares a variable against the size of the array. There is
    389 only one problem - we don't know which array we're iterating over
    390 without looking at the body of the loop! We are again restricted to
    391 approximating the result we want with matchers, filling in the details
    392 in the callback. So we start with:
    393 
    394 .. code-block:: c++
    395 
    396       hasCondition(binaryOperator(hasOperatorName("<"))
    397 
    398 It makes sense to ensure that the left-hand side is a reference to a
    399 variable, and that the right-hand side has integer type.
    400 
    401 .. code-block:: c++
    402 
    403       hasCondition(binaryOperator(
    404         hasOperatorName("<"),
    405         hasLHS(declRefExpr(to(varDecl(hasType(isInteger()))))),
    406         hasRHS(expr(hasType(isInteger())))))
    407 
    408 Why? Because it doesn't work. Of the three loops provided in
    409 ``test-files/simple.cpp``, zero of them have a matching condition. A
    410 quick look at the AST dump of the first for loop, produced by the
    411 previous iteration of loop-convert, shows us the answer:
    412 
    413 ::
    414 
    415       (ForStmt 0x173b240
    416         (DeclStmt 0x173afc8
    417           0x173af50 "int i =
    418             (IntegerLiteral 0x173afa8 'int' 0)")
    419         <<>>
    420         (BinaryOperator 0x173b060 '_Bool' '<'
    421           (ImplicitCastExpr 0x173b030 'int'
    422             (DeclRefExpr 0x173afe0 'int' lvalue Var 0x173af50 'i' 'int'))
    423           (ImplicitCastExpr 0x173b048 'int'
    424             (DeclRefExpr 0x173b008 'const int' lvalue Var 0x170fa80 'N' 'const int')))
    425         (UnaryOperator 0x173b0b0 'int' lvalue prefix '++'
    426           (DeclRefExpr 0x173b088 'int' lvalue Var 0x173af50 'i' 'int'))
    427         (CompoundStatement ...
    428 
    429 We already know that the declaration and increments both match, or this
    430 loop wouldn't have been dumped. The culprit lies in the implicit cast
    431 applied to the first operand (i.e. the LHS) of the less-than operator,
    432 an L-value to R-value conversion applied to the expression referencing
    433 ``i``. Thankfully, the matcher library offers a solution to this problem
    434 in the form of ``ignoringParenImpCasts``, which instructs the matcher to
    435 ignore implicit casts and parentheses before continuing to match.
    436 Adjusting the condition operator will restore the desired match.
    437 
    438 .. code-block:: c++
    439 
    440       hasCondition(binaryOperator(
    441         hasOperatorName("<"),
    442         hasLHS(ignoringParenImpCasts(declRefExpr(
    443           to(varDecl(hasType(isInteger())))))),
    444         hasRHS(expr(hasType(isInteger())))))
    445 
    446 After adding binds to the expressions we wished to capture and
    447 extracting the identifier strings into variables, we have array-step-2
    448 completed.
    449 
    450 Step 4: Retrieving Matched Nodes
    451 ================================
    452 
    453 So far, the matcher callback isn't very interesting: it just dumps the
    454 loop's AST. At some point, we will need to make changes to the input
    455 source code. Next, we'll work on using the nodes we bound in the
    456 previous step.
    457 
    458 The ``MatchFinder::run()`` callback takes a
    459 ``MatchFinder::MatchResult&`` as its parameter. We're most interested in
    460 its ``Context`` and ``Nodes`` members. Clang uses the ``ASTContext``
    461 class to represent contextual information about the AST, as the name
    462 implies, though the most functionally important detail is that several
    463 operations require an ``ASTContext*`` parameter. More immediately useful
    464 is the set of matched nodes, and how we retrieve them.
    465 
    466 Since we bind three variables (identified by ConditionVarName,
    467 InitVarName, and IncrementVarName), we can obtain the matched nodes by
    468 using the ``getNodeAs()`` member function.
    469 
    470 In ``LoopConvert.cpp`` add
    471 
    472 .. code-block:: c++
    473 
    474       #include "clang/AST/ASTContext.h"
    475 
    476 Change ``LoopMatcher`` to
    477 
    478 .. code-block:: c++
    479 
    480       StatementMatcher LoopMatcher =
    481           forStmt(hasLoopInit(declStmt(
    482                       hasSingleDecl(varDecl(hasInitializer(integerLiteral(equals(0))))
    483                                         .bind("initVarName")))),
    484                   hasIncrement(unaryOperator(
    485                       hasOperatorName("++"),
    486                       hasUnaryOperand(declRefExpr(
    487                           to(varDecl(hasType(isInteger())).bind("incVarName")))))),
    488                   hasCondition(binaryOperator(
    489                       hasOperatorName("<"),
    490                       hasLHS(ignoringParenImpCasts(declRefExpr(
    491                           to(varDecl(hasType(isInteger())).bind("condVarName"))))),
    492                       hasRHS(expr(hasType(isInteger())))))).bind("forLoop");
    493 
    494 And change ``LoopPrinter::run`` to
    495 
    496 .. code-block:: c++
    497 
    498       void LoopPrinter::run(const MatchFinder::MatchResult &Result) {
    499         ASTContext *Context = Result.Context;
    500         const ForStmt *FS = Result.Nodes.getStmtAs<ForStmt>("forLoop");
    501         // We do not want to convert header files!
    502         if (!FS || !Context->getSourceManager().isFromMainFile(FS->getForLoc()))
    503           return;
    504         const VarDecl *IncVar = Result.Nodes.getNodeAs<VarDecl>("incVarName");
    505         const VarDecl *CondVar = Result.Nodes.getNodeAs<VarDecl>("condVarName");
    506         const VarDecl *InitVar = Result.Nodes.getNodeAs<VarDecl>("initVarName");
    507 
    508         if (!areSameVariable(IncVar, CondVar) || !areSameVariable(IncVar, InitVar))
    509           return;
    510         llvm::outs() << "Potential array-based loop discovered.\n";
    511       }
    512 
    513 Clang associates a ``VarDecl`` with each variable to represent the variable's
    514 declaration. Since the "canonical" form of each declaration is unique by
    515 address, all we need to do is make sure neither ``ValueDecl`` (base class of
    516 ``VarDecl``) is ``NULL`` and compare the canonical Decls.
    517 
    518 .. code-block:: c++
    519 
    520       static bool areSameVariable(const ValueDecl *First, const ValueDecl *Second) {
    521         return First && Second &&
    522                First->getCanonicalDecl() == Second->getCanonicalDecl();
    523       }
    524 
    525 If execution reaches the end of ``LoopPrinter::run()``, we know that the
    526 loop shell that looks like
    527 
    528 .. code-block:: c++
    529 
    530       for (int i= 0; i < expr(); ++i) { ... }
    531 
    532 For now, we will just print a message explaining that we found a loop.
    533 The next section will deal with recursively traversing the AST to
    534 discover all changes needed.
    535 
    536 As a side note, it's not as trivial to test if two expressions are the same,
    537 though Clang has already done the hard work for us by providing a way to
    538 canonicalize expressions:
    539 
    540 .. code-block:: c++
    541 
    542       static bool areSameExpr(ASTContext *Context, const Expr *First,
    543                               const Expr *Second) {
    544         if (!First || !Second)
    545           return false;
    546         llvm::FoldingSetNodeID FirstID, SecondID;
    547         First->Profile(FirstID, *Context, true);
    548         Second->Profile(SecondID, *Context, true);
    549         return FirstID == SecondID;
    550       }
    551 
    552 This code relies on the comparison between two
    553 ``llvm::FoldingSetNodeIDs``. As the documentation for
    554 ``Stmt::Profile()`` indicates, the ``Profile()`` member function builds
    555 a description of a node in the AST, based on its properties, along with
    556 those of its children. ``FoldingSetNodeID`` then serves as a hash we can
    557 use to compare expressions. We will need ``areSameExpr`` later. Before
    558 you run the new code on the additional loops added to
    559 test-files/simple.cpp, try to figure out which ones will be considered
    560 potentially convertible.
    561