Home | History | Annotate | Download | only in docs
      1 ===============================================================
      2 Tutorial for building tools using LibTooling and LibASTMatchers
      3 ===============================================================
      4 
      5 This document is intended to show how to build a useful source-to-source
      6 translation tool based on Clang's `LibTooling <LibTooling.html>`_. It is
      7 explicitly aimed at people who are new to Clang, so all you should need
      8 is a working knowledge of C++ and the command line.
      9 
     10 In order to work on the compiler, you need some basic knowledge of the
     11 abstract syntax tree (AST). To this end, the reader is incouraged to
     12 skim the :doc:`Introduction to the Clang
     13 AST <IntroductionToTheClangAST>`
     14 
     15 Step 0: Obtaining Clang
     16 =======================
     17 
     18 As Clang is part of the LLVM project, you'll need to download LLVM's
     19 source code first. Both Clang and LLVM are maintained as Subversion
     20 repositories, but we'll be accessing them through the git mirror. For
     21 further information, see the `getting started
     22 guide <http://llvm.org/docs/GettingStarted.html>`_.
     23 
     24 .. code-block:: console
     25 
     26       mkdir ~/clang-llvm && cd ~/clang-llvm
     27       git clone http://llvm.org/git/llvm.git
     28       cd llvm/tools
     29       git clone http://llvm.org/git/clang.git
     30       cd clang/tools
     31       git clone http://llvm.org/git/clang-tools-extra.git extra
     32 
     33 Next you need to obtain the CMake build system and Ninja build tool. You
     34 may already have CMake installed, but current binary versions of CMake
     35 aren't built with Ninja support.
     36 
     37 .. code-block:: console
     38 
     39       cd ~/clang-llvm
     40       git clone https://github.com/martine/ninja.git
     41       cd ninja
     42       git checkout release
     43       ./bootstrap.py
     44       sudo cp ninja /usr/bin/
     45 
     46       cd ~/clang-llvm
     47       git clone git://cmake.org/stage/cmake.git
     48       cd cmake
     49       git checkout next
     50       ./bootstrap
     51       make
     52       sudo make install
     53 
     54 Okay. Now we'll build Clang!
     55 
     56 .. code-block:: console
     57 
     58       cd ~/clang-llvm
     59       mkdir build && cd build
     60       cmake -G Ninja ../llvm -DLLVM_BUILD_TESTS=ON  # Enable tests; default is off.
     61       ninja
     62       ninja check       # Test LLVM only.
     63       ninja clang-test  # Test Clang only.
     64       ninja install
     65 
     66 And we're live.
     67 
     68 All of the tests should pass, though there is a (very) small chance that
     69 you can catch LLVM and Clang out of sync. Running ``'git svn rebase'``
     70 in both the llvm and clang directories should fix any problems.
     71 
     72 Finally, we want to set Clang as its own compiler.
     73 
     74 .. code-block:: console
     75 
     76       cd ~/clang-llvm/build
     77       ccmake ../llvm
     78 
     79 The second command will bring up a GUI for configuring Clang. You need
     80 to set the entry for ``CMAKE_CXX_COMPILER``. Press ``'t'`` to turn on
     81 advanced mode. Scroll down to ``CMAKE_CXX_COMPILER``, and set it to
     82 ``/usr/bin/clang++``, or wherever you installed it. Press ``'c'`` to
     83 configure, then ``'g'`` to generate CMake's files.
     84 
     85 Finally, run ninja one last time, and you're done.
     86 
     87 Step 1: Create a ClangTool
     88 ==========================
     89 
     90 Now that we have enough background knowledge, it's time to create the
     91 simplest productive ClangTool in existence: a syntax checker. While this
     92 already exists as ``clang-check``, it's important to understand what's
     93 going on.
     94 
     95 First, we'll need to create a new directory for our tool and tell CMake
     96 that it exists. As this is not going to be a core clang tool, it will
     97 live in the ``tools/extra`` repository.
     98 
     99 .. code-block:: console
    100 
    101       cd ~/clang-llvm/llvm/tools/clang
    102       mkdir tools/extra/loop-convert
    103       echo 'add_subdirectory(loop-convert)' >> tools/extra/CMakeLists.txt
    104       vim tools/extra/loop-convert/CMakeLists.txt
    105 
    106 CMakeLists.txt should have the following contents:
    107 
    108 ::
    109 
    110       set(LLVM_LINK_COMPONENTS support)
    111       set(LLVM_USED_LIBS clangTooling clangBasic clangAST)
    112 
    113       add_clang_executable(loop-convert
    114         LoopConvert.cpp
    115         )
    116       target_link_libraries(loop-convert
    117         clangTooling
    118         clangBasic
    119         clangASTMatchers
    120         )
    121 
    122 With that done, Ninja will be able to compile our tool. Let's give it
    123 something to compile! Put the following into
    124 ``tools/extra/loop-convert/LoopConvert.cpp``. A detailed explanation of
    125 why the different parts are needed can be found in the `LibTooling
    126 documentation <LibTooling.html>`_.
    127 
    128 .. code-block:: c++
    129 
    130       // Declares clang::SyntaxOnlyAction.
    131       #include "clang/Frontend/FrontendActions.h"
    132       #include "clang/Tooling/CommonOptionsParser.h"
    133       #include "clang/Tooling/Tooling.h"
    134       // Declares llvm::cl::extrahelp.
    135       #include "llvm/Support/CommandLine.h"
    136 
    137       using namespace clang::tooling;
    138       using namespace llvm;
    139 
    140       // CommonOptionsParser declares HelpMessage with a description of the common
    141       // command-line options related to the compilation database and input files.
    142       // It's nice to have this help message in all tools.
    143       static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage);
    144 
    145       // A help message for this specific tool can be added afterwards.
    146       static cl::extrahelp MoreHelp("\nMore help text...");
    147 
    148       int main(int argc, const char **argv) {
    149         CommonOptionsParser OptionsParser(argc, argv);
    150         ClangTool Tool(OptionsParser.getCompilations(),
    151                        OptionsParser.getSourcePathList());
    152         return Tool.run(newFrontendActionFactory<clang::SyntaxOnlyAction>());
    153       }
    154 
    155 And that's it! You can compile our new tool by running ninja from the
    156 ``build`` directory.
    157 
    158 .. code-block:: console
    159 
    160       cd ~/clang-llvm/build
    161       ninja
    162 
    163 You should now be able to run the syntax checker, which is located in
    164 ``~/clang-llvm/build/bin``, on any source file. Try it!
    165 
    166 .. code-block:: console
    167 
    168       cat "int main() { return 0; }" > test.cpp
    169       bin/loop-convert test.cpp --
    170 
    171 Note the two dashes after we specify the source file. The additional
    172 options for the compiler are passed after the dashes rather than loading
    173 them from a compilation database - there just aren't any options needed
    174 right now.
    175 
    176 Intermezzo: Learn AST matcher basics
    177 ====================================
    178 
    179 Clang recently introduced the :doc:`ASTMatcher
    180 library <LibASTMatchers>` to provide a simple, powerful, and
    181 concise way to describe specific patterns in the AST. Implemented as a
    182 DSL powered by macros and templates (see
    183 `ASTMatchers.h <../doxygen/ASTMatchers_8h_source.html>`_ if you're
    184 curious), matchers offer the feel of algebraic data types common to
    185 functional programming languages.
    186 
    187 For example, suppose you wanted to examine only binary operators. There
    188 is a matcher to do exactly that, conveniently named ``binaryOperator``.
    189 I'll give you one guess what this matcher does:
    190 
    191 .. code-block:: c++
    192 
    193       binaryOperator(hasOperatorName("+"), hasLHS(integerLiteral(equals(0))))
    194 
    195 Shockingly, it will match against addition expressions whose left hand
    196 side is exactly the literal 0. It will not match against other forms of
    197 0, such as ``'\0'`` or ``NULL``, but it will match against macros that
    198 expand to 0. The matcher will also not match against calls to the
    199 overloaded operator ``'+'``, as there is a separate ``operatorCallExpr``
    200 matcher to handle overloaded operators.
    201 
    202 There are AST matchers to match all the different nodes of the AST,
    203 narrowing matchers to only match AST nodes fulfilling specific criteria,
    204 and traversal matchers to get from one kind of AST node to another. For
    205 a complete list of AST matchers, take a look at the `AST Matcher
    206 References <LibASTMatchersReference.html>`_
    207 
    208 All matcher that are nouns describe entities in the AST and can be
    209 bound, so that they can be referred to whenever a match is found. To do
    210 so, simply call the method ``bind`` on these matchers, e.g.:
    211 
    212 .. code-block:: c++
    213 
    214       variable(hasType(isInteger())).bind("intvar")
    215 
    216 Step 2: Using AST matchers
    217 ==========================
    218 
    219 Okay, on to using matchers for real. Let's start by defining a matcher
    220 which will capture all ``for`` statements that define a new variable
    221 initialized to zero. Let's start with matching all ``for`` loops:
    222 
    223 .. code-block:: c++
    224 
    225       forStmt()
    226 
    227 Next, we want to specify that a single variable is declared in the first
    228 portion of the loop, so we can extend the matcher to
    229 
    230 .. code-block:: c++
    231 
    232       forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl()))))
    233 
    234 Finally, we can add the condition that the variable is initialized to
    235 zero.
    236 
    237 .. code-block:: c++
    238 
    239       forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
    240         hasInitializer(integerLiteral(equals(0))))))))
    241 
    242 It is fairly easy to read and understand the matcher definition ("match
    243 loops whose init portion declares a single variable which is initialized
    244 to the integer literal 0"), but deciding that every piece is necessary
    245 is more difficult. Note that this matcher will not match loops whose
    246 variables are initialized to ``'\0'``, ``0.0``, ``NULL``, or any form of
    247 zero besides the integer 0.
    248 
    249 The last step is giving the matcher a name and binding the ``ForStmt``
    250 as we will want to do something with it:
    251 
    252 .. code-block:: c++
    253 
    254       StatementMatcher LoopMatcher =
    255         forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
    256           hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop");
    257 
    258 Once you have defined your matchers, you will need to add a little more
    259 scaffolding in order to run them. Matchers are paired with a
    260 ``MatchCallback`` and registered with a ``MatchFinder`` object, then run
    261 from a ``ClangTool``. More code!
    262 
    263 Add the following to ``LoopConvert.cpp``:
    264 
    265 .. code-block:: c++
    266 
    267       #include "clang/ASTMatchers/ASTMatchers.h"
    268       #include "clang/ASTMatchers/ASTMatchFinder.h"
    269 
    270       using namespace clang;
    271       using namespace clang::ast_matchers;
    272 
    273       StatementMatcher LoopMatcher =
    274         forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
    275           hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop");
    276 
    277       class LoopPrinter : public MatchFinder::MatchCallback {
    278       public :
    279         virtual void run(const MatchFinder::MatchResult &Result) {
    280           if (const ForStmt *FS = Result.Nodes.getNodeAs<clang::ForStmt>("forLoop"))
    281             FS->dump();
    282         }
    283       };
    284 
    285 And change ``main()`` to:
    286 
    287 .. code-block:: c++
    288 
    289       int main(int argc, const char **argv) {
    290         CommonOptionsParser OptionsParser(argc, argv);
    291         ClangTool Tool(OptionsParser.getCompilations(),
    292                        OptionsParser.getSourcePathList());
    293 
    294         LoopPrinter Printer;
    295         MatchFinder Finder;
    296         Finder.addMatcher(LoopMatcher, &Printer);
    297 
    298         return Tool.run(newFrontendActionFactory(&Finder));
    299       }
    300 
    301 Now, you should be able to recompile and run the code to discover for
    302 loops. Create a new file with a few examples, and test out our new
    303 handiwork:
    304 
    305 .. code-block:: console
    306 
    307       cd ~/clang-llvm/llvm/llvm_build/
    308       ninja loop-convert
    309       vim ~/test-files/simple-loops.cc
    310       bin/loop-convert ~/test-files/simple-loops.cc
    311 
    312 Step 3.5: More Complicated Matchers
    313 ===================================
    314 
    315 Our simple matcher is capable of discovering for loops, but we would
    316 still need to filter out many more ourselves. We can do a good portion
    317 of the remaining work with some cleverly chosen matchers, but first we
    318 need to decide exactly which properties we want to allow.
    319 
    320 How can we characterize for loops over arrays which would be eligible
    321 for translation to range-based syntax? Range based loops over arrays of
    322 size ``N`` that:
    323 
    324 -  start at index ``0``
    325 -  iterate consecutively
    326 -  end at index ``N-1``
    327 
    328 We already check for (1), so all we need to add is a check to the loop's
    329 condition to ensure that the loop's index variable is compared against
    330 ``N`` and another check to ensure that the increment step just
    331 increments this same variable. The matcher for (2) is straightforward:
    332 require a pre- or post-increment of the same variable declared in the
    333 init portion.
    334 
    335 Unfortunately, such a matcher is impossible to write. Matchers contain
    336 no logic for comparing two arbitrary AST nodes and determining whether
    337 or not they are equal, so the best we can do is matching more than we
    338 would like to allow, and punting extra comparisons to the callback.
    339 
    340 In any case, we can start building this sub-matcher. We can require that
    341 the increment step be a unary increment like this:
    342 
    343 .. code-block:: c++
    344 
    345       hasIncrement(unaryOperator(hasOperatorName("++")))
    346 
    347 Specifying what is incremented introduces another quirk of Clang's AST:
    348 Usages of variables are represented as ``DeclRefExpr``'s ("declaration
    349 reference expressions") because they are expressions which refer to
    350 variable declarations. To find a ``unaryOperator`` that refers to a
    351 specific declaration, we can simply add a second condition to it:
    352 
    353 .. code-block:: c++
    354 
    355       hasIncrement(unaryOperator(
    356         hasOperatorName("++"),
    357         hasUnaryOperand(declRefExpr())))
    358 
    359 Furthermore, we can restrict our matcher to only match if the
    360 incremented variable is an integer:
    361 
    362 .. code-block:: c++
    363 
    364       hasIncrement(unaryOperator(
    365         hasOperatorName("++"),
    366         hasUnaryOperand(declRefExpr(to(varDecl(hasType(isInteger())))))))
    367 
    368 And the last step will be to attach an identifier to this variable, so
    369 that we can retrieve it in the callback:
    370 
    371 .. code-block:: c++
    372 
    373       hasIncrement(unaryOperator(
    374         hasOperatorName("++"),
    375         hasUnaryOperand(declRefExpr(to(
    376           varDecl(hasType(isInteger())).bind("incrementVariable"))))))
    377 
    378 We can add this code to the definition of ``LoopMatcher`` and make sure
    379 that our program, outfitted with the new matcher, only prints out loops
    380 that declare a single variable initialized to zero and have an increment
    381 step consisting of a unary increment of some variable.
    382 
    383 Now, we just need to add a matcher to check if the condition part of the
    384 ``for`` loop compares a variable against the size of the array. There is
    385 only one problem - we don't know which array we're iterating over
    386 without looking at the body of the loop! We are again restricted to
    387 approximating the result we want with matchers, filling in the details
    388 in the callback. So we start with:
    389 
    390 .. code-block:: c++
    391 
    392       hasCondition(binaryOperator(hasOperatorName("<"))
    393 
    394 It makes sense to ensure that the left-hand side is a reference to a
    395 variable, and that the right-hand side has integer type.
    396 
    397 .. code-block:: c++
    398 
    399       hasCondition(binaryOperator(
    400         hasOperatorName("<"),
    401         hasLHS(declRefExpr(to(varDecl(hasType(isInteger()))))),
    402         hasRHS(expr(hasType(isInteger())))))
    403 
    404 Why? Because it doesn't work. Of the three loops provided in
    405 ``test-files/simple.cpp``, zero of them have a matching condition. A
    406 quick look at the AST dump of the first for loop, produced by the
    407 previous iteration of loop-convert, shows us the answer:
    408 
    409 ::
    410 
    411       (ForStmt 0x173b240
    412         (DeclStmt 0x173afc8
    413           0x173af50 "int i =
    414             (IntegerLiteral 0x173afa8 'int' 0)")
    415         <<>>
    416         (BinaryOperator 0x173b060 '_Bool' '<'
    417           (ImplicitCastExpr 0x173b030 'int'
    418             (DeclRefExpr 0x173afe0 'int' lvalue Var 0x173af50 'i' 'int'))
    419           (ImplicitCastExpr 0x173b048 'int'
    420             (DeclRefExpr 0x173b008 'const int' lvalue Var 0x170fa80 'N' 'const int')))
    421         (UnaryOperator 0x173b0b0 'int' lvalue prefix '++'
    422           (DeclRefExpr 0x173b088 'int' lvalue Var 0x173af50 'i' 'int'))
    423         (CompoundStatement ...
    424 
    425 We already know that the declaration and increments both match, or this
    426 loop wouldn't have been dumped. The culprit lies in the implicit cast
    427 applied to the first operand (i.e. the LHS) of the less-than operator,
    428 an L-value to R-value conversion applied to the expression referencing
    429 ``i``. Thankfully, the matcher library offers a solution to this problem
    430 in the form of ``ignoringParenImpCasts``, which instructs the matcher to
    431 ignore implicit casts and parentheses before continuing to match.
    432 Adjusting the condition operator will restore the desired match.
    433 
    434 .. code-block:: c++
    435 
    436       hasCondition(binaryOperator(
    437         hasOperatorName("<"),
    438         hasLHS(ignoringParenImpCasts(declRefExpr(
    439           to(varDecl(hasType(isInteger())))))),
    440         hasRHS(expr(hasType(isInteger())))))
    441 
    442 After adding binds to the expressions we wished to capture and
    443 extracting the identifier strings into variables, we have array-step-2
    444 completed.
    445 
    446 Step 4: Retrieving Matched Nodes
    447 ================================
    448 
    449 So far, the matcher callback isn't very interesting: it just dumps the
    450 loop's AST. At some point, we will need to make changes to the input
    451 source code. Next, we'll work on using the nodes we bound in the
    452 previous step.
    453 
    454 The ``MatchFinder::run()`` callback takes a
    455 ``MatchFinder::MatchResult&`` as its parameter. We're most interested in
    456 its ``Context`` and ``Nodes`` members. Clang uses the ``ASTContext``
    457 class to represent contextual information about the AST, as the name
    458 implies, though the most functionally important detail is that several
    459 operations require an ``ASTContext*`` parameter. More immediately useful
    460 is the set of matched nodes, and how we retrieve them.
    461 
    462 Since we bind three variables (identified by ConditionVarName,
    463 InitVarName, and IncrementVarName), we can obtain the matched nodes by
    464 using the ``getNodeAs()`` member function.
    465 
    466 In ``LoopConvert.cpp`` add
    467 
    468 .. code-block:: c++
    469 
    470       #include "clang/AST/ASTContext.h"
    471 
    472 Change ``LoopMatcher`` to
    473 
    474 .. code-block:: c++
    475 
    476       StatementMatcher LoopMatcher =
    477           forStmt(hasLoopInit(declStmt(
    478                       hasSingleDecl(varDecl(hasInitializer(integerLiteral(equals(0))))
    479                                         .bind("initVarName")))),
    480                   hasIncrement(unaryOperator(
    481                       hasOperatorName("++"),
    482                       hasUnaryOperand(declRefExpr(
    483                           to(varDecl(hasType(isInteger())).bind("incVarName")))))),
    484                   hasCondition(binaryOperator(
    485                       hasOperatorName("<"),
    486                       hasLHS(ignoringParenImpCasts(declRefExpr(
    487                           to(varDecl(hasType(isInteger())).bind("condVarName"))))),
    488                       hasRHS(expr(hasType(isInteger())))))).bind("forLoop");
    489 
    490 And change ``LoopPrinter::run`` to
    491 
    492 .. code-block:: c++
    493 
    494       void LoopPrinter::run(const MatchFinder::MatchResult &Result) {
    495         ASTContext *Context = Result.Context;
    496         const ForStmt *FS = Result.Nodes.getStmtAs<ForStmt>("forLoop");
    497         // We do not want to convert header files!
    498         if (!FS || !Context->getSourceManager().isFromMainFile(FS->getForLoc()))
    499           return;
    500         const VarDecl *IncVar = Result.Nodes.getNodeAs<VarDecl>("incVarName");
    501         const VarDecl *CondVar = Result.Nodes.getNodeAs<VarDecl>("condVarName");
    502         const VarDecl *InitVar = Result.Nodes.getNodeAs<VarDecl>("initVarName");
    503 
    504         if (!areSameVariable(IncVar, CondVar) || !areSameVariable(IncVar, InitVar))
    505           return;
    506         llvm::outs() << "Potential array-based loop discovered.\n";
    507       }
    508 
    509 Clang associates a ``VarDecl`` with each variable to represent the variable's
    510 declaration. Since the "canonical" form of each declaration is unique by
    511 address, all we need to do is make sure neither ``ValueDecl`` (base class of
    512 ``VarDecl``) is ``NULL`` and compare the canonical Decls.
    513 
    514 .. code-block:: c++
    515 
    516       static bool areSameVariable(const ValueDecl *First, const ValueDecl *Second) {
    517         return First && Second &&
    518                First->getCanonicalDecl() == Second->getCanonicalDecl();
    519       }
    520 
    521 If execution reaches the end of ``LoopPrinter::run()``, we know that the
    522 loop shell that looks like
    523 
    524 .. code-block:: c++
    525 
    526       for (int i= 0; i < expr(); ++i) { ... }
    527 
    528 For now, we will just print a message explaining that we found a loop.
    529 The next section will deal with recursively traversing the AST to
    530 discover all changes needed.
    531 
    532 As a side note, it's not as trivial to test if two expressions are the same,
    533 though Clang has already done the hard work for us by providing a way to
    534 canonicalize expressions:
    535 
    536 .. code-block:: c++
    537 
    538       static bool areSameExpr(ASTContext *Context, const Expr *First,
    539                               const Expr *Second) {
    540         if (!First || !Second)
    541           return false;
    542         llvm::FoldingSetNodeID FirstID, SecondID;
    543         First->Profile(FirstID, *Context, true);
    544         Second->Profile(SecondID, *Context, true);
    545         return FirstID == SecondID;
    546       }
    547 
    548 This code relies on the comparison between two
    549 ``llvm::FoldingSetNodeIDs``. As the documentation for
    550 ``Stmt::Profile()`` indicates, the ``Profile()`` member function builds
    551 a description of a node in the AST, based on its properties, along with
    552 those of its children. ``FoldingSetNodeID`` then serves as a hash we can
    553 use to compare expressions. We will need ``areSameExpr`` later. Before
    554 you run the new code on the additional loops added to
    555 test-files/simple.cpp, try to figure out which ones will be considered
    556 potentially convertible.
    557