Home | History | Annotate | Download | only in docs
      1 ===============================================================
      2 Tutorial for building tools using LibTooling and LibASTMatchers
      3 ===============================================================
      4 
      5 This document is intended to show how to build a useful source-to-source
      6 translation tool based on Clang's `LibTooling <LibTooling.html>`_. It is
      7 explicitly aimed at people who are new to Clang, so all you should need
      8 is a working knowledge of C++ and the command line.
      9 
     10 In order to work on the compiler, you need some basic knowledge of the
     11 abstract syntax tree (AST). To this end, the reader is incouraged to
     12 skim the :doc:`Introduction to the Clang
     13 AST <IntroductionToTheClangAST>`
     14 
     15 Step 0: Obtaining Clang
     16 =======================
     17 
     18 As Clang is part of the LLVM project, you'll need to download LLVM's
     19 source code first. Both Clang and LLVM are maintained as Subversion
     20 repositories, but we'll be accessing them through the git mirror. For
     21 further information, see the `getting started
     22 guide <http://llvm.org/docs/GettingStarted.html>`_.
     23 
     24 .. code-block:: console
     25 
     26       mkdir ~/clang-llvm && cd ~/clang-llvm
     27       git clone http://llvm.org/git/llvm.git
     28       cd llvm/tools
     29       git clone http://llvm.org/git/clang.git
     30       cd clang/tools
     31       git clone http://llvm.org/git/clang-tools-extra.git extra
     32 
     33 Next you need to obtain the CMake build system and Ninja build tool. You
     34 may already have CMake installed, but current binary versions of CMake
     35 aren't built with Ninja support.
     36 
     37 .. code-block:: console
     38 
     39       cd ~/clang-llvm
     40       git clone https://github.com/martine/ninja.git
     41       cd ninja
     42       git checkout release
     43       ./bootstrap.py
     44       sudo cp ninja /usr/bin/
     45 
     46       cd ~/clang-llvm
     47       git clone git://cmake.org/stage/cmake.git
     48       cd cmake
     49       git checkout next
     50       ./bootstrap
     51       make
     52       sudo make install
     53 
     54 Okay. Now we'll build Clang!
     55 
     56 .. code-block:: console
     57 
     58       cd ~/clang-llvm
     59       mkdir build && cd build
     60       cmake -G Ninja ../llvm -DLLVM_BUILD_TESTS=ON  # Enable tests; default is off.
     61       ninja
     62       ninja check       # Test LLVM only.
     63       ninja clang-test  # Test Clang only.
     64       ninja install
     65 
     66 And we're live.
     67 
     68 All of the tests should pass, though there is a (very) small chance that
     69 you can catch LLVM and Clang out of sync. Running ``'git svn rebase'``
     70 in both the llvm and clang directories should fix any problems.
     71 
     72 Finally, we want to set Clang as its own compiler.
     73 
     74 .. code-block:: console
     75 
     76       cd ~/clang-llvm/build
     77       ccmake ../llvm
     78 
     79 The second command will bring up a GUI for configuring Clang. You need
     80 to set the entry for ``CMAKE_CXX_COMPILER``. Press ``'t'`` to turn on
     81 advanced mode. Scroll down to ``CMAKE_CXX_COMPILER``, and set it to
     82 ``/usr/bin/clang++``, or wherever you installed it. Press ``'c'`` to
     83 configure, then ``'g'`` to generate CMake's files.
     84 
     85 Finally, run ninja one last time, and you're done.
     86 
     87 Step 1: Create a ClangTool
     88 ==========================
     89 
     90 Now that we have enough background knowledge, it's time to create the
     91 simplest productive ClangTool in existence: a syntax checker. While this
     92 already exists as ``clang-check``, it's important to understand what's
     93 going on.
     94 
     95 First, we'll need to create a new directory for our tool and tell CMake
     96 that it exists. As this is not going to be a core clang tool, it will
     97 live in the ``tools/extra`` repository.
     98 
     99 .. code-block:: console
    100 
    101       cd ~/clang-llvm/llvm/tools/clang
    102       mkdir tools/extra/loop-convert
    103       echo 'add_subdirectory(loop-convert)' >> tools/extra/CMakeLists.txt
    104       vim tools/extra/loop-convert/CMakeLists.txt
    105 
    106 CMakeLists.txt should have the following contents:
    107 
    108 ::
    109 
    110       set(LLVM_LINK_COMPONENTS support)
    111 
    112       add_clang_executable(loop-convert
    113         LoopConvert.cpp
    114         )
    115       target_link_libraries(loop-convert
    116         clangTooling
    117         clangBasic
    118         clangASTMatchers
    119         )
    120 
    121 With that done, Ninja will be able to compile our tool. Let's give it
    122 something to compile! Put the following into
    123 ``tools/extra/loop-convert/LoopConvert.cpp``. A detailed explanation of
    124 why the different parts are needed can be found in the `LibTooling
    125 documentation <LibTooling.html>`_.
    126 
    127 .. code-block:: c++
    128 
    129       // Declares clang::SyntaxOnlyAction.
    130       #include "clang/Frontend/FrontendActions.h"
    131       #include "clang/Tooling/CommonOptionsParser.h"
    132       #include "clang/Tooling/Tooling.h"
    133       // Declares llvm::cl::extrahelp.
    134       #include "llvm/Support/CommandLine.h"
    135 
    136       using namespace clang::tooling;
    137       using namespace llvm;
    138 
    139       // Apply a custom category to all command-line options so that they are the
    140       // only ones displayed.
    141       static llvm::cl::OptionCategory MyToolCategory("my-tool options");
    142 
    143       // CommonOptionsParser declares HelpMessage with a description of the common
    144       // command-line options related to the compilation database and input files.
    145       // It's nice to have this help message in all tools.
    146       static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage);
    147 
    148       // A help message for this specific tool can be added afterwards.
    149       static cl::extrahelp MoreHelp("\nMore help text...");
    150 
    151       int main(int argc, const char **argv) {
    152         CommonOptionsParser OptionsParser(argc, argv, MyToolCategory);
    153         ClangTool Tool(OptionsParser.getCompilations(),
    154                        OptionsParser.getSourcePathList());
    155         return Tool.run(newFrontendActionFactory<clang::SyntaxOnlyAction>().get());
    156       }
    157 
    158 And that's it! You can compile our new tool by running ninja from the
    159 ``build`` directory.
    160 
    161 .. code-block:: console
    162 
    163       cd ~/clang-llvm/build
    164       ninja
    165 
    166 You should now be able to run the syntax checker, which is located in
    167 ``~/clang-llvm/build/bin``, on any source file. Try it!
    168 
    169 .. code-block:: console
    170 
    171       echo "int main() { return 0; }" > test.cpp
    172       bin/loop-convert test.cpp --
    173 
    174 Note the two dashes after we specify the source file. The additional
    175 options for the compiler are passed after the dashes rather than loading
    176 them from a compilation database - there just aren't any options needed
    177 right now.
    178 
    179 Intermezzo: Learn AST matcher basics
    180 ====================================
    181 
    182 Clang recently introduced the :doc:`ASTMatcher
    183 library <LibASTMatchers>` to provide a simple, powerful, and
    184 concise way to describe specific patterns in the AST. Implemented as a
    185 DSL powered by macros and templates (see
    186 `ASTMatchers.h <../doxygen/ASTMatchers_8h_source.html>`_ if you're
    187 curious), matchers offer the feel of algebraic data types common to
    188 functional programming languages.
    189 
    190 For example, suppose you wanted to examine only binary operators. There
    191 is a matcher to do exactly that, conveniently named ``binaryOperator``.
    192 I'll give you one guess what this matcher does:
    193 
    194 .. code-block:: c++
    195 
    196       binaryOperator(hasOperatorName("+"), hasLHS(integerLiteral(equals(0))))
    197 
    198 Shockingly, it will match against addition expressions whose left hand
    199 side is exactly the literal 0. It will not match against other forms of
    200 0, such as ``'\0'`` or ``NULL``, but it will match against macros that
    201 expand to 0. The matcher will also not match against calls to the
    202 overloaded operator ``'+'``, as there is a separate ``operatorCallExpr``
    203 matcher to handle overloaded operators.
    204 
    205 There are AST matchers to match all the different nodes of the AST,
    206 narrowing matchers to only match AST nodes fulfilling specific criteria,
    207 and traversal matchers to get from one kind of AST node to another. For
    208 a complete list of AST matchers, take a look at the `AST Matcher
    209 References <LibASTMatchersReference.html>`_
    210 
    211 All matcher that are nouns describe entities in the AST and can be
    212 bound, so that they can be referred to whenever a match is found. To do
    213 so, simply call the method ``bind`` on these matchers, e.g.:
    214 
    215 .. code-block:: c++
    216 
    217       variable(hasType(isInteger())).bind("intvar")
    218 
    219 Step 2: Using AST matchers
    220 ==========================
    221 
    222 Okay, on to using matchers for real. Let's start by defining a matcher
    223 which will capture all ``for`` statements that define a new variable
    224 initialized to zero. Let's start with matching all ``for`` loops:
    225 
    226 .. code-block:: c++
    227 
    228       forStmt()
    229 
    230 Next, we want to specify that a single variable is declared in the first
    231 portion of the loop, so we can extend the matcher to
    232 
    233 .. code-block:: c++
    234 
    235       forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl()))))
    236 
    237 Finally, we can add the condition that the variable is initialized to
    238 zero.
    239 
    240 .. code-block:: c++
    241 
    242       forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
    243         hasInitializer(integerLiteral(equals(0))))))))
    244 
    245 It is fairly easy to read and understand the matcher definition ("match
    246 loops whose init portion declares a single variable which is initialized
    247 to the integer literal 0"), but deciding that every piece is necessary
    248 is more difficult. Note that this matcher will not match loops whose
    249 variables are initialized to ``'\0'``, ``0.0``, ``NULL``, or any form of
    250 zero besides the integer 0.
    251 
    252 The last step is giving the matcher a name and binding the ``ForStmt``
    253 as we will want to do something with it:
    254 
    255 .. code-block:: c++
    256 
    257       StatementMatcher LoopMatcher =
    258         forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
    259           hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop");
    260 
    261 Once you have defined your matchers, you will need to add a little more
    262 scaffolding in order to run them. Matchers are paired with a
    263 ``MatchCallback`` and registered with a ``MatchFinder`` object, then run
    264 from a ``ClangTool``. More code!
    265 
    266 Add the following to ``LoopConvert.cpp``:
    267 
    268 .. code-block:: c++
    269 
    270       #include "clang/ASTMatchers/ASTMatchers.h"
    271       #include "clang/ASTMatchers/ASTMatchFinder.h"
    272 
    273       using namespace clang;
    274       using namespace clang::ast_matchers;
    275 
    276       StatementMatcher LoopMatcher =
    277         forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
    278           hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop");
    279 
    280       class LoopPrinter : public MatchFinder::MatchCallback {
    281       public :
    282         virtual void run(const MatchFinder::MatchResult &Result) {
    283           if (const ForStmt *FS = Result.Nodes.getNodeAs<clang::ForStmt>("forLoop"))
    284             FS->dump();
    285         }
    286       };
    287 
    288 And change ``main()`` to:
    289 
    290 .. code-block:: c++
    291 
    292       int main(int argc, const char **argv) {
    293         CommonOptionsParser OptionsParser(argc, argv, MyToolCategory);
    294         ClangTool Tool(OptionsParser.getCompilations(),
    295                        OptionsParser.getSourcePathList());
    296 
    297         LoopPrinter Printer;
    298         MatchFinder Finder;
    299         Finder.addMatcher(LoopMatcher, &Printer);
    300 
    301         return Tool.run(newFrontendActionFactory(&Finder).get());
    302       }
    303 
    304 Now, you should be able to recompile and run the code to discover for
    305 loops. Create a new file with a few examples, and test out our new
    306 handiwork:
    307 
    308 .. code-block:: console
    309 
    310       cd ~/clang-llvm/llvm/llvm_build/
    311       ninja loop-convert
    312       vim ~/test-files/simple-loops.cc
    313       bin/loop-convert ~/test-files/simple-loops.cc
    314 
    315 Step 3.5: More Complicated Matchers
    316 ===================================
    317 
    318 Our simple matcher is capable of discovering for loops, but we would
    319 still need to filter out many more ourselves. We can do a good portion
    320 of the remaining work with some cleverly chosen matchers, but first we
    321 need to decide exactly which properties we want to allow.
    322 
    323 How can we characterize for loops over arrays which would be eligible
    324 for translation to range-based syntax? Range based loops over arrays of
    325 size ``N`` that:
    326 
    327 -  start at index ``0``
    328 -  iterate consecutively
    329 -  end at index ``N-1``
    330 
    331 We already check for (1), so all we need to add is a check to the loop's
    332 condition to ensure that the loop's index variable is compared against
    333 ``N`` and another check to ensure that the increment step just
    334 increments this same variable. The matcher for (2) is straightforward:
    335 require a pre- or post-increment of the same variable declared in the
    336 init portion.
    337 
    338 Unfortunately, such a matcher is impossible to write. Matchers contain
    339 no logic for comparing two arbitrary AST nodes and determining whether
    340 or not they are equal, so the best we can do is matching more than we
    341 would like to allow, and punting extra comparisons to the callback.
    342 
    343 In any case, we can start building this sub-matcher. We can require that
    344 the increment step be a unary increment like this:
    345 
    346 .. code-block:: c++
    347 
    348       hasIncrement(unaryOperator(hasOperatorName("++")))
    349 
    350 Specifying what is incremented introduces another quirk of Clang's AST:
    351 Usages of variables are represented as ``DeclRefExpr``'s ("declaration
    352 reference expressions") because they are expressions which refer to
    353 variable declarations. To find a ``unaryOperator`` that refers to a
    354 specific declaration, we can simply add a second condition to it:
    355 
    356 .. code-block:: c++
    357 
    358       hasIncrement(unaryOperator(
    359         hasOperatorName("++"),
    360         hasUnaryOperand(declRefExpr())))
    361 
    362 Furthermore, we can restrict our matcher to only match if the
    363 incremented variable is an integer:
    364 
    365 .. code-block:: c++
    366 
    367       hasIncrement(unaryOperator(
    368         hasOperatorName("++"),
    369         hasUnaryOperand(declRefExpr(to(varDecl(hasType(isInteger())))))))
    370 
    371 And the last step will be to attach an identifier to this variable, so
    372 that we can retrieve it in the callback:
    373 
    374 .. code-block:: c++
    375 
    376       hasIncrement(unaryOperator(
    377         hasOperatorName("++"),
    378         hasUnaryOperand(declRefExpr(to(
    379           varDecl(hasType(isInteger())).bind("incrementVariable"))))))
    380 
    381 We can add this code to the definition of ``LoopMatcher`` and make sure
    382 that our program, outfitted with the new matcher, only prints out loops
    383 that declare a single variable initialized to zero and have an increment
    384 step consisting of a unary increment of some variable.
    385 
    386 Now, we just need to add a matcher to check if the condition part of the
    387 ``for`` loop compares a variable against the size of the array. There is
    388 only one problem - we don't know which array we're iterating over
    389 without looking at the body of the loop! We are again restricted to
    390 approximating the result we want with matchers, filling in the details
    391 in the callback. So we start with:
    392 
    393 .. code-block:: c++
    394 
    395       hasCondition(binaryOperator(hasOperatorName("<"))
    396 
    397 It makes sense to ensure that the left-hand side is a reference to a
    398 variable, and that the right-hand side has integer type.
    399 
    400 .. code-block:: c++
    401 
    402       hasCondition(binaryOperator(
    403         hasOperatorName("<"),
    404         hasLHS(declRefExpr(to(varDecl(hasType(isInteger()))))),
    405         hasRHS(expr(hasType(isInteger())))))
    406 
    407 Why? Because it doesn't work. Of the three loops provided in
    408 ``test-files/simple.cpp``, zero of them have a matching condition. A
    409 quick look at the AST dump of the first for loop, produced by the
    410 previous iteration of loop-convert, shows us the answer:
    411 
    412 ::
    413 
    414       (ForStmt 0x173b240
    415         (DeclStmt 0x173afc8
    416           0x173af50 "int i =
    417             (IntegerLiteral 0x173afa8 'int' 0)")
    418         <<>>
    419         (BinaryOperator 0x173b060 '_Bool' '<'
    420           (ImplicitCastExpr 0x173b030 'int'
    421             (DeclRefExpr 0x173afe0 'int' lvalue Var 0x173af50 'i' 'int'))
    422           (ImplicitCastExpr 0x173b048 'int'
    423             (DeclRefExpr 0x173b008 'const int' lvalue Var 0x170fa80 'N' 'const int')))
    424         (UnaryOperator 0x173b0b0 'int' lvalue prefix '++'
    425           (DeclRefExpr 0x173b088 'int' lvalue Var 0x173af50 'i' 'int'))
    426         (CompoundStatement ...
    427 
    428 We already know that the declaration and increments both match, or this
    429 loop wouldn't have been dumped. The culprit lies in the implicit cast
    430 applied to the first operand (i.e. the LHS) of the less-than operator,
    431 an L-value to R-value conversion applied to the expression referencing
    432 ``i``. Thankfully, the matcher library offers a solution to this problem
    433 in the form of ``ignoringParenImpCasts``, which instructs the matcher to
    434 ignore implicit casts and parentheses before continuing to match.
    435 Adjusting the condition operator will restore the desired match.
    436 
    437 .. code-block:: c++
    438 
    439       hasCondition(binaryOperator(
    440         hasOperatorName("<"),
    441         hasLHS(ignoringParenImpCasts(declRefExpr(
    442           to(varDecl(hasType(isInteger())))))),
    443         hasRHS(expr(hasType(isInteger())))))
    444 
    445 After adding binds to the expressions we wished to capture and
    446 extracting the identifier strings into variables, we have array-step-2
    447 completed.
    448 
    449 Step 4: Retrieving Matched Nodes
    450 ================================
    451 
    452 So far, the matcher callback isn't very interesting: it just dumps the
    453 loop's AST. At some point, we will need to make changes to the input
    454 source code. Next, we'll work on using the nodes we bound in the
    455 previous step.
    456 
    457 The ``MatchFinder::run()`` callback takes a
    458 ``MatchFinder::MatchResult&`` as its parameter. We're most interested in
    459 its ``Context`` and ``Nodes`` members. Clang uses the ``ASTContext``
    460 class to represent contextual information about the AST, as the name
    461 implies, though the most functionally important detail is that several
    462 operations require an ``ASTContext*`` parameter. More immediately useful
    463 is the set of matched nodes, and how we retrieve them.
    464 
    465 Since we bind three variables (identified by ConditionVarName,
    466 InitVarName, and IncrementVarName), we can obtain the matched nodes by
    467 using the ``getNodeAs()`` member function.
    468 
    469 In ``LoopConvert.cpp`` add
    470 
    471 .. code-block:: c++
    472 
    473       #include "clang/AST/ASTContext.h"
    474 
    475 Change ``LoopMatcher`` to
    476 
    477 .. code-block:: c++
    478 
    479       StatementMatcher LoopMatcher =
    480           forStmt(hasLoopInit(declStmt(
    481                       hasSingleDecl(varDecl(hasInitializer(integerLiteral(equals(0))))
    482                                         .bind("initVarName")))),
    483                   hasIncrement(unaryOperator(
    484                       hasOperatorName("++"),
    485                       hasUnaryOperand(declRefExpr(
    486                           to(varDecl(hasType(isInteger())).bind("incVarName")))))),
    487                   hasCondition(binaryOperator(
    488                       hasOperatorName("<"),
    489                       hasLHS(ignoringParenImpCasts(declRefExpr(
    490                           to(varDecl(hasType(isInteger())).bind("condVarName"))))),
    491                       hasRHS(expr(hasType(isInteger())))))).bind("forLoop");
    492 
    493 And change ``LoopPrinter::run`` to
    494 
    495 .. code-block:: c++
    496 
    497       void LoopPrinter::run(const MatchFinder::MatchResult &Result) {
    498         ASTContext *Context = Result.Context;
    499         const ForStmt *FS = Result.Nodes.getStmtAs<ForStmt>("forLoop");
    500         // We do not want to convert header files!
    501         if (!FS || !Context->getSourceManager().isFromMainFile(FS->getForLoc()))
    502           return;
    503         const VarDecl *IncVar = Result.Nodes.getNodeAs<VarDecl>("incVarName");
    504         const VarDecl *CondVar = Result.Nodes.getNodeAs<VarDecl>("condVarName");
    505         const VarDecl *InitVar = Result.Nodes.getNodeAs<VarDecl>("initVarName");
    506 
    507         if (!areSameVariable(IncVar, CondVar) || !areSameVariable(IncVar, InitVar))
    508           return;
    509         llvm::outs() << "Potential array-based loop discovered.\n";
    510       }
    511 
    512 Clang associates a ``VarDecl`` with each variable to represent the variable's
    513 declaration. Since the "canonical" form of each declaration is unique by
    514 address, all we need to do is make sure neither ``ValueDecl`` (base class of
    515 ``VarDecl``) is ``NULL`` and compare the canonical Decls.
    516 
    517 .. code-block:: c++
    518 
    519       static bool areSameVariable(const ValueDecl *First, const ValueDecl *Second) {
    520         return First && Second &&
    521                First->getCanonicalDecl() == Second->getCanonicalDecl();
    522       }
    523 
    524 If execution reaches the end of ``LoopPrinter::run()``, we know that the
    525 loop shell that looks like
    526 
    527 .. code-block:: c++
    528 
    529       for (int i= 0; i < expr(); ++i) { ... }
    530 
    531 For now, we will just print a message explaining that we found a loop.
    532 The next section will deal with recursively traversing the AST to
    533 discover all changes needed.
    534 
    535 As a side note, it's not as trivial to test if two expressions are the same,
    536 though Clang has already done the hard work for us by providing a way to
    537 canonicalize expressions:
    538 
    539 .. code-block:: c++
    540 
    541       static bool areSameExpr(ASTContext *Context, const Expr *First,
    542                               const Expr *Second) {
    543         if (!First || !Second)
    544           return false;
    545         llvm::FoldingSetNodeID FirstID, SecondID;
    546         First->Profile(FirstID, *Context, true);
    547         Second->Profile(SecondID, *Context, true);
    548         return FirstID == SecondID;
    549       }
    550 
    551 This code relies on the comparison between two
    552 ``llvm::FoldingSetNodeIDs``. As the documentation for
    553 ``Stmt::Profile()`` indicates, the ``Profile()`` member function builds
    554 a description of a node in the AST, based on its properties, along with
    555 those of its children. ``FoldingSetNodeID`` then serves as a hash we can
    556 use to compare expressions. We will need ``areSameExpr`` later. Before
    557 you run the new code on the additional loops added to
    558 test-files/simple.cpp, try to figure out which ones will be considered
    559 potentially convertible.
    560