Home | History | Annotate | Download | only in docs
      1 ===============================================================
      2 Tutorial for building tools using LibTooling and LibASTMatchers
      3 ===============================================================
      4 
      5 This document is intended to show how to build a useful source-to-source
      6 translation tool based on Clang's `LibTooling <LibTooling.html>`_. It is
      7 explicitly aimed at people who are new to Clang, so all you should need
      8 is a working knowledge of C++ and the command line.
      9 
     10 In order to work on the compiler, you need some basic knowledge of the
     11 abstract syntax tree (AST). To this end, the reader is incouraged to
     12 skim the :doc:`Introduction to the Clang
     13 AST <IntroductionToTheClangAST>`
     14 
     15 Step 0: Obtaining Clang
     16 =======================
     17 
     18 As Clang is part of the LLVM project, you'll need to download LLVM's
     19 source code first. Both Clang and LLVM are maintained as Subversion
     20 repositories, but we'll be accessing them through the git mirror. For
     21 further information, see the `getting started
     22 guide <http://llvm.org/docs/GettingStarted.html>`_.
     23 
     24 .. code-block:: console
     25 
     26       mkdir ~/clang-llvm && cd ~/clang-llvm
     27       git clone http://llvm.org/git/llvm.git
     28       cd llvm/tools
     29       git clone http://llvm.org/git/clang.git
     30 
     31 Next you need to obtain the CMake build system and Ninja build tool. You
     32 may already have CMake installed, but current binary versions of CMake
     33 aren't built with Ninja support.
     34 
     35 .. code-block:: console
     36 
     37       cd ~/clang-llvm
     38       git clone https://github.com/martine/ninja.git
     39       cd ninja
     40       git checkout release
     41       ./bootstrap.py
     42       sudo cp ninja /usr/bin/
     43 
     44       cd ~/clang-llvm
     45       git clone git://cmake.org/stage/cmake.git
     46       cd cmake
     47       git checkout next
     48       ./bootstrap
     49       make
     50       sudo make install
     51 
     52 Okay. Now we'll build Clang!
     53 
     54 .. code-block:: console
     55 
     56       cd ~/clang-llvm
     57       mkdir build && cd build
     58       cmake -G Ninja ../llvm -DLLVM_BUILD_TESTS=ON  # Enable tests; default is off.
     59       ninja
     60       ninja check       # Test LLVM only.
     61       ninja clang-test  # Test Clang only.
     62       ninja install
     63 
     64 And we're live.
     65 
     66 All of the tests should pass, though there is a (very) small chance that
     67 you can catch LLVM and Clang out of sync. Running ``'git svn rebase'``
     68 in both the llvm and clang directories should fix any problems.
     69 
     70 Finally, we want to set Clang as its own compiler.
     71 
     72 .. code-block:: console
     73 
     74       cd ~/clang-llvm/build
     75       ccmake ../llvm
     76 
     77 The second command will bring up a GUI for configuring Clang. You need
     78 to set the entry for ``CMAKE_CXX_COMPILER``. Press ``'t'`` to turn on
     79 advanced mode. Scroll down to ``CMAKE_CXX_COMPILER``, and set it to
     80 ``/usr/bin/clang++``, or wherever you installed it. Press ``'c'`` to
     81 configure, then ``'g'`` to generate CMake's files.
     82 
     83 Finally, run ninja one last time, and you're done.
     84 
     85 Step 1: Create a ClangTool
     86 ==========================
     87 
     88 Now that we have enough background knowledge, it's time to create the
     89 simplest productive ClangTool in existence: a syntax checker. While this
     90 already exists as ``clang-check``, it's important to understand what's
     91 going on.
     92 
     93 First, we'll need to create a new directory for our tool and tell CMake
     94 that it exists. As this is not going to be a core clang tool, it will
     95 live in the ``tools/extra`` repository.
     96 
     97 .. code-block:: console
     98 
     99       cd ~/clang-llvm/llvm/tools/clang
    100       mkdir tools/extra/loop-convert
    101       echo 'add_subdirectory(loop-convert)' >> tools/extra/CMakeLists.txt
    102       vim tools/extra/loop-convert/CMakeLists.txt
    103 
    104 CMakeLists.txt should have the following contents:
    105 
    106 ::
    107 
    108       set(LLVM_LINK_COMPONENTS support)
    109       set(LLVM_USED_LIBS clangTooling clangBasic clangAST)
    110 
    111       add_clang_executable(loop-convert
    112         LoopConvert.cpp
    113         )
    114       target_link_libraries(loop-convert
    115         clangTooling
    116         clangBasic
    117         clangASTMatchers
    118         )
    119 
    120 With that done, Ninja will be able to compile our tool. Let's give it
    121 something to compile! Put the following into
    122 ``tools/extra/loop-convert/LoopConvert.cpp``. A detailed explanation of
    123 why the different parts are needed can be found in the `LibTooling
    124 documentation <LibTooling.html>`_.
    125 
    126 .. code-block:: c++
    127 
    128       // Declares clang::SyntaxOnlyAction.
    129       #include "clang/Frontend/FrontendActions.h"
    130       #include "clang/Tooling/CommonOptionsParser.h"
    131       #include "clang/Tooling/Tooling.h"
    132       // Declares llvm::cl::extrahelp.
    133       #include "llvm/Support/CommandLine.h"
    134 
    135       using namespace clang::tooling;
    136       using namespace llvm;
    137 
    138       // CommonOptionsParser declares HelpMessage with a description of the common
    139       // command-line options related to the compilation database and input files.
    140       // It's nice to have this help message in all tools.
    141       static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage);
    142 
    143       // A help message for this specific tool can be added afterwards.
    144       static cl::extrahelp MoreHelp("\nMore help text...");
    145 
    146       int main(int argc, const char **argv) {
    147         CommonOptionsParser OptionsParser(argc, argv);
    148         ClangTool Tool(OptionsParser.getCompilations(),
    149                        OptionsParser.getSourcePathList());
    150         return Tool.run(newFrontendActionFactory<clang::SyntaxOnlyAction>());
    151       }
    152 
    153 And that's it! You can compile our new tool by running ninja from the
    154 ``build`` directory.
    155 
    156 .. code-block:: console
    157 
    158       cd ~/clang-llvm/build
    159       ninja
    160 
    161 You should now be able to run the syntax checker, which is located in
    162 ``~/clang-llvm/build/bin``, on any source file. Try it!
    163 
    164 .. code-block:: console
    165 
    166       cat "void main() {}" > test.cpp
    167       bin/loop-convert test.cpp --
    168 
    169 Note the two dashes after we specify the source file. The additional
    170 options for the compiler are passed after the dashes rather than loading
    171 them from a compilation database - there just aren't any options needed
    172 right now.
    173 
    174 Intermezzo: Learn AST matcher basics
    175 ====================================
    176 
    177 Clang recently introduced the :doc:`ASTMatcher
    178 library <LibASTMatchers>` to provide a simple, powerful, and
    179 concise way to describe specific patterns in the AST. Implemented as a
    180 DSL powered by macros and templates (see
    181 `ASTMatchers.h <../doxygen/ASTMatchers_8h_source.html>`_ if you're
    182 curious), matchers offer the feel of algebraic data types common to
    183 functional programming languages.
    184 
    185 For example, suppose you wanted to examine only binary operators. There
    186 is a matcher to do exactly that, conveniently named ``binaryOperator``.
    187 I'll give you one guess what this matcher does:
    188 
    189 .. code-block:: c++
    190 
    191       binaryOperator(hasOperatorName("+"), hasLHS(integerLiteral(equals(0))))
    192 
    193 Shockingly, it will match against addition expressions whose left hand
    194 side is exactly the literal 0. It will not match against other forms of
    195 0, such as ``'\0'`` or ``NULL``, but it will match against macros that
    196 expand to 0. The matcher will also not match against calls to the
    197 overloaded operator ``'+'``, as there is a separate ``operatorCallExpr``
    198 matcher to handle overloaded operators.
    199 
    200 There are AST matchers to match all the different nodes of the AST,
    201 narrowing matchers to only match AST nodes fulfilling specific criteria,
    202 and traversal matchers to get from one kind of AST node to another. For
    203 a complete list of AST matchers, take a look at the `AST Matcher
    204 References <LibASTMatchersReference.html>`_
    205 
    206 All matcher that are nouns describe entities in the AST and can be
    207 bound, so that they can be referred to whenever a match is found. To do
    208 so, simply call the method ``bind`` on these matchers, e.g.:
    209 
    210 .. code-block:: c++
    211 
    212       variable(hasType(isInteger())).bind("intvar")
    213 
    214 Step 2: Using AST matchers
    215 ==========================
    216 
    217 Okay, on to using matchers for real. Let's start by defining a matcher
    218 which will capture all ``for`` statements that define a new variable
    219 initialized to zero. Let's start with matching all ``for`` loops:
    220 
    221 .. code-block:: c++
    222 
    223       forStmt()
    224 
    225 Next, we want to specify that a single variable is declared in the first
    226 portion of the loop, so we can extend the matcher to
    227 
    228 .. code-block:: c++
    229 
    230       forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl()))))
    231 
    232 Finally, we can add the condition that the variable is initialized to
    233 zero.
    234 
    235 .. code-block:: c++
    236 
    237       forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
    238         hasInitializer(integerLiteral(equals(0))))))))
    239 
    240 It is fairly easy to read and understand the matcher definition ("match
    241 loops whose init portion declares a single variable which is initialized
    242 to the integer literal 0"), but deciding that every piece is necessary
    243 is more difficult. Note that this matcher will not match loops whose
    244 variables are initialized to ``'\0'``, ``0.0``, ``NULL``, or any form of
    245 zero besides the integer 0.
    246 
    247 The last step is giving the matcher a name and binding the ``ForStmt``
    248 as we will want to do something with it:
    249 
    250 .. code-block:: c++
    251 
    252       StatementMatcher LoopMatcher =
    253         forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
    254           hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop");
    255 
    256 Once you have defined your matchers, you will need to add a little more
    257 scaffolding in order to run them. Matchers are paired with a
    258 ``MatchCallback`` and registered with a ``MatchFinder`` object, then run
    259 from a ``ClangTool``. More code!
    260 
    261 Add the following to ``LoopConvert.cpp``:
    262 
    263 .. code-block:: c++
    264 
    265       #include "clang/ASTMatchers/ASTMatchers.h"
    266       #include "clang/ASTMatchers/ASTMatchFinder.h"
    267 
    268       using namespace clang;
    269       using namespace clang::ast_matchers;
    270 
    271       StatementMatcher LoopMatcher =
    272         forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
    273           hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop");
    274 
    275       class LoopPrinter : public MatchFinder::MatchCallback {
    276       public :
    277         virtual void run(const MatchFinder::MatchResult &Result) {
    278         if (const ForStmt *FS = Result.Nodes.getNodeAs<clang::ForStmt>("forLoop"))
    279           FS->dump();
    280       };
    281 
    282 And change ``main()`` to:
    283 
    284 .. code-block:: c++
    285 
    286       int main(int argc, const char **argv) {
    287         CommonOptionsParser OptionsParser(argc, argv);
    288         ClangTool Tool(OptionsParser.getCompilations(),
    289                        OptionsParser.getSourcePathList());
    290 
    291         LoopPrinter Printer;
    292         MatchFinder Finder;
    293         Finder.addMatcher(LoopMatcher, &Printer);
    294 
    295         return Tool.run(newFrontendActionFactory(&Finder));
    296       }
    297 
    298 Now, you should be able to recompile and run the code to discover for
    299 loops. Create a new file with a few examples, and test out our new
    300 handiwork:
    301 
    302 .. code-block:: console
    303 
    304       cd ~/clang-llvm/llvm/llvm_build/
    305       ninja loop-convert
    306       vim ~/test-files/simple-loops.cc
    307       bin/loop-convert ~/test-files/simple-loops.cc
    308 
    309 Step 3.5: More Complicated Matchers
    310 ===================================
    311 
    312 Our simple matcher is capable of discovering for loops, but we would
    313 still need to filter out many more ourselves. We can do a good portion
    314 of the remaining work with some cleverly chosen matchers, but first we
    315 need to decide exactly which properties we want to allow.
    316 
    317 How can we characterize for loops over arrays which would be eligible
    318 for translation to range-based syntax? Range based loops over arrays of
    319 size ``N`` that:
    320 
    321 -  start at index ``0``
    322 -  iterate consecutively
    323 -  end at index ``N-1``
    324 
    325 We already check for (1), so all we need to add is a check to the loop's
    326 condition to ensure that the loop's index variable is compared against
    327 ``N`` and another check to ensure that the increment step just
    328 increments this same variable. The matcher for (2) is straightforward:
    329 require a pre- or post-increment of the same variable declared in the
    330 init portion.
    331 
    332 Unfortunately, such a matcher is impossible to write. Matchers contain
    333 no logic for comparing two arbitrary AST nodes and determining whether
    334 or not they are equal, so the best we can do is matching more than we
    335 would like to allow, and punting extra comparisons to the callback.
    336 
    337 In any case, we can start building this sub-matcher. We can require that
    338 the increment step be a unary increment like this:
    339 
    340 .. code-block:: c++
    341 
    342       hasIncrement(unaryOperator(hasOperatorName("++")))
    343 
    344 Specifying what is incremented introduces another quirk of Clang's AST:
    345 Usages of variables are represented as ``DeclRefExpr``'s ("declaration
    346 reference expressions") because they are expressions which refer to
    347 variable declarations. To find a ``unaryOperator`` that refers to a
    348 specific declaration, we can simply add a second condition to it:
    349 
    350 .. code-block:: c++
    351 
    352       hasIncrement(unaryOperator(
    353         hasOperatorName("++"),
    354         hasUnaryOperand(declRefExpr())))
    355 
    356 Furthermore, we can restrict our matcher to only match if the
    357 incremented variable is an integer:
    358 
    359 .. code-block:: c++
    360 
    361       hasIncrement(unaryOperator(
    362         hasOperatorName("++"),
    363         hasUnaryOperand(declRefExpr(to(varDecl(hasType(isInteger())))))))
    364 
    365 And the last step will be to attach an identifier to this variable, so
    366 that we can retrieve it in the callback:
    367 
    368 .. code-block:: c++
    369 
    370       hasIncrement(unaryOperator(
    371         hasOperatorName("++"),
    372         hasUnaryOperand(declRefExpr(to(
    373           varDecl(hasType(isInteger())).bind("incrementVariable"))))))
    374 
    375 We can add this code to the definition of ``LoopMatcher`` and make sure
    376 that our program, outfitted with the new matcher, only prints out loops
    377 that declare a single variable initialized to zero and have an increment
    378 step consisting of a unary increment of some variable.
    379 
    380 Now, we just need to add a matcher to check if the condition part of the
    381 ``for`` loop compares a variable against the size of the array. There is
    382 only one problem - we don't know which array we're iterating over
    383 without looking at the body of the loop! We are again restricted to
    384 approximating the result we want with matchers, filling in the details
    385 in the callback. So we start with:
    386 
    387 .. code-block:: c++
    388 
    389       hasCondition(binaryOperator(hasOperatorName("<"))
    390 
    391 It makes sense to ensure that the left-hand side is a reference to a
    392 variable, and that the right-hand side has integer type.
    393 
    394 .. code-block:: c++
    395 
    396       hasCondition(binaryOperator(
    397         hasOperatorName("<"),
    398         hasLHS(declRefExpr(to(varDecl(hasType(isInteger()))))),
    399         hasRHS(expr(hasType(isInteger())))))
    400 
    401 Why? Because it doesn't work. Of the three loops provided in
    402 ``test-files/simple.cpp``, zero of them have a matching condition. A
    403 quick look at the AST dump of the first for loop, produced by the
    404 previous iteration of loop-convert, shows us the answer:
    405 
    406 ::
    407 
    408       (ForStmt 0x173b240
    409         (DeclStmt 0x173afc8
    410           0x173af50 "int i =
    411             (IntegerLiteral 0x173afa8 'int' 0)")
    412         <<>>
    413         (BinaryOperator 0x173b060 '_Bool' '<'
    414           (ImplicitCastExpr 0x173b030 'int' 
    415             (DeclRefExpr 0x173afe0 'int' lvalue Var 0x173af50 'i' 'int'))
    416           (ImplicitCastExpr 0x173b048 'int' 
    417             (DeclRefExpr 0x173b008 'const int' lvalue Var 0x170fa80 'N' 'const int')))
    418         (UnaryOperator 0x173b0b0 'int' lvalue prefix '++'
    419           (DeclRefExpr 0x173b088 'int' lvalue Var 0x173af50 'i' 'int'))
    420         (CompoundStatement 
    421 
    422 We already know that the declaration and increments both match, or this
    423 loop wouldn't have been dumped. The culprit lies in the implicit cast
    424 applied to the first operand (i.e. the LHS) of the less-than operator,
    425 an L-value to R-value conversion applied to the expression referencing
    426 ``i``. Thankfully, the matcher library offers a solution to this problem
    427 in the form of ``ignoringParenImpCasts``, which instructs the matcher to
    428 ignore implicit casts and parentheses before continuing to match.
    429 Adjusting the condition operator will restore the desired match.
    430 
    431 .. code-block:: c++
    432 
    433       hasCondition(binaryOperator(
    434         hasOperatorName("<"),
    435         hasLHS(ignoringParenImpCasts(declRefExpr(
    436           to(varDecl(hasType(isInteger())))))),
    437         hasRHS(expr(hasType(isInteger())))))
    438 
    439 After adding binds to the expressions we wished to capture and
    440 extracting the identifier strings into variables, we have array-step-2
    441 completed.
    442 
    443 Step 4: Retrieving Matched Nodes
    444 ================================
    445 
    446 So far, the matcher callback isn't very interesting: it just dumps the
    447 loop's AST. At some point, we will need to make changes to the input
    448 source code. Next, we'll work on using the nodes we bound in the
    449 previous step.
    450 
    451 The ``MatchFinder::run()`` callback takes a
    452 ``MatchFinder::MatchResult&`` as its parameter. We're most interested in
    453 its ``Context`` and ``Nodes`` members. Clang uses the ``ASTContext``
    454 class to represent contextual information about the AST, as the name
    455 implies, though the most functionally important detail is that several
    456 operations require an ``ASTContext*`` parameter. More immediately useful
    457 is the set of matched nodes, and how we retrieve them.
    458 
    459 Since we bind three variables (identified by ConditionVarName,
    460 InitVarName, and IncrementVarName), we can obtain the matched nodes by
    461 using the ``getNodeAs()`` member function.
    462 
    463 In ``LoopActions.cpp``:
    464 
    465 .. code-block:: c++
    466 
    467       #include "clang/AST/ASTContext.h"
    468 
    469       void LoopPrinter::run(const MatchFinder::MatchResult &Result) {
    470         ASTContext *Context = Result.Context;
    471         const ForStmt *FS = Result.Nodes.getStmtAs<ForStmt>(LoopName);
    472         // We do not want to convert header files!
    473         if (!FS || !Context->getSourceManager().isFromMainFile(FS->getForLoc()))
    474           return;
    475         const VarDecl *IncVar = Result.Nodes.getNodeAs<VarDecl>(IncrementVarName);
    476         const VarDecl *CondVar = Result.Nodes.getNodeAs<VarDecl>(ConditionVarName);
    477         const VarDecl *InitVar = Result.Nodes.getNodeAs<VarDecl>(InitVarName);
    478 
    479 Now that we have the three variables, represented by their respective
    480 declarations, let's make sure that they're all the same, using a helper
    481 function I call ``areSameVariable()``.
    482 
    483 .. code-block:: c++
    484 
    485       if (!areSameVariable(IncVar, CondVar) || !areSameVariable(IncVar, InitVar))
    486         return;
    487       llvm::outs() << "Potential array-based loop discovered.\n";
    488     }
    489 
    490 If execution reaches the end of ``LoopPrinter::run()``, we know that the
    491 loop shell that looks like
    492 
    493 .. code-block:: c++
    494 
    495       for (int i= 0; i < expr(); ++i) { ... }
    496 
    497 For now, we will just print a message explaining that we found a loop.
    498 The next section will deal with recursively traversing the AST to
    499 discover all changes needed.
    500 
    501 As a side note, here is the implementation of ``areSameVariable``. Clang
    502 associates a ``VarDecl`` with each variable to represent the variable's
    503 declaration. Since the "canonical" form of each declaration is unique by
    504 address, all we need to do is make sure neither ``ValueDecl`` (base
    505 class of ``VarDecl``) is ``NULL`` and compare the canonical Decls.
    506 
    507 .. code-block:: c++
    508 
    509       static bool areSameVariable(const ValueDecl *First, const ValueDecl *Second) {
    510         return First && Second &&
    511                First->getCanonicalDecl() == Second->getCanonicalDecl();
    512       }
    513 
    514 It's not as trivial to test if two expressions are the same, though
    515 Clang has already done the hard work for us by providing a way to
    516 canonicalize expressions:
    517 
    518 .. code-block:: c++
    519 
    520       static bool areSameExpr(ASTContext *Context, const Expr *First,
    521                               const Expr *Second) {
    522         if (!First || !Second)
    523           return false;
    524         llvm::FoldingSetNodeID FirstID, SecondID;
    525         First->Profile(FirstID, *Context, true);
    526         Second->Profile(SecondID, *Context, true);
    527         return FirstID == SecondID;
    528       }
    529 
    530 This code relies on the comparison between two
    531 ``llvm::FoldingSetNodeIDs``. As the documentation for
    532 ``Stmt::Profile()`` indicates, the ``Profile()`` member function builds
    533 a description of a node in the AST, based on its properties, along with
    534 those of its children. ``FoldingSetNodeID`` then serves as a hash we can
    535 use to compare expressions. We will need ``areSameExpr`` later. Before
    536 you run the new code on the additional loops added to
    537 test-files/simple.cpp, try to figure out which ones will be considered
    538 potentially convertible.
    539