Home | History | Annotate | Download | only in HistoricalNotes
      1 //===----------------------------------------------------------------------===//
      2 // C Language Family Front-end
      3 //===----------------------------------------------------------------------===//
      4                                                              Chris Lattner
      5 
      6 I. Introduction:
      7  
      8  clang: noun
      9     1. A loud, resonant, metallic sound.
     10     2. The strident call of a crane or goose.
     11     3. C-language family front-end toolkit.
     12 
     13  The world needs better compiler tools, tools which are built as libraries. This
     14  design point allows reuse of the tools in new and novel ways. However, building
     15  the tools as libraries isn't enough: they must have clean APIs, be as
     16  decoupled from each other as possible, and be easy to modify/extend.  This
     17  requires clean layering, decent design, and avoiding tying the libraries to a
     18  specific use.  Oh yeah, did I mention that we want the resultant libraries to
     19  be as fast as possible? :)
     20 
     21  This front-end is built as a component of the LLVM toolkit that can be used
     22  with the LLVM backend or independently of it.  In this spirit, the API has been
     23  carefully designed as the following components:
     24  
     25    libsupport  - Basic support library, reused from LLVM.
     26 
     27    libsystem   - System abstraction library, reused from LLVM.
     28    
     29    libbasic    - Diagnostics, SourceLocations, SourceBuffer abstraction,
     30                  file system caching for input source files.  This depends on
     31                  libsupport and libsystem.
     32 
     33    libast      - Provides classes to represent the C AST, the C type system,
     34                  builtin functions, and various helpers for analyzing and
     35                  manipulating the AST (visitors, pretty printers, etc).  This
     36                  library depends on libbasic.
     37 
     38 
     39    liblex      - C/C++/ObjC lexing and preprocessing, identifier hash table,
     40                  pragma handling, tokens, and macros.  This depends on libbasic.
     41 
     42    libparse    - C (for now) parsing and local semantic analysis. This library
     43                  invokes coarse-grained 'Actions' provided by the client to do
     44                  stuff (e.g. libsema builds ASTs).  This depends on liblex.
     45 
     46    libsema     - Provides a set of parser actions to build a standardized AST
     47                  for programs.  AST's are 'streamed' out a top-level declaration
     48                  at a time, allowing clients to use decl-at-a-time processing,
     49                  build up entire translation units, or even build 'whole
     50                  program' ASTs depending on how they use the APIs.  This depends
     51                  on libast and libparse.
     52 
     53    librewrite  - Fast, scalable rewriting of source code.  This operates on
     54                  the raw syntactic text of source code, allowing a client
     55                  to insert and delete text in very large source files using
     56                  the same source location information embedded in ASTs.  This
     57                  is intended to be a low-level API that is useful for
     58                  higher-level clients and libraries such as code refactoring.
     59 
     60    libanalysis - Source-level dataflow analysis useful for performing analyses
     61                  such as computing live variables.  It also includes a
     62                  path-sensitive "graph-reachability" engine for writing
     63                  analyses that reason about different possible paths of
     64                  execution through source code.  This is currently being
     65                  employed to write a set of checks for finding bugs in software.
     66 
     67    libcodegen  - Lower the AST to LLVM IR for optimization & codegen.  Depends
     68                  on libast.
     69                  
     70    clang       - An example driver, client of the libraries at various levels.
     71                  This depends on all these libraries, and on LLVM VMCore.
     72 
     73  This front-end has been intentionally built as a DAG of libraries, making it
     74  easy to  reuse individual parts or replace pieces if desired. For example, to
     75  build a preprocessor, you take the Basic and Lexer libraries. If you want an
     76  indexer, you take those plus the Parser library and provide some actions for
     77  indexing.  If you want a refactoring, static analysis, or source-to-source
     78  compiler tool, it makes sense to take those plus the AST building and semantic
     79  analyzer library.  Finally, if you want to use this with the LLVM backend,
     80  you'd take these components plus the AST to LLVM lowering code.
     81  
     82  In the future I hope this toolkit will grow to include new and interesting
     83  components, including a C++ front-end, ObjC support, and a whole lot of other
     84  things.
     85 
     86  Finally, it should be pointed out that the goal here is to build something that
     87  is high-quality and industrial-strength: all the obnoxious features of the C
     88  family must be correctly supported (trigraphs, preprocessor arcana, K&R-style
     89  prototypes, GCC/MS extensions, etc).  It cannot be used if it is not 'real'.
     90 
     91 
     92 II. Usage of clang driver:
     93 
     94  * Basic Command-Line Options:
     95    - Help: clang --help
     96    - Standard GCC options accepted: -E, -I*, -i*, -pedantic, -std=c90, etc.
     97    - To make diagnostics more gcc-like: -fno-caret-diagnostics -fno-show-column
     98    - Enable metric printing: -stats
     99 
    100  * -fsyntax-only is currently the default mode.
    101 
    102  * -E mode works the same way as GCC.
    103 
    104  * -Eonly mode does all preprocessing, but does not print the output,
    105      useful for timing the preprocessor.
    106  
    107  * -fsyntax-only is currently partially implemented, lacking some
    108      semantic analysis (some errors and warnings are not produced).
    109 
    110  * -parse-noop parses code without building an AST.  This is useful
    111      for timing the cost of the parser without including AST building
    112      time.
    113  
    114  * -parse-ast builds ASTs, but doesn't print them.  This is most
    115      useful for timing AST building vs -parse-noop.
    116  
    117  * -parse-ast-print pretty prints most expression and statements nodes.
    118 
    119  * -parse-ast-check checks that diagnostic messages that are expected
    120      are reported and that those which are reported are expected.
    121 
    122  * -dump-cfg builds ASTs and then CFGs.  CFGs are then pretty-printed.
    123 
    124  * -view-cfg builds ASTs and then CFGs.  CFGs are then visualized by
    125      invoking Graphviz.
    126 
    127      For more information on getting Graphviz to work with clang/LLVM,
    128      see: http://llvm.org/docs/ProgrammersManual.html#ViewGraph
    129 
    130 
    131 III. Current advantages over GCC:
    132 
    133  * Column numbers are fully tracked (no 256 col limit, no GCC-style pruning).
    134  * All diagnostics have column numbers, includes 'caret diagnostics', and they
    135    highlight regions of interesting code (e.g. the LHS and RHS of a binop).
    136  * Full diagnostic customization by client (can format diagnostics however they
    137    like, e.g. in an IDE or refactoring tool) through DiagnosticClient interface.
    138  * Built as a framework, can be reused by multiple tools.
    139  * All languages supported linked into same library (no cc1,cc1obj, ...).
    140  * mmap's code in read-only, does not dirty the pages like GCC (mem footprint).
    141  * LLVM License, can be linked into non-GPL projects.
    142  * Full diagnostic control, per diagnostic.  Diagnostics are identified by ID.
    143  * Significantly faster than GCC at semantic analysis, parsing, preprocessing
    144    and lexing.
    145  * Defers exposing platform-specific stuff to as late as possible, tracks use of
    146    platform-specific features (e.g. #ifdef PPC) to allow 'portable bytecodes'.
    147  * The lexer doesn't rely on the "lexer hack": it has no notion of scope and
    148    does not categorize identifiers as types or variables -- this is up to the
    149    parser to decide.
    150 
    151 Potential Future Features:
    152 
    153  * Fine grained diag control within the source (#pragma enable/disable warning).
    154  * Better token tracking within macros?  (Token came from this line, which is
    155    a macro argument instantiated here, recursively instantiated here).
    156  * Fast #import with a module system.
    157  * Dependency tracking: change to header file doesn't recompile every function
    158    that texually depends on it: recompile only those functions that need it.
    159    This is aka 'incremental parsing'.
    160 
    161 
    162 IV. Missing Functionality / Improvements
    163 
    164 Lexer:
    165  * Source character mapping.  GCC supports ASCII and UTF-8.
    166    See GCC options: -ftarget-charset and -ftarget-wide-charset.
    167  * Universal character support.  Experimental in GCC, enabled with
    168    -fextended-identifiers.
    169  * -fpreprocessed mode.
    170 
    171 Preprocessor:
    172  * #assert/#unassert
    173  * MSExtension: "L#param" stringizes to a wide string literal.
    174  * Add support for -M*
    175 
    176 Traditional Preprocessor:
    177  * Currently, we have none. :)
    178 
    179