Home | History | Annotate | Download | only in docs
      1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
      2           "http://www.w3.org/TR/html4/strict.dtd">
      3 <html>
      4 <head>
      5 <title>Static Analyzer Design Document: Memory Regions</title>
      6 </head>
      7 <body>
      8   
      9 <h1>Static Analyzer Design Document: Memory Regions</h1>
     10 
     11 <h3>Authors</h3>
     12 
     13 <p>Ted Kremenek, <tt>kremenek at apple</tt><br>
     14 Zhongxing Xu, <tt>xuzhongzhing at gmail</tt></p>
     15 
     16 <h2 id="intro">Introduction</h2>
     17 
     18 <p>The path-sensitive analysis engine in libAnalysis employs an extensible API
     19 for abstractly modeling the memory of an analyzed program. This API employs the
     20 concept of "memory regions" to abstractly model chunks of program memory such as
     21 program variables and dynamically allocated memory such as those returned from
     22 'malloc' and 'alloca'. Regions are hierarchical, with subregions modeling
     23 subtyping relationships, field and array offsets into larger chunks of memory,
     24 and so on.</p>
     25 
     26 <p>The region API consists of two components:</p>
     27 
     28 <ul> <li>A taxonomy and representation of regions themselves within the analyzer
     29 engine. The primary definitions and interfaces are described in <tt><a
     30 href="http://clang.llvm.org/doxygen/MemRegion_8h-source.html">MemRegion.h</a></tt>.
     31 At the root of the region hierarchy is the class <tt>MemRegion</tt> with
     32 specific subclasses refining the region concept for variables, heap allocated
     33 memory, and so forth.</li> <li>The modeling of binding of values to regions. For
     34 example, modeling the value stored to a local variable <tt>x</tt> consists of
     35 recording the binding between the region for <tt>x</tt> (which represents the
     36 raw memory associated with <tt>x</tt>) and the value stored to <tt>x</tt>. This
     37 binding relationship is captured with the notion of &quot;symbolic
     38 stores.&quot;</li> </ul>
     39 
     40 <p>Symbolic stores, which can be thought of as representing the relation
     41 <tt>regions -> values</tt>, are implemented by subclasses of the
     42 <tt>StoreManager</tt> class (<tt><a
     43 href="http://clang.llvm.org/doxygen/Store_8h-source.html">Store.h</a></tt>). A
     44 particular StoreManager implementation has complete flexibility concerning the
     45 following:
     46 
     47 <ul>
     48 <li><em>How</em> to model the binding between regions and values</li>
     49 <li><em>What</em> bindings are recorded
     50 </ul>
     51 
     52 <p>Together, both points allow different StoreManagers to tradeoff between
     53 different levels of analysis precision and scalability concerning the reasoning
     54 of program memory. Meanwhile, the core path-sensitive engine makes no
     55 assumptions about either points, and queries a StoreManager about the bindings
     56 to a memory region through a generic interface that all StoreManagers share. If
     57 a particular StoreManager cannot reason about the potential bindings of a given
     58 memory region (e.g., '<tt>BasicStoreManager</tt>' does not reason about fields
     59 of structures) then the StoreManager can simply return 'unknown' (represented by
     60 '<tt>UnknownVal</tt>') for a particular region-binding. This separation of
     61 concerns not only isolates the core analysis engine from the details of
     62 reasoning about program memory but also facilities the option of a client of the
     63 path-sensitive engine to easily swap in different StoreManager implementations
     64 that internally reason about program memory in very different ways.</p>
     65 
     66 <p>The rest of this document is divided into two parts. We first discuss region
     67 taxonomy and the semantics of regions. We then discuss the StoreManager
     68 interface, and details of how the currently available StoreManager classes
     69 implement region bindings.</p>
     70 
     71 <h2 id="regions">Memory Regions and Region Taxonomy</h2>
     72 
     73 <h3>Pointers</h3>
     74 
     75 <p>Before talking about the memory regions, we would talk about the pointers
     76 since memory regions are essentially used to represent pointer values.</p>
     77 
     78 <p>The pointer is a type of values. Pointer values have two semantic aspects.
     79 One is its physical value, which is an address or location. The other is the
     80 type of the memory object residing in the address.</p>
     81 
     82 <p>Memory regions are designed to abstract these two properties of the pointer.
     83 The physical value of a pointer is represented by MemRegion pointers. The rvalue
     84 type of the region corresponds to the type of the pointee object.</p>
     85 
     86 <p>One complication is that we could have different view regions on the same
     87 memory chunk. They represent the same memory location, but have different
     88 abstract location, i.e., MemRegion pointers. Thus we need to canonicalize the
     89 abstract locations to get a unique abstract location for one physical
     90 location.</p>
     91 
     92 <p>Furthermore, these different view regions may or may not represent memory
     93 objects of different types. Some different types are semantically the same,
     94 for example, 'struct s' and 'my_type' are the same type.</p>
     95 
     96 <pre>
     97 struct s;
     98 typedef struct s my_type;
     99 </pre>
    100 
    101 <p>But <tt>char</tt> and <tt>int</tt> are not the same type in the code below:</p>
    102 
    103 <pre>
    104 void *p;
    105 int *q = (int*) p;
    106 char *r = (char*) p;
    107 </pre>
    108 
    109 <p>Thus we need to canonicalize the MemRegion which is used in binding and
    110 retrieving.</p>
    111 
    112 <h3>Regions</h3>
    113 <p>Region is the entity used to model pointer values. A Region has the following
    114 properties:</p>
    115 
    116 <ul>
    117 <li>Kind</li>
    118 
    119 <li>ObjectType: the type of the object residing on the region.</li>
    120 
    121 <li>LocationType: the type of the pointer value that the region corresponds to.
    122   Usually this is the pointer to the ObjectType. But sometimes we want to cache
    123   this type explicitly, for example, for a CodeTextRegion.</li>
    124 
    125 <li>StartLocation</li>
    126 
    127 <li>EndLocation</li>
    128 </ul>
    129 
    130 <h3>Symbolic Regions</h3>
    131 
    132 <p>A symbolic region is a map of the concept of symbolic values into the domain
    133 of regions. It is the way that we represent symbolic pointers. Whenever a
    134 symbolic pointer value is needed, a symbolic region is created to represent
    135 it.</p>
    136 
    137 <p>A symbolic region has no type. It wraps a SymbolData. But sometimes we have
    138 type information associated with a symbolic region. For this case, a
    139 TypedViewRegion is created to layer the type information on top of the symbolic
    140 region. The reason we do not carry type information with the symbolic region is
    141 that the symbolic regions can have no type. To be consistent, we don't let them
    142 to carry type information.</p>
    143 
    144 <p>Like a symbolic pointer, a symbolic region may be NULL, has unknown extent,
    145 and represents a generic chunk of memory.</p>
    146 
    147 <p><em><b>NOTE</b>: We plan not to use loc::SymbolVal in RegionStore and remove it
    148   gradually.</em></p>
    149 
    150 <p>Symbolic regions get their rvalue types through the following ways:</p>
    151 
    152 <ul>
    153 <li>Through the parameter or global variable that points to it, e.g.:
    154 <pre>
    155 void f(struct s* p) {
    156   ...
    157 }
    158 </pre>
    159 
    160 <p>The symbolic region pointed to by <tt>p</tt> has type <tt>struct
    161 s</tt>.</p></li>
    162 
    163 <li>Through explicit or implicit casts, e.g.:
    164 <pre>
    165 void f(void* p) {
    166   struct s* q = (struct s*) p;
    167   ...
    168 }
    169 </pre>
    170 </li>
    171 </ul>
    172 
    173 <p>We attach the type information to the symbolic region lazily. For the first
    174 case above, we create the <tt>TypedViewRegion</tt> only when the pointer is
    175 actually used to access the pointee memory object, that is when the element or
    176 field region is created. For the cast case, the <tt>TypedViewRegion</tt> is
    177 created when visiting the <tt>CastExpr</tt>.</p>
    178 
    179 <p>The reason for doing lazy typing is that symbolic regions are sometimes only
    180 used to do location comparison.</p>
    181 
    182 <h3>Pointer Casts</h3>
    183 
    184 <p>Pointer casts allow people to impose different 'views' onto a chunk of
    185 memory.</p>
    186 
    187 <p>Usually we have two kinds of casts. One kind of casts cast down with in the
    188 type hierarchy. It imposes more specific views onto more generic memory regions.
    189 The other kind of casts cast up with in the type hierarchy. It strips away more
    190 specific views on top of the more generic memory regions.</p>
    191 
    192 <p>We simulate the down casts by layering another <tt>TypedViewRegion</tt> on
    193 top of the original region. We simulate the up casts by striping away the top
    194 <tt>TypedViewRegion</tt>. Down casts is usually simple. For up casts, if the
    195 there is no <tt>TypedViewRegion</tt> to be stripped, we return the original
    196 region. If the underlying region is of the different type than the cast-to type,
    197 we flag an error state.</p>
    198 
    199 <p>For toll-free bridging casts, we return the original region.</p>
    200 
    201 <p>We can set up a partial order for pointer types, with the most general type
    202 <tt>void*</tt> at the top. The partial order forms a tree with <tt>void*</tt> as
    203 its root node.</p>
    204 
    205 <p>Every <tt>MemRegion</tt> has a root position in the type tree. For example,
    206 the pointee region of <tt>void *p</tt> has its root position at the root node of
    207 the tree. <tt>VarRegion</tt> of <tt>int x</tt> has its root position at the 'int
    208 type' node.</p>
    209 
    210 <p><tt>TypedViewRegion</tt> is used to move the region down or up in the tree.
    211 Moving down in the tree adds a <tt>TypedViewRegion</tt>. Moving up in the tree
    212 removes a <Tt>TypedViewRegion</tt>.</p>
    213 
    214 <p>Do we want to allow moving up beyond the root position? This happens
    215 when:</p> <pre> int x; void *p = &amp;x; </pre>
    216 
    217 <p>The region of <tt>x</tt> has its root position at 'int*' node. the cast to
    218 void* moves that region up to the 'void*' node. I propose to not allow such
    219 casts, and assign the region of <tt>x</tt> for <tt>p</tt>.</p>
    220 
    221 <p>Another non-ideal case is that people might cast to a non-generic pointer
    222 from another non-generic pointer instead of first casting it back to the generic
    223 pointer. Direct handling of this case would result in multiple layers of
    224 TypedViewRegions. This enforces an incorrect semantic view to the region,
    225 because we can only have one typed view on a region at a time. To avoid this
    226 inconsistency, before casting the region, we strip the TypedViewRegion, then do
    227 the cast. In summary, we only allow one layer of TypedViewRegion.</p>
    228 
    229 <h3>Region Bindings</h3>
    230 
    231 <p>The following region kinds are boundable: VarRegion, CompoundLiteralRegion,
    232 StringRegion, ElementRegion, FieldRegion, and ObjCIvarRegion.</p>
    233 
    234 <p>When binding regions, we perform canonicalization on element regions and field
    235 regions. This is because we can have different views on the same region, some
    236 of which are essentially the same view with different sugar type names.</p>
    237 
    238 <p>To canonicalize a region, we get the canonical types for all TypedViewRegions
    239 along the way up to the root region, and make new TypedViewRegions with those
    240 canonical types.</p>
    241 
    242 <p>For Objective-C and C++, perhaps another canonicalization rule should be
    243 added: for FieldRegion, the least derived class that has the field is used as
    244 the type of the super region of the FieldRegion.</p>
    245 
    246 <p>All bindings and retrievings are done on the canonicalized regions.</p>
    247 
    248 <p>Canonicalization is transparent outside the region store manager, and more
    249 specifically, unaware outside the Bind() and Retrieve() method. We don't need to
    250 consider region canonicalization when doing pointer cast.</p>
    251 
    252 <h3>Constraint Manager</h3>
    253 
    254 <p>The constraint manager reasons about the abstract location of memory objects.
    255 We can have different views on a region, but none of these views changes the
    256 location of that object. Thus we should get the same abstract location for those
    257 regions.</p>
    258 
    259 </body>
    260 </html>
    261