1 <html> 2 <head> 3 <title>Static Analyzer Design Document: Memory Regions</title> 4 </head> 5 <body> 6 7 <h1>Static Analyzer Design Document: Memory Regions</h1> 8 9 <h3>Authors</h3> 10 11 <p>Ted Kremenek, <tt>kremenek at apple</tt><br> 12 Zhongxing Xu, <tt>xuzhongzhing at gmail</tt></p> 13 14 <h2 id="intro">Introduction</h2> 15 16 <p>The path-sensitive analysis engine in libAnalysis employs an extensible API 17 for abstractly modeling the memory of an analyzed program. This API employs the 18 concept of "memory regions" to abstractly model chunks of program memory such as 19 program variables and dynamically allocated memory such as those returned from 20 'malloc' and 'alloca'. Regions are hierarchical, with subregions modeling 21 subtyping relationships, field and array offsets into larger chunks of memory, 22 and so on.</p> 23 24 <p>The region API consists of two components:</p> 25 26 <ul> <li>A taxonomy and representation of regions themselves within the analyzer 27 engine. The primary definitions and interfaces are described in <tt><a 28 href="http://clang.llvm.org/doxygen/MemRegion_8h-source.html">MemRegion.h</a></tt>. 29 At the root of the region hierarchy is the class <tt>MemRegion</tt> with 30 specific subclasses refining the region concept for variables, heap allocated 31 memory, and so forth.</li> <li>The modeling of binding of values to regions. For 32 example, modeling the value stored to a local variable <tt>x</tt> consists of 33 recording the binding between the region for <tt>x</tt> (which represents the 34 raw memory associated with <tt>x</tt>) and the value stored to <tt>x</tt>. This 35 binding relationship is captured with the notion of "symbolic 36 stores."</li> </ul> 37 38 <p>Symbolic stores, which can be thought of as representing the relation 39 <tt>regions -> values</tt>, are implemented by subclasses of the 40 <tt>StoreManager</tt> class (<tt><a 41 href="http://clang.llvm.org/doxygen/Store_8h-source.html">Store.h</a></tt>). A 42 particular StoreManager implementation has complete flexibility concerning the 43 following: 44 45 <ul> 46 <li><em>How</em> to model the binding between regions and values</li> 47 <li><em>What</em> bindings are recorded 48 </ul> 49 50 <p>Together, both points allow different StoreManagers to tradeoff between 51 different levels of analysis precision and scalability concerning the reasoning 52 of program memory. Meanwhile, the core path-sensitive engine makes no 53 assumptions about either points, and queries a StoreManager about the bindings 54 to a memory region through a generic interface that all StoreManagers share. If 55 a particular StoreManager cannot reason about the potential bindings of a given 56 memory region (e.g., '<tt>BasicStoreManager</tt>' does not reason about fields 57 of structures) then the StoreManager can simply return 'unknown' (represented by 58 '<tt>UnknownVal</tt>') for a particular region-binding. This separation of 59 concerns not only isolates the core analysis engine from the details of 60 reasoning about program memory but also facilities the option of a client of the 61 path-sensitive engine to easily swap in different StoreManager implementations 62 that internally reason about program memory in very different ways.</pp> 63 64 <p>The rest of this document is divided into two parts. We first discuss region 65 taxonomy and the semantics of regions. We then discuss the StoreManager 66 interface, and details of how the currently available StoreManager classes 67 implement region bindings.</p> 68 69 <h2 id="regions">Memory Regions and Region Taxonomy</h2> 70 71 <h3>Pointers</h3> 72 73 <p>Before talking about the memory regions, we would talk about the pointers 74 since memory regions are essentially used to represent pointer values.</p> 75 76 <p>The pointer is a type of values. Pointer values have two semantic aspects. 77 One is its physical value, which is an address or location. The other is the 78 type of the memory object residing in the address.</p> 79 80 <p>Memory regions are designed to abstract these two properties of the pointer. 81 The physical value of a pointer is represented by MemRegion pointers. The rvalue 82 type of the region corresponds to the type of the pointee object.</p> 83 84 <p>One complication is that we could have different view regions on the same 85 memory chunk. They represent the same memory location, but have different 86 abstract location, i.e., MemRegion pointers. Thus we need to canonicalize the 87 abstract locations to get a unique abstract location for one physical 88 location.</p> 89 90 <p>Furthermore, these different view regions may or may not represent memory 91 objects of different types. Some different types are semantically the same, 92 for example, 'struct s' and 'my_type' are the same type.</p> 93 94 <pre> 95 struct s; 96 typedef struct s my_type; 97 </pre> 98 99 <p>But <tt>char</tt> and <tt>int</tt> are not the same type in the code below:</p> 100 101 <pre> 102 void *p; 103 int *q = (int*) p; 104 char *r = (char*) p; 105 </pre 106 107 <p>Thus we need to canonicalize the MemRegion which is used in binding and 108 retrieving.</p> 109 110 <h3>Regions</h3> 111 <p>Region is the entity used to model pointer values. A Region has the following 112 properties:</p> 113 114 <ul> 115 <li>Kind</li> 116 117 <li>ObjectType: the type of the object residing on the region.</li> 118 119 <li>LocationType: the type of the pointer value that the region corresponds to. 120 Usually this is the pointer to the ObjectType. But sometimes we want to cache 121 this type explicitly, for example, for a CodeTextRegion.</li> 122 123 <li>StartLocation</li> 124 125 <li>EndLocation</li> 126 </ul> 127 128 <h3>Symbolic Regions</h3> 129 130 <p>A symbolic region is a map of the concept of symbolic values into the domain 131 of regions. It is the way that we represent symbolic pointers. Whenever a 132 symbolic pointer value is needed, a symbolic region is created to represent 133 it.</p> 134 135 <p>A symbolic region has no type. It wraps a SymbolData. But sometimes we have 136 type information associated with a symbolic region. For this case, a 137 TypedViewRegion is created to layer the type information on top of the symbolic 138 region. The reason we do not carry type information with the symbolic region is 139 that the symbolic regions can have no type. To be consistent, we don't let them 140 to carry type information.</p> 141 142 <p>Like a symbolic pointer, a symbolic region may be NULL, has unknown extent, 143 and represents a generic chunk of memory.</p> 144 145 <p><em><b>NOTE</b>: We plan not to use loc::SymbolVal in RegionStore and remove it 146 gradually.</em></p> 147 148 <p>Symbolic regions get their rvalue types through the following ways:</p> 149 150 <ul> 151 <li>Through the parameter or global variable that points to it, e.g.: 152 <pre> 153 void f(struct s* p) { 154 ... 155 } 156 </pre> 157 158 <p>The symbolic region pointed to by <tt>p</tt> has type <tt>struct 159 s</tt>.</p></li> 160 161 <li>Through explicit or implicit casts, e.g.: 162 <pre> 163 void f(void* p) { 164 struct s* q = (struct s*) p; 165 ... 166 } 167 </pre> 168 </li> 169 </ul> 170 171 <p>We attach the type information to the symbolic region lazily. For the first 172 case above, we create the <tt>TypedViewRegion</tt> only when the pointer is 173 actually used to access the pointee memory object, that is when the element or 174 field region is created. For the cast case, the <tt>TypedViewRegion</tt> is 175 created when visiting the <tt>CastExpr</tt>.</p> 176 177 <p>The reason for doing lazy typing is that symbolic regions are sometimes only 178 used to do location comparison.</p> 179 180 <h3>Pointer Casts</h3> 181 182 <p>Pointer casts allow people to impose different 'views' onto a chunk of 183 memory.</p> 184 185 <p>Usually we have two kinds of casts. One kind of casts cast down with in the 186 type hierarchy. It imposes more specific views onto more generic memory regions. 187 The other kind of casts cast up with in the type hierarchy. It strips away more 188 specific views on top of the more generic memory regions.</p> 189 190 <p>We simulate the down casts by layering another <tt>TypedViewRegion</tt> on 191 top of the original region. We simulate the up casts by striping away the top 192 <tt>TypedViewRegion</tt>. Down casts is usually simple. For up casts, if the 193 there is no <tt>TypedViewRegion</tt> to be stripped, we return the original 194 region. If the underlying region is of the different type than the cast-to type, 195 we flag an error state.</p> 196 197 <p>For toll-free bridging casts, we return the original region.</p> 198 199 <p>We can set up a partial order for pointer types, with the most general type 200 <tt>void*</tt> at the top. The partial order forms a tree with <tt>void*</tt> as 201 its root node.</p> 202 203 <p>Every <tt>MemRegion</tt> has a root position in the type tree. For example, 204 the pointee region of <tt>void *p</tt> has its root position at the root node of 205 the tree. <tt>VarRegion</tt> of <tt>int x</tt> has its root position at the 'int 206 type' node.</p> 207 208 <p><tt>TypedViewRegion</tt> is used to move the region down or up in the tree. 209 Moving down in the tree adds a <tt>TypedViewRegion</tt>. Moving up in the tree 210 removes a <Tt>TypedViewRegion</tt>.</p> 211 212 <p>Do we want to allow moving up beyond the root position? This happens 213 when:</p> <pre> int x; void *p = &x; </pre> 214 215 <p>The region of <tt>x</tt> has its root position at 'int*' node. the cast to 216 void* moves that region up to the 'void*' node. I propose to not allow such 217 casts, and assign the region of <tt>x</tt> for <tt>p</tt>.</p> 218 219 <p>Another non-ideal case is that people might cast to a non-generic pointer 220 from another non-generic pointer instead of first casting it back to the generic 221 pointer. Direct handling of this case would result in multiple layers of 222 TypedViewRegions. This enforces an incorrect semantic view to the region, 223 because we can only have one typed view on a region at a time. To avoid this 224 inconsistency, before casting the region, we strip the TypedViewRegion, then do 225 the cast. In summary, we only allow one layer of TypedViewRegion.</p> 226 227 <h3>Region Bindings</h3> 228 229 <p>The following region kinds are boundable: VarRegion, CompoundLiteralRegion, 230 StringRegion, ElementRegion, FieldRegion, and ObjCIvarRegion.</p> 231 232 <p>When binding regions, we perform canonicalization on element regions and field 233 regions. This is because we can have different views on the same region, some 234 of which are essentially the same view with different sugar type names.</p> 235 236 <p>To canonicalize a region, we get the canonical types for all TypedViewRegions 237 along the way up to the root region, and make new TypedViewRegions with those 238 canonical types.</p> 239 240 <p>For Objective-C and C++, perhaps another canonicalization rule should be 241 added: for FieldRegion, the least derived class that has the field is used as 242 the type of the super region of the FieldRegion.</p> 243 244 <p>All bindings and retrievings are done on the canonicalized regions.</p> 245 246 <p>Canonicalization is transparent outside the region store manager, and more 247 specifically, unaware outside the Bind() and Retrieve() method. We don't need to 248 consider region canonicalization when doing pointer cast.</p> 249 250 <h3>Constraint Manager</h3> 251 252 <p>The constraint manager reasons about the abstract location of memory objects. 253 We can have different views on a region, but none of these views changes the 254 location of that object. Thus we should get the same abstract location for those 255 regions.</p> 256 257 </body> 258 </html> 259