Home | History | Annotate | Download | only in source
      1 FlatBuffer Internals    {#flatbuffers_internals}
      2 ====================
      3 
      4 This section is entirely optional for the use of FlatBuffers. In normal
      5 usage, you should never need the information contained herein. If you're
      6 interested however, it should give you more of an appreciation of why
      7 FlatBuffers is both efficient and convenient.
      8 
      9 ### Format components
     10 
     11 A FlatBuffer is a binary file and in-memory format consisting mostly of
     12 scalars of various sizes, all aligned to their own size. Each scalar is
     13 also always represented in little-endian format, as this corresponds to
     14 all commonly used CPUs today. FlatBuffers will also work on big-endian
     15 machines, but will be slightly slower because of additional
     16 byte-swap intrinsics.
     17 
     18 On purpose, the format leaves a lot of details about where exactly
     19 things live in memory undefined, e.g. fields in a table can have any
     20 order, and objects to some extent can be stored in many orders. This is
     21 because the format doesn't need this information to be efficient, and it
     22 leaves room for optimization and extension (for example, fields can be
     23 packed in a way that is most compact). Instead, the format is defined in
     24 terms of offsets and adjacency only. This may mean two different
     25 implementations may produce different binaries given the same input
     26 values, and this is perfectly valid.
     27 
     28 ### Format identification
     29 
     30 The format also doesn't contain information for format identification
     31 and versioning, which is also by design. FlatBuffers is a statically typed
     32 system, meaning the user of a buffer needs to know what kind of buffer
     33 it is. FlatBuffers can of course be wrapped inside other containers
     34 where needed, or you can use its union feature to dynamically identify
     35 multiple possible sub-objects stored. Additionally, it can be used
     36 together with the schema parser if full reflective capabilities are
     37 desired.
     38 
     39 Versioning is something that is intrinsically part of the format (the
     40 optionality / extensibility of fields), so the format itself does not
     41 need a version number (it's a meta-format, in a sense). We're hoping
     42 that this format can accommodate all data needed. If format breaking
     43 changes are ever necessary, it would become a new kind of format rather
     44 than just a variation.
     45 
     46 ### Offsets
     47 
     48 The most important and generic offset type (see `flatbuffers.h`) is
     49 `uoffset_t`, which is currently always a `uint32_t`, and is used to
     50 refer to all tables/unions/strings/vectors (these are never stored
     51 in-line). 32bit is
     52 intentional, since we want to keep the format binary compatible between
     53 32 and 64bit systems, and a 64bit offset would bloat the size for almost
     54 all uses. A version of this format with 64bit (or 16bit) offsets is easy to set
     55 when needed. Unsigned means they can only point in one direction, which
     56 typically is forward (towards a higher memory location). Any backwards
     57 offsets will be explicitly marked as such.
     58 
     59 The format starts with an `uoffset_t` to the root object in the buffer.
     60 
     61 We have two kinds of objects, structs and tables.
     62 
     63 ### Structs
     64 
     65 These are the simplest, and as mentioned, intended for simple data that
     66 benefits from being extra efficient and doesn't need versioning /
     67 extensibility. They are always stored inline in their parent (a struct,
     68 table, or vector) for maximum compactness. Structs define a consistent
     69 memory layout where all components are aligned to their size, and
     70 structs aligned to their largest scalar member. This is done independent
     71 of the alignment rules of the underlying compiler to guarantee a cross
     72 platform compatible layout. This layout is then enforced in the generated
     73 code.
     74 
     75 ### Tables
     76 
     77 Unlike structs, these are not stored in inline in their parent, but are
     78 referred to by offset.
     79 
     80 They start with an `soffset_t` to a vtable. This is a signed version of
     81 `uoffset_t`, since vtables may be stored anywhere relative to the object.
     82 This offset is substracted (not added) from the object start to arrive at
     83 the vtable start. This offset is followed by all the
     84 fields as aligned scalars (or offsets). Unlike structs, not all fields
     85 need to be present. There is no set order and layout.
     86 
     87 To be able to access fields regardless of these uncertainties, we go
     88 through a vtable of offsets. Vtables are shared between any objects that
     89 happen to have the same vtable values.
     90 
     91 The elements of a vtable are all of type `voffset_t`, which is
     92 a `uint16_t`. The first element is the size of the vtable in bytes,
     93 including the size element. The second one is the size of the object, in bytes
     94 (including the vtable offset). This size could be used for streaming, to know
     95 how many bytes to read to be able to access all *inline* fields of the object.
     96 The remaining elements are the N offsets, where N is the amount of fields
     97 declared in the schema when the code that constructed this buffer was
     98 compiled (thus, the size of the table is N + 2).
     99 
    100 All accessor functions in the generated code for tables contain the
    101 offset into this table as a constant. This offset is checked against the
    102 first field (the number of elements), to protect against newer code
    103 reading older data. If this offset is out of range, or the vtable entry
    104 is 0, that means the field is not present in this object, and the
    105 default value is return. Otherwise, the entry is used as offset to the
    106 field to be read.
    107 
    108 ### Strings and Vectors
    109 
    110 Strings are simply a vector of bytes, and are always
    111 null-terminated. Vectors are stored as contiguous aligned scalar
    112 elements prefixed by a 32bit element count (not including any
    113 null termination). Neither is stored inline in their parent, but are referred to
    114 by offset.
    115 
    116 ### Construction
    117 
    118 The current implementation constructs these buffers backwards (starting
    119 at the highest memory address of the buffer), since
    120 that significantly reduces the amount of bookkeeping and simplifies the
    121 construction API.
    122 
    123 ### Code example
    124 
    125 Here's an example of the code that gets generated for the `samples/monster.fbs`.
    126 What follows is the entire file, broken up by comments:
    127 
    128     // automatically generated, do not modify
    129 
    130     #include "flatbuffers/flatbuffers.h"
    131 
    132     namespace MyGame {
    133     namespace Sample {
    134 
    135 Nested namespace support.
    136 
    137     enum {
    138       Color_Red = 0,
    139       Color_Green = 1,
    140       Color_Blue = 2,
    141     };
    142 
    143     inline const char **EnumNamesColor() {
    144       static const char *names[] = { "Red", "Green", "Blue", nullptr };
    145       return names;
    146     }
    147 
    148     inline const char *EnumNameColor(int e) { return EnumNamesColor()[e]; }
    149 
    150 Enums and convenient reverse lookup.
    151 
    152     enum {
    153       Any_NONE = 0,
    154       Any_Monster = 1,
    155     };
    156 
    157     inline const char **EnumNamesAny() {
    158       static const char *names[] = { "NONE", "Monster", nullptr };
    159       return names;
    160     }
    161 
    162     inline const char *EnumNameAny(int e) { return EnumNamesAny()[e]; }
    163 
    164 Unions share a lot with enums.
    165 
    166     struct Vec3;
    167     struct Monster;
    168 
    169 Predeclare all data types since circular references between types are allowed
    170 (circular references between object are not, though).
    171 
    172     FLATBUFFERS_MANUALLY_ALIGNED_STRUCT(4) Vec3 {
    173      private:
    174       float x_;
    175       float y_;
    176       float z_;
    177 
    178      public:
    179       Vec3(float x, float y, float z)
    180         : x_(flatbuffers::EndianScalar(x)), y_(flatbuffers::EndianScalar(y)), z_(flatbuffers::EndianScalar(z)) {}
    181 
    182       float x() const { return flatbuffers::EndianScalar(x_); }
    183       float y() const { return flatbuffers::EndianScalar(y_); }
    184       float z() const { return flatbuffers::EndianScalar(z_); }
    185     };
    186     FLATBUFFERS_STRUCT_END(Vec3, 12);
    187 
    188 These ugly macros do a couple of things: they turn off any padding the compiler
    189 might normally do, since we add padding manually (though none in this example),
    190 and they enforce alignment chosen by FlatBuffers. This ensures the layout of
    191 this struct will look the same regardless of compiler and platform. Note that
    192 the fields are private: this is because these store little endian scalars
    193 regardless of platform (since this is part of the serialized data).
    194 `EndianScalar` then converts back and forth, which is a no-op on all current
    195 mobile and desktop platforms, and a single machine instruction on the few
    196 remaining big endian platforms.
    197 
    198     struct Monster : private flatbuffers::Table {
    199       const Vec3 *pos() const { return GetStruct<const Vec3 *>(4); }
    200       int16_t mana() const { return GetField<int16_t>(6, 150); }
    201       int16_t hp() const { return GetField<int16_t>(8, 100); }
    202       const flatbuffers::String *name() const { return GetPointer<const flatbuffers::String *>(10); }
    203       const flatbuffers::Vector<uint8_t> *inventory() const { return GetPointer<const flatbuffers::Vector<uint8_t> *>(14); }
    204       int8_t color() const { return GetField<int8_t>(16, 2); }
    205     };
    206 
    207 Tables are a bit more complicated. A table accessor struct is used to point at
    208 the serialized data for a table, which always starts with an offset to its
    209 vtable. It derives from `Table`, which contains the `GetField` helper functions.
    210 GetField takes a vtable offset, and a default value. It will look in the vtable
    211 at that offset. If the offset is out of bounds (data from an older version) or
    212 the vtable entry is 0, the field is not present and the default is returned.
    213 Otherwise, it uses the entry as an offset into the table to locate the field.
    214 
    215     struct MonsterBuilder {
    216       flatbuffers::FlatBufferBuilder &fbb_;
    217       flatbuffers::uoffset_t start_;
    218       void add_pos(const Vec3 *pos) { fbb_.AddStruct(4, pos); }
    219       void add_mana(int16_t mana) { fbb_.AddElement<int16_t>(6, mana, 150); }
    220       void add_hp(int16_t hp) { fbb_.AddElement<int16_t>(8, hp, 100); }
    221       void add_name(flatbuffers::Offset<flatbuffers::String> name) { fbb_.AddOffset(10, name); }
    222       void add_inventory(flatbuffers::Offset<flatbuffers::Vector<uint8_t>> inventory) { fbb_.AddOffset(14, inventory); }
    223       void add_color(int8_t color) { fbb_.AddElement<int8_t>(16, color, 2); }
    224       MonsterBuilder(flatbuffers::FlatBufferBuilder &_fbb) : fbb_(_fbb) { start_ = fbb_.StartTable(); }
    225       flatbuffers::Offset<Monster> Finish() { return flatbuffers::Offset<Monster>(fbb_.EndTable(start_, 7)); }
    226     };
    227 
    228 `MonsterBuilder` is the base helper struct to construct a table using a
    229 `FlatBufferBuilder`. You can add the fields in any order, and the `Finish`
    230 call will ensure the correct vtable gets generated.
    231 
    232     inline flatbuffers::Offset<Monster> CreateMonster(flatbuffers::FlatBufferBuilder &_fbb,
    233                                                       const Vec3 *pos, int16_t mana,
    234                                                       int16_t hp,
    235                                                       flatbuffers::Offset<flatbuffers::String> name,
    236                                                       flatbuffers::Offset<flatbuffers::Vector<uint8_t>> inventory,
    237                                                       int8_t color) {
    238       MonsterBuilder builder_(_fbb);
    239       builder_.add_inventory(inventory);
    240       builder_.add_name(name);
    241       builder_.add_pos(pos);
    242       builder_.add_hp(hp);
    243       builder_.add_mana(mana);
    244       builder_.add_color(color);
    245       return builder_.Finish();
    246     }
    247 
    248 `CreateMonster` is a convenience function that calls all functions in
    249 `MonsterBuilder` above for you. Note that if you pass values which are
    250 defaults as arguments, it will not actually construct that field, so
    251 you can probably use this function instead of the builder class in
    252 almost all cases.
    253 
    254     inline const Monster *GetMonster(const void *buf) { return flatbuffers::GetRoot<Monster>(buf); }
    255 
    256 This function is only generated for the root table type, to be able to
    257 start traversing a FlatBuffer from a raw buffer pointer.
    258 
    259     }; // namespace MyGame
    260     }; // namespace Sample
    261 
    262 ### Encoding example.
    263 
    264 Below is a sample encoding for the following JSON corresponding to the above
    265 schema:
    266 
    267     { pos: { x: 1, y: 2, z: 3 }, name: "fred", hp: 50 }
    268 
    269 Resulting in this binary buffer:
    270 
    271     // Start of the buffer:
    272     uint32_t 20  // Offset to the root table.
    273 
    274     // Start of the vtable. Not shared in this example, but could be:
    275     uint16_t 16 // Size of table, starting from here.
    276     uint16_t 22 // Size of object inline data.
    277     uint16_t 4, 0, 20, 16, 0, 0  // Offsets to fields from start of (root) table, 0 for not present.
    278 
    279     // Start of the root table:
    280     int32_t 16     // Offset to vtable used (default negative direction)
    281     float 1, 2, 3  // the Vec3 struct, inline.
    282     uint32_t 8     // Offset to the name string.
    283     int16_t 50     // hp field.
    284     int16_t 0      // Padding for alignment.
    285 
    286     // Start of name string:
    287     uint32_t 4  // Length of string.
    288     int8_t 'f', 'r', 'e', 'd', 0, 0, 0, 0  // Text + 0 termination + padding.
    289 
    290 Note that this not the only possible encoding, since the writer has some
    291 flexibility in which of the children of root object to write first (though in
    292 this case there's only one string), and what order to write the fields in.
    293 Different orders may also cause different alignments to happen.
    294 
    295 ### Additional reading.
    296 
    297 The author of the C language implementation has made a similar
    298 [document](https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#flatbuffers-binary-format)
    299 that may further help clarify the format.
    300 
    301 # FlexBuffers
    302 
    303 The [schema-less](@ref flexbuffers) version of FlatBuffers have their
    304 own encoding, detailed here.
    305 
    306 It shares many properties mentioned above, in that all data is accessed
    307 over offsets, all scalars are aligned to their own size, and
    308 all data is always stored in little endian format.
    309 
    310 One difference is that FlexBuffers are built front to back, so children are
    311 stored before parents, and the root of the data starts at the last byte.
    312 
    313 Another difference is that scalar data is stored with a variable number of bits
    314 (8/16/32/64). The current width is always determined by the *parent*, i.e. if
    315 the scalar sits in a vector, the vector determines the bit width for all
    316 elements at once. Selecting the minimum bit width for a particular vector is
    317 something the encoder does automatically and thus is typically of no concern
    318 to the user, though being aware of this feature (and not sticking a double in
    319 the same vector as a bunch of byte sized elements) is helpful for efficiency.
    320 
    321 Unlike FlatBuffers there is only one kind of offset, and that is an unsigned
    322 integer indicating the number of bytes in a negative direction from the address
    323 of itself (where the offset is stored).
    324 
    325 ### Vectors
    326 
    327 The representation of the vector is at the core of how FlexBuffers works (since
    328 maps are really just a combination of 2 vectors), so it is worth starting there.
    329 
    330 As mentioned, a vector is governed by a single bit width (supplied by its
    331 parent). This includes the size field. For example, a vector that stores the
    332 integer values `1, 2, 3` is encoded as follows:
    333 
    334     uint8_t 3, 1, 2, 3, 4, 4, 4
    335 
    336 The first `3` is the size field, and is placed before the vector (an offset
    337 from the parent to this vector points to the first element, not the size
    338 field, so the size field is effectively at index -1).
    339 Since this is an untyped vector `SL_VECTOR`, it is followed by 3 type
    340 bytes (one per element of the vector), which are always following the vector,
    341 and are always a uint8_t even if the vector is made up of bigger scalars.
    342 
    343 ### Types
    344 
    345 A type byte is made up of 2 components (see flexbuffers.h for exact values):
    346 
    347 * 2 lower bits representing the bit-width of the child (8, 16, 32, 64).
    348   This is only used if the child is accessed over an offset, such as a child
    349   vector. It is ignored for inline types.
    350 * 6 bits representing the actual type (see flexbuffers.h).
    351 
    352 Thus, in this example `4` means 8 bit child (value 0, unused, since the value is
    353 in-line), type `SL_INT` (value 1).
    354 
    355 ### Typed Vectors
    356 
    357 These are like the Vectors above, but omit the type bytes. The type is instead
    358 determined by the vector type supplied by the parent. Typed vectors are only
    359 available for a subset of types for which these savings can be significant,
    360 namely inline signed/unsigned integers (`TYPE_VECTOR_INT` / `TYPE_VECTOR_UINT`),
    361 floats (`TYPE_VECTOR_FLOAT`), and keys (`TYPE_VECTOR_KEY`, see below).
    362 
    363 Additionally, for scalars, there are fixed length vectors of sizes 2 / 3 / 4
    364 that don't store the size (`TYPE_VECTOR_INT2` etc.), for an additional savings
    365 in space when storing common vector or color data.
    366 
    367 ### Scalars
    368 
    369 FlexBuffers supports integers (`TYPE_INT` and `TYPE_UINT`) and floats
    370 (`TYPE_FLOAT`), available in the bit-widths mentioned above. They can be stored
    371 both inline and over an offset (`TYPE_INDIRECT_*`).
    372 
    373 The offset version is useful to encode costly 64bit (or even 32bit) quantities
    374 into vectors / maps of smaller sizes, and to share / repeat a value multiple
    375 times.
    376 
    377 ### Booleans and Nulls
    378 
    379 Booleans (`TYPE_BOOL`) and nulls (`TYPE_NULL`) are encoded as inlined unsigned integers.
    380 
    381 ### Blobs, Strings and Keys.
    382 
    383 A blob (`TYPE_BLOB`) is encoded similar to a vector, with one difference: the
    384 elements are always `uint8_t`. The parent bit width only determines the width of
    385 the size field, allowing blobs to be large without the elements being large.
    386 
    387 Strings (`TYPE_STRING`) are similar to blobs, except they have an additional 0
    388 termination byte for convenience, and they MUST be UTF-8 encoded (since an
    389 accessor in a language that does not support pointers to UTF-8 data may have to
    390 convert them to a native string type).
    391 
    392 A "Key" (`TYPE_KEY`) is similar to a string, but doesn't store the size
    393 field. They're so named because they are used with maps, which don't care
    394 for the size, and can thus be even more compact. Unlike strings, keys cannot
    395 contain bytes of value 0 as part of their data (size can only be determined by
    396 `strlen`), so while you can use them outside the context of maps if you so
    397 desire, you're usually better off with strings.
    398 
    399 ### Maps
    400 
    401 A map (`TYPE_MAP`) is like an (untyped) vector, but with 2 prefixes before the
    402 size field:
    403 
    404 | index | field                                                        |
    405 | ----: | :----------------------------------------------------------- |
    406 | -3    | An offset to the keys vector (may be shared between tables). |
    407 | -2    | Byte width of the keys vector.                               |
    408 | -1    | Size (from here on it is compatible with `TYPE_VECTOR`)      |
    409 | 0     | Elements.                                                    |
    410 | Size  | Types.                                                       |
    411 
    412 Since a map is otherwise the same as a vector, it can be iterated like
    413 a vector (which is probably faster than lookup by key).
    414 
    415 The keys vector is a typed vector of keys. Both the keys and corresponding
    416 values *have* to be stored in sorted order (as determined by `strcmp`), such
    417 that lookups can be made using binary search.
    418 
    419 The reason the key vector is a seperate structure from the value vector is
    420 such that it can be shared between multiple value vectors, and also to
    421 allow it to be treated as its own individual vector in code.
    422 
    423 An example map { foo: 13, bar: 14 } would be encoded as:
    424 
    425     0 : uint8_t 'b', 'a', 'r', 0
    426     4 : uint8_t 'f', 'o', 'o', 0
    427     8 : uint8_t 2      // key vector of size 2
    428     // key vector offset points here
    429     9 : uint8_t 9, 6   // offsets to bar_key and foo_key
    430     11: uint8_t 2, 1   // offset to key vector, and its byte width
    431     13: uint8_t 2      // value vector of size
    432     // value vector offset points here
    433     14: uint8_t 14, 13 // values
    434     16: uint8_t 4, 4   // types
    435 
    436 ### The root
    437 
    438 As mentioned, the root starts at the end of the buffer.
    439 The last uint8_t is the width in bytes of the root (normally the parent
    440 determines the width, but the root has no parent). The uint8_t before this is
    441 the type of the root, and the bytes before that are the root value (of the
    442 number of bytes specified by the last byte).
    443 
    444 So for example, the integer value `13` as root would be:
    445 
    446     uint8_t 13, 4, 1    // Value, type, root byte width.
    447 
    448 
    449 <br>
    450