Home | History | Annotate | Download | only in doc
      1 # Hyb (hyphenation pattern binary) file format
      2 
      3 The hyb file format is how hyphenation patterns are stored in the system image.
      4 
      5 Goals include:
      6 
      7 * Concise (system image space is at a premium)
      8 * Usable when mmap'ed, so it doesn't take significant physical RAM
      9 * Fast to compute
     10 * Simple
     11 
     12 It is _not_ intended as an interchange format, so there is no attempt to make the format
     13 extensible or facilitate backward and forward compatibility.
     14 
     15 Further, at some point we will probably pack patterns for multiple languages into a single
     16 physical file, to reduce number of open mmap'ed files. This document doesn't cover that.
     17 
     18 ## Theoretical basis
     19 
     20 At heart, the file contains packed tries with suffix compression, actually quite similar
     21 to the implementation in TeX.
     22 
     23 The file contains three sections. The first section represents the "alphabet," including
     24 case folding. It is effectively a map from Unicode code point to a small integer.
     25 
     26 The second section contains the trie in packed form. It is an array of 3-tuples, packed
     27 into a 32 bit integer. Each (suffix-compressed) trie node has a unique index within this
     28 array, and the pattern field in the tuple is the pattern for that node. Further, each edge
     29 in the trie has an entry in the array, and the character and link fields in the tuple
     30 represent the label and destination of the edge. The packing strategy is as in
     31 [Word Hy-phen-a-tion by Com-put-er](http://www.tug.org/docs/liang/liang-thesis.pdf) by
     32 Franklin Mark Liang.
     33 
     34 The trie representation is similar but not identical to the "double-array trie".
     35 The fundamental operation of lookup of the edge from `s` to `t` with label `c` is
     36 to compare `c == character[s + c]`, and if so, `t = link[s + c]`.
     37 
     38 The third section contains the pattern strings. This section is in two parts: first,
     39 an array with a 3-tuple for each pattern (length, number of trailing 0's, and offset
     40 into the string pool); and second, the string pool. Each pattern is encoded as a byte
     41 (packing 2 per byte would be possible but the space savings would not be signficant).
     42 
     43 As much as possible of the file is represented as 32 bit integers, as that is especially
     44 efficent to access. All are little-endian (this could be revised if the code ever needs
     45 to be ported to big-endian systems).
     46 
     47 ## Header
     48 
     49 ```
     50 uint32_t magic == 0x62ad7968
     51 uint32_t version = 0
     52 uint32_t alphabet_offset (in bytes)
     53 uint32_t trie_offset (in bytes)
     54 uint32_t pattern_offset (in bytes)
     55 uint32_t file_size (in bytes)
     56 ```
     57 
     58 Offsets are from the front of the file, and in bytes.
     59 
     60 ## Alphabet
     61 
     62 The alphabet table comes in two versions. The first is well suited to dense Unicode
     63 ranges and is limited to 256. The second is more general, but lookups will be slower.
     64 
     65 ### Alphabet, direct version
     66 
     67 ```
     68 uint32_t version = 0
     69 uint32_t min_codepoint
     70 uint32_t max_codepoint (exclusive)
     71 uint8_t[] data
     72 ```
     73 
     74 The size of the data array is max_codepoint - min_codepoint. 0 represents an unmapped
     75 character. Note that, in the current implementation, automatic hyphenation is disabled
     76 for any word containing an unmapped character.
     77 
     78 In general, pad bytes follow this table, aligning the next table to a 4-byte boundary.
     79 
     80 ### Alphabet, general version
     81 
     82 ```
     83 uint32_t version = 1
     84 uint32_t n_entries
     85 uint32_t[n_entries] data
     86 ```
     87 
     88 Each element in the data table is `(codepoint << 11) | value`. Note that this is
     89 restricted to 11 bits (2048 possible values). The largest known current value is 483
     90 (for Sanskrit).
     91 
     92 The entries are sorted by codepoint, to facilitate binary search. Another reasonable
     93 implementation for consumers of the data would be to build a hash table at load time.
     94 
     95 ## Trie
     96 
     97 ```
     98 uint32_t version = 0
     99 uint32_t char_mask
    100 uint32_t link_shift
    101 uint32_t link_mask
    102 uint32_t pattern_shift
    103 uint32_t n_entries
    104 uint32_t[n_entries] data
    105 ```
    106 
    107 Each element in the data table is `(pattern << pattern_shift) | (link << link_shift) | char`.
    108 
    109 All known pattern tables fit in 32 bits total. If this is exceeded, there is a fairly
    110 straightforward tweak, where each node occupies a slot by itself (as opposed to sharing
    111 it with edge slots), which would require very minimal changes to the implementation (TODO
    112 present in more detail).
    113 
    114 ## Pattern
    115 
    116 ```
    117 uint32_t version = 0
    118 uint32_t n_entries
    119 uint32_t pattern_offset (in bytes)
    120 uint32_t pattern_size (in bytes)
    121 uint32_t[n_entries] data
    122 uint8_t[] pattern_buf
    123 ```
    124 
    125 Each element in data table is `(len << 26) | (shift << 20) | offset`, where an offset of 0
    126 points to the first byte of pattern_buf.
    127 
    128 Generally pattern_offset is `16 + 4 * n_entries`.
    129 
    130 For example, 'a4m5ato' would be represented as `[4, 5, 0, 0, 0]`, then len = 2, shift = 3, and
    131 offset points to [4, 5] in the pattern buffer.
    132 
    133 Future extension: additional data representing nonstandard hyphenation. See
    134 [Automatic non-standard hyphenation in OpenOffice.org](https://www.tug.org/TUGboat/tb27-1/tb86nemeth.pdf)
    135 for more information about that issue.
    136