Home | History | Annotate | Download | only in docs
      1 <chapter id="clusters">
      2 <sect1 id="clusters">
      3   <title>Clusters</title>
      4   <para>
      5     In shaping text, a <emphasis>cluster</emphasis> is a sequence of
      6     code points that needs to be treated as a single, indivisible unit.
      7   </para>
      8   <para>
      9     When you add text to a HB buffer, each character is associated with
     10     a <emphasis>cluster value</emphasis>. This is an arbitrary number as
     11     far as HB is concerned.
     12   </para>
     13   <para>
     14     Most clients will use UTF-8, UTF-16, or UTF-32 indices, but the
     15     actual number does not matter. Moreover, it is not required for the
     16     cluster values to be monotonically increasing, but pretty much all
     17     of HB's tests are performed on monotonically increasing cluster
     18     numbers. Nevertheless, there is no such assumption in the code
     19     itself. With that in mind, let's examine what happens with cluster
     20     values during shaping under each cluster-level.
     21   </para>
     22   <para>
     23     HarfBuzz provides three <emphasis>levels</emphasis> of clustering
     24     support. Level 0 is the default behavior and reproduces the behavior
     25     of the old HarfBuzz library. Level 1 tweaks this behavior slightly
     26     to produce better results, so level 1 clustering is recommended for
     27     code that is not required to implement backward compatibility with
     28     the old HarfBuzz.
     29   </para>
     30   <para>
     31     Level 2 differs significantly in how it treats cluster values.
     32     Levels 0 and 1 both process ligatures and glyph decomposition by
     33     merging clusters; level 2 does not.
     34   </para>
     35   <para>
     36     The conceptual model for what the cluster values mean, in levels 0
     37     and 1, is this:
     38   </para>
     39   <itemizedlist spacing="compact">
     40     <listitem>
     41       <para>
     42         the sequence of cluster values will always remain monotone
     43       </para>
     44     </listitem>
     45     <listitem>
     46       <para>
     47         each value represents a single cluster
     48       </para>
     49     </listitem>
     50     <listitem>
     51       <para>
     52         each cluster contains one or more glyphs and one or more
     53         characters
     54       </para>
     55     </listitem>
     56   </itemizedlist>
     57   <para>
     58     Assuming that initial cluster numbers were monotonically increasing
     59     and distinct, then all adjacent glyphs having the same cluster
     60     number belong to the same cluster, and all characters belong to the
     61     cluster that has the highest number not larger than their initial
     62     cluster number. This will become clearer with an example.
     63   </para>
     64 </sect1>
     65 <sect1 id="a-clustering-example-for-levels-0-and-1">
     66   <title>A clustering example for levels 0 and 1</title>
     67   <para>
     68     Let's say we start with the following character sequence and cluster
     69     values:
     70   </para>
     71   <programlisting>
     72    A,B,C,D,E
     73    0,1,2,3,4
     74 </programlisting>
     75   <para>
     76     We then map the characters to glyphs. For simplicity, let's assume
     77     that each character maps to the corresponding, identical-looking
     78     glyph:
     79   </para>
     80   <programlisting>
     81    A,B,C,D,E
     82    0,1,2,3,4
     83 </programlisting>
     84   <para>
     85     Now if, for example, <literal>B</literal> and <literal>C</literal>
     86     ligate, then the clusters to which they belong &quot;merge&quot;.
     87     This merged cluster takes for its cluster number the minimum of all
     88     the cluster numbers of the clusters that went in. In this case, we
     89     get:
     90   </para>
     91   <programlisting>
     92    A,BC,D,E
     93    0,1 ,3,4
     94 </programlisting>
     95   <para>
     96     Now let's assume that the <literal>BC</literal> glyph decomposes
     97     into three components, and <literal>D</literal> also decomposes into
     98     two. The components each inherit the cluster value of their parent:
     99   </para>
    100   <programlisting>
    101    A,BC0,BC1,BC2,D0,D1,E
    102    0,1  ,1  ,1  ,3 ,3 ,4
    103 </programlisting>
    104   <para>
    105     Now if <literal>BC2</literal> and <literal>D0</literal> ligate, then
    106     their clusters (numbers 1 and 3) merge into
    107     <literal>min(1,3) = 1</literal>:
    108   </para>
    109   <programlisting>
    110    A,BC0,BC1,BC2D0,D1,E
    111    0,1  ,1  ,1    ,1 ,4
    112 </programlisting>
    113   <para>
    114     At this point, cluster 1 means: the character sequence
    115     <literal>BCD</literal> is represented by glyphs
    116     <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any
    117     further.
    118   </para>
    119 </sect1>
    120 <sect1 id="reordering-in-levels-0-and-1">
    121   <title>Reordering in levels 0 and 1</title>
    122   <para>
    123     Another common operation in the more complex shapers is when things
    124     reorder. In those cases, to maintain monotone clusters, HB merges
    125     the clusters of everything in the reordering sequence. For example,
    126     let's again start with the character sequence:
    127   </para>
    128   <programlisting>
    129    A,B,C,D,E
    130    0,1,2,3,4
    131 </programlisting>
    132   <para>
    133     If <literal>D</literal> is reordered before <literal>B</literal>,
    134     then the <literal>B</literal>, <literal>C</literal>, and
    135     <literal>D</literal> clusters merge, and we get:
    136   </para>
    137   <programlisting>
    138    A,D,B,C,E
    139    0,1,1,1,4
    140 </programlisting>
    141   <para>
    142     This is clearly not ideal, but it is the only sensible way to
    143     maintain monotone indices and retain the true relationship between
    144     glyphs and characters.
    145   </para>
    146 </sect1>
    147 <sect1 id="the-distinction-between-levels-0-and-1">
    148   <title>The distinction between levels 0 and 1</title>
    149   <para>
    150     So, the above is pretty much what cluster levels 0 and 1 do. The
    151     only difference between the two is this: in level 0, at the very
    152     beginning of the shaping process, we also merge clusters between
    153     base characters and all Unicode marks (combining or not) following
    154     them. E.g.:
    155   </para>
    156   <programlisting>
    157   A,acute,B
    158   0,1    ,2
    159 </programlisting>
    160   <para>
    161     will become:
    162   </para>
    163   <programlisting>
    164   A,acute,B
    165   0,0    ,2
    166 </programlisting>
    167   <para>
    168     This is the default behavior. We do it because Windows did it and
    169     old HarfBuzz did it, so this remained the default. But this behavior
    170     makes it impossible to color diacritic marks differently from their
    171     base characters. That's why in level 1 we do not perform this
    172     initial merging step.
    173   </para>
    174   <para>
    175     For clients, level 0 is more convenient if they rely on HarfBuzz
    176     clusters for cursor positioning. But that's wrong anyway: cursor
    177     positions should be determined based on Unicode grapheme boundaries,
    178     NOT shaping clusters. As such, level 1 clusters are preferred.
    179   </para>
    180   <para>
    181     One last note about levels 0 and 1. We currently don't allow a
    182     <literal>MultipleSubst</literal> lookup to replace a glyph with zero
    183     glyphs (i.e., to delete a glyph). But in some other situations,
    184     glyphs can be deleted. In those cases, if the glyph being deleted is
    185     the last glyph of its cluster, we make sure to merge the cluster
    186     with a neighboring cluster.
    187   </para>
    188   <para>
    189     This is, primarily, to make sure that the starting cluster of the
    190     text always has the cluster index pointing to the start of the text
    191     for the run; more than one client currently relies on this
    192     guarantee.
    193   </para>
    194   <para>
    195     Incidentally, Apple's CoreText does something else to maintain the
    196     same promise: it inserts a glyph with id 65535 at the beginning of
    197     the glyph string if the glyph corresponding to the first character
    198     in the run was deleted. HarfBuzz might do something similar in the
    199     future.
    200   </para>
    201 </sect1>
    202 <sect1 id="level-2">
    203   <title>Level 2</title>
    204   <para>
    205     Level 2 is a different beast from levels 0 and 1. It is simple to
    206     describe, but hard to make sense of. It simply doesn't do any
    207     cluster merging whatsoever. When things ligate or otherwise multiple
    208     glyphs turn into one, the cluster value of the first glyph is
    209     retained.
    210   </para>
    211   <para>
    212     Here are a few examples of why processing cluster values produced at
    213     this level might be tricky:
    214   </para>
    215   <sect2 id="ligatures-with-combining-marks">
    216     <title>Ligatures with combining marks</title>
    217     <para>
    218       Imagine capital letters are bases and lower case letters are
    219       combining marks. With an input sequence like this:
    220     </para>
    221     <programlisting>
    222   A,a,B,b,C,c
    223   0,1,2,3,4,5
    224 </programlisting>
    225     <para>
    226       if <literal>A,B,C</literal> ligate, then here are the cluster
    227       values one would get under the various levels:
    228     </para>
    229     <para>
    230       level 0:
    231     </para>
    232     <programlisting>
    233   ABC,a,b,c
    234   0  ,0,0,0
    235 </programlisting>
    236     <para>
    237       level 1:
    238     </para>
    239     <programlisting>
    240   ABC,a,b,c
    241   0  ,0,0,5
    242 </programlisting>
    243     <para>
    244       level 2:
    245     </para>
    246     <programlisting>
    247   ABC,a,b,c
    248   0  ,1,3,5
    249 </programlisting>
    250     <para>
    251       Making sense of the last example is the hardest for a client,
    252       because there is nothing in the cluster values to suggest that
    253       <literal>B</literal> and <literal>C</literal> ligated with
    254       <literal>A</literal>.
    255     </para>
    256   </sect2>
    257   <sect2 id="reordering">
    258     <title>Reordering</title>
    259     <para>
    260       Another tricky case is when things reorder. Under level 2:
    261     </para>
    262     <programlisting>
    263   A,B,C,D,E
    264   0,1,2,3,4
    265 </programlisting>
    266     <para>
    267       Now imagine <literal>D</literal> moves before
    268       <literal>B</literal>:
    269     </para>
    270     <programlisting>
    271   A,D,B,C,E
    272   0,3,1,2,4
    273 </programlisting>
    274     <para>
    275       Now, if <literal>D</literal> ligates with <literal>B</literal>, we
    276       get:
    277     </para>
    278     <programlisting>
    279   A,DB,C,E
    280   0,3 ,2,4
    281 </programlisting>
    282     <para>
    283       In a different scenario, <literal>A</literal> and
    284       <literal>B</literal> could have ligated
    285       <emphasis>before</emphasis> <literal>D</literal> reordered; that
    286       would have resulted in:
    287     </para>
    288     <programlisting>
    289   AB,D,C,E
    290   0 ,3,2,4   
    291 </programlisting>
    292     <para>
    293       There's no way to differentitate between these two scenarios based
    294       on the cluster numbers alone.
    295     </para>
    296     <para>
    297       Another problem appens with ligatures under level 2 if the
    298       direction of the text is forced to opposite of its natural
    299       direction (e.g. left-to-right Arabic). But that's too much of a
    300       corner case to worry about.
    301     </para>
    302   </sect2>
    303 </sect1>
    304 </chapter>
    305