1 <chapter id="clusters"> 2 <sect1 id="clusters"> 3 <title>Clusters</title> 4 <para> 5 In shaping text, a <emphasis>cluster</emphasis> is a sequence of 6 code points that needs to be treated as a single, indivisible unit. 7 </para> 8 <para> 9 When you add text to a HB buffer, each character is associated with 10 a <emphasis>cluster value</emphasis>. This is an arbitrary number as 11 far as HB is concerned. 12 </para> 13 <para> 14 Most clients will use UTF-8, UTF-16, or UTF-32 indices, but the 15 actual number does not matter. Moreover, it is not required for the 16 cluster values to be monotonically increasing, but pretty much all 17 of HB's tests are performed on monotonically increasing cluster 18 numbers. Nevertheless, there is no such assumption in the code 19 itself. With that in mind, let's examine what happens with cluster 20 values during shaping under each cluster-level. 21 </para> 22 <para> 23 HarfBuzz provides three <emphasis>levels</emphasis> of clustering 24 support. Level 0 is the default behavior and reproduces the behavior 25 of the old HarfBuzz library. Level 1 tweaks this behavior slightly 26 to produce better results, so level 1 clustering is recommended for 27 code that is not required to implement backward compatibility with 28 the old HarfBuzz. 29 </para> 30 <para> 31 Level 2 differs significantly in how it treats cluster values. 32 Levels 0 and 1 both process ligatures and glyph decomposition by 33 merging clusters; level 2 does not. 34 </para> 35 <para> 36 The conceptual model for what the cluster values mean, in levels 0 37 and 1, is this: 38 </para> 39 <itemizedlist spacing="compact"> 40 <listitem> 41 <para> 42 the sequence of cluster values will always remain monotone 43 </para> 44 </listitem> 45 <listitem> 46 <para> 47 each value represents a single cluster 48 </para> 49 </listitem> 50 <listitem> 51 <para> 52 each cluster contains one or more glyphs and one or more 53 characters 54 </para> 55 </listitem> 56 </itemizedlist> 57 <para> 58 Assuming that initial cluster numbers were monotonically increasing 59 and distinct, then all adjacent glyphs having the same cluster 60 number belong to the same cluster, and all characters belong to the 61 cluster that has the highest number not larger than their initial 62 cluster number. This will become clearer with an example. 63 </para> 64 </sect1> 65 <sect1 id="a-clustering-example-for-levels-0-and-1"> 66 <title>A clustering example for levels 0 and 1</title> 67 <para> 68 Let's say we start with the following character sequence and cluster 69 values: 70 </para> 71 <programlisting> 72 A,B,C,D,E 73 0,1,2,3,4 74 </programlisting> 75 <para> 76 We then map the characters to glyphs. For simplicity, let's assume 77 that each character maps to the corresponding, identical-looking 78 glyph: 79 </para> 80 <programlisting> 81 A,B,C,D,E 82 0,1,2,3,4 83 </programlisting> 84 <para> 85 Now if, for example, <literal>B</literal> and <literal>C</literal> 86 ligate, then the clusters to which they belong "merge". 87 This merged cluster takes for its cluster number the minimum of all 88 the cluster numbers of the clusters that went in. In this case, we 89 get: 90 </para> 91 <programlisting> 92 A,BC,D,E 93 0,1 ,3,4 94 </programlisting> 95 <para> 96 Now let's assume that the <literal>BC</literal> glyph decomposes 97 into three components, and <literal>D</literal> also decomposes into 98 two. The components each inherit the cluster value of their parent: 99 </para> 100 <programlisting> 101 A,BC0,BC1,BC2,D0,D1,E 102 0,1 ,1 ,1 ,3 ,3 ,4 103 </programlisting> 104 <para> 105 Now if <literal>BC2</literal> and <literal>D0</literal> ligate, then 106 their clusters (numbers 1 and 3) merge into 107 <literal>min(1,3) = 1</literal>: 108 </para> 109 <programlisting> 110 A,BC0,BC1,BC2D0,D1,E 111 0,1 ,1 ,1 ,1 ,4 112 </programlisting> 113 <para> 114 At this point, cluster 1 means: the character sequence 115 <literal>BCD</literal> is represented by glyphs 116 <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any 117 further. 118 </para> 119 </sect1> 120 <sect1 id="reordering-in-levels-0-and-1"> 121 <title>Reordering in levels 0 and 1</title> 122 <para> 123 Another common operation in the more complex shapers is when things 124 reorder. In those cases, to maintain monotone clusters, HB merges 125 the clusters of everything in the reordering sequence. For example, 126 let's again start with the character sequence: 127 </para> 128 <programlisting> 129 A,B,C,D,E 130 0,1,2,3,4 131 </programlisting> 132 <para> 133 If <literal>D</literal> is reordered before <literal>B</literal>, 134 then the <literal>B</literal>, <literal>C</literal>, and 135 <literal>D</literal> clusters merge, and we get: 136 </para> 137 <programlisting> 138 A,D,B,C,E 139 0,1,1,1,4 140 </programlisting> 141 <para> 142 This is clearly not ideal, but it is the only sensible way to 143 maintain monotone indices and retain the true relationship between 144 glyphs and characters. 145 </para> 146 </sect1> 147 <sect1 id="the-distinction-between-levels-0-and-1"> 148 <title>The distinction between levels 0 and 1</title> 149 <para> 150 So, the above is pretty much what cluster levels 0 and 1 do. The 151 only difference between the two is this: in level 0, at the very 152 beginning of the shaping process, we also merge clusters between 153 base characters and all Unicode marks (combining or not) following 154 them. E.g.: 155 </para> 156 <programlisting> 157 A,acute,B 158 0,1 ,2 159 </programlisting> 160 <para> 161 will become: 162 </para> 163 <programlisting> 164 A,acute,B 165 0,0 ,2 166 </programlisting> 167 <para> 168 This is the default behavior. We do it because Windows did it and 169 old HarfBuzz did it, so this remained the default. But this behavior 170 makes it impossible to color diacritic marks differently from their 171 base characters. That's why in level 1 we do not perform this 172 initial merging step. 173 </para> 174 <para> 175 For clients, level 0 is more convenient if they rely on HarfBuzz 176 clusters for cursor positioning. But that's wrong anyway: cursor 177 positions should be determined based on Unicode grapheme boundaries, 178 NOT shaping clusters. As such, level 1 clusters are preferred. 179 </para> 180 <para> 181 One last note about levels 0 and 1. We currently don't allow a 182 <literal>MultipleSubst</literal> lookup to replace a glyph with zero 183 glyphs (i.e., to delete a glyph). But in some other situations, 184 glyphs can be deleted. In those cases, if the glyph being deleted is 185 the last glyph of its cluster, we make sure to merge the cluster 186 with a neighboring cluster. 187 </para> 188 <para> 189 This is, primarily, to make sure that the starting cluster of the 190 text always has the cluster index pointing to the start of the text 191 for the run; more than one client currently relies on this 192 guarantee. 193 </para> 194 <para> 195 Incidentally, Apple's CoreText does something else to maintain the 196 same promise: it inserts a glyph with id 65535 at the beginning of 197 the glyph string if the glyph corresponding to the first character 198 in the run was deleted. HarfBuzz might do something similar in the 199 future. 200 </para> 201 </sect1> 202 <sect1 id="level-2"> 203 <title>Level 2</title> 204 <para> 205 Level 2 is a different beast from levels 0 and 1. It is simple to 206 describe, but hard to make sense of. It simply doesn't do any 207 cluster merging whatsoever. When things ligate or otherwise multiple 208 glyphs turn into one, the cluster value of the first glyph is 209 retained. 210 </para> 211 <para> 212 Here are a few examples of why processing cluster values produced at 213 this level might be tricky: 214 </para> 215 <sect2 id="ligatures-with-combining-marks"> 216 <title>Ligatures with combining marks</title> 217 <para> 218 Imagine capital letters are bases and lower case letters are 219 combining marks. With an input sequence like this: 220 </para> 221 <programlisting> 222 A,a,B,b,C,c 223 0,1,2,3,4,5 224 </programlisting> 225 <para> 226 if <literal>A,B,C</literal> ligate, then here are the cluster 227 values one would get under the various levels: 228 </para> 229 <para> 230 level 0: 231 </para> 232 <programlisting> 233 ABC,a,b,c 234 0 ,0,0,0 235 </programlisting> 236 <para> 237 level 1: 238 </para> 239 <programlisting> 240 ABC,a,b,c 241 0 ,0,0,5 242 </programlisting> 243 <para> 244 level 2: 245 </para> 246 <programlisting> 247 ABC,a,b,c 248 0 ,1,3,5 249 </programlisting> 250 <para> 251 Making sense of the last example is the hardest for a client, 252 because there is nothing in the cluster values to suggest that 253 <literal>B</literal> and <literal>C</literal> ligated with 254 <literal>A</literal>. 255 </para> 256 </sect2> 257 <sect2 id="reordering"> 258 <title>Reordering</title> 259 <para> 260 Another tricky case is when things reorder. Under level 2: 261 </para> 262 <programlisting> 263 A,B,C,D,E 264 0,1,2,3,4 265 </programlisting> 266 <para> 267 Now imagine <literal>D</literal> moves before 268 <literal>B</literal>: 269 </para> 270 <programlisting> 271 A,D,B,C,E 272 0,3,1,2,4 273 </programlisting> 274 <para> 275 Now, if <literal>D</literal> ligates with <literal>B</literal>, we 276 get: 277 </para> 278 <programlisting> 279 A,DB,C,E 280 0,3 ,2,4 281 </programlisting> 282 <para> 283 In a different scenario, <literal>A</literal> and 284 <literal>B</literal> could have ligated 285 <emphasis>before</emphasis> <literal>D</literal> reordered; that 286 would have resulted in: 287 </para> 288 <programlisting> 289 AB,D,C,E 290 0 ,3,2,4 291 </programlisting> 292 <para> 293 There's no way to differentitate between these two scenarios based 294 on the cluster numbers alone. 295 </para> 296 <para> 297 Another problem appens with ligatures under level 2 if the 298 direction of the text is forced to opposite of its natural 299 direction (e.g. left-to-right Arabic). But that's too much of a 300 corner case to worry about. 301 </para> 302 </sect2> 303 </sect1> 304 </chapter> 305