1 <?xml version="1.0"?> 2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" 3 "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [ 4 <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'"> 5 <!ENTITY version SYSTEM "version.xml"> 6 ]> 7 <chapter id="shaping-concepts"> 8 <title>Shaping concepts</title> 9 <section id="text-shaping-concepts"> 10 <title>Text shaping</title> 11 <para> 12 Text shaping is the process of transforming a sequence of Unicode 13 codepoints that represent individual characters (letters, 14 diacritics, tone marks, numbers, symbols, etc.) into the 15 orthographically and linguistically correct two-dimensional layout 16 of glyph shapes taken from a specified font. 17 </para> 18 <para> 19 For some writing systems (or <emphasis>scripts</emphasis>) and 20 languages, the process is simple, requiring the shaper to do 21 little more than advance the horizontal position forward by the 22 correct amount for each successive glyph. 23 </para> 24 <para> 25 But, for <emphasis>complex scripts</emphasis>, any combination of 26 several shaping operations may be required, and the rules for how 27 and when they are applied vary from script to script. HarfBuzz and 28 other shaping engines implement these rules. 29 </para> 30 <para> 31 The exact rules and necessary operations for a particular script 32 constitute a shaping <emphasis>model</emphasis>. OpenType 33 specifies a set of shaping models that covers all of 34 Unicode. Other shaping models are available, however, including 35 Graphite and Apple Advanced Typography (AAT). 36 </para> 37 </section> 38 39 <section id="complex-scripts"> 40 <title>Complex scripts</title> 41 <para> 42 In text-shaping terminology, scripts are generally classified as 43 either <emphasis>complex</emphasis> or <emphasis>non-complex</emphasis>. 44 </para> 45 <para> 46 Complex scripts are those for which transforming the input 47 sequence into the final layout requires some combination of 48 operations—such as context-dependent substitutions, 49 context-dependent mark positioning, glyph-to-glyph joining, 50 glyph reordering, or glyph stacking. 51 </para> 52 <para> 53 In some complex scripts, the shaping rules require that a text 54 run be divided into syllables before the operations can be 55 applied. Other complex scripts may apply shaping operations over 56 entire words or over the entire text run, with no subdivision 57 required. 58 </para> 59 <para> 60 Non-complex scripts, by definition, do not require these 61 operations. However, correctly shaping a text run in a 62 non-complex script may still involve Unicode normalization, 63 ligature substitutions, mark positioning, kerning, and applying 64 other font features. The key difference is that a text run in a 65 non-complex script can be processed sequentially and in the same 66 order as the input sequence of Unicode codepoints, without 67 requiring an analysis stage. 68 </para> 69 </section> 70 71 <section id="shaping-operations"> 72 <title>Shaping operations</title> 73 <para> 74 Shaping a complex-script text run involves transforming the 75 input sequence of Unicode codepoints with some combination of 76 operations that is specified in the shaping model for the 77 script. 78 </para> 79 <para> 80 The specific conditions that trigger a given operation for a 81 text run varies from script to script, as do the order that the 82 operations are performed in and which codepoints are 83 affected. However, the same general set of shaping operations is 84 common to all of the complex-script shaping models. 85 </para> 86 87 <itemizedlist> 88 <listitem> 89 <para> 90 A <emphasis>reordering</emphasis> operation moves a glyph 91 from its original ("logical") position in the sequence to 92 some other ("visual") position. 93 </para> 94 <para> 95 The shaping model for a given complex script might involve 96 more than one reordering step. 97 </para> 98 </listitem> 99 100 <listitem> 101 <para> 102 A <emphasis>joining</emphasis> operation replaces a glyph 103 with an alternate form that is designed to connect with one 104 or more of the adjacent glyphs in the sequence. 105 </para> 106 </listitem> 107 108 <listitem> 109 <para> 110 A contextual <emphasis>substitution</emphasis> operation 111 replaces either a single glyph or a subsequence of several 112 glyphs with an alternate glyph. This substitution is 113 performed when the original glyph or subsequence of glyphs 114 occurs in a specified position with respect to the 115 surrounding sequence. For example, one substitution might be 116 performed only when the target glyph is the first glyph in 117 the sequence, while another substitution is performed only 118 when a different target glyph occurs immediately after a 119 particular string pattern. 120 </para> 121 <para> 122 The shaping model for a given complex script might involve 123 multiple contextual-substitution operations, each applying 124 to different target glyphs and patterns, and which are 125 performed in separate steps. 126 </para> 127 </listitem> 128 129 <listitem> 130 <para> 131 A contextual <emphasis>positioning</emphasis> operation 132 moves the horizontal and/or vertical position of a 133 glyph. This positioning move is performed when the glyph 134 occurs in a specified position with respect to the 135 surrounding sequence. 136 </para> 137 <para> 138 Many contextual positioning operations are used to place 139 <emphasis>mark</emphasis> glyphs (such as diacritics, vowel 140 signs, and tone markers) with respect to 141 <emphasis>base</emphasis> glyphs. However, some complex 142 scripts may use contextual positioning operations to 143 correctly place base glyphs as well, such as 144 when the script uses <emphasis>stacking</emphasis> characters. 145 </para> 146 </listitem> 147 148 </itemizedlist> 149 </section> 150 151 <section id="unicode-character-categories"> 152 <title>Unicode character categories</title> 153 <para> 154 Shaping models are typically specified with respect to how 155 scripts are defined in the Unicode standard. 156 </para> 157 <para> 158 Every codepoint in the Unicode Character Database (UCD) is 159 assigned a <emphasis>Unicode General Category</emphasis> (UGC), 160 which provides the most fundamental information about the 161 codepoint: whether the codepoint represents a 162 <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a 163 <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a 164 <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>, 165 or something else (<emphasis>Other</emphasis>). 166 </para> 167 <para> 168 These UGC properties are "Major" categories. Each codepoint is 169 further assigned to a "minor" category within its Major 170 category, such as "Letter, uppercase" (<literal>Lu</literal>) or 171 "Letter, modifier" (<literal>Lm</literal>). 172 </para> 173 <para> 174 Shaping models are concerned primarily with Letter and Mark 175 codepoints. The minor categories of Mark codepoints are 176 particularly important for shaping. Marks can be nonspacing 177 (<literal>Mn</literal>), spacing combining 178 (<literal>Mc</literal>), or enclosing (<literal>Me</literal>). 179 </para> 180 <para> 181 In addition to the UGC property, codepoints in the Indic and 182 Southeast Asian scripts are also assigned 183 <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and 184 <emphasis>Unicode Indic Positional Category</emphasis> (UIPC) 185 property that provides more detailed information needed for 186 shaping. 187 </para> 188 <para> 189 The UISC property sub-categorizes Letters and Marks according to 190 common script-shaping behaviors. For example, UISC distinguishes 191 between consonant letters, vowel letters, and vowel marks. The 192 UIPC property sub-categorizes Mark codepoints by the visual 193 position that they occupy (above, below, right, left, or in 194 multiple positions). 195 </para> 196 <para> 197 Some complex scripts require that the text run be split into 198 syllables, and what constitutes a valid syllable in these 199 scripts is specified in regular expressions of the Letter and 200 Mark codepoints that take the UISC and UIPC properties into account. 201 </para> 202 203 </section> 204 205 <section id="text-runs"> 206 <title>Text runs</title> 207 <para> 208 Real-world text usually contains codepoints from a mixture of 209 different Unicode scripts (including punctuation, numbers, symbols, 210 white-space characters, and other codepoints that do not belong 211 to any script). Real-world text may also be marked up with 212 formatting that changes font properties (including the font, 213 font style, and font size). 214 </para> 215 <para> 216 For shaping purposes, all real-world text streams must be first 217 segmented into runs that have a uniform set of properties. 218 </para> 219 <para> 220 In particular, shaping models always assume that every codepoint 221 in a text run has the same <emphasis>direction</emphasis>, 222 <emphasis>script</emphasis> tag, and 223 <emphasis>language</emphasis> tag. 224 </para> 225 </section> 226 227 <section id="opentype-shaping-models"> 228 <title>OpenType shaping models</title> 229 <para> 230 OpenType provides shaping models for the following scripts: 231 </para> 232 233 <itemizedlist> 234 <listitem> 235 <para> 236 The <emphasis>default</emphasis> shaping model handles all 237 non-complex scripts, and may also be used as a fallback for 238 handling unrecognized scripts. 239 </para> 240 </listitem> 241 242 <listitem> 243 <para> 244 The <emphasis>Indic</emphasis> shaping model handles the Indic 245 scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, 246 Malayalam, Oriya, Tamil, Telugu, and Sinhala. 247 </para> 248 <para> 249 The Indic shaping model was revised significantly in 250 2005. To denote the change, a new set of <emphasis>script 251 tags</emphasis> was assigned for Bengali, Devanagari, 252 Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and 253 Telugu. For the sake of clarity, the term "Indic2" is 254 sometimes used to refer to the current, revised shaping 255 model. 256 </para> 257 </listitem> 258 259 <listitem> 260 <para> 261 The <emphasis>Arabic</emphasis> shaping model supports 262 Arabic, Mongolian, N'Ko, Syriac, and several other connected 263 or cursive scripts. 264 </para> 265 </listitem> 266 267 <listitem> 268 <para> 269 The <emphasis>Thai/Lao</emphasis> shaping model supports 270 the Thai and Lao scripts. 271 </para> 272 </listitem> 273 274 <listitem> 275 <para> 276 The <emphasis>Khmer</emphasis> shaping model supports the 277 Khmer script. 278 </para> 279 </listitem> 280 281 <listitem> 282 <para> 283 The <emphasis>Myanmar</emphasis> shaping model supports the 284 Myanmar (or Burmese) script. 285 </para> 286 </listitem> 287 288 <listitem> 289 <para> 290 The <emphasis>Tibetan</emphasis> shaping model supports the 291 Tibetan script. 292 </para> 293 </listitem> 294 295 <listitem> 296 <para> 297 The <emphasis>Hangul</emphasis> shaping model supports the 298 Hangul script. 299 </para> 300 </listitem> 301 302 <listitem> 303 <para> 304 The <emphasis>Hebrew</emphasis> shaping model supports the 305 Hebrew script. 306 </para> 307 </listitem> 308 309 <listitem> 310 <para> 311 The <emphasis>Universal Shaping Engine</emphasis> (USE) 312 shaping model supports complex scripts not covered by one of 313 the above, script-specific shaping models, including 314 Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi, 315 Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai 316 Viet, and many others. 317 </para> 318 </listitem> 319 320 <listitem> 321 <para> 322 Text runs that do not fall under one of the above shaping 323 models may still require processing by a shaping engine. Of 324 particular note is <emphasis>Emoji</emphasis> shaping, which 325 may involve variation-selector sequences and glyph 326 substitution. Emoji shaping is handled by the default 327 shaping model. 328 </para> 329 </listitem> 330 331 </itemizedlist> 332 333 </section> 334 335 <section id="graphite-shaping"> 336 <title>Graphite shaping</title> 337 <para> 338 In contrast to OpenType shaping, Graphite shaping does not 339 specify a predefined set of shaping models or a set of supported 340 scripts. 341 </para> 342 <para> 343 Instead, each Graphite font contains a complete set of rules that 344 implement the required shaping model for the intended 345 script. These rules include finite-state machines to match 346 sequences of codepoints to the shaping operations to perform. 347 </para> 348 <para> 349 Graphite shaping can perform the same shaping operations used in 350 OpenType shaping, as well as other functions that have not been 351 defined for OpenType shaping. 352 </para> 353 </section> 354 355 <section id="aat-shaping"> 356 <title>AAT shaping</title> 357 <para> 358 In contrast to OpenType shaping, AAT shaping does not specify a 359 predefined set of shaping models or a set of supported scripts. 360 </para> 361 <para> 362 Instead, each AAT font includes a complete set of rules that 363 implement the desired shaping model for the intended 364 script. These rules include finite-state machines to match glyph 365 sequences and the shaping operations to perform. 366 </para> 367 <para> 368 Notably, AAT shaping rules are expressed for glyphs in the font, 369 not for Unicode codepoints. AAT shaping can perform the same 370 shaping operations used in OpenType shaping, as well as other 371 functions that have not been defined for OpenType shaping. 372 </para> 373 </section> 374 </chapter> 375