Home | History | Annotate | Download | only in docs
      1 <?xml version="1.0"?>
      2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
      3                "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
      4   <!ENTITY % local.common.attrib "xmlns:xi  CDATA  #FIXED 'http://www.w3.org/2003/XInclude'">
      5   <!ENTITY version SYSTEM "version.xml">
      6 ]>
      7 <chapter id="shaping-concepts">
      8   <title>Shaping concepts</title>
      9   <section id="text-shaping-concepts">
     10     <title>Text shaping</title>
     11     <para>
     12       Text shaping is the process of transforming a sequence of Unicode
     13       codepoints that represent individual characters (letters,
     14       diacritics, tone marks, numbers, symbols, etc.) into the
     15       orthographically and linguistically correct two-dimensional layout
     16       of glyph shapes taken from a specified font.
     17     </para>
     18     <para>
     19       For some writing systems (or <emphasis>scripts</emphasis>) and
     20       languages, the process is simple, requiring the shaper to do
     21       little more than advance the horizontal position forward by the
     22       correct amount for each successive glyph.
     23     </para>
     24     <para>
     25       But, for <emphasis>complex scripts</emphasis>, any combination of
     26       several shaping operations may be required, and the rules for how
     27       and when they are applied vary from script to script. HarfBuzz and
     28       other shaping engines implement these rules.
     29     </para>
     30     <para>
     31       The exact rules and necessary operations for a particular script
     32       constitute a shaping <emphasis>model</emphasis>. OpenType
     33       specifies a set of shaping models that covers all of
     34       Unicode. Other shaping models are available, however, including
     35       Graphite and Apple Advanced Typography (AAT). 
     36     </para>
     37   </section>
     38   
     39   <section id="complex-scripts">
     40     <title>Complex scripts</title>
     41     <para>
     42       In text-shaping terminology, scripts are generally classified as
     43       either <emphasis>complex</emphasis> or <emphasis>non-complex</emphasis>.
     44     </para>
     45     <para>
     46       Complex scripts are those for which transforming the input
     47       sequence into the final layout requires some combination of
     48       operations&mdash;such as context-dependent substitutions,
     49       context-dependent mark positioning, glyph-to-glyph joining,
     50       glyph reordering, or glyph stacking.
     51     </para>
     52     <para>
     53       In some complex scripts, the shaping rules require that a text
     54       run be divided into syllables before the operations can be
     55       applied. Other complex scripts may apply shaping operations over
     56       entire words or over the entire text run, with no subdivision
     57       required.
     58     </para>
     59     <para>
     60       Non-complex scripts, by definition, do not require these
     61       operations. However, correctly shaping a text run in a
     62       non-complex script may still involve Unicode normalization,
     63       ligature substitutions, mark positioning, kerning, and applying
     64       other font features. The key difference is that a text run in a
     65       non-complex script can be processed sequentially and in the same
     66       order as the input sequence of Unicode codepoints, without
     67       requiring an analysis stage.
     68     </para>
     69   </section>
     70   
     71   <section id="shaping-operations">
     72     <title>Shaping operations</title>
     73     <para>
     74       Shaping a complex-script text run involves transforming the
     75       input sequence of Unicode codepoints with some combination of
     76       operations that is specified in the shaping model for the
     77       script.
     78     </para>
     79     <para>
     80       The specific conditions that trigger a given operation for a
     81       text run varies from script to script, as do the order that the
     82       operations are performed in and which codepoints are
     83       affected. However, the same general set of shaping operations is
     84       common to all of the complex-script shaping models. 
     85     </para>
     86     
     87     <itemizedlist>
     88       <listitem>
     89 	<para>
     90 	  A <emphasis>reordering</emphasis> operation moves a glyph
     91 	  from its original ("logical") position in the sequence to
     92 	  some other ("visual") position.
     93 	</para>
     94 	<para>
     95 	  The shaping model for a given complex script might involve
     96 	  more than one reordering step.
     97 	</para>
     98       </listitem>
     99       
    100       <listitem>
    101 	<para>
    102 	  A <emphasis>joining</emphasis> operation replaces a glyph
    103 	  with an alternate form that is designed to connect with one
    104 	  or more of the adjacent glyphs in the sequence.
    105 	</para>
    106       </listitem>
    107       
    108       <listitem>
    109 	<para>
    110 	  A contextual <emphasis>substitution</emphasis> operation
    111 	  replaces either a single glyph or a subsequence of several
    112 	  glyphs with an alternate glyph. This substitution is
    113 	  performed when the original glyph or subsequence of glyphs
    114 	  occurs in a specified position with respect to the
    115 	  surrounding sequence. For example, one substitution might be
    116 	  performed only when the target glyph is the first glyph in
    117 	  the sequence, while another substitution is performed only
    118 	  when a different target glyph occurs immediately after a
    119 	  particular string pattern.
    120 	</para>
    121 	<para>
    122 	  The shaping model for a given complex script might involve
    123 	  multiple contextual-substitution operations, each applying
    124 	  to different target glyphs and patterns, and which are
    125 	  performed in separate steps.
    126 	</para>
    127       </listitem>
    128       
    129       <listitem>
    130 	<para>
    131 	  A contextual <emphasis>positioning</emphasis> operation
    132 	  moves the horizontal and/or vertical position of a
    133 	  glyph. This positioning move is performed when the glyph
    134 	  occurs in a specified position with respect to the
    135 	  surrounding sequence.
    136 	</para>
    137 	<para>
    138 	  Many contextual positioning operations are used to place
    139 	  <emphasis>mark</emphasis> glyphs (such as diacritics, vowel
    140 	  signs, and tone markers) with respect to
    141 	  <emphasis>base</emphasis> glyphs. However, some complex
    142 	  scripts may use contextual positioning operations to
    143 	  correctly place base glyphs as well, such as
    144 	  when the script uses <emphasis>stacking</emphasis> characters.
    145 	</para>
    146       </listitem>
    147       
    148     </itemizedlist>
    149   </section>
    150   
    151   <section id="unicode-character-categories">
    152     <title>Unicode character categories</title>
    153     <para>
    154       Shaping models are typically specified with respect to how
    155       scripts are defined in the Unicode standard.
    156     </para>
    157     <para>
    158       Every codepoint in the Unicode Character Database (UCD) is
    159       assigned a <emphasis>Unicode General Category</emphasis> (UGC),
    160       which provides the most fundamental information about the
    161       codepoint: whether the codepoint represents a
    162       <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a
    163       <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a
    164       <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>,
    165       or something else (<emphasis>Other</emphasis>).
    166     </para>
    167     <para>
    168       These UGC properties are "Major" categories. Each codepoint is
    169       further assigned to a "minor" category within its Major
    170       category, such as "Letter, uppercase" (<literal>Lu</literal>) or
    171       "Letter, modifier" (<literal>Lm</literal>).
    172     </para>
    173     <para>
    174       Shaping models are concerned primarily with Letter and Mark
    175       codepoints. The minor categories of Mark codepoints are
    176       particularly important for shaping. Marks can be nonspacing
    177       (<literal>Mn</literal>), spacing combining
    178       (<literal>Mc</literal>), or enclosing (<literal>Me</literal>).
    179     </para>
    180     <para>
    181       In addition to the UGC property, codepoints in the Indic and
    182       Southeast Asian scripts are also assigned
    183       <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and
    184       <emphasis>Unicode Indic Positional Category</emphasis> (UIPC)
    185       property that provides more detailed information needed for
    186       shaping.
    187     </para>
    188     <para>
    189       The UISC property sub-categorizes Letters and Marks according to
    190       common script-shaping behaviors. For example, UISC distinguishes
    191       between consonant letters, vowel letters, and vowel marks. The
    192       UIPC property sub-categorizes Mark codepoints by the visual
    193       position that they occupy (above, below, right, left, or in
    194       multiple positions).
    195     </para>
    196     <para>
    197       Some complex scripts require that the text run be split into
    198       syllables, and what constitutes a valid syllable in these
    199       scripts is specified in regular expressions of the Letter and
    200       Mark codepoints that take the UISC and UIPC properties into account.
    201     </para>
    202 
    203   </section>
    204   
    205   <section id="text-runs">
    206     <title>Text runs</title>
    207     <para>
    208       Real-world text usually contains codepoints from a mixture of
    209       different Unicode scripts (including punctuation, numbers, symbols,
    210       white-space characters, and other codepoints that do not belong
    211       to any script). Real-world text may also be marked up with
    212       formatting that changes font properties (including the font,
    213       font style, and font size).
    214     </para>
    215     <para>
    216       For shaping purposes, all real-world text streams must be first
    217       segmented into runs that have a uniform set of properties. 
    218     </para>
    219     <para>
    220       In particular, shaping models always assume that every codepoint
    221       in a text run has the same <emphasis>direction</emphasis>,
    222       <emphasis>script</emphasis> tag, and
    223       <emphasis>language</emphasis> tag.
    224     </para>
    225   </section>
    226   
    227   <section id="opentype-shaping-models">
    228     <title>OpenType shaping models</title>
    229     <para>
    230       OpenType provides shaping models for the following scripts:
    231     </para>
    232 
    233     <itemizedlist>
    234       <listitem>
    235 	<para>
    236 	  The <emphasis>default</emphasis> shaping model handles all
    237 	  non-complex scripts, and may also be used as a fallback for
    238 	  handling unrecognized scripts.
    239 	</para>
    240       </listitem>
    241 
    242       <listitem>
    243 	<para>
    244 	  The <emphasis>Indic</emphasis> shaping model handles the Indic
    245 	  scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada,
    246 	  Malayalam, Oriya, Tamil, Telugu, and Sinhala.
    247 	</para>
    248 	<para>
    249 	  The Indic shaping model was revised significantly in
    250 	  2005. To denote the change, a new set of <emphasis>script
    251 	  tags</emphasis> was assigned for Bengali, Devanagari,
    252 	  Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and
    253 	  Telugu. For the sake of clarity, the term "Indic2" is
    254 	  sometimes used to refer to the current, revised shaping
    255 	  model.
    256 	</para>
    257       </listitem>
    258 
    259       <listitem>
    260 	<para>
    261 	  The <emphasis>Arabic</emphasis> shaping model supports
    262 	  Arabic, Mongolian, N'Ko, Syriac, and several other connected
    263 	  or cursive scripts.
    264 	</para>
    265       </listitem>
    266 
    267       <listitem>
    268 	<para>
    269 	  The <emphasis>Thai/Lao</emphasis> shaping model supports
    270 	  the Thai and Lao scripts.
    271 	</para>
    272       </listitem>
    273 
    274       <listitem>
    275 	<para>
    276 	  The <emphasis>Khmer</emphasis> shaping model supports the
    277 	  Khmer script.
    278 	</para>
    279       </listitem>
    280 
    281       <listitem>
    282 	<para>
    283 	  The <emphasis>Myanmar</emphasis> shaping model supports the
    284 	  Myanmar (or Burmese) script.
    285 	</para>
    286       </listitem>
    287 
    288       <listitem>
    289 	<para>
    290 	  The <emphasis>Tibetan</emphasis> shaping model supports the
    291 	  Tibetan script.
    292 	</para>
    293       </listitem>
    294 
    295       <listitem>
    296 	<para>
    297 	  The <emphasis>Hangul</emphasis> shaping model supports the
    298 	  Hangul script.
    299 	</para>
    300       </listitem>
    301 
    302       <listitem>
    303 	<para>
    304 	  The <emphasis>Hebrew</emphasis> shaping model supports the
    305 	  Hebrew script.
    306 	</para>
    307       </listitem>
    308 
    309       <listitem>
    310 	<para>
    311 	  The <emphasis>Universal Shaping Engine</emphasis> (USE)
    312 	  shaping model supports complex scripts not covered by one of
    313 	  the above, script-specific shaping models, including
    314 	  Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi,
    315 	  Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai
    316 	  Viet, and many others. 
    317 	</para>
    318       </listitem>
    319 
    320       <listitem>
    321 	<para>
    322 	  Text runs that do not fall under one of the above shaping
    323 	  models may still require processing by a shaping engine. Of
    324 	  particular note is <emphasis>Emoji</emphasis> shaping, which
    325 	  may involve variation-selector sequences and glyph
    326 	  substitution. Emoji shaping is handled by the default
    327 	  shaping model.
    328 	</para>
    329       </listitem>
    330 
    331     </itemizedlist>
    332     
    333   </section>
    334   
    335   <section id="graphite-shaping">
    336     <title>Graphite shaping</title>
    337     <para>
    338       In contrast to OpenType shaping, Graphite shaping does not
    339       specify a predefined set of shaping models or a set of supported
    340       scripts.
    341     </para>
    342     <para>
    343       Instead, each Graphite font contains a complete set of rules that
    344       implement the required shaping model for the intended
    345       script. These rules include finite-state machines to match
    346       sequences of codepoints to the shaping operations to perform.
    347     </para>
    348     <para>
    349       Graphite shaping can perform the same shaping operations used in
    350       OpenType shaping, as well as other functions that have not been
    351       defined for OpenType shaping.
    352     </para>
    353   </section>
    354   
    355   <section id="aat-shaping">
    356     <title>AAT shaping</title>
    357     <para>
    358       In contrast to OpenType shaping, AAT shaping does not specify a 
    359       predefined set of shaping models or a set of supported scripts.
    360     </para>
    361     <para>
    362       Instead, each AAT font includes a complete set of rules that
    363       implement the desired shaping model for the intended
    364       script. These rules include finite-state machines to match glyph
    365       sequences and the shaping operations to perform.
    366     </para>
    367     <para>
    368       Notably, AAT shaping rules are expressed for glyphs in the font,
    369       not for Unicode codepoints. AAT shaping can perform the same
    370       shaping operations used in OpenType shaping, as well as other
    371       functions that have not been defined for OpenType shaping.
    372     </para>
    373   </section>
    374 </chapter>
    375