1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 2 "http://www.w3.org/TR/html4/loose.dtd"> 3 <html> 4 5 <head> 6 <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 7 <meta http-equiv="Content-Language" content="en-us"> 8 <link rel="stylesheet" href="http://www.unicode.org/reports/reports.css" 9 type="text/css"> 10 <title>UTS #35: Unicode LDML: Collation</title> 11 <style type="text/css"> 12 <!-- 13 .dtd { 14 font-family: monospace; 15 font-size: 90%; 16 background-color: #CCCCFF; 17 border-style: dotted; 18 border-width: 1px; 19 } 20 21 .xmlExample { 22 font-family: monospace; 23 font-size: 80% 24 } 25 26 .blockedInherited { 27 font-style: italic; 28 font-weight: bold; 29 border-style: dashed; 30 border-width: 1px; 31 background-color: #FF0000 32 } 33 34 .inherited { 35 font-weight: bold; 36 border-style: dashed; 37 border-width: 1px; 38 background-color: #00FF00 39 } 40 41 .element { 42 font-weight: bold; 43 color: red; 44 } 45 46 .attribute { 47 font-weight: bold; 48 color: maroon; 49 } 50 51 .attributeValue { 52 font-weight: bold; 53 color: blue; 54 } 55 56 li, p { 57 margin-top: 0.5em; 58 margin-bottom: 0.5em 59 } 60 61 h2, h3, h4, table { 62 margin-top: 1.5em; 63 margin-bottom: 0.5em; 64 } 65 --> 66 </style> 67 </head> 68 69 <body> 70 71 <table class="header" width="100%"> 72 <tr> 73 <td class="icon"><a href="http://unicode.org"> <img 74 alt="[Unicode]" src="http://unicode.org/webscripts/logo60s2.gif" 75 width="34" height="33" 76 style="vertical-align: middle; border-left-width: 0px; border-bottom-width: 0px; border-right-width: 0px; border-top-width: 0px;"></a> 77 <a class="bar" href="http://www.unicode.org/reports/">Technical 78 Reports</a></td> 79 </tr> 80 <tr> 81 <td class="gray"> </td> 82 </tr> 83 </table> 84 <div class="body"> 85 <h2 style="text-align: center"> 86 Unicode Technical 87 Standard #35 88 </h2> 89 <h1> 90 Unicode Locale Data Markup Language (LDML)<br>Part 5: Collation 91 </h1> 92 93 <!-- At least the first row of this header table should be identical across the parts of this UTS. --> 94 <table border="1" cellpadding="2" cellspacing="0" class="wide"> 95 <tr> 96 <td>Version</td> 97 <td>34</td> 98 </tr> 99 <tr> 100 <td>Editors</td> 101 <td><a 102 href="https://plus.google.com/117587389715494866571?rel=author"> 103 Markus Scherer</a> (<a href="mailto:markus.icu (a] gmail.com">markus.icu (a] gmail.com</a>) 104 and <a href="tr35.html#Acknowledgments">other CLDR committee 105 members</a></td> 106 </tr> 107 </table> 108 109 <p> 110 For the full header, summary, and status, see <a href="tr35.html"> 111 Part 1: Core</a> 112 </p> 113 114 <h3> 115 <i>Summary</i> 116 </h3> 117 <p> 118 This document describes parts of an XML format (<i>vocabulary</i>) 119 for the exchange of structured locale data. This format is used in 120 the <a href="http://cldr.unicode.org/">Unicode Common Locale Data 121 Repository</a>. 122 </p> 123 124 <p> 125 This is a partial document, describing only those parts of the LDML 126 that are relevant for collation (sorting, searching & grouping). 127 For the other parts of the LDML see the <a href="tr35.html">main 128 LDML document</a> and the links above. 129 </p> 130 131 <h3> 132 <i>Status</i> 133 </h3> 134 135 <!-- NOT YET APPROVED 136 <p> 137 <i class="changed">This is a<b><font color="#ff3333"> 138 draft </font></b>document which may be updated, replaced, or superseded by 139 other documents at any time. Publication does not imply endorsement 140 by the Unicode Consortium. This is not a stable document; it is 141 inappropriate to cite this document as other than a work in 142 progress. 143 </i> 144 </p> 145 END NOT YET APPROVED --> 146 <!-- APPROVED --> 147 <p> 148 <i>This document has been reviewed by Unicode members and other 149 interested parties, and has been approved for publication by the 150 Unicode Consortium. This is a stable document and may be used as 151 reference material or cited as a normative reference by other 152 specifications.</i> 153 </p> 154 <!-- END APPROVED --> 155 156 157 <blockquote> 158 <p> 159 <i><b>A Unicode Technical Standard (UTS)</b> is an independent 160 specification. Conformance to the Unicode Standard does not imply 161 conformance to any UTS.</i> 162 </p> 163 </blockquote> 164 <p> 165 <i>Please submit corrigenda and other comments with the CLDR bug 166 reporting form [<a href="tr35.html#Bugs">Bugs</a>]. Related 167 information that is useful in understanding this document is found 168 in the <a href="tr35.html#References">References</a>. For the latest 169 version of the Unicode Standard see [<a href="tr35.html#Unicode">Unicode</a>]. 170 For a list of current Unicode Technical Reports see [<a 171 href="tr35.html#Reports">Reports</a>]. For more information about 172 versions of the Unicode Standard, see [<a href="tr35.html#Versions">Versions</a>]. 173 </i> 174 </p> 175 <h2> 176 <a name="Parts" href="#Parts">Parts</a> 177 </h2> 178 179 <!-- This section of Parts should be identical in all of the parts of this UTS. --> 180 <p>The LDML specification is divided into the following parts:</p> 181 <ul class="toc"> 182 <li>Part 1: <a href="tr35.html#Contents">Core</a> (languages, 183 locales, basic structure) 184 </li> 185 <li>Part 2: <a href="tr35-general.html#Contents">General</a> 186 (display names & transforms, etc.) 187 </li> 188 <li>Part 3: <a href="tr35-numbers.html#Contents">Numbers</a> 189 (number & currency formatting) 190 </li> 191 <li>Part 4: <a href="tr35-dates.html#Contents">Dates</a> (date, 192 time, time zone formatting) 193 </li> 194 <li>Part 5: <a href="tr35-collation.html#Contents">Collation</a> 195 (sorting, searching, grouping) 196 </li> 197 <li>Part 6: <a href="tr35-info.html#Contents">Supplemental</a> 198 (supplemental data) 199 </li> 200 <li>Part 7: <a href="tr35-keyboards.html#Contents">Keyboards</a> 201 (keyboard mappings) 202 </li> 203 </ul> 204 205 <h2> 206 <a name="Contents" href="#Contents">Contents of Part 5, Collation</a> 207 </h2> 208 <!-- START Generated TOC: CheckHtmlFiles --> 209 <ul class="toc"> 210 <li>1 <a href="#CLDR_Collation">CLDR Collation</a> 211 <ul class="toc"> 212 <li>1.1 <a href="#CLDR_Collation_Algorithm">CLDR Collation 213 Algorithm</a> 214 <ul class="toc"> 215 <li>1.1.1 <a href="#Algorithm_FFFE">U+FFFE</a></li> 216 <li>1.1.2 <a href="#Context_Sensitive_Mappings">Context-Sensitive 217 Mappings</a></li> 218 <li>1.1.3 <a href="#Algorithm_Case">Case Handling</a></li> 219 <li>1.1.4 <a href="#Algorithm_Reordering_Groups">Reordering 220 Groups</a></li> 221 <li>1.1.5 <a href="#Combining_Rules">Combining Rules</a></li> 222 </ul> 223 </li> 224 </ul> 225 </li> 226 <li>2 <a href="#Root_Collation">Root Collation</a> 227 <ul class="toc"> 228 <li>2.1 <a href="#grouping_classes_of_characters">Grouping 229 classes of characters</a></li> 230 <li>2.2 <a href="#non_variable_symbols">Non-variable 231 symbols</a></li> 232 <li>2.3 <a href="#tibetan_contractions">Additional 233 contractions for Tibetan</a></li> 234 <li>2.4 <a href="#tailored_noncharacter_weights">Tailored 235 noncharacter weights</a></li> 236 <li>2.5 <a href="#Root_Data_Files">Root Collation Data 237 Files</a></li> 238 <li>2.6 <a href="#Root_Data_File_Formats">Root Collation 239 Data File Formats</a> 240 <ul class="toc"> 241 <li>2.6.1 <a href="#File_Format_allkeys_CLDR_txt">allkeys_CLDR.txt</a></li> 242 <li>2.6.2 <a href="#File_Format_FractionalUCA_txt">FractionalUCA.txt</a></li> 243 <li>2.6.3 <a href="#File_Format_UCA_Rules_txt">UCA_Rules.txt</a></li> 244 </ul> 245 </li> 246 </ul> 247 </li> 248 <li>3 <a href="#Collation_Tailorings">Collation Tailorings</a> 249 <ul class="toc"> 250 <li>3.1 <a href="#Collation_Types">Collation Types</a> 251 <ul class="toc"> 252 <li>3.1.1 <a href="#Collation_Type_Fallback">Collation 253 Type Fallback</a> 254 <ul class="toc"> 255 <li>Table: <a 256 href="#Sample_requested_and_actual_collation_locales_and_types">Sample 257 requested and actual collation locales and types</a></li> 258 </ul> 259 </li> 260 </ul> 261 </li> 262 <li>3.2 <a href="#Collation_Version">Version</a></li> 263 <li>3.3 <a href="#Collation_Element">Collation Element</a></li> 264 <li>3.4 <a href="#Setting_Options">Setting Options</a> 265 <ul class="toc"> 266 <li>Table: <a href="#Collation_Settings">Collation 267 Settings</a></li> 268 <li>3.4.1 <a href="#Common_Settings">Common settings 269 combinations</a></li> 270 <li>3.4.2 <a href="#Normalization_Setting">Notes on the 271 normalization setting</a></li> 272 <li>3.4.3 <a href="#Variable_Top_Settings">Notes on 273 variable top settings</a></li> 274 </ul> 275 </li> 276 <li>3.5 <a href="#Rules">Collation Rule Syntax</a></li> 277 <li>3.6 <a href="#Orderings">Orderings</a> 278 <ul class="toc"> 279 <li>Table: <a href="#Specifying_Collation_Ordering">Specifying 280 Collation Ordering</a></li> 281 <li>Table: <a href="#Abbreviating_Ordering_Specifications">Abbreviating 282 Ordering Specifications</a></li> 283 </ul> 284 </li> 285 <li>3.7 <a href="#Contractions">Contractions</a> 286 <ul class="toc"> 287 <li>Table: <a href="#Specifying_Contractions">Specifying 288 Contractions</a></li> 289 </ul> 290 </li> 291 <li>3.8 <a href="#Expansions">Expansions</a></li> 292 <li>3.9 <a href="#Context_Before">Context Before</a> 293 <ul class="toc"> 294 <li>Table: <a href="#Specifying_Previous_Context">Specifying 295 Previous Context</a></li> 296 </ul> 297 </li> 298 <li>3.10 <a href="#Placing_Characters_Before_Others">Placing 299 Characters Before Others</a></li> 300 <li>3.11 <a href="#Logical_Reset_Positions">Logical Reset 301 Positions</a> 302 <ul class="toc"> 303 <li>Table: <a href="#Specifying_Logical_Positions">Specifying 304 Logical Positions</a></li> 305 </ul> 306 </li> 307 <li>3.12 <a href="#Special_Purpose_Commands">Special-Purpose 308 Commands</a> 309 <ul class="toc"> 310 <li>Table: <a href="#Special_Purpose_Elements">Special-Purpose 311 Elements</a></li> 312 </ul> 313 </li> 314 <li>3.13 <a href="#Script_Reordering">Collation Reordering</a> 315 <ul class="toc"> 316 <li>3.13.1 <a href="#Interpretation_reordering">Interpretation 317 of a reordering list</a></li> 318 <li>3.13.2 <a href="#Reordering_Groups_allkeys">Reordering 319 Groups for allkeys.txt</a></li> 320 </ul> 321 </li> 322 <li>3.14 <a href="#Case_Parameters">Case Parameters</a> 323 <ul class="toc"> 324 <li>3.14.1 <a href="#Case_Untailored">Untailored 325 Characters</a></li> 326 <li>3.14.2 <a href="#Case_Weights">Compute Modified 327 Collation Elements</a></li> 328 <li>3.14.3 <a href="#Case_Tailored">Tailored Strings</a></li> 329 </ul> 330 </li> 331 <li>3.15 <a href="#Visibility">Visibility</a></li> 332 <li>3.16 <a href="#Collation_Indexes">Collation Indexes</a> 333 <ul class="toc"> 334 <li>3.16.1 <a href="#Index_Characters">Index Characters</a></li> 335 <li>3.16.2 <a href="#CJK_Index_Markers">CJK Index 336 Markers</a></li> 337 </ul> 338 </li> 339 </ul> 340 </li> 341 </ul> 342 <!-- END Generated TOC: CheckHtmlFiles --> 343 344 <h2> 345 1 <a name="CLDR_Collation" href="#CLDR_Collation">CLDR Collation</a> 346 </h2> 347 <p>Collation is the general term for the process and function of 348 determining the sorting order of strings of characters, for example 349 for lists of strings presented to users, or in databases for sorting 350 and selecting records.</p> 351 352 <p>Collation varies by language, by application (some languages 353 use special phonebook sorting), and other criteria (for example, 354 phonetic vs. visual).</p> 355 356 <p> 357 CLDR provides collation data for many languages and styles. The data 358 supports not only sorting but also language-sensitive searching and 359 grouping under index headers. All CLDR collations are based on the [<a 360 href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>] default 361 order, with common modifications applied in the CLDR root collation, 362 and further tailored for language and style as needed. 363 </p> 364 365 <h3> 366 1.1 <a name="CLDR_Collation_Algorithm" 367 href="#CLDR_Collation_Algorithm">CLDR Collation Algorithm</a> 368 </h3> 369 370 <p> 371 The CLDR collation algorithm is an extension of the <a 372 href="http://www.unicode.org/reports/tr10/#Main_Algorithm">Unicode 373 Collation Algorithm</a>. 374 </p> 375 376 <h4> 377 1.1.1 <a name="Algorithm_FFFE" href="#Algorithm_FFFE">U+FFFE</a> 378 </h4> 379 380 <p> 381 U+FFFE maps to a CE with a minimal, unique primary weight. Its 382 primary weight is not "variable": U+FFFE must not become ignorable in 383 alternate handling. On the identical level, a minimal, unique 384 weight must be emitted for U+FFFE as well. This allows for <a 385 href="http://www.unicode.org/reports/tr10/#Merging_Sort_Keys">Merging 386 Sort Keys</a> within code point space. 387 </p> 388 <p> 389 For example, when sorting names in a database, a sortable string can 390 be formed with <em>last_name</em> + '\uFFFE' + <em>first_name</em>. 391 These strings would sort properly, without ever comparing the last 392 part of a last name with the first part of another first name. 393 </p> 394 395 <p> 396 For backwards secondary level sorting, text <i>segments</i> separated 397 by U+FFFE are processed in forward segment order, and <i>within</i> 398 each segment the secondary weights are compared backwards. This is so 399 that such combined strings are processed consistently with merging 400 their sort keys (for example, by concatenating them level by level 401 with a low separator). 402 </p> 403 404 <p class="note"> 405 Note: With unique, low weights on <i>all</i> levels it is possible to 406 achieve 407 <code>sortkey(str1 + "\uFFFE" + str2) == 408 mergeSortkeys(sortkey(str1), sortkey(str2))</code> 409 . When that is not necessary, then code can be a little simpler (no 410 special handling for U+FFFE except for backwards-secondary), sort 411 keys can be a little shorter (when using compressible common 412 non-primary weights for U+FFFE), and another low weight can be used 413 in tailorings. 414 </p> 415 416 <h4> 417 1.1.2 <a name="Context_Sensitive_Mappings" 418 href="#Context_Sensitive_Mappings">Context-Sensitive Mappings</a> 419 </h4> 420 421 <p>Contraction matching, as in the UCA, starts from the first 422 character of the contraction string. It slows down processing of that 423 first character even when none of its contractions matches. In some 424 cases, it is preferrable to change such contractions to mappings with 425 a prefix (context before a character), so that complex processing is 426 done only when the less-frequently occurring trailing character is 427 encountered.</p> 428 429 <p>For example, the DUCET contains contractions for several 430 variants of L (L followed by middle dot). Collating ASCII text is 431 slowed down by contraction matching starting with L/l. In the CLDR 432 root collation, these contractions are replaced by prefix mappings 433 (L|) which are triggered only when the middle dot is encountered. 434 CLDR also uses prefix rules in the Japanese tailoring, for processing 435 of Hiragana/Katakana length and iteration marks.</p> 436 437 <p>The mapping is conditional on the prefix match but does not 438 change the mappings for the preceding text. As a result, a 439 contraction mapping for "px" can be replaced by a prefix rule "p|x" 440 only if px maps to the collation elements for p followed by the 441 collation elements for "x if after p". In the DUCET, L maps to CE(L) 442 followed by a special secondary CE (which differs from CE() when 443 is not preceded by L). In the CLDR root collation, L has no 444 context-sensitive mappings, but maps to that special secondary CE 445 if preceded by L.</p> 446 447 <p>A prefix mapping for p|x behaves mostly like the contraction 448 px, except when there is a contraction that overlaps with the prefix, 449 for example one for "op". A contraction matches only new text (and 450 consumes it), while a prefix matches only already-consumed text.</p> 451 <ul> 452 <li>With mappings for "op" and "px", only the first contraction 453 matches in text "opx". (It consumes the "op" characters, and there 454 is no context-sensitive mapping for x.)</li> 455 <li>With mappings for "op" and "p|x", both the contraction and 456 the prefix rule match in text "opx". (The prefix always matches 457 already-consumed characters, regardless of whether they mapped as 458 part of contractions.)</li> 459 </ul> 460 461 <p class="note"> 462 Note: Matching of discontiguous contractions should be implemented 463 without rewriting the text (unlike in the [<a 464 href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>] algorithm 465 specification), so that prefix matching is predictable. (It should 466 also help with contraction matching performance.) An implementation 467 that does rewrite the text, as in the UCA, will get different results 468 for some (unusual) combinations of contractions, prefix rules, and 469 input text. 470 </p> 471 472 <p>Prefix matching uses a simple longest-match algorithm (op|c 473 wins over p|c). It is recommended that prefix rules be limited to 474 mappings where both the prefix string and the mapped string begin 475 with an NFC boundary (that is, with a normalization starter that does 476 not combine backwards). (In op|ch both o and c should be starters 477 (ccc=0) and NFC_QC=Yes.) Otherwise, prefix matching would be affected 478 by canonical reordering and discontiguous matching, like 479 contractions. Prefix matching is thus always contiguous.</p> 480 481 <p>A character can have mappings with both prefixes (context 482 before) and contraction suffixes. Prefixes are matched first. This is 483 to keep them reasonably implementable: When there is a mapping with 484 both a prefix and a contraction suffix (like in Japanese: |), then 485 the matching needs to go in both directions. The contraction might 486 involve discontiguous matching, which needs complex text iteration 487 and handling of skipped combining marks, and will consume the 488 matching suffix. Prefix matching should be first because, regardless 489 of whether there is a match, the implementation will always return to 490 the original text index (right after the prefix) from where it will 491 start to look at all of the contractions for that prefix.</p> 492 493 <p>If there is a match for a prefix but no match for any of the 494 suffixes for that prefix, then fall back to mappings with the 495 next-longest matching prefix, and so on, ultimately to mappings with 496 no prefix. (Otherwise mappings with longer prefixes would hide 497 mappings with shorter prefixes.)</p> 498 499 <p>Consider the following mappings.</p> 500 <ol> 501 <li>p CE(p)</li> 502 <li>h CE(h)</li> 503 <li>c CE(c)</li> 504 <li>ch CE(d)</li> 505 <li>p|c CE(u)</li> 506 <li>p|ci CE(v)</li> 507 <li>p| CE(w)</li> 508 <li>op|ck CE(x)</li> 509 </ol> 510 511 <p>With these, text collates like this:</p> 512 <ul> 513 <li>pc CE(p)CE(u)</li> 514 <li>pci CE(p)CE(v)</li> 515 <li>pch CE(p)CE(u)CE(h)</li> 516 <li>p CE(p)CE(w)</li> 517 <li>p CE(p)CE(w)CE(U+0323) // discontiguous</li> 518 <li>opck CE(o)CE(p)CE(x)</li> 519 <li>opch CE(o)CE(p)CE(u)CE(h)</li> 520 </ul> 521 522 <p> 523 However, if the mapping p|c CE(u) is missing, then text "pch" maps 524 to CE(p)CE(d), "opch" maps to CE(o)CE(p)CE(d), and "p" maps to 525 CE(p)CE(c)CE(U+0323)CE(U+0302) (because discontiguous contraction 526 matching extends <i>an existing match</i> by one non-starter at a 527 time). 528 </p> 529 530 <h4> 531 1.1.3 <a name="Algorithm_Case" href="#Algorithm_Case">Case 532 Handling</a> 533 </h4> 534 <p> 535 CLDR specifies how to sort lowercase or uppercase first, as a 536 stronger distinction than other tertiary variants (<strong>caseFirst</strong>) 537 or while completely ignoring all other tertiary distinctions (<strong>caseLevel</strong>). 538 See <i>Section 3.3 <a href="#Setting_Options">Setting Options</a></i> 539 and <i>Section 3.13 <a href="#Case_Parameters">Case 540 Parameters</a></i>. 541 </p> 542 543 <h4> 544 1.1.4 <a name="Algorithm_Reordering_Groups" 545 href="#Algorithm_Reordering_Groups">Reordering Groups</a> 546 </h4> 547 <p>CLDR specifies how to do parametric reordering of groups of 548 scripts (e.g., native script first) as well as special groups 549 (e.g., digits after letters), and provides data for the effective 550 implementation of such reordering.</p> 551 552 <h4> 553 1.1.5 <a name="Combining_Rules" 554 href="#Combining_Rules">Combining Rules</a> 555 </h4> 556 <p>Rules from different sources can be combined, with the later rules overriding the earlier ones. The following is an example of how this can be useful.</p> 557 <p>There is a root collation for "emoji" in CLDR. So use of "-u-co-emoji" in a Unicode locale identifier will access that ordering. </p> 558 <p>Example, using ICU:</p> 559 <blockquote> 560 <p>collator = Collator.getInstance(ULocale.forLanguageTag("en-u-co-emoji")); </p> 561 </blockquote> 562 <p>However, use of the emoji will supplant the language's customizations. So the above is the equivalent of: </p> 563 <blockquote> 564 <p>collator = Collator.getInstance(ULocale.forLanguageTag("und-u-co-emoji")); </p> 565 </blockquote> 566 <p>The same structure will not work for a language that does require customization, like Danish. That is, the following will fail.</p> 567 <blockquote> 568 <p> collator = Collator.getInstance(ULocale.forLanguageTag("da-u-co-emoji")); </p> 569 </blockquote> 570 <p>For that, a slightly more cumbersome method needs to be employed, which is to take the rules for Danish, and explicitly add the rules for emoji. </p> 571 <blockquote> 572 <p>RuleBasedCollator collator = new RuleBasedCollator(<br> 573 ((RuleBasedCollator) Collator.getInstance(ULocale.forLanguageTag("da"))).getRules() +<br> 574 ((RuleBasedCollator) Collator.getInstance(ULocale.forLanguageTag("und-u-co-emoji")))<br> 575 .getRules());</p> 576 </blockquote> 577 <p>The following table shows the differences. When emoji ordering is supported, the two faces will be adjacent. When Danish ordering is supported, the is after the y.</p> 578 <table class='simple'> 579 <tbody> 580 <tr> 581 <td>code point order</td> 582 <td>,</td> 583 <td></td> 584 <td></td> 585 <td>Z</td> 586 <td>a</td> 587 <td>y</td> 588 <td></td> 589 <td></td> 590 <td></td> 591 <td></td> 592 <td></td> 593 </tr> 594 <tr> 595 <td>en</td> 596 <td>,</td> 597 <td></td> 598 <td></td> 599 <td></td> 600 <td>a</td> 601 <td></td> 602 <td>y</td> 603 <td>Z</td> 604 <td></td> 605 </tr> 606 <tr> 607 <td>en-u-co-emoji</td> 608 <td>,</td> 609 <td></td> 610 <td></td> 611 <td></td> 612 <td>a</td> 613 <td></td> 614 <td>y</td> 615 <td>Z</td> 616 <td></td> 617 </tr> 618 <tr> 619 <td>da</td> 620 <td>,</td> 621 <td></td> 622 <td></td> 623 <td></td> 624 <td>a</td> 625 <td>y</td> 626 <td><strong><u></u></strong></td> 627 <td>Z</td> 628 <td></td> 629 </tr> 630 <tr> 631 <td>da-u-co-emoji</td> 632 <td>,</td> 633 <td></td> 634 <td></td> 635 <td></td> 636 <td>a</td> 637 <td><strong><u></u></strong></td> 638 <td>y</td> 639 <td>Z</td> 640 <td></td> 641 </tr> 642 <tr> 643 <td>combined rules</td> 644 <td>,</td> 645 <td></td> 646 <td></td> 647 <td></td> 648 <td>a</td> 649 <td>y</td> 650 <td><strong><u></u></strong></td> 651 <td>Z</td> 652 <td></td> 653 </tr> 654 </tbody> 655 </table> 656 657 <br> 658 <p> </p> 659 <p> </p> 660 661 <h2> 662 2 <a name="Root_Collation" href="#Root_Collation">Root Collation</a> 663 </h2> 664 <p> 665 The CLDR root collation order is based on the <a 666 href="http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table">Default 667 Unicode Collation Element Table (DUCET)</a> defined in <em>UTS #10: 668 Unicode Collation Algorithm</em> [<a 669 href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>]. It is 670 used by all other locales by default, or as the base for their 671 tailorings. (For a chart view of the UCA, see Collation Chart [<a 672 href="tr35.html#UCAChart">UCAChart</a>].) 673 </p> 674 <p>Starting with CLDR 1.9, CLDR uses modified tables for the root 675 collation order. The root locale ordering is tailored in the 676 following ways:</p> 677 678 <h3> 679 2.1 <a name="grouping_classes_of_characters" 680 href="#grouping_classes_of_characters">Grouping classes of 681 characters</a> 682 </h3> 683 <p>As of Version 6.1.0, the DUCET puts characters into the 684 following ordering:</p> 685 <ul> 686 <li>First "common characters": whitespace, 687 punctuation, general symbols, some numbers, currency symbols, and 688 other numbers.</li> 689 <li>Then "script characters": Latin, Greek, and the 690 rest of the scripts.</li> 691 </ul> 692 <p>(There are a few exceptions to this general ordering.)</p> 693 <p>The CLDR root locale modifies the DUCET tailoring by ordering 694 the common characters more strictly by category:</p> 695 <ul> 696 <li>whitespace, punctuation, general symbols, currency symbols, 697 and numbers.</li> 698 </ul> 699 <p>What the regrouping allows is for users to parametrically 700 reorder the groups. For example, users can reorder numbers after all 701 scripts, or reorder Greek before Latin.</p> 702 <p>The relative order within each of these groups still matches 703 the DUCET. Symbols, punctuation, and numbers that are grouped with a 704 particular script stay with that script. The differences between CLDR 705 and the DUCET order are:</p> 706 <ol> 707 <li>CLDR groups the numbers together after currency symbols, 708 instead of splitting them with some before and some after. Thus the 709 following are put <em>after</em> currencies and just before all the 710 other numbers. 711 <blockquote> 712 <p> 713 U+09F4 ( ) [No] BENGALI CURRENCY NUMERATOR ONE<br> ...<br> 714 U+1D371 ( ) [No] COUNTING ROD TENS DIGIT NINE 715 </p> 716 </blockquote> 717 </li> 718 <li>CLDR handles a few other characters differently 719 <ol> 720 <li>U+10A7F ( ) [Po] OLD SOUTH ARABIAN NUMERIC INDICATOR is 721 put with punctuation, not symbols</li> 722 <li>U+20A8 ( ) [Sc] RUPEE SIGN and U+FDFC ( ) [Sc] RIAL 723 SIGN are put with currency signs, not with R and REH.</li> 724 </ol> 725 </li> 726 </ol> 727 728 <h3> 729 2.2 <a name="non_variable_symbols" href="#non_variable_symbols">Non-variable 730 symbols</a> 731 </h3> 732 <p> 733 There are multiple <a 734 href="http://www.unicode.org/reports/tr10/#Variable_Weighting">Variable-Weighting</a> 735 options in the UCA for symbols and punctuation, including <em>non-ignorable</em> 736 and <em>shifted</em>. With the <em>shifted</em> option, almost all 737 symbols and punctuation are ignoredexcept at a fourth level. The 738 CLDR root locale ordering is modified so that symbols are not 739 affected by the <em>shifted</em> option. That is, by default, symbols 740 are not variable in CLDR. So <em>shifted</em> only causes 741 whitespace and punctuation to be ignored, but not symbols (like ). 742 The DUCET behavior can be specified with a locale ID using the 743 "kv" keyword, to set the Variable section to include all of 744 the symbols below it, or be set parametrically where implementations 745 allow access. 746 </p> 747 <p>See also:</p> 748 <ul> 749 <li><i>Section 3.3, <a href="#Setting_Options">Setting 750 Options</a></i></li> 751 <li><a href="http://www.unicode.org/charts/collation/">http://www.unicode.org/charts/collation/</a></li> 752 </ul> 753 754 <h3> 755 2.3 <a name="tibetan_contractions" href="#tibetan_contractions">Additional 756 contractions for Tibetan</a> 757 </h3> 758 <p> 759 Ten contractions are added for Tibetan: Two to fulfill <a 760 href="http://www.unicode.org/reports/tr10/#WF5">well-formedness 761 condition 5</a>, and eight more to preserve the default order for 762 Tibetan. For details see <i>UTS #10, Section 3.8.2, <a 763 href="http://www.unicode.org/reports/tr10/#Well_Formed_DUCET">Well-Formedness 764 of the DUCET</a></i>. 765 </p> 766 767 <h3> 768 2.4 <a name="tailored_noncharacter_weights" 769 href="#tailored_noncharacter_weights">Tailored noncharacter 770 weights</a> 771 </h3> 772 <p>U+FFFE and U+FFFF have special tailorings:</p> 773 <blockquote> 774 <p> 775 <strong>U+FFFF: </strong>This code point is tailored to have a 776 primary weight higher than all other characters. This allows the 777 reliable specification of a range, such as “Sch” X 778 “Sch\uFFFF”, to include all strings starting with 779 "sch" or equivalent. 780 </p> 781 <p> 782 <strong>U+FFFE: </strong>This code point produces a CE with minimal, 783 unique weights on primary and identical levels. For details see the 784 <i><a href="#Algorithm_FFFE">CLDR Collation Algorithm</a></i> above. 785 </p> 786 </blockquote> 787 <p> 788 UCA (beginning with version 6.3) also maps <strong>U+FFFD</strong> to 789 a special collation element with a very high primary weight, so that 790 it is reliably non-<a 791 href="http://www.unicode.org/reports/tr10/#Variable_Weighting">variable</a>, 792 for use with <a 793 href="http://www.unicode.org/reports/tr10/#Handling_Illformed">ill-formed 794 code unit sequences</a>. 795 </p> 796 <p> 797 In CLDR, so as to maintain the special collation elements, <strong>U+FFFD..U+FFFF 798 </strong> are not further tailorable, and nothing can tailor to them. That is, 799 neither can occur in a collation rule. For example, the following 800 rules are illegal: 801 </p> 802 <p> 803 <code>&\uFFFF < x</code> 804 </p> 805 <p> 806 <code>&x <\uFFFF</code> 807 <br> 808 </p> 809 810 <p class="note"> 811 <b>Note:</b> 812 </p> 813 <ul> 814 <li class="note">Java uses an early version of this collation 815 syntax, but has not been updated recently. It does not support any 816 of the syntax marked with [...], and its default table is not the 817 DUCET nor the CLDR root collation.</li> 818 </ul> 819 820 <h3> 821 2.5 <a name="Root_Data_Files" href="#Root_Data_Files">Root 822 Collation Data Files</a> 823 </h3> 824 <p> 825 The CLDR root collation data files are in the CLDR repository and 826 release, under the path <a 827 href="http://unicode.org/repos/cldr/tags/latest/common/uca/">common/uca/</a>. 828 </p> 829 830 <p> 831 For most data files there are <strong>_SHORT</strong> versions 832 available. They contain the same data but only minimal comments, to 833 reduce the file sizes. 834 </p> 835 836 <p>Comments with DUCET-style weights in files other than 837 allkeys_CLDR.txt and allkeys_DUCET.txt use the weights defined in 838 allkeys_CLDR.txt.</p> 839 <ul> 840 <li><strong>allkeys_CLDR</strong> - A file that provides a 841 remapping of UCA DUCET weights for use with CLDR.</li> 842 <li><strong>allkeys_DUCET</strong> - The same as DUCET 843 allkeys.txt, but in alternate=non-ignorable sort order, for easier 844 comparison with allkeys_CLDR.txt.</li> 845 <li><strong>FractionalUCA</strong> - A file that provides a 846 remapping of UCA DUCET weights for use with CLDR. The weight values 847 are modified: 848 <ul> 849 <li>The weights have variable length, with 1..4 bytes each. 850 Each secondary or tertiary weight currently uses at most 2 bytes.</li> 851 <li>There are tailoring gaps between adjacent weights, so that 852 a number of characters can be tailored to sort between any two 853 root collation elements.</li> 854 <li>There are collation elements with primary weights at the 855 boundaries between reordering groups and Unicode scripts, so that 856 tailoring around the first or last primary of a group/script 857 results in new collation elements that sort and reorder together 858 with that group or script. These boundary weights also define the 859 primary weight ranges for parametric group and script reordering. 860 </li> 861 </ul> An implementation may modify the weights further to fit the needs 862 of its data structures.</li> 863 <li><strong>UCA_Rules</strong> - A file that specifies the root 864 collation order in the form of <a href="#Collation_Tailorings">tailoring 865 rules</a>. This is only an approximation of the FractionalUCA data, 866 since the rule syntax cannot express every detail of the collation 867 elements. For example, in the DUCET and in FractionalUCA, tertiary 868 differences are usually expressed with special tertiary weights on 869 all collation elements of an expansion, while a typical from-rules 870 builder will modify the tertiary weight of only one of the collation 871 elements.</li> 872 <li><strong>CollationTest_CLDR</strong> - The CLDR versions of 873 the CollationTest files, which use the tailorings for CLDR. For 874 information on the format, see <a 875 href="http://www.unicode.org/Public/UCA/latest/CollationTest.html">CollationTest.html</a> 876 in the <a href="http://www.unicode.org/reports/tr10/#Data10">UCA 877 data directory</a>. 878 <ul> 879 <li>CollationTest_CLDR_NON_IGNORABLE.txt</li> 880 <li>CollationTest_CLDR_SHIFTED.txt</li> 881 </ul></li> 882 </ul> 883 884 <h3> 885 2.6 <a name="Root_Data_File_Formats" href="#Root_Data_File_Formats">Root 886 Collation Data File Formats</a> 887 </h3> 888 889 <p>The file formats may change between versions of CLDR. The 890 formats for CLDR 23 and beyond are as follows. As usual, text after a 891 # is a comment.</p> 892 893 <h4> 894 2.6.1 <a name="File_Format_allkeys_CLDR_txt" 895 href="#File_Format_allkeys_CLDR_txt">allkeys_CLDR.txt</a> 896 </h4> 897 <p> 898 This file defines CLDRs tailoring of the DUCET, as described in <i>Section 899 2, <a href="#Root_Collation">Root Collation</a> 900 </i>. 901 </p> 902 <p> 903 The format is similar to that of <a 904 href="http://www.unicode.org/reports/tr10/#File_Format">allkeys.txt</a>, 905 although there may be some differences in whitespace. 906 </p> 907 908 <h4> 909 2.6.2 <a name="File_Format_FractionalUCA_txt" 910 href="#File_Format_FractionalUCA_txt">FractionalUCA.txt</a> 911 </h4> 912 <p>The format is illustrated by the following sample lines, with 913 commentary afterwards.</p> 914 <pre>[UCA version = 6.0.0]</pre> 915 <blockquote> 916 <p>Provides the version number of the UCA table.</p> 917 </blockquote> 918 919 <pre>[Unified_Ideograph 4E00..9FCC FA0E..FA0F FA11 FA13..FA14 FA1F FA21 FA23..FA24 FA27..FA29 3400..4DB5 20000..2A6D6 2A700..2B734 2B740..2B81D]</pre> 920 <blockquote> 921 <p> 922 Lists the ranges of Unified_Ideograph characters in collation order. 923 (New in CLDR 24.) They map to collation elements with <a 924 href="http://www.unicode.org/reports/tr10/#Implicit_Weights">implicit 925 (constructed) primary weights</a>. 926 </p> 927 </blockquote> 928 929 <pre>[radical 6=:----] 930 [radical 210=:--] 931 [radical 210'=:] 932 [radical end]</pre> 933 <blockquote> 934 <p> 935 Data for Unihan radical-stroke order. (New in CLDR 26.) Following 936 the [Unified_Ideograph] line, a section of 937 <code>[radical ...]</code> 938 lines defines a radical-stroke order of the Unified_Ideograph 939 characters. 940 </p> 941 942 <p> 943 For Han characters, an implementation may choose either to implement 944 the order defined in the UCA and the [Unified_Ideograph] data, or to 945 implement the order defined by the 946 <code>[radical ...]</code> 947 lines. Beginning with CLDR 26, the CJK type="unihan" tailorings 948 assume that the root collation order sorts Han characters in Unihan 949 radical-stroke order according to the 950 <code>[radical ...]</code> 951 data. The CollationTest_CLDR files only contain Han characters that 952 are in the same relative order using implicit weights or the 953 radical-stroke order. 954 </p> 955 956 <p> 957 The root collation radical-stroke order is derived from the first 958 (normative) values of the <a 959 href="http://www.unicode.org/reports/tr38/#kRSUnicode">Unihan 960 kRSUnicode</a> field for each Han character. Han characters are ordered 961 by radical, with traditional forms sorting before simplified ones. 962 Characters with the same radical are ordered by residual stroke 963 count. Characters with the same radical-stroke values are ordered by 964 block and code point, as for <a 965 href="http://www.unicode.org/reports/tr10/#Implicit_Weights">UCA 966 implicit weights</a>. 967 </p> 968 969 <p> 970 There is one 971 <code>[radical ...]</code> 972 line per radical, in the order of radical numbers. Each line shows 973 the radical number and the representative characters from the <a 974 href="http://www.unicode.org/reports/tr44/#UCD_Files_Table">UCD 975 file CJKRadicals.txt</a>, followed by a colon (:) and the Han 976 characters with that radical in the order as described above. A 977 range like 978 <code>-</code> 979 indicates that the code points in that range sort in code point 980 order. 981 </p> 982 983 <p> 984 The radical number and characters are informational. The sort order 985 is established only by the order of the 986 <code>[radical ...]</code> 987 lines, and within each line by the characters and ranges between the 988 colon (:) and the bracket (]). 989 </p> 990 991 <p> 992 Each Unified_Ideograph occurs exactly once. Only Unified_Ideograph 993 characters are listed on 994 <code>[radical ...]</code> 995 lines. 996 </p> 997 998 <p> 999 This section is terminated with one 1000 <code>[radical end]</code> 1001 line. 1002 </p> 1003 </blockquote> 1004 1005 <pre>0000; [,,] # Zyyy Cc [0000.0000.0000] * <NULL></pre> 1006 <blockquote> 1007 <p> 1008 Provides a weight line. The first element (before the ";") 1009 is a hex codepoint sequence. The second field is a sequence of 1010 collation elements. Each collation element has 3 parts separated by 1011 commas: the primary weight, secondary weight, and tertiary weight. 1012 The tertiary weight actually consists of two components: the top two 1013 bits (0xC0) are used for the <em>case level</em>, and should be 1014 masked off where a case level is not used. 1015 </p> 1016 <p>A weight is either empty (meaning a zero or ignorable weight) 1017 or is a sequence of one or more bytes. The bytes are interpreted as 1018 a "fraction", meaning that the ordering is 04 < 05 05 1019 < 06. The weights are constructed so that no weight is an initial 1020 subsequence of another: that is, having both the weights 05 and 05 1021 05 is illegal. The above line consists of all ignorable weights.</p> 1022 <p>The vertical bar (|) character is used to indicate context, 1023 as in:</p> 1024 </blockquote> 1025 <pre>006C | 00B7; [, DB A9, 05]</pre> 1026 <blockquote> 1027 This example indicates that if U+00B7 appears immediately after 1028 U+006C, it is given the corresponding collation element instead. This 1029 syntax is roughly equivalent to the following contraction, but is 1030 more efficient. For details see the specification of <i><a 1031 href="#Context_Sensitive_Mappings">Context-Sensitive Mappings</a></i> 1032 above. 1033 </blockquote> 1034 <pre>006C 00B7; <em>CE(006C)</em> [, DB A9, 05]</pre> 1035 <blockquote> 1036 <p>Single-byte primary weights are given to particularly frequent 1037 characters, such as space, digits, and a-z. More frequent characters 1038 are given two-byte weights, while relatively infrequent characters 1039 are given three-byte weights. For example:</p> 1040 </blockquote> 1041 <pre>... 1042 0009; [03 05, 05, 05] # Zyyy Cc [0100.0020.0002] * <CHARACTER TABULATION> 1043 ... 1044 1B60; [06 14 0C, 05, 05] # Bali Po [0111.0020.0002] * BALINESE PAMENENG 1045 ... 1046 0031; [14, 05, 05] # Zyyy Nd [149B.0020.0002] * DIGIT ONE</pre> 1047 <blockquote> 1048 <p>The assignment of 2 vs 3 bytes does not reflect importance, or 1049 exact frequency.</p> 1050 </blockquote> 1051 1052 <pre> 1053 3041; [76 06, 05, 03] # Hira Lo [3888.0020.000D] * HIRAGANA LETTER SMALL A 1054 3042; [76 06, 05, 85] # Hira Lo [3888.0020.000E] * HIRAGANA LETTER A 1055 30A1; [76 06, 05, 10] # Kana Lo [3888.0020.000F] * KATAKANA LETTER SMALL A 1056 30A2; [76 06, 05, 9E] # Kana Lo [3888.0020.0011] * KATAKANA LETTER A</pre> 1057 <blockquote> 1058 <p> 1059 Beginning with CLDR 27, some primary or secondary collation elements 1060 may have below-common tertiary weights (e.g., 1061 <code>03</code> 1062 ), in particular to allow normal Hiragana letters to have common 1063 tertiary weights. 1064 </p> 1065 </blockquote> 1066 1067 <pre># SPECIAL MAX/MIN COLLATION ELEMENTS 1068 FFFE; [02, 05, 05] # Special LOWEST primary, for merge/interleaving 1069 FFFF; [EF FE, 05, 05] # Special HIGHEST primary, for ranges</pre> 1070 <blockquote> 1071 <p>The two tailored noncharacters have their own primary weights. 1072 </p> 1073 </blockquote> 1074 1075 <pre> 1076 F967; [U+4E0D] # Hani Lo [FB40.0020.0002][CE0D.0000.0000] * CJK COMPATIBILITY IDEOGRAPH-F967 1077 2F02; [U+4E36, 10] # Hani So [FB40.0020.0004][CE36.0000.0000] * KANGXI RADICAL DOT 1078 2E80; [U+4E36, 70, 20] # Hani So [FB40.0020.0004][CE36.0000.0000][0000.00FC.0004] * CJK RADICAL REPEAT</pre> 1079 <blockquote> 1080 <p>Some collation elements are specified by reference to other 1081 mappings. This is particularly useful for Han characters which are 1082 given implicit/constructed primary weights; the reference to a 1083 Unified_Ideograph makes these mappings independent of implementation 1084 details. This technique may also be used in other mappings to show 1085 the relationship of character variants.</p> 1086 <p>The referenced character must have a mapping listed earlier in 1087 the file, or the mapping must have been defined via the 1088 [Unified_Ideograph] data line. The referenced character must map to 1089 exactly one collation element.</p> 1090 <p> 1091 <code>[U+4E0D]</code> 1092 copies U+4E0Ds entire collation element. 1093 <code>[U+4E36, 10]</code> 1094 copies U+4E36s primary and secondary weights and specifies a 1095 different tertiary weight. 1096 <code>[U+4E36, 70, 20]</code> 1097 only copies U+4E36s primary weight and specifies other secondary 1098 and tertiary weights. 1099 </p> 1100 <p>FractionalUCA.txt does not have any explicit mappings for 1101 implicit weights. Therefore, an implementation is free to choose an 1102 algorithm for computing implicit weights according to the principles 1103 specified in the UCA.</p> 1104 </blockquote> 1105 1106 <pre> 1107 FDD1 20AC; [0D 20 02, 05, 05] # CURRENCY first primary 1108 FDD1 0034; [0E 02 02, 05, 05] # DIGIT first primary starts new lead byte 1109 FDD0 FF21; [26 02 02, 05, 05] # REORDER_RESERVED_BEFORE_LATIN first primary starts new lead byte 1110 FDD1 004C; [28 02 02, 05, 05] # LATIN first primary starts new lead byte 1111 FDD0 FF3A; [5D 02 02, 05, 05] # REORDER_RESERVED_AFTER_LATIN first primary starts new lead byte 1112 FDD1 03A9; [5F 04 02, 05, 05] # GREEK first primary starts new lead byte (compressible) 1113 FDD1 03E2; [5F 60 02, 05, 05] # COPTIC first primary (compressible)</pre> 1114 <blockquote> 1115 <p> 1116 These are special mappings with primaries at the boundaries of 1117 scripts and reordering groups. They serve as tailoring boundaries, 1118 so that tailoring near the first or last character of a script or 1119 group places the tailored item into the same group. Beginning with 1120 CLDR 24, each of these is a contraction of U+FDD1 with 1121 a character of the corresponding script 1122 (or of the General_Category [Z, P, S, Sc, Nd] 1123 corresponding to a special reordering group), 1124 mapping to the first possible primary weight per 1125 script or group. They can be enumerated for implementations of <a 1126 href="#Collation_Indexes">Collation Indexes</a>. (Earlier versions 1127 mapped contractions with U+FDD0 to the last primary weights of each 1128 group but not each script.) 1129 </p> 1130 <p>Beginning with CLDR 27, these mappings alone define the 1131 boundaries for reordering single scripts. (There are no mappings for 1132 Hrkt, Hans, or Hant because they are not fully distinct scripts; 1133 they share primary weights with other scripts: Hrkt=Hira=Kana & 1134 Hans=Hant=Hani.) There are some reserved ranges, beginning at 1135 boundaries marked with U+FDD0 plus following characters as shown 1136 above. The reserved ranges are not used for collation elements and 1137 are not available for tailoring.</p> 1138 <p>Some primary lead bytes must be reserved so that reordering of 1139 scripts along partial-lead-byte boundaries can split the primary 1140 lead byte and use up a reserved byte. This is for implementations 1141 that write sort keys, which must reorder primary weights by 1142 offsetting them by whole lead bytes. There are reorder-reserved 1143 ranges before and after Latin, so that reordering scripts with few 1144 primary lead bytes relative to Latin can move those scripts into the 1145 reserved ranges without changing the primary weights of any other 1146 script. Each of these boundaries begins with a new two-byte primary; 1147 that is, no two groups/scripts/ranges share the top 16 bits of their 1148 primary weights.</p> 1149 </blockquote> 1150 1151 <pre> 1152 FDD0 0034; [11, 05, 05] # lead byte for numeric sorting</pre> 1153 <blockquote> 1154 <p>This mapping specifies the lead byte for numeric sorting. It 1155 must be different from the lead byte of any other primary weight, 1156 otherwise numeric sorting would generate ill-formed collation 1157 elements. Therefore, this mapping itself must be excluded from the 1158 set of regular mappings. This value can be ignored by 1159 implementations that do not support numeric sorting. (Other 1160 contractions with U+FDD0 can normally be ignored altogether.)</p> 1161 </blockquote> 1162 1163 <pre> 1164 # HOMELESS COLLATION ELEMENTS 1165 FDD0 0063; [, 97, 3D] # [15E4.0020.0004] [1844.0020.0004] [0000.0041.001F] * U+01C6 LATIN SMALL LETTER DZ WITH CARON 1166 FDD0 0064; [, A7, 09] # [15D1.0020.0004] [0000.0056.0004] * U+1DD7 COMBINING LATIN SMALL LETTER C CEDILLA 1167 FDD0 0065; [, B1, 09] # [1644.0020.0004] [0000.0061.0004] * U+A7A1 LATIN SMALL LETTER G WITH OBLIQUE STROKE</pre> 1168 <blockquote> 1169 <p>The DUCET has some weights that don't correspond directly to a 1170 character. To allow for implementations to have a mapping for each 1171 collation element (necessary for certain implementations of 1172 tailoring), this requires the construction of special sequences for 1173 those weights. These collation elements can normally be ignored.</p> 1174 </blockquote> 1175 1176 <p>Next, a number of tables are defined. The function of each of 1177 the tables is summarized afterwards.</p> 1178 1179 <pre># VALUES BASED ON UCA 1180 ... 1181 [first regular [0D 0A, 05, 05]] # U+0060 GRAVE ACCENT 1182 [last regular [7A FE, 05, 05]] # U+1342E EGYPTIAN HIEROGLYPH AA032 1183 [first implicit [E0 04 06, 05, 05]] # CONSTRUCTED 1184 [last implicit [E4 DF 7E 20, 05, 05]] # CONSTRUCTED 1185 [first trailing [E5, 05, 05]] # CONSTRUCTED 1186 [last trailing [E5, 05, 05]] # CONSTRUCTED 1187 ...</pre> 1188 <blockquote> 1189 <p>This table summarizes ranges of important groups of characters 1190 for implementations.</p> 1191 </blockquote> 1192 <pre># Top Byte => Reordering Tokens 1193 [top_byte 00 TERMINATOR ] # [0] TERMINATOR=1 1194 [top_byte 01 LEVEL-SEPARATOR ] # [0] LEVEL-SEPARATOR=1 1195 [top_byte 02 FIELD-SEPARATOR ] # [0] FIELD-SEPARATOR=1 1196 [top_byte 03 SPACE ] # [9] SPACE=1 Cc=6 Zl=1 Zp=1 Zs=1 1197 ...</pre> 1198 <blockquote> 1199 <p>This table defines the reordering groups, for script 1200 reordering. The table maps from the first bytes of the fractional 1201 weights to a reordering token. The format is "[top_byte " 1202 byte-value reordering-token "COMPRESS"? "]". The 1203 "COMPRESS" value is present when there is only one byte in 1204 the reordering token, and primary-weight compression can be applied. 1205 Most reordering tokens are script values; others are special-purpose 1206 values, such as PUNCTUATION. Beginning with CLDR 24, this table 1207 precedes the regular mappings, so that parsers can use this 1208 information while processing and optimizing mappings. Beginning with 1209 CLDR 27, most of this data is irrelevant because single scripts can 1210 be reordered. Only the "COMPRESS" data is still useful.</p> 1211 </blockquote> 1212 <pre># Reordering Tokens => Top Bytes 1213 [reorderingTokens Arab 61=910 62=910 ] 1214 [reorderingTokens Armi 7A=22 ] 1215 [reorderingTokens Armn 5F=82 ] 1216 [reorderingTokens Avst 7A=54 ] 1217 ...</pre> 1218 <blockquote> 1219 <p>This table is an inverse mapping from reordering token to top 1220 byte(s). In terms like "61=910", the first value is the 1221 top byte, while the second is informational, indicating the number 1222 of primaries assigned with that top byte.</p> 1223 </blockquote> 1224 <pre># General Categories => Top Byte 1225 [categories Cc 03{SPACE}=6 ] 1226 [categories Cf 77{Khmr Tale Talu Lana Cham Bali Java Mong Olck Cher Cans Ogam Runr Orkh Vaii Bamu}=2 ] 1227 [categories Lm 0D{SYMBOL}=25 0E{SYMBOL}=22 27{Latn}=12 28{Latn}=12 29{Latn}=12 2A{Latn}=12...</pre> 1228 <blockquote> 1229 <p>This table is informational, providing the top bytes, scripts, 1230 and primaries associated with each general category value.</p> 1231 </blockquote> 1232 <pre># FIXED VALUES 1233 [fixed first implicit byte E0] 1234 [fixed last implicit byte E4] 1235 [fixed first trail byte E5] 1236 [fixed last trail byte EF] 1237 [fixed first special byte F0] 1238 [fixed last special byte FF] 1239 1240 [fixed secondary common byte 05] 1241 [fixed last secondary common byte 45] 1242 [fixed first ignorable secondary byte 80] 1243 1244 [fixed tertiary common byte 05] 1245 [fixed first ignorable tertiary byte 3C] 1246 </pre> 1247 <blockquote> 1248 <p>The final table gives certain hard-coded byte values. The 1249 "trail" area is provided for implementation of the 1250 "trailing weights" as described in the UCA.</p> 1251 </blockquote> 1252 1253 <p class="note">Note: The particular primary lead bytes for Hani 1254 vs. IMPLICIT vs. TRAILING are only an example. An implementation is 1255 free to move them if it also moves the explicit TRAILING weights. 1256 This affects only a small number of explicit mappings in 1257 FractionalUCA.txt, such as for U+FFFD, U+FFFF, and the unassigned 1258 first primary. It is possible to use no SPECIAL bytes at all, and to 1259 use only the one primary lead byte FF for TRAILING weights.</p> 1260 1261 <h4> 1262 2.6.3 <a name="File_Format_UCA_Rules_txt" 1263 href="#File_Format_UCA_Rules_txt">UCA_Rules.txt</a> 1264 </h4> 1265 <p> 1266 The format for this file uses the CLDR collation syntax, see <i>Section 1267 3, <a href="#Collation_Tailorings">Collation Tailorings</a> 1268 </i>. 1269 </p> 1270 1271 1272 <h2> 1273 3 <a name="Collation_Tailorings" href="#Collation_Tailorings">Collation 1274 Tailorings</a> 1275 </h2> 1276 <p class="dtd"><!ELEMENT collations (alias | 1277 (defaultCollation?, collation*, special*)) ></p> 1278 <p class="dtd"><!ELEMENT defaultCollation ( #PCDATA ) ></p> 1279 <p> 1280 This element of the LDML format contains one or more <span 1281 class="element">collation</span> elements, distinguished by type. 1282 Each <span class="element">collation</span> contains elements with 1283 parametric settings, or rules that specify a certain sort order, as a 1284 tailoring of the root order, or both. 1285 </p> 1286 <p class="note"> 1287 Note: CLDR collation tailoring data should follow the <a 1288 href="http://cldr.unicode.org/index/cldr-spec/collation-guidelines">CLDR 1289 Collation Guidelines</a>. 1290 </p> 1291 1292 <h3> 1293 3.1 <a name="Collation_Types" href="#Collation_Types">Collation 1294 Types</a> 1295 </h3> 1296 <p> 1297 Each locale may have multiple sort orders (types). The <span 1298 class="element">defaultCollation</span> element defines the default 1299 tailoring for a locale and its sublocales. For example: 1300 </p> 1301 <ul> 1302 <li>root.xml: <code><defaultCollation>standard</defaultCollation></code></li> 1303 <li>zh.xml: <code><defaultCollation>pinyin</defaultCollation></code></li> 1304 <li>zh_Hant.xml: <code><defaultCollation>stroke</defaultCollation></code></li> 1305 </ul> 1306 1307 <p> 1308 To allow implementations in reduced memory environments to use CJK 1309 sorting, there are also short forms of each of these collation 1310 sequences. These provide for the most common characters in common 1311 use, and are marked with <span class="attribute">alt</span>="<span 1312 class="attributeValue">short</span>". 1313 </p> 1314 1315 <p>A collation type name that starts with "private-", for example, 1316 "private-kana", indicates an incomplete tailoring that is only 1317 intended for import into one or more other tailorings (usually for 1318 sharing common rules). It does not establish a complete sort order. 1319 An implementation should not build data tables for a private 1320 collation type, and should not include a private collation type in a 1321 list of available types.</p> 1322 1323 <p class="note"> 1324 <b>Note:</b> 1325 </p> 1326 <ul> 1327 <li>There is an on-line demonstration of collation at [<a 1328 href="tr35.html#LocaleExplorer">LocaleExplorer</a>] that uses the 1329 same rule syntax. (Pick the locale and scroll to "Collation 1330 Rules", near the end.) 1331 </li> 1332 <li class="note">In CLDR 23 and before, LDML collation files 1333 used an XML format. Starting with CLDR 24, the XML collation syntax 1334 is deprecated and no longer used. See the <i><a 1335 href="http://www.unicode.org/reports/tr35/tr35-31/tr35-collation.html#Collation_Tailorings">CLDR 1336 23 version of this document</a></i> for details about the XML collation 1337 syntax. 1338 </li> 1339 </ul> 1340 1341 <h4> 1342 3.1.1 <a name="Collation_Type_Fallback" 1343 href="#Collation_Type_Fallback">Collation Type Fallback</a> 1344 </h4> 1345 <p>When loading a requested tailoring from its data file and the 1346 parent file chain, use the following type fallback to find the 1347 tailoring.</p> 1348 <ol> 1349 <li>Determine the default type from the <defaultCollation> 1350 element; map the default type to its alias if one is defined. If 1351 there is no <defaultCollation> element, then use "standard" as 1352 the default type.</li> 1353 <li>If the request language tag specifies the collation type 1354 (keyword "co"), then map it to its alias if one is defined (e.g., 1355 "-co-phonebk" "phonebook"). If the language tag does not specify 1356 the type, then use the default type.</li> 1357 <li>Use the <collation> element with this type.</li> 1358 <li>If it does not exist, and the type starts with "search" but 1359 is longer, then set the type to "search" and use that 1360 <collation> element. (For example, "searchjl" "search".)</li> 1361 <li>If it does not exist, and the type is not the default type, 1362 then set the type to the default type and use that <collation> 1363 element.</li> 1364 <li>If it does not exist, and the type is not "standard", then 1365 set the type to "standard" and use that <collation> element.</li> 1366 <li>If it does not exist, then use the CLDR root collation.</li> 1367 </ol> 1368 <p class="note">Note that the CLDR collation/root.xml contains 1369 <defaultCollation>standard</defaultCollation>, 1370 <collation type="standard"> (with an empty tailoring, so this 1371 is the same as the CLDR root collation), and <collation 1372 type="search">.</p> 1373 1374 <p>For example, assume that we have collation data for the 1375 following tailorings. ("da/search" is shorthand for 1376 "da-u-co-search".)</p> 1377 <ul> 1378 <li>root/defaultCollation=standard</li> 1379 <li>root/standard (this is the same as the CLDR root collator)</li> 1380 <li>root/search</li> 1381 <li>da/standard</li> 1382 <li>da/search</li> 1383 <li>el/standard</li> 1384 <li>ko/standard</li> 1385 <li>ko/search</li> 1386 <li>ko/searchjl</li> 1387 <li>zh/defaultCollation=pinyin</li> 1388 <li>zh/pinyin</li> 1389 <li>zh/stroke</li> 1390 <li>zh-Hant/defaultCollation=stroke</li> 1391 </ul> 1392 <table> 1393 <caption> 1394 <a name="Sample_requested_and_actual_collation_locales_and_types" 1395 href="#Sample_requested_and_actual_collation_locales_and_types">Sample 1396 requested and actual collation locales and types</a> 1397 </caption> 1398 <tr> 1399 <th>requested</th> 1400 <th>actual</th> 1401 <th>comment</th> 1402 </tr> 1403 <tr> 1404 <td>da/phonebook</td> 1405 <td>da/standard</td> 1406 <td>default type for Danish</td> 1407 </tr> 1408 <tr> 1409 <td>zh</td> 1410 <td>zh/pinyin</td> 1411 <td>default type for zh</td> 1412 </tr> 1413 <tr> 1414 <td>zh/standard</td> 1415 <td>root/standard</td> 1416 <td>no "standard" tailoring for zh, falls back to root</td> 1417 </tr> 1418 <tr> 1419 <td>zh/phonebook</td> 1420 <td>zh/pinyin</td> 1421 <td>default type for zh</td> 1422 </tr> 1423 <tr> 1424 <td>zh-Hant/phonebook</td> 1425 <td>zh/stroke</td> 1426 <td>default type for zh-Hant is "stroke"</td> 1427 </tr> 1428 <tr> 1429 <td>da/searchjl</td> 1430 <td>da/search</td> 1431 <td>"search.+" falls back to "search"</td> 1432 </tr> 1433 <tr> 1434 <td>el/search</td> 1435 <td>root/search</td> 1436 <td>no "search" tailoring for Greek</td> 1437 </tr> 1438 <tr> 1439 <td>el/searchjl</td> 1440 <td>root/search</td> 1441 <td>"search.+" falls back to "search", found in root</td> 1442 </tr> 1443 <tr> 1444 <td>ko/searchjl</td> 1445 <td>ko/searchjl</td> 1446 <td>requested data is actually available</td> 1447 </tr> 1448 </table> 1449 1450 <h3> 1451 3.2 <a name="Collation_Version" href="#Collation_Version">Version</a> 1452 </h3> 1453 <p>The version attribute is used in case a specific version of the 1454 UCA is to be specified. It is optional, and is specified if the 1455 results are to be identical on different systems. If it is not 1456 supplied, then the version is assumed to be the same as the Unicode 1457 version for the system as a whole.</p> 1458 <blockquote> 1459 <p class="note"> 1460 <b>Note: </b>For version 3.1.1 of the UCA, the version of Unicode 1461 must also be specified with any versioning information; an example 1462 would be "3.1.1/3.2" for version 3.1.1 of the UCA, for 1463 version 3.2 of Unicode. This was changed by decision of the UTC, so 1464 that dual versions were no longer necessary. So for UCA 4.0 and 1465 beyond, the version just has a single number. 1466 </p> 1467 </blockquote> 1468 1469 <h3> 1470 3.3 <a name="Collation_Element" href="#Collation_Element">Collation 1471 Element</a> 1472 </h3> 1473 <p class="dtd"><!ELEMENT collation (alias | (cr*, special*)) 1474 ></p> 1475 <p> 1476 The tailoring syntax is designed to be independent of the actual 1477 weights used in any particular UCA table. That way the same rules can 1478 be applied to UCA versions over time, even if the underlying weights 1479 change. The following illustrates the overall structure of a <span 1480 class="element">collation</span>: 1481 </p> 1482 <pre><collation type="phonebook"> 1483 <cr><![CDATA[ 1484 [caseLevel on] 1485 &c < k 1486 ]]></cr> 1487 </collation></pre> 1488 1489 <h3> 1490 3.4 <a name="Setting_Options" href="#Setting_Options">Setting 1491 Options</a> 1492 </h3> 1493 <p> 1494 Parametric settings can be specified in language tags or in rule 1495 syntax (in the form 1496 <code>[keyword value]</code> 1497 ). For example, 1498 <code>-ks-level2</code> 1499 or 1500 <code>[strength 2]</code> 1501 will only compare strings based on their primary and secondary 1502 weights. 1503 </p> 1504 <p> 1505 If a setting is not present, the CLDR default (or the default for the 1506 locale, if there is one) is used. That default is listed in bold 1507 italics. Where there is a UCA default that is different, it is listed 1508 in bold with (<strong>UCA default</strong>). Note that the default 1509 value for a locale may be different than the normal default value for 1510 the setting. 1511 </p> 1512 1513 <table> 1514 <caption> 1515 <a name="Collation_Settings" href="#Collation_Settings">Collation 1516 Settings</a> 1517 </caption> 1518 <tr> 1519 <th>BCP47 Key</th> 1520 <th>BCP47 Value</th> 1521 <th>Rule Syntax</th> 1522 <th>Description</th> 1523 </tr> 1524 <tr> 1525 <td rowspan="5">ks</td> 1526 <td>level1</td> 1527 <td><code>[strength 1]</code><br>(primary)</td> 1528 <td rowspan="5">Sets the default strength for comparison, as 1529 described in the [<a 1530 href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>].<em> 1531 Note that a strength setting of greater than 4 may have the same 1532 effect as <strong>identical</strong>, depending on the locale and 1533 implementation. 1534 </em> 1535 </td> 1536 </tr> 1537 <tr> 1538 <td>level2</td> 1539 <td><code>[strength 2]</code><br>(secondary)</td> 1540 </tr> 1541 <tr> 1542 <td>level3</td> 1543 <td><em><strong><code>[strength 3]</code><br>(tertiary)</strong></em></td> 1544 </tr> 1545 <tr> 1546 <td>level4</td> 1547 <td><code>[strength 4]</code><br>(quaternary)</td> 1548 </tr> 1549 <tr> 1550 <td>identic</td> 1551 <td><code>[strength I]</code><br>(identical)</td> 1552 </tr> 1553 <tr> 1554 <td rowspan="3">ka</td> 1555 <td>noignore</td> 1556 <td><i><strong><code>[alternate 1557 non-ignorable]</code></strong></i><br></td> 1558 <td rowspan="3">Sets alternate handling for variable weights, 1559 as described in [<a 1560 href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>], where 1561 "shifted" causes certain characters to be ignored in 1562 comparison. <em>The default for LDML is different than it is 1563 in the UCA. In LDML, the default for alternate handling is <strong>non-ignorable</strong>, 1564 while in UCA it is <strong>shifted</strong>. In addition, in LDML 1565 only whitespace and punctuation are variable by default. 1566 </em> 1567 </td> 1568 </tr> 1569 <tr> 1570 <td>shifted</td> 1571 <td><strong><code>[alternate shifted]</code><br>(UCA 1572 default)</strong></td> 1573 </tr> 1574 <tr> 1575 <td><em>n/a</em></td> 1576 <td><i>n/a</i><br>(blanked)</td> 1577 </tr> 1578 <tr> 1579 <td rowspan="2">kb</td> 1580 <td>true</td> 1581 <td><code>[backwards 2]</code></td> 1582 <td rowspan="2">Sets the comparison for the second level to be 1583 <strong>backwards</strong>, as described in [<a 1584 href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>]. 1585 </td> 1586 </tr> 1587 <tr> 1588 <td>false</td> 1589 <td><i><strong>n/a</strong></i></td> 1590 </tr> 1591 <tr> 1592 <td rowspan="2">kk</td> 1593 <td>true</td> 1594 <td><strong><code>[normalization on]</code><br>(UCA 1595 default)</strong></td> 1596 <td rowspan="2">If <strong>on</strong>, then the normal [<a 1597 href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>] 1598 algorithm is used. If <strong>off</strong>, then most strings 1599 should still sort correctly despite not normalizing to NFD first.<br> 1600 <em>Note that the default for CLDR locales may be different 1601 than in the UCA. The rules for particular locales have it set to <strong>on</strong>: 1602 those locales whose exemplar characters (in forms commonly 1603 interchanged) would be affected by normalization. 1604 </em> 1605 </td> 1606 </tr> 1607 <tr> 1608 <td>false</td> 1609 <td><i><strong><code>[normalization off]</code></strong></i></td> 1610 </tr> 1611 <tr> 1612 <td rowspan="2">kc</td> 1613 <td>true</td> 1614 <td><code>[caseLevel on]</code></td> 1615 <td rowspan="2">If set to <strong>on</strong><i>,</i> a level 1616 consisting only of case characteristics will be inserted in front 1617 of tertiary level, as a "Level 2.5". To ignore accents 1618 but take case into account, set strength to <strong>primary</strong> 1619 and case level to <strong>on</strong>. For details, see <em>Section 1620 3.14, <a href="#Case_Parameters">Case Parameters</a> 1621 </em>. 1622 </td> 1623 </tr> 1624 <tr> 1625 <td>false</td> 1626 <td><i><strong><code>[caseLevel off]</code></strong></i></td> 1627 </tr> 1628 <tr> 1629 <td rowspan="3">kf</td> 1630 <td>upper</td> 1631 <td><code>[caseFirst upper]</code></td> 1632 <td rowspan="3">If set to <strong>upper</strong>, causes upper 1633 case to sort before lower case. If set to <strong>lower</strong>, 1634 causes lower case to sort before upper case. Useful for locales 1635 that have already supported ordering but require different order of 1636 cases. Affects case and tertiary levels. For details, see <em>Section 1637 3.14, <a href="#Case_Parameters">Case Parameters</a> 1638 </em>. 1639 </td> 1640 </tr> 1641 <tr> 1642 <td>lower</td> 1643 <td><code>[caseFirst lower]</code></td> 1644 </tr> 1645 <tr> 1646 <td>false</td> 1647 <td><i><strong><code>[caseFirst off]</code></strong></i></td> 1648 </tr> 1649 <tr> 1650 <td rowspan="2">kh</td> 1651 <td>true<br> <i><strong>Deprecated:</strong></i> Use rules 1652 with quater­nary relations instead. 1653 </td> 1654 <td><code>[hiraganaQ on]</code></td> 1655 <td rowspan="2">Controls special treatment of Hiragana code 1656 points on quaternary level. If turned <strong>on</strong>, Hiragana 1657 codepoints will get lower values than all the other non-variable 1658 code points in <strong>shifted</strong>. That is, the normal Level 1659 4 value for a regular collation element is FFFF, as described in [<a 1660 href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>], <em>Section 1661 3.6, <a 1662 href="http://www.unicode.org/reports/tr10/#Variable_Weighting">Variable 1663 Weighting</a> 1664 </em>. This is changed to FFFE for [:script=Hiragana:] characters. The 1665 strength must be greater or equal than quaternary if this attribute 1666 is to have any effect. 1667 </td> 1668 </tr> 1669 <tr> 1670 <td>false</td> 1671 <td><i><strong><code>[hiraganaQ off]</code></strong></i></td> 1672 </tr> 1673 <tr> 1674 <td rowspan="2">kn</td> 1675 <td>true</td> 1676 <td><code>[numericOrdering on]</code></td> 1677 <td rowspan="2">If set to <strong>on</strong>, any sequence of 1678 Decimal Digits (General_Category = Nd in the [<a 1679 href="http://www.unicode.org/reports/tr41/#UAX44">UAX44</a>]) is 1680 sorted at a primary level with its numeric value. For example, 1681 "A-21" < "A-123". The computed primary 1682 weights are all at the start of the <strong>digit</strong> 1683 reordering group. Thus with an untailored UCA table, "a$" 1684 < "a0" < "a2" < "a12" < 1685 "a" < "aa". 1686 </td> 1687 </tr> 1688 <tr> 1689 <td>false</td> 1690 <td><i><strong><code>[numericOrdering off]</code></strong></i></td> 1691 </tr> 1692 <tr> 1693 <td>kr</td> 1694 <td>a sequence of one or more reorder codes: <strong>space, 1695 punct, symbol, currency, digit</strong>, or any BCP47 script ID 1696 </td> 1697 <td><code>[reorder Grek digit]</code></td> 1698 <td>Specifies a reordering of scripts or other significant 1699 blocks of characters such as symbols, punctuation, and digits. For 1700 the precise meaning and usage of the reorder codes, see <em>Section 1701 3.13, <a href="#Script_Reordering">Collation Reordering</a>. 1702 </em> 1703 </td> 1704 </tr> 1705 <tr> 1706 <td rowspan="4">kv</td> 1707 <td>space</td> 1708 <td><code>[maxVariable space]</code></td> 1709 <td rowspan="4">Sets the variable top to the top of the 1710 specified reordering group. All code points with primary weights 1711 less than or equal to the variable top will be considered variable, 1712 and thus affected by the alternate handling. Variables are 1713 ignorable by default in [<a 1714 href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>], but not 1715 in CLDR. 1716 </td> 1717 </tr> 1718 <tr> 1719 <td>punct</td> 1720 <td><i><strong><code>[maxVariable punct]</code></strong></i></td> 1721 </tr> 1722 <tr> 1723 <td>symbol</td> 1724 <td><strong><code>[maxVariable symbol]</code><br>(UCA 1725 default)</strong></td> 1726 </tr> 1727 <tr> 1728 <td>currency</td> 1729 <td><code>[maxVariable currency]</code></td> 1730 </tr> 1731 <tr> 1732 <td>vt</td> 1733 <td>See <i>Part 1 Section 3.6.4, <a 1734 href="tr35.html#Unicode_Locale_Extension_Data_Files">U 1735 Extension Data Files</a></i>.<br> <i><strong>Deprecated:</strong></i> 1736 Use maxVariable instead. 1737 </td> 1738 <td><code>&\u00XX\uYYYY < [variable top]</code><br> 1739 <br> (the default is set to the highest punctuation, thus 1740 including spaces and punctuation, but not symbols)</td> 1741 <td> 1742 <p> 1743 The BCP47 value is described in <i>Appendix Q: <a 1744 href="tr35.html#Locale_Extension_Key_and_Type_Data">Locale 1745 Extension Keys and Types</a>. 1746 </i> 1747 </p> 1748 <p> 1749 Sets the string value for the variable top. All the code points 1750 with primary weights less than or equal to the variable top will 1751 be considered variable, and thus affected by the alternate 1752 handling.<br> An implementation that supports the variableTop 1753 setting should also support the maxVariable setting, and it should 1754 "pin" ("round up") the variableTop to the top of the containing 1755 reordering group.<br> Variables are ignorable by default in [<a 1756 href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>], but 1757 not in CLDR. See below for more information. 1758 </p> 1759 </td> 1760 </tr> 1761 <tr> 1762 <td><em>n/a</em></td> 1763 <td><em>n/a</em></td> 1764 <td><em>n/a</em></td> 1765 <td>match-boundaries: <em><strong>none</strong></em> | 1766 whole-character | whole-word <br> Defined by <em>Section 1767 8, <a href="http://www.unicode.org/reports/tr10/#Searching">Searching 1768 and Matching</a> 1769 </em> of [<a href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>]. 1770 </td> 1771 </tr> 1772 <tr> 1773 <td><em>n/a</em></td> 1774 <td><em>n/a</em></td> 1775 <td><em>n/a</em></td> 1776 <td>match-style: <em><strong>minimal</strong></em> | medial | 1777 maximal <br> Defined by <em>Section 8, <a 1778 href="http://www.unicode.org/reports/tr10/#Searching">Searching 1779 and Matching</a></em> of [<a 1780 href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>]. 1781 </td> 1782 </tr> 1783 </table> 1784 1785 <h4> 1786 3.4.1 <a name="Common_Settings" href="#Common_Settings">Common 1787 settings combinations</a> 1788 </h4> 1789 <p>Some commonly used parametric collation settings are available 1790 via combinations of LDML settings attributes:</p> 1791 <ul> 1792 <li>Ignore accents: <strong>strength=primary</strong></li> 1793 <li>Ignore accents but take case into account: <strong>strength=primary 1794 caseLevel=on</strong></li> 1795 <li>Ignore case: <strong>strength=secondary</strong></li> 1796 <li>Ignore punctuation (completely): <strong>strength=tertiary 1797 alternate=shifted</strong></li> 1798 <li>Ignore punctuation but distinguish among punctuation 1799 marks: <strong>strength=quaternary alternate=shifted</strong> 1800 </li> 1801 </ul> 1802 1803 <h4> 1804 3.4.2 <a name="Normalization_Setting" href="#Normalization_Setting">Notes 1805 on the normalization setting</a> 1806 </h4> 1807 <p>The UCA always normalizes input strings into NFD form before 1808 the rest of the algorithm. However, this results in poor performance.</p> 1809 <p> 1810 With <strong>normalization=off</strong>, strings that are in [<a 1811 href="tr35.html#FCD">FCD</a>] and do not contain Tibetan precomposed 1812 vowels (U+0F73, U+0F75, U+0F81) should sort correctly. With <strong>normalization=on</strong>, 1813 an implementation that does not normalize to NFD must at least 1814 perform an incremental FCD check and normalize substrings as 1815 necessary. It should also always decompose the Tibetan precomposed 1816 vowels. (Otherwise discontiguous contractions across their leading 1817 components cannot be handled correctly.) 1818 </p> 1819 <p>Another complication for an implementation that does not always 1820 use NFD arises when contraction mappings overlap with canonical 1821 Decomposition_Mapping strings. For example, the Danish contraction 1822 aa overlaps with the decompositions of , , and other 1823 characters. In the root collation (and in the DUCET), Cyrillic 1824 maps to a single collation element, which means that its 1825 decomposition +◌̈ forms a contraction, and its 1826 second character (U+0308) is the same as the first character in the 1827 Decomposition_Mapping of U+0344 1828 ◌̈́=◌̈+◌́.</p> 1829 <p>In order to handle strings with these characters (e.g., a 1830 and ̈́ [which are in FCD]) exactly as with prior NFD 1831 normalization, an implementation needs to either add overlap 1832 contractions to its data (e.g., a+ and +◌̈́), or 1833 it needs to decompose the relevant composites (e.g., and 1834 ◌̈́) as soon as they are encountered.</p> 1835 1836 <h4> 1837 3.4.3 <a name="Variable_Top_Settings" href="#Variable_Top_Settings">Notes 1838 on variable top settings</a> 1839 </h4> 1840 <p> 1841 Users may want to include more or fewer characters as Variable. For 1842 example, someone could want to restrict the Variable characters to 1843 just include space marks. In that case, maxVariable would be set to 1844 "space". (In CLDR 24 and earlier, the now-deprecated variableTop 1845 would be set to U+1680, see the Whitespace <a 1846 href="http://unicode.org/charts/collation/">UCA collation chart</a>). 1847 Alternatively, someone could want more of the Common characters in 1848 them, and include characters up to (but not including) '0', by 1849 setting maxVariable to "currency". (In CLDR 24 and earlier, the 1850 now-deprecated variableTop would be set to U+20BA, see the 1851 Currency-Symbol collation chart). 1852 </p> 1853 <p>The effect of these settings is to customize to ignore 1854 different sets of characters when comparing strings. For example, the 1855 locale identifier "de-u-ka-shifted-kv-currency" is requesting 1856 settings appropriate for German, including German sorting 1857 conventions, and that currency symbols and characters sorting below 1858 them are ignored in sorting.</p> 1859 1860 <h3> 1861 3.5 <a name="Rules" href="#Rules">Collation Rule Syntax</a> 1862 </h3> 1863 <p class="dtd"><!ELEMENT cr #PCDATA ></p> 1864 <p> 1865 The goal for the collation rule syntax is to have clearly expressed 1866 rules with a concise format. The CLDR rule syntax is a subset of the 1867 [<a href="tr35.html#ICUCollation">ICUCollation</a>] syntax. 1868 </p> 1869 1870 <p> 1871 For the CLDR root collation, the FractionalUCA.txt file defines all 1872 mappings for all of Unicode directly, and it also provides 1873 information about script boundaries, reordering groups, and other 1874 details. For tailorings, this is neither necessary nor practical. In 1875 particular, while the root collation sort order rarely changes for 1876 existing characters, their numeric collation weights change with 1877 every version. If tailorings also specified numeric weights directly, 1878 then they would have to change with every version, parallel with the 1879 root collation. Instead, for tailorings, mappings are added and 1880 modified relative to the root collation. (There is no syntax to <i>remove</i> 1881 mappings, except via <a href="#Special_Purpose_Commands">special 1882 [suppressContractions [...]] </a>.) 1883 </p> 1884 1885 <p> 1886 The ASCII [:P:] and [:S:] characters are reserved for collation 1887 syntax: 1888 <code>[\u0021-\u002F \u003A-\u0040 \u005B-\u0060 1889 \u007B-\u007E]</code> 1890 </p> 1891 1892 <p>Unicode Pattern_White_Space characters between tokens are 1893 ignored. Unquoted white space terminates reset and relation strings.</p> 1894 1895 <p>A pair of ASCII apostrophes encloses quoted literal text. They 1896 are normally used to enclose a syntax character or white space, or a 1897 whole reset/relation string containing one or more such characters, 1898 so that those are parsed as part of the reset/relation strings rather 1899 than treated as syntax. A pair of immediately adjacent apostrophes is 1900 used to encode one apostrophe.</p> 1901 1902 <p> 1903 Code points can be escaped with 1904 <code>\uhhhh</code> 1905 and 1906 <code>\U00hhhhhh</code> 1907 escapes, as well as common escapes like 1908 <code>\t</code> 1909 and 1910 <code>\n</code> 1911 . (For details see the documentation of ICU 1912 UnicodeString::unescape().) This is particularly useful for 1913 default-ignorable code points, combining marks, visually indistinct 1914 variants, hard-to-type characters, etc. These sequences are unescaped 1915 before the rules are parsed; this means that even escaped syntax and 1916 white space characters need to be enclosed in apostrophes. For 1917 example: 1918 <code>&'\u0020'='\u3000'</code>. 1919 Note: The unescaping is done by ICU tools (genrb) and demos before passing 1920 rule strings into the ICU library code. 1921 The ICU collation API does not unescape rule strings. 1922 </p> 1923 1924 <p> 1925 The ASCII double quote must be both escaped (so that the collation 1926 syntax can be enclosed in pairs of double quotes in programming 1927 environments such as ICU resource bundle .txt files) 1928 and quoted. For example: 1929 <code>&'\u0022'<<<x</code> 1930 </p> 1931 1932 <p> 1933 Comments are allowed at the beginning, and after any complete reset, 1934 relation, setting, or command. A comment begins with a 1935 <code>#</code> 1936 and extends to the end of the line (according to the Unicode Newline 1937 Guidelines). 1938 </p> 1939 1940 <p>The collation syntax is case-sensitive.</p> 1941 1942 <h3> 1943 3.6 <a name="Orderings" href="#Orderings">Orderings</a> 1944 </h3> 1945 1946 <p>The root collation mappings form the initial state. Mappings 1947 are added and removed via a sequence of rule chains. Each tailoring 1948 rule builds on the current state after all of the preceding rules 1949 (and is not affected by any following rules). Rule chains may 1950 alternate with comments, settings, and special commands.</p> 1951 1952 <p>A rule chain consists of a reset followed by one or more 1953 relations. The reset position is a string which maps to one or more 1954 collation elements according to the current state. A relation 1955 consists of an operator and a string; it maps the string to the 1956 current collation elements, modified according to the operator.</p> 1957 1958 <table> 1959 <caption> 1960 <a name="Specifying_Collation_Ordering" 1961 href="#Specifying_Collation_Ordering">Specifying Collation 1962 Ordering</a> 1963 1964 </caption> 1965 <tr> 1966 <th>Relation Operator</th> 1967 <th> Example</th> 1968 <th>Description</th> 1969 </tr> 1970 <tr> 1971 <td><code>&</code></td> 1972 <td><code>& Z</code></td> 1973 <td>Map Z to collation elements according to the current state. 1974 These will be modified according to the following relation 1975 operators and then assigned to the corresponding relation strings.</td> 1976 </tr> 1977 <tr> 1978 <td><code><</code></td> 1979 <td><code> 1980 & a<br> < b 1981 </code></td> 1982 <td>Make 'b' sort after 'a', as a <i>primary</i> 1983 (base-character) difference 1984 </td> 1985 </tr> 1986 <tr> 1987 <td><code><<</code></td> 1988 <td><code> 1989 & a<br> << 1990 </code></td> 1991 <td>Make '' sort after 'a' as a <i>secondary</i> 1992 (accent) difference 1993 </td> 1994 </tr> 1995 <tr> 1996 <td><code><<<</code></td> 1997 <td><code> 1998 & a<br> <<< A 1999 </code></td> 2000 <td>Make 'A' sort after 'a' as a <i>tertiary</i> 2001 (case/variant) difference 2002 </td> 2003 </tr> 2004 <tr> 2005 <td><code><<<<</code></td> 2006 <td><code> 2007 & <br> <<<< 2008 </code></td> 2009 <td>Make '' (Katakana Ka) sort after '' 2010 (Hiragana Ka) as a <i>quaternary</i> difference 2011 </td> 2012 </tr> 2013 <tr> 2014 <td><code>= </code></td> 2015 <td><code> 2016 & v<br> = w 2017 </code></td> 2018 <td>Make 'w' sort <i>identically</i> to 'v' 2019 </td> 2020 </tr> 2021 </table> 2022 <p>The following shows the result of serially applying three 2023 rules.</p> 2024 <table> 2025 <tr> 2026 <th> </th> 2027 <th>Rules</th> 2028 <th>Result</th> 2029 <th>Comment</th> 2030 </tr> 2031 <tr> 2032 <td>1</td> 2033 <td>& a < g</td> 2034 <td>... a<font color="red"> <<sub>1</sub> g 2035 </font> ... 2036 </td> 2037 <td>Put g after a.</td> 2038 </tr> 2039 <tr> 2040 <td>2</td> 2041 <td>& a < h < k</td> 2042 <td>... a<font color="red"> <<sub>1</sub> h <<sub>1</sub> 2043 k 2044 </font> <<sub>1</sub> g ... 2045 </td> 2046 <td>Now put h and k after a (inserting before the g).</td> 2047 </tr> 2048 <tr> 2049 <td>3</td> 2050 <td>& h << g</td> 2051 <td>... a <<sub>1</sub> h<font color="red"> <<sub>1</sub> 2052 g 2053 </font> <<sub>1</sub> k ... 2054 </td> 2055 <td>Now put g after h (inserting before k).</td> 2056 </tr> 2057 </table> 2058 <p>Notice that relation strings can occur multiple times, and thus 2059 override previous rules.</p> 2060 2061 <p>Each relation uses and modifies the collation elements of the 2062 immediately preceding reset position or relation. A rule chain with 2063 two or more relations is equivalent to a sequence of atomic rules 2064 where each rule chain has exactly one relation, and each relation is 2065 followed by a reset to this same relation string.</p> 2066 2067 <p> 2068 <i>Example:</i> 2069 </p> 2070 <table> 2071 <tr> 2072 <th>Rules</th> 2073 <th>Equivalent Atomic Rules</th> 2074 </tr> 2075 <tr> 2076 <td>& b < q <<< Q<br> & a < x 2077 <<< X << q <<< Q < z 2078 </td> 2079 <td>& b < q<br> & q <<< Q<br> 2080 & a < x<br> & x <<< X<br> & X 2081 << q<br> & q <<< Q<br> & Q < z 2082 </td> 2083 </tr> 2084 </table> 2085 <p>This is not always possible because prefix and extension 2086 strings can occur in a relation but not in a reset (see below).</p> 2087 2088 <p> 2089 The relation operator 2090 <code>=</code> 2091 maps its relation string to the current collation elements. Any other 2092 relation operator modifies the current collation elements as follows. 2093 </p> 2094 <ul> 2095 <li>Find the <i>last</i> collation element whose strength is at 2096 least as great as the strength of the operator. For example, for <code><<</code> 2097 find the last primary or secondary CE. This CE will be modified; all 2098 following CEs should be removed. If there is no such CE, then reset 2099 the collation elements to a single completely-ignorable CE. 2100 </li> 2101 <li>Increment the collation element weight corresponding to the 2102 strength of the operator. For example, for <code><<</code> 2103 increment the secondary weight. 2104 </li> 2105 <li>The new weight must be less than the next weight for the 2106 same combination of higher-level weights of any collation element 2107 according to the current state.</li> 2108 <li>Weights must be allocated in accordance with the <a 2109 href="http://www.unicode.org/reports/tr10/#Well-Formed">UCA 2110 well-formedness conditions</a>. 2111 </li> 2112 <li>When incrementing any weight, lower-level weights should be 2113 reset to the common values, to help with sort key compression.</li> 2114 </ul> 2115 2116 <p> 2117 In all cases, even for 2118 <code>=</code> 2119 , the case bits are recomputed according to <i>Section 3.13, <a 2120 href="#Case_Parameters">Case Parameters</a></i>. (This can be skipped if 2121 an implementation does not support the caseLevel or caseFirst 2122 settings.) 2123 </p> 2124 2125 <p> 2126 For example, 2127 <code>&ae<x</code> 2128 maps x to two collation elements. The first one is the same as for 2129 a, and the second one has a primary weight between those for e 2130 and f. As a result, x sorts between ae and af. (If the 2131 primary of the first collation element was incremented instead, then 2132 x would sort after az. While also sorting primary-after ae this 2133 would be surprising and sub-optimal.) 2134 </p> 2135 2136 <p>Some additional operators are provided to save space with large 2137 tailorings. The addition of a * to the relation operator indicates 2138 that each of the following single characters are to be handled as if 2139 they were separate relations with the corresponding strength. Each of 2140 the following single characters must be NFD-inert, that is, it does 2141 not have a canonical decomposition and it does not reorder (ccc=0). 2142 This keeps abbreviated rules unambiguous.</p> 2143 <p> 2144 A starred relation operator is followed by a sequence of characters 2145 with the same quoting/escaping rules as normal relation strings. Such 2146 a sequence can also be followed by one or more pairs of - and 2147 another sequence of characters. The single characters adjacent to the 2148 - establish a code point order range. The same character cannot be 2149 both the end of a range and the start of another range. (For example, 2150 <code><a-d-g</code> 2151 is not allowed.) 2152 </p> 2153 <table> 2154 <caption> 2155 <a name="Abbreviating_Ordering_Specifications" 2156 href="#Abbreviating_Ordering_Specifications">Abbreviating 2157 Ordering Specifications</a> 2158 </caption> 2159 <tr> 2160 <th>Relation Operator</th> 2161 <th>Example</th> 2162 <th>Equivalent</th> 2163 </tr> 2164 <tr> 2165 <td><code><*</code></td> 2166 <td><code> 2167 & <span style="color: blue">a</span><br> <* <span 2168 style="color: blue">bcd-gp-s</span> 2169 </code></td> 2170 <td><code> 2171 & <span style="color: blue">a</span><br> < <span 2172 style="color: blue">b </span><<span style="color: blue"> 2173 c </span><<span style="color: blue"> d</span> < <span 2174 style="color: blue">e</span> < <span style="color: blue">f</span> 2175 < <span style="color: blue">g</span> < <span 2176 style="color: blue">p</span> < <span style="color: blue">q</span> 2177 < <span style="color: blue">r</span> < <span 2178 style="color: blue">s</span> 2179 </code></td> 2180 </tr> 2181 <tr> 2182 <td><code><<*</code></td> 2183 <td><code> 2184 &<span style="color: blue"> a</span><br> <<*<span 2185 style="color: blue"> </span> 2186 </code></td> 2187 <td><code> 2188 &<span style="color: blue"> a</span><br> <<<span 2189 style="color: blue"> </span><< <span style="color: blue"> 2190 </span><< <span style="color: blue"></span> 2191 </code></td> 2192 </tr> 2193 <tr> 2194 <td><code><<<*</code></td> 2195 <td><code> 2196 &<span style="color: blue"> p</span><br> <<<* <span 2197 style="color: blue">P</span> 2198 </code></td> 2199 <td><code> 2200 &<span style="color: blue"> p</span><br> <<< <span 2201 style="color: blue">P</span> <<< <span 2202 style="color: blue"></span> <<< <span 2203 style="color: blue"></span> 2204 </code></td> 2205 </tr> 2206 <tr> 2207 <td><code><<<<*</code></td> 2208 <td><code> 2209 &<span style="color: blue"> k</span><br> 2210 <<<<* <span style="color: blue">qQ</span> 2211 </code></td> 2212 <td><code> 2213 &<span style="color: blue"> k</span><br> <<<< 2214 <span style="color: blue">q</span> <<<< <span 2215 style="color: blue">Q</span> 2216 </code></td> 2217 </tr> 2218 <tr> 2219 <td><code>=*</code></td> 2220 <td><code> 2221 &<span style="color: blue"> v</span><br> =* <span 2222 style="color: blue">VwW</span> 2223 </code></td> 2224 <td><code> 2225 &<span style="color: blue"> v</span><br> = <span 2226 style="color: blue">V </span>= <span style="color: blue">w 2227 </span>= <span style="color: blue">W</span> 2228 </code></td> 2229 </tr> 2230 </table> 2231 <h3> 2232 3.7 <a name="Contractions" href="#Contractions">Contractions</a> 2233 </h3> 2234 2235 <p>A multi-character relation string defines a contraction.</p> 2236 2237 <table> 2238 <caption> 2239 <a name="Specifying_Contractions" href="#Specifying_Contractions">Specifying 2240 Contractions</a> 2241 </caption> 2242 <tr> 2243 <th>Example</th> 2244 <th>Description</th> 2245 </tr> 2246 <tr> 2247 <td><code> 2248 & k<br> < ch 2249 </code></td> 2250 <td>Make the sequence 'ch' sort after 'k', as a 2251 primary (base-character) difference</td> 2252 </tr> 2253 </table> 2254 2255 <h3> 2256 3.8 <a name="Expansions" href="#Expansions">Expansions</a> 2257 </h3> 2258 <p> 2259 A mapping to multiple collation elements defines an expansion. This 2260 is normally the result of a reset position (and/or preceding 2261 relation) that yields multiple collation elements, for example 2262 <code>&ae<x</code> 2263 or 2264 <code>&<y</code> 2265 . 2266 </p> 2267 2268 <p> 2269 A relation string can also be followed by 2270 <code>/</code> 2271 and an <i>extension string</i>. The extension string is mapped to 2272 collation elements according to the current state, and the relation 2273 string is mapped to the concatenation of the regular CEs and the 2274 extension CEs. The extension CEs are not modified, not even their 2275 case bits. The extension CEs are <i>not</i> retained for following 2276 relations. 2277 </p> 2278 2279 <p> 2280 For example, 2281 <code>&a<z/e</code> 2282 maps z to an expansion similar to 2283 <code>&ae<x</code> 2284 . However, the first CE of z is primary-after that of a, and the 2285 second CE is exactly that of e, which yields the order ae < x 2286 < af < ag < ... < az < z < b. 2287 </p> 2288 2289 <p> 2290 The choice of reset-to-expansion vs. use of an extension string can 2291 be exploited to affect contextual mappings. For example, 2292 <code>&L=x</code> 2293 yields a second CE for x equal to the context-sensitive 2294 middle-dot-after-L (which is a secondary CE in the root collation). 2295 On the other hand, 2296 <code>&L=x/</code> 2297 yields a second CE of the middle dot by itself (which is a primary 2298 CE). 2299 </p> 2300 2301 <p> 2302 The two ways of specifying expansions also differ in how case bits 2303 are computed. When some of the CEs are copied verbatim from an 2304 extension string, then the relation strings case bits are 2305 distributed over a smaller number of normal CEs. For example, 2306 <code>&aE=Ch</code> 2307 yields an uppercase CE and a lowercase CE, but 2308 <code>&a=Ch/E</code> 2309 yields a mixed-case CE (for C and h together) followed by an 2310 uppercase CE (copied from E). 2311 </p> 2312 2313 <p>In summary, there are two ways of specifying expansions which 2314 produce subtly different mappings. The use of extension strings is 2315 unusual but sometimes necessary.</p> 2316 2317 2318 <h3> 2319 3.9 <a name="Context_Before" href="#Context_Before">Context 2320 Before</a> 2321 </h3> 2322 <p> 2323 A relation string can have a prefix (context before) which makes the 2324 mapping from the relation string to its tailored position conditional 2325 on the string occurring after that prefix. For details see the 2326 specification of <i><a href="#Context_Sensitive_Mappings">Context-Sensitive 2327 Mappings</a></i>. 2328 </p> 2329 <p>For example, suppose that "-" is sorted like the 2330 previous vowel. Then one could have rules that take "a-", 2331 "e-", and so on. However, that means that every time a very 2332 common character (a, e, ...) is encountered, a system will slow down 2333 as it looks for possible contractions. An alternative is to indicate 2334 that when "-" is encountered, and it comes after an 2335 'a', it sorts like an 'a', and so on.</p> 2336 <table> 2337 <caption> 2338 <a name="Specifying_Previous_Context" 2339 href="#Specifying_Previous_Context">Specifying Previous Context</a> 2340 </caption> 2341 <tr> 2342 <th>Rules</th> 2343 </tr> 2344 <tr> 2345 <td><code> 2346 & a <<< a | '-'<br> & e <<< e | '-'<br> 2347 ... 2348 </code></td> 2349 </tr> 2350 </table> 2351 <p>Both the prefix and extension strings can occur in a relation. 2352 For example, the following are allowed:</p> 2353 <ul> 2354 <li><code>< abc | def / ghi</code></li> 2355 <li><code>< def / ghi</code></li> 2356 <li><code>< abc | def</code></li> 2357 </ul> 2358 <h3> 2359 3.10 <a name="Placing_Characters_Before_Others" 2360 href="#Placing_Characters_Before_Others">Placing Characters 2361 Before Others</a> 2362 </h3> 2363 <p>There are certain circumstances where characters need to be 2364 placed before a given character, rather than after. This is the case 2365 with Pinyin, for example, where certain accented letters are 2366 positioned before the base letter. That is accomplished with the 2367 following syntax.</p> 2368 <pre>&[before 2] a << </pre> 2369 <p>The before-strength can be 1 (primary), 2 (secondary), or 3 2370 (tertiary).</p> 2371 <p>It is an error if the strength of the reset-before differs from 2372 the strength of the immediately following relation. Thus the 2373 following are errors.</p> 2374 <ul> 2375 <li><code>&[before 2] a < # error</code></li> 2376 <li><code>&[before 2] a <<< # error</code></li> 2377 </ul> 2378 2379 <h3> 2380 3.11 <a name="Logical_Reset_Positions" 2381 href="#Logical_Reset_Positions">Logical Reset Positions</a> 2382 </h3> 2383 2384 <p>The CLDR table (based on UCA) has the following overall 2385 structure for weights, going from low to high.</p> 2386 <table> 2387 <caption> 2388 <a name="Specifying_Logical_Positions" 2389 href="#Specifying_Logical_Positions">Specifying Logical 2390 Positions</a> 2391 </caption> 2392 <tr> 2393 <th>Name</th> 2394 <th>Description</th> 2395 <th>UCA Examples</th> 2396 </tr> 2397 <tr> 2398 <td>first tertiary ignorable<br> ...<br> last 2399 tertiary ignorable 2400 </td> 2401 <td>p, s, t = ignore</td> 2402 <td>Control Codes<br> Format Characters<br> Hebrew 2403 Points<br> Tibetan Signs<br> ... 2404 </td> 2405 </tr> 2406 <tr> 2407 <td>first secondary ignorable<br> ...<br> last 2408 secondary ignorable 2409 </td> 2410 <td>p, s = ignore</td> 2411 <td>None in UCA</td> 2412 </tr> 2413 <tr> 2414 <td>first primary ignorable<br> ...<br> last primary 2415 ignorable 2416 </td> 2417 <td>p = ignore</td> 2418 <td>Most combining marks</td> 2419 </tr> 2420 <tr> 2421 <td>first variable<br> ...<br> last variable 2422 </td> 2423 <td><i><b>if</b> alternate = non-ignorable<br> </i>p != 2424 ignore,<br> <i><b>if</b> alternate = shifted</i><br> p, 2425 s, t = ignore</td> 2426 <td>Whitespace,<br> Punctuation 2427 </td> 2428 </tr> 2429 <tr> 2430 <td>first regular<br> ...<br> last regular 2431 </td> 2432 <td>p != ignore</td> 2433 <td>General Symbols<br> Currency Symbols<br> Numbers<br> 2434 Latin<br> Greek<br> ... 2435 </td> 2436 </tr> 2437 <tr> 2438 <td>first implicit<br>...<br>last implicit 2439 </td> 2440 <td>p != ignore, assigned automatically</td> 2441 <td>CJK, CJK compatibility (those that are not decomposed)<br> 2442 CJK Extension A, B, C, ...<br> Unassigned 2443 </td> 2444 </tr> 2445 <tr> 2446 <td>first trailing<br> ...<br> last trailing 2447 </td> 2448 <td>p != ignore,<br> used for trailing syllable components 2449 </td> 2450 <td>Jamo Trailing<br> Jamo Leading<br>U+FFFD<br>U+FFFF 2451 </td> 2452 </tr> 2453 </table> 2454 <p> 2455 Each of the above Names can be used with a reset to position 2456 characters relative to that logical position. That allows characters 2457 to be ordered before or after a <i>logical</i> position rather than a 2458 specific character. 2459 </p> 2460 <blockquote> 2461 <p class="note"> 2462 <b>Note: </b>The reason for this is so that tailorings can be more 2463 stable. A future version of the UCA might add characters at any 2464 point in the above list. Suppose that you set character X to be 2465 after Y. It could be that you want X to come after Y, no matter what 2466 future characters are added; or it could be that you just want Y to 2467 come after a given logical position, for example, after the last 2468 primary ignorable. 2469 </p> 2470 </blockquote> 2471 2472 <p>Each of these special reset positions always maps to a single 2473 collation element.</p> 2474 2475 <p>Here is an example of the syntax:</p> 2476 <pre>& [first tertiary ignorable] << </pre> 2477 <p>For example, to make a character be a secondary ignorable, one 2478 can make it be immediately after (at a secondary level) a specific 2479 character (like a combining diaeresis), or one can make it be 2480 immediately after the last secondary ignorable.</p> 2481 2482 <p> 2483 Each special reset position adjusts to the effects of preceding 2484 rules, just like normal reset position strings. For example, if a 2485 tailoring rule creates a new collation element after 2486 <code>&[last variable]</code> 2487 (via explicit tailoring after that, or via tailoring after the 2488 relevant character), then this new CE becomes the new <i>last 2489 variable</i> CE, and is used in following resets to 2490 <code>[last variable]</code> 2491 . 2492 </p> 2493 2494 <p>[first variable] and [first regular] and [first trailing] 2495 should be the first real such CEs (e.g., CE(U+0060 `)), as 2496 adjusted according to the tailoring, not the boundary CEs (see the 2497 FractionalUCA.txt first primary mappings starting with U+FDD1).</p> 2498 2499 <p> 2500 <code>[last regular]</code> 2501 is not actually the last normal CE with a primary weight before 2502 implicit primaries. It is used to tailor large numbers of characters, 2503 usually CJK, into the script=Hani range between the last regular 2504 script and the first implicit CE. (The first group of implicit CEs is 2505 for Han characters.) Therefore, 2506 <code>[last regular]</code> 2507 is set to the first Hani CE, the artificial script boundary CE at the 2508 beginning of this range. For example: 2509 <code>&[last regular]<*...</code> 2510 </p> 2511 2512 <p>The [last trailing] is the CE of U+FFFF. Tailoring to that is 2513 not allowed.</p> 2514 2515 <p> 2516 The 2517 <code>[last variable]</code> 2518 indicates the "highest" character that is treated as 2519 punctuation with alternate handling. 2520 </p> 2521 <p> 2522 The value can be changed by using the maxVariable setting. This takes 2523 effect, however, after the rules have been built, and does not affect 2524 any characters that are reset relative to the 2525 <code>[last variable]</code> 2526 value when the rules are being built. The maxVariable setting might 2527 also be changed via a runtime parameter. That also does not affect 2528 the rules.<br> (In CLDR 24 and earlier, the variable top could 2529 also be set by using a tailoring rule with 2530 <code>[variable top]</code> 2531 in the place of a relation string.) 2532 </p> 2533 2534 <h3> 2535 3.12 <a name="Special_Purpose_Commands" 2536 href="#Special_Purpose_Commands">Special-Purpose Commands</a> 2537 </h3> 2538 <p>The import command imports rules from another collation. This 2539 allows for better maintenance and smaller rule sizes. The source is a 2540 BCP 47 language tag with an optional collation type but without other 2541 extensions. The collation type is the BCP 47 form of the collation 2542 type in the source; it defaults to "standard".</p> 2543 <p> 2544 <em>Examples: </em> 2545 </p> 2546 <ul> 2547 <li><code>[import de-u-co-phonebk]</code> (not 2548 "...-co-phonebook")</li> 2549 <li><code>[import und-u-co-search]</code> (not 2550 "root-...")</li> 2551 <li><code>[import ja-u-co-private-kana]</code> (language 2552 "ja" required even when this import itself is in another "ja" 2553 tailoring.)</li> 2554 </ul> 2555 2556 <table> 2557 <caption> 2558 <a name="Special_Purpose_Elements" href="#Special_Purpose_Elements">Special-Purpose 2559 Elements</a> 2560 </caption> 2561 <tr> 2562 <th>Rule Syntax</th> 2563 </tr> 2564 <tr> 2565 <td>[suppressContractions [-]]</td> 2566 </tr> 2567 <tr> 2568 <td>[optimize [-]]</td> 2569 </tr> 2570 </table> 2571 <p> 2572 The <i>suppress contractions</i> tailoring command turns off any 2573 existing contractions that begin with those characters, as well as 2574 any prefixes for those characters. It is typically used to turn off 2575 the Cyrillic contractions in the UCA, since they are not used in many 2576 languages and have a considerable performance penalty. The argument 2577 is a <a href="tr35.html#Unicode_Sets">Unicode Set</a>. 2578 </p> 2579 2580 <p> 2581 The <i>suppress contractions</i> command has immediate effect on the 2582 current set of mappings, including mappings added by preceding rules. 2583 Following rules are processed after removing any context-sensitive 2584 mappings originating from any of the characters in the set. 2585 </p> 2586 2587 <p> 2588 The <i>optimize</i> tailoring command is purely for performance. It 2589 indicates that those characters are sufficiently common in the target 2590 language for the tailoring that their performance should be enhanced. 2591 </p> 2592 <p>The reason that these are not settings is so that their 2593 contents can be arbitrary characters.</p> 2594 2595 <hr width="50%"> 2596 <p> 2597 <i>Example:</i> 2598 </p> 2599 <p> 2600 The following is a simple example that combines portions of different 2601 tailorings for illustration. For more complete examples, see the 2602 actual locale data: <a 2603 href="http://unicode.org/repos/cldr/tags/latest/common/collation/ja.xml">Japanese</a>, 2604 <a 2605 href="http://unicode.org/repos/cldr/tags/latest/common/collation/zh.xml">Chinese</a>, 2606 <a 2607 href="http://unicode.org/repos/cldr/tags/latest/common/collation/sv.xml">Swedish</a>, 2608 and <a 2609 href="http://unicode.org/repos/cldr/tags/latest/common/collation/de.xml">German</a> 2610 (type="phonebook") are particularly illustrative. 2611 </p> 2612 <pre><collation> 2613 <cr><![CDATA[ 2614 [caseLevel on] 2615 &Z 2616 < <<< 2617 < <<< <<< aa <<< aA <<< Aa <<< AA 2618 < <<< 2619 < <<< << <<< 2620 < <<< << <<< 2621 &V <<<* wW 2622 &Y <<<* 2623 &[last non-ignorable] 2624 <span style="color: green"># The following is equivalent to <<<...</span> 2625 <* 2626 <* 2627 ]]></cr> 2628 </collation></pre> 2629 2630 <h3> 2631 3.13 <a name="Script_Reordering" href="#Script_Reordering">Collation 2632 Reordering</a> 2633 </h3> 2634 <p>Collation reordering allows scripts and certain other defined 2635 blocks of characters to be moved relative to each other 2636 parametrically, without changing the detailed rules for all the 2637 characters involved. This reordering is done on top of any specific 2638 ordering rules within the script or block currently in effect. 2639 Reordering can specify groups to be placed at the start and/or the 2640 end of the collation order. For example, to reorder Greek characters 2641 before Latin characters, and digits afterwards (but before other 2642 scripts), the following can be used:</p> 2643 <table> 2644 <tr> 2645 <th>Rule Syntax</th> 2646 <th>Locale Identifier</th> 2647 </tr> 2648 <tr> 2649 <td><code>[reorder Grek Latn digit]</code></td> 2650 <td><code>en-u-kr-grek-latn-digit</code></td> 2651 </tr> 2652 </table> 2653 <p> 2654 In each case, a sequence of <em><strong>reorder_codes</strong></em> 2655 is used, separated by spaces in the settings attribute and in rule 2656 syntax, and by hyphens in locale identifiers. 2657 </p> 2658 <p> 2659 A <strong><em>reorder_code</em></strong> is any of the following 2660 special codes: 2661 </p> 2662 <ol> 2663 <li><strong>space, punct, symbol, currency, digit</strong> - 2664 core groups of characters below 'a'</li> 2665 <li><strong>any script code</strong> except <strong>Common</strong> 2666 and <strong>Inherited</strong>. 2667 <ul> 2668 <li>Some pairs of scripts sort primary-equal and always 2669 reorder together. For example, Katakana characters are are always 2670 reordered with Hiragana.</li> 2671 </ul></li> 2672 <li><strong>others</strong> - where all codes not explicitly 2673 mentioned should be ordered. The script code <strong>Zzzz</strong> 2674 (Unknown Script) is a synonym for <strong>others</strong>.</li> 2675 </ol> 2676 <p>It is an error if a code occurs multiple times.</p> 2677 2678 <p> 2679 It is an error if the sequence of reorder codes is empty in the XML 2680 attribute or in the locale identifier. Some implementations may 2681 interpret an empty sequence in the 2682 <code>[reorder]</code> 2683 rule syntax as a reset to the DUCET ordering, synonymous with 2684 <code>[reorder others]</code> 2685 ; other implementations may forbid an empty sequence in the rule 2686 syntax as well. 2687 </p> 2688 2689 <p> 2690 Interaction with <strong>alternate=shifted</strong>: Whether a 2691 primary weight is variable is determined according to the variable 2692 top, before applying script reordering. Once that is determined, 2693 script reordering is applied to the primary weight regardless of 2694 whether it is regular (used in the primary level) or shifted 2695 (used in the quaternary level). 2696 </p> 2697 2698 <h4> 2699 3.13.1 <a name="Interpretation_reordering" 2700 href="#Interpretation_reordering">Interpretation of a reordering 2701 list</a> 2702 </h4> 2703 <p>The reordering list is interpreted as if it were processed in 2704 the following way.</p> 2705 <ol> 2706 <li>If any core code is not present, then it is inserted at the 2707 front of the list in the order given above.</li> 2708 <li>If the <strong>others</strong> code is not present, then it 2709 is inserted at the end of the list. 2710 </li> 2711 <li>The <strong>others</strong> code is replaced by the list of 2712 all script codes not explicitly mentioned, in DUCET order. 2713 </li> 2714 <li>The reordering list is now complete, and used to reorder 2715 characters in collation accordingly.</li> 2716 </ol> 2717 <p> 2718 The locale data may have a particular ordering. For example, the 2719 Czech locale data could put digits after all letters, with 2720 <code>[reorder others digit]</code> 2721 . Any reordering codes specified on top of that (such as with a bcp47 2722 locale identifier) completely replace what was there. To specify a 2723 version of collation that completely resets any existing reordering 2724 to the DUCET ordering, the single code <strong>Zzzz</strong> or <strong>others</strong> 2725 can be used, as below<strong></strong>. 2726 </p> 2727 <p> 2728 <em>Examples: </em> 2729 </p> 2730 <table cellpadding="0" cellspacing="0"> 2731 <tbody> 2732 <tr> 2733 <th>Locale Identifier</th> 2734 <th>Effect</th> 2735 </tr> 2736 <tr> 2737 <td><code>en-u-kr-latn-digit</code></td> 2738 <td>Reorder digits after Latin characters (but before other 2739 scripts like Cyrillic).</td> 2740 </tr> 2741 <tr> 2742 <td><code>en-u-kr-others-digit</code></td> 2743 <td>Reorder digits after all other characters.</td> 2744 </tr> 2745 <tr> 2746 <td><code>en-u-kr-arab-cyrl-others-symbol</code></td> 2747 <td>Reorder Arabic characters first, then Cyrillic, and put 2748 symbols at the endafter all other characters.</td> 2749 </tr> 2750 <tr> 2751 <td><code>en-u-kr-others</code></td> 2752 <td>Remove any locale-specific reordering, and use DUCET order 2753 for reordering blocks.</td> 2754 </tr> 2755 </tbody> 2756 </table> 2757 <p> 2758 The default reordering groups are defined by the FractionalUCA.txt 2759 file, based on the primary weights of associated collation elements. 2760 The file contains special mappings for the start of each group, 2761 script, and reorder-reserved range, see <i>Section 2.6.2, <a 2762 href="#File_Format_FractionalUCA_txt">FractionalUCA.txt</a></i>. 2763 </p> 2764 2765 <p>There are some special cases:</p> 2766 <ul> 2767 <li>The <strong>Hani</strong> group includes implicit weights 2768 for <em>Han characters</em> according to the UCA as well as any 2769 characters tailored relative to a Han character, or after <code>&[first 2770 Hani]</code>. 2771 </li> 2772 <li>Implicit weights for <em>unassigned code points</em> 2773 according to the UCA reorder as the last weights in the <strong>others</strong> 2774 (<strong>Zzzz</strong>) group.<br> There is no script code to 2775 explicitly reorder the unassigned-implicit weights into a particular 2776 position. (Unassigned-implicit weights are used for non-Hani code 2777 points without any mappings. For a given Unicode version they are 2778 the code points with General_Category values Cn, Co, Cs.) 2779 </li> 2780 <li>The TRAILING group, the FIELD-SEPARATOR (associated with 2781 U+FFFE), and collation elements with only zero primary weights are 2782 not reordered.</li> 2783 <li>The TERMINATOR, LEVEL-SEPARATOR, and SPECIAL groups are 2784 never associated with characters.</li> 2785 </ul> 2786 <p> 2787 For example, 2788 <code>reorder="Hani Zzzz Grek"</code> 2789 sorts Hani, Latin, Cyrillic, ... (all other scripts) ..., unassigned, 2790 Greek, TRAILING. 2791 </p> 2792 2793 <p>Notes for implementations that write sort keys:</p> 2794 <ul> 2795 <li>Primaries must always be offset by one or more whole primary 2796 lead bytes. (Otherwise the number of bytes in a fractional weight 2797 may change, compressible scripts may span multiple lead bytes, or 2798 trailing primary bytes may collide with separators and 2799 primary-compression terminators.)</li> 2800 <li>When a script is reordered that does not start and end on 2801 whole-primary-lead-byte boundaries, then the lead byte needs to be 2802 split, and a reserved byte is used up. The data supports this via 2803 reorder-reserved ranges of primary weights that are not used for 2804 collation elements.</li> 2805 <li>Primary weights from different original lead bytes can be 2806 reordered to a shared lead byte, as long as they do not overlap. 2807 Primary compression ends when the target lead byte differs or when 2808 the original lead byte of the next primary is not compressible.</li> 2809 <li>Non-compressible groups and scripts begin or end on 2810 whole-primary-lead-byte boundaries (or both), so that reordering 2811 cannot surround a non-compressible script by two compressible ones 2812 within the same target lead byte. This is so that primary 2813 compression can be terminated reliably (choosing the low or high 2814 terminator byte) simply by comparing the previous and current 2815 primary weights. Otherwise it would have to also check for another 2816 condition (e.g., equal scripts).</li> 2817 </ul> 2818 2819 <h4> 2820 3.13.2 <a name="Reordering_Groups_allkeys" 2821 href="#Reordering_Groups_allkeys">Reordering Groups for 2822 allkeys.txt</a> 2823 </h4> 2824 <p> 2825 For allkeys_CLDR.txt, the start of each reordering group can be 2826 determined from FractionalUCA.txt, by finding the first real mapping 2827 (after xyz first primary) of that group (e.g., 2828 <code>0060; [0D 07, 05, 05] # Zyyy Sk [0312.0020.0002] * GRAVE 2829 ACCENT</code> 2830 ), and looking for that mapping's character sequence ( 2831 <code>0060</code> 2832 ) in allkeys_CLDR.txt. The comment in FractionalUCA.txt ( 2833 <code>[0312.0020.0002]</code> 2834 ) also shows the allkeys_CLDR.txt collation elements. 2835 </p> 2836 2837 <p>The DUCET ordering of some characters is slightly different 2838 from the CLDR root collation order. The reordering groups for the 2839 DUCET are not specified. The following describes how reordering 2840 groups for the DUCET can be derived.</p> 2841 <p> 2842 For allkeys_DUCET.txt, the start of each reordering group is normally 2843 the primary weight corresponding to the same character sequence as 2844 for allkeys_CLDR.txt. In a few cases this requires adjustment, 2845 especially for the special reordering groups, due to CLDRs ordering 2846 the common characters more strictly by category than the DUCET (as 2847 described in <i>Section 2, <a href="#Root_Collation">Root 2848 Collation</a></i>). The necessary adjustment would set the start of each 2849 allkeys_DUCET.txt reordering group to the primary weight of the first 2850 mapping for the relevant General_Category for a special reordering 2851 group (for characters that sort before a), or the primary weight of 2852 the first mapping for the first script (e.g., sc=Grek) of an 2853 alphabetic group (for characters that sort at or after a). 2854 </p> 2855 <p>Note that the following only applies to primary weights greater 2856 than the one for U+FFFE and less than "trailing" weights.</p> 2857 <p>The special reordering groups correspond to General_Category 2858 values as follows:</p> 2859 <ul> 2860 <li>punct: P</li> 2861 <li>symbol: Sk, Sm, So</li> 2862 <li>space: Z, Cc</li> 2863 <li>currency: Sc</li> 2864 <li>digit: Nd</li> 2865 </ul> 2866 <p>In the DUCET, some characters that sort below a and have 2867 other General_Category values not mentioned above (e.g., gc=Lm) are 2868 also grouped with symbols. Variants of numbers (gc=No or Nl) can be 2869 found among punctuation, symbols, and digits.</p> 2870 <p>Each collation element of an expansion may be in a different 2871 reordering group, for example for parenthesized characters.</p> 2872 2873 <h3> 2874 3.14 <a name="Case_Parameters" href="#Case_Parameters">Case 2875 Parameters</a> 2876 </h3> 2877 <p> 2878 The <strong>case level</strong> is an <em>optional</em> intermediate 2879 level ("2.5") between Level 2 and Level 3 (or after Level 2880 1, if there is no Level 2 due to strength settings). The case level 2881 is used to support two parametric features: ignoring non-case 2882 variants (Level 3 differences) except for case, and giving case 2883 differences a higher-level priority than other tertiary differences. 2884 Distinctions between small and large Kana characters are also 2885 included as case differences, to support Japanese collation. 2886 </p> 2887 <p> 2888 The <strong>case first</strong> parameter controls whether to swap 2889 the order of upper and lowercase. It can be used with or without the 2890 case level. 2891 </p> 2892 <p> 2893 Importantly, the case parameters have no effect in many instances. 2894 For example, they have no effect on the comparison of two 2895 non-ignorable characters with different primary weights, or with 2896 different secondary weights if the strength = <strong>secondary 2897 (or higher).</strong> 2898 </p> 2899 <p> 2900 When either the <strong>case level</strong> or <strong>case 2901 first</strong> parameters are set, the following describes the derivation of 2902 the modified collation elements. It assumes the original levels for 2903 the code point are [p.s.t] (primary, secondary, tertiary). This 2904 derivation may change in future versions of LDML, to track the case 2905 characteristics more closely. 2906 </p> 2907 2908 <h4> 2909 3.14.1 <a name="Case_Untailored" href="#Case_Untailored">Untailored 2910 Characters</a> 2911 </h4> 2912 <p>For untailored characters and strings, that is, for mappings in 2913 the root collation, the case value for each collation element is 2914 computed from the tertiary weight listed in allkeys_CLDR.txt. This is 2915 used to modify the collation element.</p> 2916 <p>Look up a case value for the tertiary weight x of each 2917 collation element:</p> 2918 <ol> 2919 <li>UPPER if x {08-0C, 0E, 11, 12, 1D}</li> 2920 <li>UNCASED otherwise</li> 2921 <li>FractionalUCA.txt encodes the case information in bits 6 and 2922 7 of the first byte in each tertiary weight. The case bits are set 2923 to 00 for UNCASED and LOWERCASE, and 10 for UPPER. There is no MIXED 2924 case value (01) in the root collation.</li> 2925 </ol> 2926 2927 <h4> 2928 3.14.2 <a name="Case_Weights" href="#Case_Weights">Compute 2929 Modified Collation Elements</a> 2930 </h4> 2931 <p> 2932 From a computed case value, set a weight <strong>c</strong> according 2933 to the following. 2934 </p> 2935 <ol> 2936 <li>If <strong>CaseFirst=UpperFirst</strong>, set <strong>c</strong> 2937 = UPPER ? <strong>1</strong> : MIXED ? 2 : <strong>3</strong></li> 2938 <li>Otherwise set <strong>c</strong> = UPPER ? <strong>3</strong> 2939 : MIXED ? 2 : <strong>1</strong></li> 2940 </ol> 2941 <p> 2942 Compute a new collation element according to the following table. The 2943 notation <em>xt</em> means that the values are numerically combined 2944 into a single level, such that xt < yu whenever x < y. The 2945 fourth level (if it exists) is unaffected. Note that a secondary CE 2946 must have a secondary weight S which is greater than the secondary 2947 weight s of any primary CE; and a tertiary CE must have a tertiary 2948 weight T which is greater than the tertiary weight t of any primary 2949 or secondary CE ([<a 2950 href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>] <a 2951 href="http://www.unicode.org/reports/tr10/#WF2">WF2</a>). 2952 </p> 2953 2954 <div align="center"> 2955 <table> 2956 <tbody> 2957 <tr> 2958 <th>Case Level</th> 2959 <th>Strength</th> 2960 <th>Original CE</th> 2961 <th>Modified CE</th> 2962 <th>Comment</th> 2963 </tr> 2964 <tr> 2965 <td rowspan="5"><strong>on</strong></td> 2966 <td rowspan="2"><strong>primary</strong></td> 2967 <td><code>0.S.t</code></td> 2968 <td><code>0.0</code></td> 2969 <td rowspan="2">ignore case level weights of 2970 primary-ignorable CEs</td> 2971 </tr> 2972 <tr> 2973 <td><code>p.s.t</code></td> 2974 <td><code>p.c</code></td> 2975 </tr> 2976 <tr> 2977 <td rowspan="3"><strong>secondary<br> 2978 </strong>or higher</td> 2979 <td><code>0.0.T</code></td> 2980 <td><code>0.0.0.T</code></td> 2981 <td rowspan="3">ignore case level weights of 2982 secondary-ignorable CEs</td> 2983 </tr> 2984 <tr> 2985 <td><code>0.S.t</code></td> 2986 <td><code>0.S.c.t</code></td> 2987 </tr> 2988 <tr> 2989 <td><code>p.s.t</code></td> 2990 <td><code>p.s.c.t</code></td> 2991 </tr> 2992 <tr> 2993 <td rowspan="4"><strong>off</strong></td> 2994 <td rowspan="4">any</td> 2995 <td><code>0.0.0</code></td> 2996 <td><code>0.0.00</code></td> 2997 <td rowspan="4">ignore case level weights of 2998 tertiary-ignorable CEs</td> 2999 </tr> 3000 <tr> 3001 <td><code>0.0.T</code></td> 3002 <td><code> 0.0.3T </code></td> 3003 </tr> 3004 <tr> 3005 <td><code>0.S.t</code></td> 3006 <td><code>0.S.ct</code></td> 3007 </tr> 3008 <tr> 3009 <td><code>p.s.t</code></td> 3010 <td><code>p.s.ct</code></td> 3011 </tr> 3012 </tbody> 3013 </table> 3014 </div> 3015 3016 <p>For primary+case, which is used for ignore accents but not 3017 case collation, primary ignorables are ignored so that a = . For 3018 secondary+case, which would by analogy mean ignore variants but not 3019 case, secondary ignorables are ignored for equivalent behavior.</p> 3020 <p> 3021 When using <strong>caseFirst</strong> but not <strong>caseLevel</strong>, 3022 the combined case+tertiary weight of a tertiary CE must be greater 3023 than the combined case+tertiary weight of any primary or secondary CE 3024 so that [<a href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>] 3025 <a href="http://www.unicode.org/reports/tr10/#WF2">well-formedness 3026 condition 2</a> is fulfilled. Since the tertiary CEs tertiary weight T 3027 is already greater than any t of primary or secondary CEs, it is 3028 sufficient to set its case weight to UPPER=3. It must not be affected 3029 by <strong>caseFirst=upper</strong>. (The table uses the constant 3 3030 in this case rather than the computed c.) 3031 </p> 3032 <p> 3033 The case weight of a tertiary-ignorable CE must be 0 so that [<a 3034 href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>] <a 3035 href="http://www.unicode.org/reports/tr10/#WF1">well-formedness 3036 condition 1</a> is fulfilled. 3037 </p> 3038 3039 <h4> 3040 3.14.3 <a name="Case_Tailored" href="#Case_Tailored">Tailored 3041 Strings</a> 3042 </h4> 3043 <p>Characters and strings that are tailored have case values 3044 computed from their root collation case bits.</p> 3045 3046 <ol> 3047 <li>Look up the tailored strings root CEs. (Ignore any prefix 3048 or extension strings.) N=number of primary root CEs.</li> 3049 <li>Determine the number and type (primary vs. weaker) of CEs a 3050 tailored string maps to. M=number of primary tailored CEs.</li> 3051 <li>If N<=M (no more root than tailoring primary CEs): Copy 3052 the root case bits for primary CEs 0..N-1. 3053 <ul> 3054 <li>If N<M (fewer root primary CEs): Clear the case bits of 3055 the remaining tailored primary CEs. (uncased/lowercase/small Kana)</li> 3056 </ul> 3057 </li> 3058 <li>If N>M (more root primary CEs): Copy the root case bits 3059 for primary CEs 0..M-2. Set the case bits for tailored primary CE 3060 M-1 according to the remaining root primary CEs M-1..N-1: 3061 <ul> 3062 <li>Set to uncased/lower if all remaining root primary CEs 3063 have uncased/lower.</li> 3064 <li>Set to uppercase if all remaining root primary CEs have 3065 uppercase.</li> 3066 <li>Otherwise, set to mixed.</li> 3067 </ul> 3068 </li> 3069 <li>Clear the case bits for secondary CEs 0.s.t.</li> 3070 <li>Tertiary CEs 0.0.t must get uppercase bits.</li> 3071 <li>Tertiary-ignorable CEs 0.0.0 must get 3072 ignorable-case=lowercase bits.</li> 3073 </ol> 3074 <p class="note">Note: Almost all Cased characters have primary 3075 (non-ignorable) root collation CEs, except for U+0345 Combining 3076 Ypogegrammeni which is Lowercase. All Uppercase characters have 3077 primary root collation CEs.</p> 3078 3079 3080 <h3> 3081 3.15 <a name="Visibility" href="#Visibility">Visibility</a> 3082 </h3> 3083 <p> 3084 Collations have external visibility by default, meaning that they can 3085 be displayed in a list of collation options for users to choose from. 3086 A collation whose type name starts with "private-" is internal and 3087 should not be shown in such a list. Collations are typically internal 3088 when they are partial sequences included in other collations. See <i>Section 3089 3.1, <a href="#Collation_Types">Collation Types</a> 3090 </i>. 3091 </p> 3092 3093 <h3> 3094 3.16 <a name="Collation_Indexes" href="#Collation_Indexes">Collation 3095 Indexes</a> 3096 </h3> 3097 <h4> 3098 3.16.1 <a name="Index_Characters" href="#Index_Characters">Index 3099 Characters</a> 3100 </h4> 3101 <p> 3102 The main data includes <exemplarCharacters> for collation 3103 indexes. See <i>Part 2 General, Section 3, <a 3104 href="tr35-general.html#Character_Elements">Character Elements</a></i>, 3105 for general information about exemplar characters. 3106 </p> 3107 <p>The index characters are a set of characters for use as a UI 3108 "index", that is, a list of clickable characters (or character 3109 sequences) that allow the user to see a segment of a larger "target" 3110 list. Each character corresponds to a bucket in the target list. One 3111 may have different kinds of index lists; one that produces an index 3112 list that is relatively static, and the other is a list that produces 3113 roughly equally-sized buckets. While CLDR is mostly focused on the 3114 first, there is provision for supporting the second as well.</p> 3115 <p>The index characters need to be used in conjunction with a 3116 collation for the locale, which will determine the order of the 3117 characters. It will also determine which index characters show up.</p> 3118 <p>The static list would be presented as something like the 3119 following (either vertically or horizontally):</p> 3120 <p align="center">A B C D E F G H CH I J K L M N O P Q R S T U V 3121 W X Y Z</p> 3122 <p>In the "A" bucket, you would find all items that are primary 3123 greater than or equal to "A" in collation order, and primary less 3124 than "B". The use of the list requires that the target list be sorted 3125 according to the locale that is used to create that list. Although we 3126 say "character" above, the index character could be a sequence, like 3127 "CH" above. The index exemplar characters must always be used with a 3128 collation appropriate for the locale. Any characters that do not have 3129 primary differences from others in the set should be removed.</p> 3130 <p>Details:</p> 3131 <ol> 3132 <li>The primary weight (according to the collation) is used to 3133 determine which bucket a string is in. There are special buckets for 3134 before the first character, between buckets of different scripts, 3135 and after the last bucket (and of a different script).</li> 3136 <li>Characters in the <em>index characters</em> do not need to 3137 have distinct primary weights. That is, the <em>index 3138 characters</em> are adapted to the underlying collation: normally is 3139 in the bucket for Russian, but if someone used a variant of 3140 Russian collation that distinguished them on a primary level, then 3141 would show up as its own bucket. 3142 </li> 3143 <li>If an <em>index character</em> string ends with a single "*" 3144 (U+002A), for example "Sch*" and "St*" in German, then there will be 3145 a separate bucket for the string minus the "*", for example "Sch" 3146 and "St", even if that string does not sort distinctly. 3147 </li> 3148 <li>An <em>index character</em> can have multiple primary 3149 weights, for example "" and "Sch". Names that have the same initial 3150 primary weights sort into this <em>index character</em>s bucket. 3151 This can be achieved by using an upper-boundary string that is the 3152 concatenation of the <em>index character</em> and U+FFFF, for 3153 example "\uFFFF" and "Sch\uFFFF". Names that sort greater than this 3154 upper boundary but less than the next index character are redirected 3155 to the last preceding single-primary index character (A and S for 3156 the examples here). 3157 </li> 3158 </ol> 3159 <p> 3160 For example, for index characters 3161 <code>[A B R S {Sch*} {St*} T]</code> 3162 the following sample names are sorted into an index as shown. 3163 </p> 3164 <ul> 3165 <li>A — Adelbert, Afrika</li> 3166 <li> — sculap, Aesthet</li> 3167 <li>B — Berlin</li> 3168 <li>R — Rilke</li> 3169 <li>S — Sacher, Seiler, Sultan</li> 3170 <li>Sch — Schiller</li> 3171 <li>St — Steiff</li> 3172 <li>T — Thomas</li> 3173 </ul> 3174 <p> 3175 Theitems are special: each is a bucket for everything else, either 3176 less or greater. They are inserted at the start and end of the index 3177 list, <em>and</em> on script boundaries. Each script has its own 3178 range, except where scripts sort primary-equal (e.g., Hira & 3179 Kana). All characters that sort in one of the low reordering groups 3180 (whitespace, punctuation, symbols, currency symbols, digits) are 3181 treated as a single script for this purpose. 3182 </p> 3183 <p>If you tailor a Greek character into the Cyrillic script, that 3184 Greek character will be bucketed (and sorted) among the Cyrillic 3185 ones.</p> 3186 3187 <p> 3188 Even in an implementation that reorders groups of scripts rather than 3189 single scripts, for example Hebrew together with Phoenician and 3190 Samaritan, the index boundaries are really script boundaries, <em>not</em> 3191 multi-script-group boundaries. So if you had a collation that 3192 reordered Hebrew after Ethiopic, you would still get index boundaries 3193 between the following (and in that order): 3194 </p> 3195 <ol> 3196 <li>Ethiopic</li> 3197 <li>Hebrew</li> 3198 <li>Phoenician<em>// included in the Hebrew reordering 3199 group</em></li> 3200 <li>Samaritan<em>// included in the Hebrew reordering 3201 group</em></li> 3202 <li>Devanagari</li> 3203 </ol> 3204 <p>(Beginning with CLDR 27, single scripts can be reordered.)</p> 3205 <p>In the UI, an index character could also be omitted or grayed 3206 out if its bucket is empty. For example, if there is nothing in the 3207 bucket for Q, then Q could be omitted. That would be up to the 3208 implementation. Additional buckets could be added if other characters 3209 are present. For example, we might see something like the following:</p> 3210 <table border="1" cellspacing="0"> 3211 <tbody> 3212 <tr align="center"> 3213 <td><div align="center"> 3214 <strong>Sample Greek Index<br> 3215 </strong> 3216 </div></td> 3217 <td><strong>Contents<br> 3218 </strong></td> 3219 </tr> 3220 <tr align="center"> 3221 <td><div align="center"> 3222 </div></td> 3223 <td>With only content beginning with Greek letters<br> 3224 </td> 3225 </tr> 3226 <tr align="center"> 3227 <td><div align="center"> 3228 </div></td> 3229 <td>With some content before or after</td> 3230 </tr> 3231 <tr align="center"> 3232 <td><div align="center"> 9 3233 </div></td> 3234 <td>With numbers, and nothing between 9 and Alpha</td> 3235 </tr> 3236 <tr align="center"> 3237 <td><div align="center"> 3238 9<em>A-Z</em> 3239 3240 </div></td> 3241 <td>With numbers, some Latin</td> 3242 </tr> 3243 </tbody> 3244 </table> 3245 <p>Here is a sample of the XML structure:</p> 3246 <pre><exemplarCharacters type="index">[A B C D E F G H I J K L M N O P Q R S T U V W X Y Z]</exemplarCharacters></pre> 3247 <p> 3248 The display of the index characters can be modified with the Index 3249 labels elements, discussed in the <i>Part 2 General, Section 3.3, 3250 <a href="tr35-general.html#IndexLabels">Index Labels</a> 3251 </i>. 3252 </p> 3253 3254 <h4> 3255 3.16.2 <a name="CJK_Index_Markers" href="#CJK_Index_Markers">CJK 3256 Index Markers</a> 3257 </h4> 3258 <p>Special index markers have been added to the CJK collations for 3259 stroke, pinyin, zhuyin, and unihan. These markers allow for effective 3260 and robust use of indexes for these collations.</p> 3261 <p>The per-language index exemplar characters are not useful for 3262 collation indexes for CJK because for each such language there are 3263 multiple sort orders in use (for example, Chinese pinyin vs. stroke 3264 vs. unihan vs. zhuyin), and these sort orders use very different 3265 index characters. In addition, sometimes the boundary strings are 3266 different from the bucket label strings. For collations that contain 3267 index markers, the boundary strings and bucket labels should be 3268 derived from those index markers, ignoring the index exemplar 3269 characters.</p> 3270 <p>For example, near the start of the pinyin tailoring there is 3271 the following:</p> 3272 <p> 3273 <p> A</p><!-- INDEX A --><br> 3274 <pc></pc><!-- --> 3275 </p> 3276 <p></p> 3277 <p> 3278 <pc></pc><!-- ao --><br> <p> 3279 B</p><!-- INDEX B --> 3280 </p> 3281 <p>These indicate the boundaries of "buckets" that can 3282 be used for indexing. They are always two characters starting with 3283 the noncharacter U+FDD0, and thus will not occur in normal text. For 3284 pinyin the second character is A-Z; for unihan it is one of the 3285 radicals; and for stroke it is a character after U+2800 indicating 3286 the number of strokes, such as . For zhuyin the second character is 3287 one of the standard Bopomofo characters in the range U+3105 through 3288 U+3129.</p> 3289 3290 <p>The corresponding bucket label strings are the boundary strings 3291 with the leading U+FDD0 removed. For example, the Pinyin boundary 3292 string "\uFDD0A" yields the label string "A".</p> 3293 3294 <p>However, for stroke order, the label string is the stroke count 3295 (second character minus U+2800) as a decimal-digit number followed by 3296 劃 (U+5283). For example, the stroke order boundary string 3297 "\uFDD0\u2805" yields the label string "5劃".</p> 3298 3299 <hr> 3300 <p class="copyright"> 3301 Copyright 20012018 Unicode, Inc. All 3302 Rights Reserved. The Unicode Consortium makes no expressed or implied 3303 warranty of any kind, and assumes no liability for errors or 3304 omissions. No liability is assumed for incidental and consequential 3305 damages in connection with or arising out of the use of the 3306 information or programs contained or accompanying this technical 3307 report. The Unicode <a href="http://unicode.org/copyright.html">Terms 3308 of Use</a> apply. 3309 </p> 3310 <p class="copyright">Unicode and the Unicode logo are trademarks 3311 of Unicode, Inc., and are registered in some jurisdictions.</p> 3312 </div> 3313 3314 </body> 3315 3316 </html> 3317