Home | History | Annotate | Download | only in ldml
      1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
      2 "http://www.w3.org/TR/html4/loose.dtd">
      3 <html>
      4 
      5 <head>
      6 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      7 <meta http-equiv="Content-Language" content="en-us">
      8 <link rel="stylesheet" href="http://www.unicode.org/reports/reports.css"
      9 	type="text/css">
     10 <title>UTS #35: Unicode LDML: Collation</title>
     11 <style type="text/css">
     12 <!--
     13 .dtd {
     14 	font-family: monospace;
     15 	font-size: 90%;
     16 	background-color: #CCCCFF;
     17 	border-style: dotted;
     18 	border-width: 1px;
     19 }
     20 
     21 .xmlExample {
     22 	font-family: monospace;
     23 	font-size: 80%
     24 }
     25 
     26 .blockedInherited {
     27 	font-style: italic;
     28 	font-weight: bold;
     29 	border-style: dashed;
     30 	border-width: 1px;
     31 	background-color: #FF0000
     32 }
     33 
     34 .inherited {
     35 	font-weight: bold;
     36 	border-style: dashed;
     37 	border-width: 1px;
     38 	background-color: #00FF00
     39 }
     40 
     41 .element {
     42 	font-weight: bold;
     43 	color: red;
     44 }
     45 
     46 .attribute {
     47 	font-weight: bold;
     48 	color: maroon;
     49 }
     50 
     51 .attributeValue {
     52 	font-weight: bold;
     53 	color: blue;
     54 }
     55 
     56 li, p {
     57 	margin-top: 0.5em;
     58 	margin-bottom: 0.5em
     59 }
     60 
     61 h2, h3, h4, table {
     62 	margin-top: 1.5em;
     63 	margin-bottom: 0.5em;
     64 }
     65 -->
     66 </style>
     67 </head>
     68 
     69 <body>
     70 
     71 	<table class="header" width="100%">
     72 		<tr>
     73 			<td class="icon"><a href="http://unicode.org"> <img
     74 					alt="[Unicode]" src="http://unicode.org/webscripts/logo60s2.gif"
     75 					width="34" height="33"
     76 					style="vertical-align: middle; border-left-width: 0px; border-bottom-width: 0px; border-right-width: 0px; border-top-width: 0px;"></a>&nbsp;
     77 				<a class="bar" href="http://www.unicode.org/reports/">Technical
     78 					Reports</a></td>
     79 		</tr>
     80 		<tr>
     81 			<td class="gray">&nbsp;</td>
     82 		</tr>
     83 	</table>
     84 	<div class="body">
     85 		<h2 style="text-align: center">
     86 			Unicode Technical
     87 			Standard #35
     88 		</h2>
     89 		<h1>
     90 			Unicode Locale Data Markup Language (LDML)<br>Part 5: Collation
     91 		</h1>
     92 
     93 		<!-- At least the first row of this header table should be identical across the parts of this UTS. -->
     94 		<table border="1" cellpadding="2" cellspacing="0" class="wide">
     95 			<tr>
     96 				<td>Version</td>
     97 				<td>34</td>
     98 			</tr>
     99 			<tr>
    100 				<td>Editors</td>
    101 				<td><a
    102 					href="https://plus.google.com/117587389715494866571?rel=author">
    103 						Markus Scherer</a> (<a href="mailto:markus.icu (a] gmail.com">markus.icu (a] gmail.com</a>)
    104 					and <a href="tr35.html#Acknowledgments">other CLDR committee
    105 						members</a></td>
    106 			</tr>
    107 		</table>
    108 
    109 		<p>
    110 			For the full header, summary, and status, see <a href="tr35.html">
    111 				Part 1: Core</a>
    112 		</p>
    113 
    114 		<h3>
    115 			<i>Summary</i>
    116 		</h3>
    117 		<p>
    118 			This document describes parts of an XML format (<i>vocabulary</i>)
    119 			for the exchange of structured locale data. This format is used in
    120 			the <a href="http://cldr.unicode.org/">Unicode Common Locale Data
    121 				Repository</a>.
    122 		</p>
    123 
    124 		<p>
    125 			This is a partial document, describing only those parts of the LDML
    126 			that are relevant for collation (sorting, searching &amp; grouping).
    127 			For the other parts of the LDML see the <a href="tr35.html">main
    128 				LDML document</a> and the links above.
    129 		</p>
    130 
    131 		<h3>
    132 			<i>Status</i>
    133 		</h3>
    134 
    135 		<!-- NOT YET APPROVED 
    136 		<p>
    137 				<i class="changed">This is a<b><font color="#ff3333">
    138 				draft </font></b>document which may be updated, replaced, or superseded by
    139 				other documents at any time. Publication does not imply endorsement
    140 				by the Unicode Consortium. This is not a stable document; it is
    141 				inappropriate to cite this document as other than a work in
    142 				progress.
    143 			</i>
    144 		</p>
    145 		 END NOT YET APPROVED -->
    146 		<!-- APPROVED -->
    147 		<p>
    148 			<i>This document has been reviewed by Unicode members and other
    149 				interested parties, and has been approved for publication by the
    150 				Unicode Consortium. This is a stable document and may be used as
    151 				reference material or cited as a normative reference by other
    152 				specifications.</i>
    153 		</p>
    154 		<!-- END APPROVED -->
    155 
    156 
    157 		<blockquote>
    158 			<p>
    159 				<i><b>A Unicode Technical Standard (UTS)</b> is an independent
    160 					specification. Conformance to the Unicode Standard does not imply
    161 					conformance to any UTS.</i>
    162 			</p>
    163 		</blockquote>
    164 		<p>
    165 			<i>Please submit corrigenda and other comments with the CLDR bug
    166 				reporting form [<a href="tr35.html#Bugs">Bugs</a>]. Related
    167 				information that is useful in understanding this document is found
    168 				in the <a href="tr35.html#References">References</a>. For the latest
    169 				version of the Unicode Standard see [<a href="tr35.html#Unicode">Unicode</a>].
    170 				For a list of current Unicode Technical Reports see [<a
    171 				href="tr35.html#Reports">Reports</a>]. For more information about
    172 				versions of the Unicode Standard, see [<a href="tr35.html#Versions">Versions</a>].
    173 			</i>
    174 		</p>
    175 		<h2>
    176 			<a name="Parts" href="#Parts">Parts</a>
    177 		</h2>
    178 
    179 		<!-- This section of Parts should be identical in all of the parts of this UTS. -->
    180 		<p>The LDML specification is divided into the following parts:</p>
    181 		<ul class="toc">
    182 			<li>Part 1: <a href="tr35.html#Contents">Core</a> (languages,
    183 				locales, basic structure)
    184 			</li>
    185 			<li>Part 2: <a href="tr35-general.html#Contents">General</a>
    186 				(display names &amp; transforms, etc.)
    187 			</li>
    188 			<li>Part 3: <a href="tr35-numbers.html#Contents">Numbers</a>
    189 				(number &amp; currency formatting)
    190 			</li>
    191 			<li>Part 4: <a href="tr35-dates.html#Contents">Dates</a> (date,
    192 				time, time zone formatting)
    193 			</li>
    194 			<li>Part 5: <a href="tr35-collation.html#Contents">Collation</a>
    195 				(sorting, searching, grouping)
    196 			</li>
    197 			<li>Part 6: <a href="tr35-info.html#Contents">Supplemental</a>
    198 				(supplemental data)
    199 			</li>
    200 			<li>Part 7: <a href="tr35-keyboards.html#Contents">Keyboards</a>
    201 				(keyboard mappings)
    202 			</li>
    203 		</ul>
    204 
    205 		<h2>
    206 			<a name="Contents" href="#Contents">Contents of Part 5, Collation</a>
    207 		</h2>
    208 		<!-- START Generated TOC: CheckHtmlFiles -->
    209 		<ul class="toc">
    210 			<li>1 <a href="#CLDR_Collation">CLDR Collation</a>
    211 				<ul class="toc">
    212 					<li>1.1 <a href="#CLDR_Collation_Algorithm">CLDR Collation
    213 							Algorithm</a>
    214 						<ul class="toc">
    215 							<li>1.1.1 <a href="#Algorithm_FFFE">U+FFFE</a></li>
    216 							<li>1.1.2 <a href="#Context_Sensitive_Mappings">Context-Sensitive
    217 									Mappings</a></li>
    218 							<li>1.1.3 <a href="#Algorithm_Case">Case Handling</a></li>
    219 							<li>1.1.4 <a href="#Algorithm_Reordering_Groups">Reordering
    220 									Groups</a></li>
    221 							<li>1.1.5 <a href="#Combining_Rules">Combining Rules</a></li>
    222 						</ul>
    223 					</li>
    224 				</ul>
    225 			</li>
    226 			<li>2 <a href="#Root_Collation">Root Collation</a>
    227 				<ul class="toc">
    228 					<li>2.1 <a href="#grouping_classes_of_characters">Grouping
    229 							classes of characters</a></li>
    230 					<li>2.2 <a href="#non_variable_symbols">Non-variable
    231 							symbols</a></li>
    232 					<li>2.3 <a href="#tibetan_contractions">Additional
    233 							contractions for Tibetan</a></li>
    234 					<li>2.4 <a href="#tailored_noncharacter_weights">Tailored
    235 							noncharacter weights</a></li>
    236 					<li>2.5 <a href="#Root_Data_Files">Root Collation Data
    237 							Files</a></li>
    238 					<li>2.6 <a href="#Root_Data_File_Formats">Root Collation
    239 							Data File Formats</a>
    240 						<ul class="toc">
    241 							<li>2.6.1 <a href="#File_Format_allkeys_CLDR_txt">allkeys_CLDR.txt</a></li>
    242 							<li>2.6.2 <a href="#File_Format_FractionalUCA_txt">FractionalUCA.txt</a></li>
    243 							<li>2.6.3 <a href="#File_Format_UCA_Rules_txt">UCA_Rules.txt</a></li>
    244 						</ul>
    245 					</li>
    246 				</ul>
    247 			</li>
    248 			<li>3 <a href="#Collation_Tailorings">Collation Tailorings</a>
    249 				<ul class="toc">
    250 					<li>3.1 <a href="#Collation_Types">Collation Types</a>
    251 						<ul class="toc">
    252 							<li>3.1.1 <a href="#Collation_Type_Fallback">Collation
    253 									Type Fallback</a>
    254 								<ul class="toc">
    255 									<li>Table: <a
    256 										href="#Sample_requested_and_actual_collation_locales_and_types">Sample
    257 											requested and actual collation locales and types</a></li>
    258 								</ul>
    259 							</li>
    260 						</ul>
    261 					</li>
    262 					<li>3.2 <a href="#Collation_Version">Version</a></li>
    263 					<li>3.3 <a href="#Collation_Element">Collation Element</a></li>
    264 					<li>3.4 <a href="#Setting_Options">Setting Options</a>
    265 						<ul class="toc">
    266 							<li>Table: <a href="#Collation_Settings">Collation
    267 									Settings</a></li>
    268 							<li>3.4.1 <a href="#Common_Settings">Common settings
    269 									combinations</a></li>
    270 							<li>3.4.2 <a href="#Normalization_Setting">Notes on the
    271 									normalization setting</a></li>
    272 							<li>3.4.3 <a href="#Variable_Top_Settings">Notes on
    273 									variable top settings</a></li>
    274 						</ul>
    275 					</li>
    276 					<li>3.5 <a href="#Rules">Collation Rule Syntax</a></li>
    277 					<li>3.6 <a href="#Orderings">Orderings</a>
    278 						<ul class="toc">
    279 							<li>Table: <a href="#Specifying_Collation_Ordering">Specifying
    280 									Collation Ordering</a></li>
    281 							<li>Table: <a href="#Abbreviating_Ordering_Specifications">Abbreviating
    282 									Ordering Specifications</a></li>
    283 						</ul>
    284 					</li>
    285 					<li>3.7 <a href="#Contractions">Contractions</a>
    286 						<ul class="toc">
    287 							<li>Table: <a href="#Specifying_Contractions">Specifying
    288 									Contractions</a></li>
    289 						</ul>
    290 					</li>
    291 					<li>3.8 <a href="#Expansions">Expansions</a></li>
    292 					<li>3.9 <a href="#Context_Before">Context Before</a>
    293 						<ul class="toc">
    294 							<li>Table: <a href="#Specifying_Previous_Context">Specifying
    295 									Previous Context</a></li>
    296 						</ul>
    297 					</li>
    298 					<li>3.10 <a href="#Placing_Characters_Before_Others">Placing
    299 							Characters Before Others</a></li>
    300 					<li>3.11 <a href="#Logical_Reset_Positions">Logical Reset
    301 							Positions</a>
    302 						<ul class="toc">
    303 							<li>Table: <a href="#Specifying_Logical_Positions">Specifying
    304 									Logical Positions</a></li>
    305 						</ul>
    306 					</li>
    307 					<li>3.12 <a href="#Special_Purpose_Commands">Special-Purpose
    308 							Commands</a>
    309 						<ul class="toc">
    310 							<li>Table: <a href="#Special_Purpose_Elements">Special-Purpose
    311 									Elements</a></li>
    312 						</ul>
    313 					</li>
    314 					<li>3.13 <a href="#Script_Reordering">Collation Reordering</a>
    315 						<ul class="toc">
    316 							<li>3.13.1 <a href="#Interpretation_reordering">Interpretation
    317 									of a reordering list</a></li>
    318 							<li>3.13.2 <a href="#Reordering_Groups_allkeys">Reordering
    319 									Groups for allkeys.txt</a></li>
    320 						</ul>
    321 					</li>
    322 					<li>3.14 <a href="#Case_Parameters">Case Parameters</a>
    323 						<ul class="toc">
    324 							<li>3.14.1 <a href="#Case_Untailored">Untailored
    325 									Characters</a></li>
    326 							<li>3.14.2 <a href="#Case_Weights">Compute Modified
    327 									Collation Elements</a></li>
    328 							<li>3.14.3 <a href="#Case_Tailored">Tailored Strings</a></li>
    329 						</ul>
    330 					</li>
    331 					<li>3.15 <a href="#Visibility">Visibility</a></li>
    332 					<li>3.16 <a href="#Collation_Indexes">Collation Indexes</a>
    333 						<ul class="toc">
    334 							<li>3.16.1 <a href="#Index_Characters">Index Characters</a></li>
    335 							<li>3.16.2 <a href="#CJK_Index_Markers">CJK Index
    336 									Markers</a></li>
    337 						</ul>
    338 					</li>
    339 				</ul>
    340 			</li>
    341 		</ul>
    342 		<!-- END Generated TOC: CheckHtmlFiles -->
    343 
    344 		<h2>
    345 			1 <a name="CLDR_Collation" href="#CLDR_Collation">CLDR Collation</a>
    346 		</h2>
    347 		<p>Collation is the general term for the process and function of
    348 			determining the sorting order of strings of characters, for example
    349 			for lists of strings presented to users, or in databases for sorting
    350 			and selecting records.</p>
    351 
    352 		<p>Collation varies by language, by application (some languages
    353 			use special phonebook sorting), and other criteria (for example,
    354 			phonetic vs. visual).</p>
    355 
    356 		<p>
    357 			CLDR provides collation data for many languages and styles. The data
    358 			supports not only sorting but also language-sensitive searching and
    359 			grouping under index headers. All CLDR collations are based on the [<a
    360 				href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>] default
    361 			order, with common modifications applied in the CLDR root collation,
    362 			and further tailored for language and style as needed.
    363 		</p>
    364 
    365 		<h3>
    366 			1.1 <a name="CLDR_Collation_Algorithm"
    367 				href="#CLDR_Collation_Algorithm">CLDR Collation Algorithm</a>
    368 		</h3>
    369 
    370 		<p>
    371 			The CLDR collation algorithm is an extension of the <a
    372 				href="http://www.unicode.org/reports/tr10/#Main_Algorithm">Unicode
    373 				Collation Algorithm</a>.
    374 		</p>
    375 
    376 		<h4>
    377 			1.1.1 <a name="Algorithm_FFFE" href="#Algorithm_FFFE">U+FFFE</a>
    378 		</h4>
    379 
    380 		<p>
    381 			U+FFFE maps to a CE with a minimal, unique primary weight. Its
    382 			primary weight is not "variable": U+FFFE must not become ignorable in
    383 			alternate handling. On the identical level, a minimal, unique
    384 			weight must be emitted for U+FFFE as well. This allows for <a
    385 				href="http://www.unicode.org/reports/tr10/#Merging_Sort_Keys">Merging
    386 				Sort Keys</a> within code point space.
    387 		</p>
    388 		<p>
    389 			For example, when sorting names in a database, a sortable string can
    390 			be formed with <em>last_name</em> + '\uFFFE' + <em>first_name</em>.
    391 			These strings would sort properly, without ever comparing the last
    392 			part of a last name with the first part of another first name.
    393 		</p>
    394 
    395 		<p>
    396 			For backwards secondary level sorting, text <i>segments</i> separated
    397 			by U+FFFE are processed in forward segment order, and <i>within</i>
    398 			each segment the secondary weights are compared backwards. This is so
    399 			that such combined strings are processed consistently with merging
    400 			their sort keys (for example, by concatenating them level by level
    401 			with a low separator).
    402 		</p>
    403 
    404 		<p class="note">
    405 			Note: With unique, low weights on <i>all</i> levels it is possible to
    406 			achieve
    407 			<code>sortkey(str1 + "\uFFFE" + str2) ==
    408 				mergeSortkeys(sortkey(str1), sortkey(str2))</code>
    409 			. When that is not necessary, then code can be a little simpler (no
    410 			special handling for U+FFFE except for backwards-secondary), sort
    411 			keys can be a little shorter (when using compressible common
    412 			non-primary weights for U+FFFE), and another low weight can be used
    413 			in tailorings.
    414 		</p>
    415 
    416 		<h4>
    417 			1.1.2 <a name="Context_Sensitive_Mappings"
    418 				href="#Context_Sensitive_Mappings">Context-Sensitive Mappings</a>
    419 		</h4>
    420 
    421 		<p>Contraction matching, as in the UCA, starts from the first
    422 			character of the contraction string. It slows down processing of that
    423 			first character even when none of its contractions matches. In some
    424 			cases, it is preferrable to change such contractions to mappings with
    425 			a prefix (context before a character), so that complex processing is
    426 			done only when the less-frequently occurring trailing character is
    427 			encountered.</p>
    428 
    429 		<p>For example, the DUCET contains contractions for several
    430 			variants of L (L followed by middle dot). Collating ASCII text is
    431 			slowed down by contraction matching starting with L/l. In the CLDR
    432 			root collation, these contractions are replaced by prefix mappings
    433 			(L|) which are triggered only when the middle dot is encountered.
    434 			CLDR also uses prefix rules in the Japanese tailoring, for processing
    435 			of Hiragana/Katakana length and iteration marks.</p>
    436 
    437 		<p>The mapping is conditional on the prefix match but does not
    438 			change the mappings for the preceding text. As a result, a
    439 			contraction mapping for "px" can be replaced by a prefix rule "p|x"
    440 			only if px maps to the collation elements for p followed by the
    441 			collation elements for "x if after p". In the DUCET, L maps to CE(L)
    442 			followed by a special secondary CE (which differs from CE() when 
    443 			is not preceded by L). In the CLDR root collation, L has no
    444 			context-sensitive mappings, but  maps to that special secondary CE
    445 			if preceded by L.</p>
    446 
    447 		<p>A prefix mapping for p|x behaves mostly like the contraction
    448 			px, except when there is a contraction that overlaps with the prefix,
    449 			for example one for "op". A contraction matches only new text (and
    450 			consumes it), while a prefix matches only already-consumed text.</p>
    451 		<ul>
    452 			<li>With mappings for "op" and "px", only the first contraction
    453 				matches in text "opx". (It consumes the "op" characters, and there
    454 				is no context-sensitive mapping for x.)</li>
    455 			<li>With mappings for "op" and "p|x", both the contraction and
    456 				the prefix rule match in text "opx". (The prefix always matches
    457 				already-consumed characters, regardless of whether they mapped as
    458 				part of contractions.)</li>
    459 		</ul>
    460 
    461 		<p class="note">
    462 			Note: Matching of discontiguous contractions should be implemented
    463 			without rewriting the text (unlike in the [<a
    464 				href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>] algorithm
    465 			specification), so that prefix matching is predictable. (It should
    466 			also help with contraction matching performance.) An implementation
    467 			that does rewrite the text, as in the UCA, will get different results
    468 			for some (unusual) combinations of contractions, prefix rules, and
    469 			input text.
    470 		</p>
    471 
    472 		<p>Prefix matching uses a simple longest-match algorithm (op|c
    473 			wins over p|c). It is recommended that prefix rules be limited to
    474 			mappings where both the prefix string and the mapped string begin
    475 			with an NFC boundary (that is, with a normalization starter that does
    476 			not combine backwards). (In op|ch both o and c should be starters
    477 			(ccc=0) and NFC_QC=Yes.) Otherwise, prefix matching would be affected
    478 			by canonical reordering and discontiguous matching, like
    479 			contractions. Prefix matching is thus always contiguous.</p>
    480 
    481 		<p>A character can have mappings with both prefixes (context
    482 			before) and contraction suffixes. Prefixes are matched first. This is
    483 			to keep them reasonably implementable: When there is a mapping with
    484 			both a prefix and a contraction suffix (like in Japanese: |), then
    485 			the matching needs to go in both directions. The contraction might
    486 			involve discontiguous matching, which needs complex text iteration
    487 			and handling of skipped combining marks, and will consume the
    488 			matching suffix. Prefix matching should be first because, regardless
    489 			of whether there is a match, the implementation will always return to
    490 			the original text index (right after the prefix) from where it will
    491 			start to look at all of the contractions for that prefix.</p>
    492 
    493 		<p>If there is a match for a prefix but no match for any of the
    494 			suffixes for that prefix, then fall back to mappings with the
    495 			next-longest matching prefix, and so on, ultimately to mappings with
    496 			no prefix. (Otherwise mappings with longer prefixes would hide
    497 			mappings with shorter prefixes.)</p>
    498 
    499 		<p>Consider the following mappings.</p>
    500 		<ol>
    501 			<li>p  CE(p)</li>
    502 			<li>h  CE(h)</li>
    503 			<li>c  CE(c)</li>
    504 			<li>ch  CE(d)</li>
    505 			<li>p|c  CE(u)</li>
    506 			<li>p|ci  CE(v)</li>
    507 			<li>p|  CE(w)</li>
    508 			<li>op|ck  CE(x)</li>
    509 		</ol>
    510 
    511 		<p>With these, text collates like this:</p>
    512 		<ul>
    513 			<li>pc  CE(p)CE(u)</li>
    514 			<li>pci  CE(p)CE(v)</li>
    515 			<li>pch  CE(p)CE(u)CE(h)</li>
    516 			<li>p  CE(p)CE(w)</li>
    517 			<li>p  CE(p)CE(w)CE(U+0323) // discontiguous</li>
    518 			<li>opck  CE(o)CE(p)CE(x)</li>
    519 			<li>opch  CE(o)CE(p)CE(u)CE(h)</li>
    520 		</ul>
    521 
    522 		<p>
    523 			However, if the mapping p|c  CE(u) is missing, then text "pch" maps
    524 			to CE(p)CE(d), "opch" maps to CE(o)CE(p)CE(d), and "p" maps to
    525 			CE(p)CE(c)CE(U+0323)CE(U+0302) (because discontiguous contraction
    526 			matching extends <i>an existing match</i> by one non-starter at a
    527 			time).
    528 		</p>
    529 
    530 		<h4>
    531 			1.1.3 <a name="Algorithm_Case" href="#Algorithm_Case">Case
    532 				Handling</a>
    533 		</h4>
    534 		<p>
    535 			CLDR specifies how to sort lowercase or uppercase first, as a
    536 			stronger distinction than other tertiary variants (<strong>caseFirst</strong>)
    537 			or while completely ignoring all other tertiary distinctions (<strong>caseLevel</strong>).
    538 			See <i>Section 3.3 <a href="#Setting_Options">Setting Options</a></i>
    539 			and <i>Section 3.13 <a href="#Case_Parameters">Case
    540 					Parameters</a></i>.
    541 		</p>
    542 
    543 		<h4>
    544 			1.1.4 <a name="Algorithm_Reordering_Groups"
    545 				href="#Algorithm_Reordering_Groups">Reordering Groups</a>
    546 		</h4>
    547 		<p>CLDR specifies how to do parametric reordering of groups of
    548 			scripts (e.g., native script first) as well as special groups
    549 			(e.g., digits after letters), and provides data for the effective
    550 			implementation of such reordering.</p>
    551 
    552 		<h4>
    553 			1.1.5 <a name="Combining_Rules"
    554 				href="#Combining_Rules">Combining Rules</a>
    555 		</h4>
    556 		<p>Rules from different sources can be combined, with the later rules overriding the earlier ones. The following is an example of how this can be useful.</p>
    557 		<p>There is a root collation for &quot;emoji&quot; in CLDR. So use of &quot;-u-co-emoji&quot; in a Unicode locale identifier will access that ordering. </p>
    558 		<p>Example, using ICU:</p>
    559 		<blockquote>
    560 		  <p>collator = Collator.getInstance(ULocale.forLanguageTag(&quot;en-u-co-emoji&quot;));  </p>
    561 	  </blockquote>
    562 		<p>However, use of the emoji will supplant the language's customizations. So the above is the equivalent of: </p>
    563 		<blockquote>
    564 		  <p>collator = Collator.getInstance(ULocale.forLanguageTag(&quot;und-u-co-emoji&quot;));  </p>
    565 	  </blockquote>
    566 		<p>The same structure will not work for a language that does require customization, like Danish. That is, the following will fail.</p>
    567 		<blockquote>
    568 		  <p> collator = Collator.getInstance(ULocale.forLanguageTag(&quot;da-u-co-emoji&quot;));  </p>
    569 	  </blockquote>
    570 		<p>For that, a slightly more cumbersome method needs to be employed, which is to take the rules for Danish, and explicitly add the rules for emoji. </p>
    571 		<blockquote>
    572 		  <p>RuleBasedCollator collator = new RuleBasedCollator(<br>
    573 		    ((RuleBasedCollator) Collator.getInstance(ULocale.forLanguageTag(&quot;da&quot;))).getRules() +<br>
    574 		    ((RuleBasedCollator) Collator.getInstance(ULocale.forLanguageTag(&quot;und-u-co-emoji&quot;)))<br>
    575 	      .getRules());</p>
    576 	  </blockquote>
    577 		<p>The following table shows the differences. When emoji ordering is supported, the two faces will be adjacent. When Danish ordering is supported, the  is after the y.</p>
    578 		<table class='simple'>
    579 		  <tbody>
    580 		    <tr>
    581 		      <td>code point order</td>
    582 		      <td>,</td>
    583 		      <td></td>
    584 		      <td></td>
    585 		      <td>Z</td>
    586 		      <td>a</td>
    587 		      <td>y</td>
    588 		      <td></td>
    589 		      <td></td>
    590 		      <td></td>
    591 		      <td></td>
    592 		      <td></td>
    593 	        </tr>
    594 		    <tr>
    595 		      <td>en</td>
    596 		      <td>,</td>
    597 		      <td></td>
    598 		      <td></td>
    599 		      <td></td>
    600 		      <td>a</td>
    601 		      <td></td>
    602 		      <td>y</td>
    603 		      <td>Z</td>
    604 		      <td></td>
    605 	        </tr>
    606 		    <tr>
    607 		      <td>en-u-co-emoji</td>
    608 		      <td>,</td>
    609 		      <td></td>
    610 		      <td></td>
    611 		      <td></td>
    612 		      <td>a</td>
    613 		      <td></td>
    614 		      <td>y</td>
    615 		      <td>Z</td>
    616 		      <td></td>
    617 	        </tr>
    618 		    <tr>
    619 		      <td>da</td>
    620 		      <td>,</td>
    621 		      <td></td>
    622 		      <td></td>
    623 		      <td></td>
    624 		      <td>a</td>
    625 		      <td>y</td>
    626 		      <td><strong><u></u></strong></td>
    627 		      <td>Z</td>
    628 		      <td></td>
    629 	        </tr>
    630 		    <tr>
    631 		      <td>da-u-co-emoji</td>
    632 		      <td>,</td>
    633 		      <td></td>
    634 		      <td></td>
    635 		      <td></td>
    636 		      <td>a</td>
    637 		      <td><strong><u></u></strong></td>
    638 		      <td>y</td>
    639 		      <td>Z</td>
    640 		      <td></td>
    641 	        </tr>
    642 		    <tr>
    643 		      <td>combined rules</td>
    644 		      <td>,</td>
    645 		      <td></td>
    646 		      <td></td>
    647 		      <td></td>
    648 		      <td>a</td>
    649 		      <td>y</td>
    650 		      <td><strong><u></u></strong></td>
    651 		      <td>Z</td>
    652 		      <td></td>
    653 	        </tr>
    654 	      </tbody>
    655 	  </table>
    656 
    657 		<br>
    658 		<p>&nbsp;</p>
    659 		<p> </p>
    660 
    661 		<h2>
    662 			2 <a name="Root_Collation" href="#Root_Collation">Root Collation</a>
    663 		</h2>
    664 		<p>
    665 			The CLDR root collation order is based on the <a
    666 				href="http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table">Default
    667 				Unicode Collation Element Table (DUCET)</a> defined in <em>UTS #10:
    668 				Unicode Collation Algorithm</em> [<a
    669 				href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>]. It is
    670 			used by all other locales by default, or as the base for their
    671 			tailorings. (For a chart view of the UCA, see Collation Chart [<a
    672 				href="tr35.html#UCAChart">UCAChart</a>].)
    673 		</p>
    674 		<p>Starting with CLDR 1.9, CLDR uses modified tables for the root
    675 			collation order. The root locale ordering is tailored in the
    676 			following ways:</p>
    677 
    678 		<h3>
    679 			2.1 <a name="grouping_classes_of_characters"
    680 				href="#grouping_classes_of_characters">Grouping classes of
    681 				characters</a>
    682 		</h3>
    683 		<p>As of Version 6.1.0, the DUCET puts characters into the
    684 			following ordering:</p>
    685 		<ul>
    686 			<li>First &quot;common characters&quot;: whitespace,
    687 				punctuation, general symbols, some numbers, currency symbols, and
    688 				other numbers.</li>
    689 			<li>Then &quot;script characters&quot;: Latin, Greek, and the
    690 				rest of the scripts.</li>
    691 		</ul>
    692 		<p>(There are a few exceptions to this general ordering.)</p>
    693 		<p>The CLDR root locale modifies the DUCET tailoring by ordering
    694 			the common characters more strictly by category:</p>
    695 		<ul>
    696 			<li>whitespace, punctuation, general symbols, currency symbols,
    697 				and numbers.</li>
    698 		</ul>
    699 		<p>What the regrouping allows is for users to parametrically
    700 			reorder the groups. For example, users can reorder numbers after all
    701 			scripts, or reorder Greek before Latin.</p>
    702 		<p>The relative order within each of these groups still matches
    703 			the DUCET. Symbols, punctuation, and numbers that are grouped with a
    704 			particular script stay with that script. The differences between CLDR
    705 			and the DUCET order are:</p>
    706 		<ol>
    707 			<li>CLDR groups the numbers together after currency symbols,
    708 				instead of splitting them with some before and some after. Thus the
    709 				following are put <em>after</em> currencies and just before all the
    710 				other numbers.
    711 				<blockquote>
    712 					<p>
    713 						U+09F4 (  ) [No] BENGALI CURRENCY NUMERATOR ONE<br> ...<br>
    714 						U+1D371 (  ) [No] COUNTING ROD TENS DIGIT NINE
    715 					</p>
    716 				</blockquote>
    717 			</li>
    718 			<li>CLDR handles a few other characters differently
    719 				<ol>
    720 					<li>U+10A7F (  ) [Po] OLD SOUTH ARABIAN NUMERIC INDICATOR is
    721 						put with punctuation, not symbols</li>
    722 					<li>U+20A8 (  ) [Sc] RUPEE SIGN and U+FDFC (  ) [Sc] RIAL
    723 						SIGN are put with currency signs, not with R and REH.</li>
    724 				</ol>
    725 			</li>
    726 		</ol>
    727 
    728 		<h3>
    729 			2.2 <a name="non_variable_symbols" href="#non_variable_symbols">Non-variable
    730 				symbols</a>
    731 		</h3>
    732 		<p>
    733 			There are multiple <a
    734 				href="http://www.unicode.org/reports/tr10/#Variable_Weighting">Variable-Weighting</a>
    735 			options in the UCA for symbols and punctuation, including <em>non-ignorable</em>
    736 			and <em>shifted</em>. With the <em>shifted</em> option, almost all
    737 			symbols and punctuation are ignoredexcept at a fourth level. The
    738 			CLDR root locale ordering is modified so that symbols are not
    739 			affected by the <em>shifted</em> option. That is, by default, symbols
    740 			are not variable in CLDR. So <em>shifted</em> only causes
    741 			whitespace and punctuation to be ignored, but not symbols (like ).
    742 			The DUCET behavior can be specified with a locale ID using the
    743 			&quot;kv&quot; keyword, to set the Variable section to include all of
    744 			the symbols below it, or be set parametrically where implementations
    745 			allow access.
    746 		</p>
    747 		<p>See also:</p>
    748 		<ul>
    749 			<li><i>Section 3.3, <a href="#Setting_Options">Setting
    750 						Options</a></i></li>
    751 			<li><a href="http://www.unicode.org/charts/collation/">http://www.unicode.org/charts/collation/</a></li>
    752 		</ul>
    753 
    754 		<h3>
    755 			2.3 <a name="tibetan_contractions" href="#tibetan_contractions">Additional
    756 				contractions for Tibetan</a>
    757 		</h3>
    758 		<p>
    759 			Ten contractions are added for Tibetan: Two to fulfill <a
    760 				href="http://www.unicode.org/reports/tr10/#WF5">well-formedness
    761 				condition 5</a>, and eight more to preserve the default order for
    762 			Tibetan. For details see <i>UTS #10, Section 3.8.2, <a
    763 				href="http://www.unicode.org/reports/tr10/#Well_Formed_DUCET">Well-Formedness
    764 					of the DUCET</a></i>.
    765 		</p>
    766 
    767 		<h3>
    768 			2.4 <a name="tailored_noncharacter_weights"
    769 				href="#tailored_noncharacter_weights">Tailored noncharacter
    770 				weights</a>
    771 		</h3>
    772 		<p>U+FFFE and U+FFFF have special tailorings:</p>
    773 		<blockquote>
    774 			<p>
    775 				<strong>U+FFFF: </strong>This code point is tailored to have a
    776 				primary weight higher than all other characters. This allows the
    777 				reliable specification of a range, such as &ldquo;Sch&rdquo;  X 
    778 				&ldquo;Sch\uFFFF&rdquo;, to include all strings starting with
    779 				&quot;sch&quot; or equivalent.
    780 			</p>
    781 			<p>
    782 				<strong>U+FFFE: </strong>This code point produces a CE with minimal,
    783 				unique weights on primary and identical levels. For details see the
    784 				<i><a href="#Algorithm_FFFE">CLDR Collation Algorithm</a></i> above.
    785 			</p>
    786 		</blockquote>
    787 		<p>
    788 			UCA (beginning with version 6.3) also maps <strong>U+FFFD</strong> to
    789 			a special collation element with a very high primary weight, so that
    790 			it is reliably non-<a
    791 				href="http://www.unicode.org/reports/tr10/#Variable_Weighting">variable</a>,
    792 			for use with <a
    793 				href="http://www.unicode.org/reports/tr10/#Handling_Illformed">ill-formed
    794 				code unit sequences</a>.
    795 		</p>
    796 		<p>
    797 			In CLDR, so as to maintain the special collation elements, <strong>U+FFFD..U+FFFF
    798 			</strong> are not further tailorable, and nothing can tailor to them. That is,
    799 			neither can occur in a collation rule. For example, the following
    800 			rules are illegal:
    801 		</p>
    802 		<p>
    803 			<code>&amp;\uFFFF &lt; x</code>
    804 		</p>
    805 		<p>
    806 			<code>&amp;x &lt;\uFFFF</code>
    807 			<br>
    808 		</p>
    809 
    810 		<p class="note">
    811 			<b>Note:</b>
    812 		</p>
    813 		<ul>
    814 			<li class="note">Java uses an early version of this collation
    815 				syntax, but has not been updated recently. It does not support any
    816 				of the syntax marked with [...], and its default table is not the
    817 				DUCET nor the CLDR root collation.</li>
    818 		</ul>
    819 
    820 		<h3>
    821 			2.5 <a name="Root_Data_Files" href="#Root_Data_Files">Root
    822 				Collation Data Files</a>
    823 		</h3>
    824 		<p>
    825 			The CLDR root collation data files are in the CLDR repository and
    826 			release, under the path <a
    827 				href="http://unicode.org/repos/cldr/tags/latest/common/uca/">common/uca/</a>.
    828 		</p>
    829 
    830 		<p>
    831 			For most data files there are <strong>_SHORT</strong> versions
    832 			available. They contain the same data but only minimal comments, to
    833 			reduce the file sizes.
    834 		</p>
    835 
    836 		<p>Comments with DUCET-style weights in files other than
    837 			allkeys_CLDR.txt and allkeys_DUCET.txt use the weights defined in
    838 			allkeys_CLDR.txt.</p>
    839 		<ul>
    840 			<li><strong>allkeys_CLDR</strong> - A file that provides a
    841 				remapping of UCA DUCET weights for use with CLDR.</li>
    842 			<li><strong>allkeys_DUCET</strong> - The same as DUCET
    843 				allkeys.txt, but in alternate=non-ignorable sort order, for easier
    844 				comparison with allkeys_CLDR.txt.</li>
    845 			<li><strong>FractionalUCA</strong> - A file that provides a
    846 				remapping of UCA DUCET weights for use with CLDR. The weight values
    847 				are modified:
    848 				<ul>
    849 					<li>The weights have variable length, with 1..4 bytes each.
    850 						Each secondary or tertiary weight currently uses at most 2 bytes.</li>
    851 					<li>There are tailoring gaps between adjacent weights, so that
    852 						a number of characters can be tailored to sort between any two
    853 						root collation elements.</li>
    854 					<li>There are collation elements with primary weights at the
    855 						boundaries between reordering groups and Unicode scripts, so that
    856 						tailoring around the first or last primary of a group/script
    857 						results in new collation elements that sort and reorder together
    858 						with that group or script. These boundary weights also define the
    859 						primary weight ranges for parametric group and script reordering.
    860 					</li>
    861 				</ul> An implementation may modify the weights further to fit the needs
    862 				of its data structures.</li>
    863 			<li><strong>UCA_Rules</strong> - A file that specifies the root
    864 				collation order in the form of <a href="#Collation_Tailorings">tailoring
    865 					rules</a>. This is only an approximation of the FractionalUCA data,
    866 				since the rule syntax cannot express every detail of the collation
    867 				elements. For example, in the DUCET and in FractionalUCA, tertiary
    868 				differences are usually expressed with special tertiary weights on
    869 				all collation elements of an expansion, while a typical from-rules
    870 				builder will modify the tertiary weight of only one of the collation
    871 				elements.</li>
    872 			<li><strong>CollationTest_CLDR</strong> - The CLDR versions of
    873 				the CollationTest files, which use the tailorings for CLDR. For
    874 				information on the format, see <a
    875 				href="http://www.unicode.org/Public/UCA/latest/CollationTest.html">CollationTest.html</a>
    876 				in the <a href="http://www.unicode.org/reports/tr10/#Data10">UCA
    877 					data directory</a>.
    878 				<ul>
    879 					<li>CollationTest_CLDR_NON_IGNORABLE.txt</li>
    880 					<li>CollationTest_CLDR_SHIFTED.txt</li>
    881 				</ul></li>
    882 		</ul>
    883 
    884 		<h3>
    885 			2.6 <a name="Root_Data_File_Formats" href="#Root_Data_File_Formats">Root
    886 				Collation Data File Formats</a>
    887 		</h3>
    888 
    889 		<p>The file formats may change between versions of CLDR. The
    890 			formats for CLDR 23 and beyond are as follows. As usual, text after a
    891 			# is a comment.</p>
    892 
    893 		<h4>
    894 			2.6.1 <a name="File_Format_allkeys_CLDR_txt"
    895 				href="#File_Format_allkeys_CLDR_txt">allkeys_CLDR.txt</a>
    896 		</h4>
    897 		<p>
    898 			This file defines CLDRs tailoring of the DUCET, as described in <i>Section
    899 				2, <a href="#Root_Collation">Root Collation</a>
    900 			</i>.
    901 		</p>
    902 		<p>
    903 			The format is similar to that of <a
    904 				href="http://www.unicode.org/reports/tr10/#File_Format">allkeys.txt</a>,
    905 			although there may be some differences in whitespace.
    906 		</p>
    907 
    908 		<h4>
    909 			2.6.2 <a name="File_Format_FractionalUCA_txt"
    910 				href="#File_Format_FractionalUCA_txt">FractionalUCA.txt</a>
    911 		</h4>
    912 		<p>The format is illustrated by the following sample lines, with
    913 			commentary afterwards.</p>
    914 		<pre>[UCA version = 6.0.0]</pre>
    915 		<blockquote>
    916 			<p>Provides the version number of the UCA table.</p>
    917 		</blockquote>
    918 
    919 		<pre>[Unified_Ideograph 4E00..9FCC FA0E..FA0F FA11 FA13..FA14 FA1F FA21 FA23..FA24 FA27..FA29 3400..4DB5 20000..2A6D6 2A700..2B734 2B740..2B81D]</pre>
    920 		<blockquote>
    921 			<p>
    922 				Lists the ranges of Unified_Ideograph characters in collation order.
    923 				(New in CLDR 24.) They map to collation elements with <a
    924 					href="http://www.unicode.org/reports/tr10/#Implicit_Weights">implicit
    925 					(constructed) primary weights</a>.
    926 			</p>
    927 		</blockquote>
    928 
    929 		<pre>[radical 6=:----]
    930 [radical 210=:--]
    931 [radical 210'=:]
    932 [radical end]</pre>
    933 		<blockquote>
    934 			<p>
    935 				Data for Unihan radical-stroke order. (New in CLDR 26.) Following
    936 				the [Unified_Ideograph] line, a section of
    937 				<code>[radical ...]</code>
    938 				lines defines a radical-stroke order of the Unified_Ideograph
    939 				characters.
    940 			</p>
    941 
    942 			<p>
    943 				For Han characters, an implementation may choose either to implement
    944 				the order defined in the UCA and the [Unified_Ideograph] data, or to
    945 				implement the order defined by the
    946 				<code>[radical ...]</code>
    947 				lines. Beginning with CLDR 26, the CJK type="unihan" tailorings
    948 				assume that the root collation order sorts Han characters in Unihan
    949 				radical-stroke order according to the
    950 				<code>[radical ...]</code>
    951 				data. The CollationTest_CLDR files only contain Han characters that
    952 				are in the same relative order using implicit weights or the
    953 				radical-stroke order.
    954 			</p>
    955 
    956 			<p>
    957 				The root collation radical-stroke order is derived from the first
    958 				(normative) values of the <a
    959 					href="http://www.unicode.org/reports/tr38/#kRSUnicode">Unihan
    960 					kRSUnicode</a> field for each Han character. Han characters are ordered
    961 				by radical, with traditional forms sorting before simplified ones.
    962 				Characters with the same radical are ordered by residual stroke
    963 				count. Characters with the same radical-stroke values are ordered by
    964 				block and code point, as for <a
    965 					href="http://www.unicode.org/reports/tr10/#Implicit_Weights">UCA
    966 					implicit weights</a>.
    967 			</p>
    968 
    969 			<p>
    970 				There is one
    971 				<code>[radical ...]</code>
    972 				line per radical, in the order of radical numbers. Each line shows
    973 				the radical number and the representative characters from the <a
    974 					href="http://www.unicode.org/reports/tr44/#UCD_Files_Table">UCD
    975 					file CJKRadicals.txt</a>, followed by a colon (:) and the Han
    976 				characters with that radical in the order as described above. A
    977 				range like
    978 				<code>-</code>
    979 				indicates that the code points in that range sort in code point
    980 				order.
    981 			</p>
    982 
    983 			<p>
    984 				The radical number and characters are informational. The sort order
    985 				is established only by the order of the
    986 				<code>[radical ...]</code>
    987 				lines, and within each line by the characters and ranges between the
    988 				colon (:) and the bracket (]).
    989 			</p>
    990 
    991 			<p>
    992 				Each Unified_Ideograph occurs exactly once. Only Unified_Ideograph
    993 				characters are listed on
    994 				<code>[radical ...]</code>
    995 				lines.
    996 			</p>
    997 
    998 			<p>
    999 				This section is terminated with one
   1000 				<code>[radical end]</code>
   1001 				line.
   1002 			</p>
   1003 		</blockquote>
   1004 
   1005 		<pre>0000; [,,]     # Zyyy Cc       [0000.0000.0000]        * &lt;NULL&gt;</pre>
   1006 		<blockquote>
   1007 			<p>
   1008 				Provides a weight line. The first element (before the &quot;;&quot;)
   1009 				is a hex codepoint sequence. The second field is a sequence of
   1010 				collation elements. Each collation element has 3 parts separated by
   1011 				commas: the primary weight, secondary weight, and tertiary weight.
   1012 				The tertiary weight actually consists of two components: the top two
   1013 				bits (0xC0) are used for the <em>case level</em>, and should be
   1014 				masked off where a case level is not used.
   1015 			</p>
   1016 			<p>A weight is either empty (meaning a zero or ignorable weight)
   1017 				or is a sequence of one or more bytes. The bytes are interpreted as
   1018 				a &quot;fraction&quot;, meaning that the ordering is 04 &lt; 05 05
   1019 				&lt; 06. The weights are constructed so that no weight is an initial
   1020 				subsequence of another: that is, having both the weights 05 and 05
   1021 				05 is illegal. The above line consists of all ignorable weights.</p>
   1022 			<p>The vertical bar (|) character is used to indicate context,
   1023 				as in:</p>
   1024 		</blockquote>
   1025 		<pre>006C | 00B7; [, DB A9, 05]</pre>
   1026 		<blockquote>
   1027 			This example indicates that if U+00B7 appears immediately after
   1028 			U+006C, it is given the corresponding collation element instead. This
   1029 			syntax is roughly equivalent to the following contraction, but is
   1030 			more efficient. For details see the specification of <i><a
   1031 				href="#Context_Sensitive_Mappings">Context-Sensitive Mappings</a></i>
   1032 			above.
   1033 		</blockquote>
   1034 		<pre>006C 00B7; <em>CE(006C)</em> [, DB A9, 05]</pre>
   1035 		<blockquote>
   1036 			<p>Single-byte primary weights are given to particularly frequent
   1037 				characters, such as space, digits, and a-z. More frequent characters
   1038 				are given two-byte weights, while relatively infrequent characters
   1039 				are given three-byte weights. For example:</p>
   1040 		</blockquote>
   1041 		<pre>...
   1042 0009; [03 05, 05, 05] # Zyyy Cc       [0100.0020.0002]        * &lt;CHARACTER TABULATION&gt;
   1043 ...
   1044 1B60; [06 14 0C, 05, 05]    # Bali Po       [0111.0020.0002]        * BALINESE PAMENENG
   1045 ...
   1046 0031; [14, 05, 05]    # Zyyy Nd       [149B.0020.0002]        * DIGIT ONE</pre>
   1047 		<blockquote>
   1048 			<p>The assignment of 2 vs 3 bytes does not reflect importance, or
   1049 				exact frequency.</p>
   1050 		</blockquote>
   1051 
   1052 		<pre>
   1053 3041; [76 06, 05, 03]	# Hira Lo	[3888.0020.000D]	* HIRAGANA LETTER SMALL A
   1054 3042; [76 06, 05, 85]	# Hira Lo	[3888.0020.000E]	* HIRAGANA LETTER A
   1055 30A1; [76 06, 05, 10]	# Kana Lo	[3888.0020.000F]	* KATAKANA LETTER SMALL A
   1056 30A2; [76 06, 05, 9E]	# Kana Lo	[3888.0020.0011]	* KATAKANA LETTER A</pre>
   1057 		<blockquote>
   1058 			<p>
   1059 				Beginning with CLDR 27, some primary or secondary collation elements
   1060 				may have below-common tertiary weights (e.g.,
   1061 				<code>03</code>
   1062 				), in particular to allow normal Hiragana letters to have common
   1063 				tertiary weights.
   1064 			</p>
   1065 		</blockquote>
   1066 
   1067 		<pre># SPECIAL MAX/MIN COLLATION ELEMENTS
   1068 FFFE; [02, 05, 05]     # Special LOWEST primary, for merge/interleaving
   1069 FFFF; [EF FE, 05, 05]  # Special HIGHEST primary, for ranges</pre>
   1070 		<blockquote>
   1071 			<p>The two tailored noncharacters have their own primary weights.
   1072 			</p>
   1073 		</blockquote>
   1074 
   1075 		<pre>
   1076 F967; [U+4E0D]  # Hani Lo       [FB40.0020.0002][CE0D.0000.0000]        * CJK COMPATIBILITY IDEOGRAPH-F967
   1077 2F02; [U+4E36, 10]      # Hani So       [FB40.0020.0004][CE36.0000.0000]        * KANGXI RADICAL DOT
   1078 2E80; [U+4E36, 70, 20]  # Hani So       [FB40.0020.0004][CE36.0000.0000][0000.00FC.0004]        * CJK RADICAL REPEAT</pre>
   1079 		<blockquote>
   1080 			<p>Some collation elements are specified by reference to other
   1081 				mappings. This is particularly useful for Han characters which are
   1082 				given implicit/constructed primary weights; the reference to a
   1083 				Unified_Ideograph makes these mappings independent of implementation
   1084 				details. This technique may also be used in other mappings to show
   1085 				the relationship of character variants.</p>
   1086 			<p>The referenced character must have a mapping listed earlier in
   1087 				the file, or the mapping must have been defined via the
   1088 				[Unified_Ideograph] data line. The referenced character must map to
   1089 				exactly one collation element.</p>
   1090 			<p>
   1091 				<code>[U+4E0D]</code>
   1092 				copies U+4E0Ds entire collation element.
   1093 				<code>[U+4E36, 10]</code>
   1094 				copies U+4E36s primary and secondary weights and specifies a
   1095 				different tertiary weight.
   1096 				<code>[U+4E36, 70, 20]</code>
   1097 				only copies U+4E36s primary weight and specifies other secondary
   1098 				and tertiary weights.
   1099 			</p>
   1100 			<p>FractionalUCA.txt does not have any explicit mappings for
   1101 				implicit weights. Therefore, an implementation is free to choose an
   1102 				algorithm for computing implicit weights according to the principles
   1103 				specified in the UCA.</p>
   1104 		</blockquote>
   1105 
   1106 		<pre>
   1107 FDD1 20AC;	[0D 20 02, 05, 05]	# CURRENCY first primary
   1108 FDD1 0034;	[0E 02 02, 05, 05]	# DIGIT first primary starts new lead byte
   1109 FDD0 FF21;	[26 02 02, 05, 05]	# REORDER_RESERVED_BEFORE_LATIN first primary starts new lead byte
   1110 FDD1 004C;	[28 02 02, 05, 05]	# LATIN first primary starts new lead byte
   1111 FDD0 FF3A;	[5D 02 02, 05, 05]	# REORDER_RESERVED_AFTER_LATIN first primary starts new lead byte
   1112 FDD1 03A9;	[5F 04 02, 05, 05]	# GREEK first primary starts new lead byte (compressible)
   1113 FDD1 03E2;	[5F 60 02, 05, 05]	# COPTIC first primary (compressible)</pre>
   1114 		<blockquote>
   1115 			<p>
   1116 				These are special mappings with primaries at the boundaries of
   1117 				scripts and reordering groups. They serve as tailoring boundaries,
   1118 				so that tailoring near the first or last character of a script or
   1119 				group places the tailored item into the same group. Beginning with
   1120 				CLDR 24, each of these is a contraction of U+FDD1 with
   1121 				a character of the corresponding script
   1122 				(or of the General_Category [Z, P, S, Sc, Nd]
   1123 				corresponding to a special reordering group),
   1124 				mapping to the first possible primary weight per
   1125 				script or group. They can be enumerated for implementations of <a
   1126 					href="#Collation_Indexes">Collation Indexes</a>. (Earlier versions
   1127 				mapped contractions with U+FDD0 to the last primary weights of each
   1128 				group but not each script.)
   1129 			</p>
   1130 			<p>Beginning with CLDR 27, these mappings alone define the
   1131 				boundaries for reordering single scripts. (There are no mappings for
   1132 				Hrkt, Hans, or Hant because they are not fully distinct scripts;
   1133 				they share primary weights with other scripts: Hrkt=Hira=Kana &amp;
   1134 				Hans=Hant=Hani.) There are some reserved ranges, beginning at
   1135 				boundaries marked with U+FDD0 plus following characters as shown
   1136 				above. The reserved ranges are not used for collation elements and
   1137 				are not available for tailoring.</p>
   1138 			<p>Some primary lead bytes must be reserved so that reordering of
   1139 				scripts along partial-lead-byte boundaries can split the primary
   1140 				lead byte and use up a reserved byte. This is for implementations
   1141 				that write sort keys, which must reorder primary weights by
   1142 				offsetting them by whole lead bytes. There are reorder-reserved
   1143 				ranges before and after Latin, so that reordering scripts with few
   1144 				primary lead bytes relative to Latin can move those scripts into the
   1145 				reserved ranges without changing the primary weights of any other
   1146 				script. Each of these boundaries begins with a new two-byte primary;
   1147 				that is, no two groups/scripts/ranges share the top 16 bits of their
   1148 				primary weights.</p>
   1149 		</blockquote>
   1150 
   1151 		<pre>
   1152 FDD0 0034;      [11, 05, 05]    # lead byte for numeric sorting</pre>
   1153 		<blockquote>
   1154 			<p>This mapping specifies the lead byte for numeric sorting. It
   1155 				must be different from the lead byte of any other primary weight,
   1156 				otherwise numeric sorting would generate ill-formed collation
   1157 				elements. Therefore, this mapping itself must be excluded from the
   1158 				set of regular mappings. This value can be ignored by
   1159 				implementations that do not support numeric sorting. (Other
   1160 				contractions with U+FDD0 can normally be ignored altogether.)</p>
   1161 		</blockquote>
   1162 
   1163 		<pre>
   1164 # HOMELESS COLLATION ELEMENTS
   1165 FDD0 0063; [, 97, 3D]       # [15E4.0020.0004] [1844.0020.0004] [0000.0041.001F]    * U+01C6 LATIN SMALL LETTER DZ WITH CARON
   1166 FDD0 0064; [, A7, 09]       # [15D1.0020.0004] [0000.0056.0004]     * U+1DD7 COMBINING LATIN SMALL LETTER C CEDILLA
   1167 FDD0 0065; [, B1, 09]       # [1644.0020.0004] [0000.0061.0004]     * U+A7A1 LATIN SMALL LETTER G WITH OBLIQUE STROKE</pre>
   1168 		<blockquote>
   1169 			<p>The DUCET has some weights that don't correspond directly to a
   1170 				character. To allow for implementations to have a mapping for each
   1171 				collation element (necessary for certain implementations of
   1172 				tailoring), this requires the construction of special sequences for
   1173 				those weights. These collation elements can normally be ignored.</p>
   1174 		</blockquote>
   1175 
   1176 		<p>Next, a number of tables are defined. The function of each of
   1177 			the tables is summarized afterwards.</p>
   1178 
   1179 		<pre># VALUES BASED ON UCA
   1180 ...
   1181 [first regular [0D 0A, 05, 05]] # U+0060 GRAVE ACCENT
   1182 [last regular [7A FE, 05, 05]] # U+1342E EGYPTIAN HIEROGLYPH AA032
   1183 [first implicit [E0 04 06, 05, 05]] # CONSTRUCTED
   1184 [last implicit [E4 DF 7E 20, 05, 05]] # CONSTRUCTED
   1185 [first trailing [E5, 05, 05]] # CONSTRUCTED
   1186 [last trailing [E5, 05, 05]] # CONSTRUCTED
   1187 ...</pre>
   1188 		<blockquote>
   1189 			<p>This table summarizes ranges of important groups of characters
   1190 				for implementations.</p>
   1191 		</blockquote>
   1192 		<pre># Top Byte =&gt; Reordering Tokens
   1193 [top_byte     00      TERMINATOR ]    #       [0]     TERMINATOR=1
   1194 [top_byte     01      LEVEL-SEPARATOR ]       #       [0]     LEVEL-SEPARATOR=1
   1195 [top_byte     02      FIELD-SEPARATOR ]       #       [0]     FIELD-SEPARATOR=1
   1196 [top_byte     03      SPACE ] #       [9]     SPACE=1 Cc=6 Zl=1 Zp=1 Zs=1
   1197 ...</pre>
   1198 		<blockquote>
   1199 			<p>This table defines the reordering groups, for script
   1200 				reordering. The table maps from the first bytes of the fractional
   1201 				weights to a reordering token. The format is &quot;[top_byte &quot;
   1202 				byte-value reordering-token &quot;COMPRESS&quot;? &quot;]&quot;. The
   1203 				&quot;COMPRESS&quot; value is present when there is only one byte in
   1204 				the reordering token, and primary-weight compression can be applied.
   1205 				Most reordering tokens are script values; others are special-purpose
   1206 				values, such as PUNCTUATION. Beginning with CLDR 24, this table
   1207 				precedes the regular mappings, so that parsers can use this
   1208 				information while processing and optimizing mappings. Beginning with
   1209 				CLDR 27, most of this data is irrelevant because single scripts can
   1210 				be reordered. Only the "COMPRESS" data is still useful.</p>
   1211 		</blockquote>
   1212 		<pre># Reordering Tokens =&gt; Top Bytes
   1213 [reorderingTokens     Arab    61=910 62=910 ]
   1214 [reorderingTokens     Armi    7A=22 ]
   1215 [reorderingTokens     Armn    5F=82 ]
   1216 [reorderingTokens     Avst    7A=54 ]
   1217 ...</pre>
   1218 		<blockquote>
   1219 			<p>This table is an inverse mapping from reordering token to top
   1220 				byte(s). In terms like &quot;61=910&quot;, the first value is the
   1221 				top byte, while the second is informational, indicating the number
   1222 				of primaries assigned with that top byte.</p>
   1223 		</blockquote>
   1224 		<pre># General Categories =&gt; Top Byte
   1225 [categories   Cc      03{SPACE}=6 ]
   1226 [categories   Cf      77{Khmr Tale Talu Lana Cham Bali Java Mong Olck Cher Cans Ogam Runr Orkh Vaii Bamu}=2 ]
   1227 [categories   Lm      0D{SYMBOL}=25 0E{SYMBOL}=22 27{Latn}=12 28{Latn}=12 29{Latn}=12 2A{Latn}=12...</pre>
   1228 		<blockquote>
   1229 			<p>This table is informational, providing the top bytes, scripts,
   1230 				and primaries associated with each general category value.</p>
   1231 		</blockquote>
   1232 		<pre># FIXED VALUES
   1233 [fixed first implicit byte E0]
   1234 [fixed last implicit byte E4]
   1235 [fixed first trail byte E5]
   1236 [fixed last trail byte EF]
   1237 [fixed first special byte F0]
   1238 [fixed last special byte FF]
   1239 
   1240 [fixed secondary common byte 05]
   1241 [fixed last secondary common byte 45]
   1242 [fixed first ignorable secondary byte 80]
   1243 
   1244 [fixed tertiary common byte 05]
   1245 [fixed first ignorable tertiary byte 3C]
   1246 		</pre>
   1247 		<blockquote>
   1248 			<p>The final table gives certain hard-coded byte values. The
   1249 				&quot;trail&quot; area is provided for implementation of the
   1250 				&quot;trailing weights&quot; as described in the UCA.</p>
   1251 		</blockquote>
   1252 
   1253 		<p class="note">Note: The particular primary lead bytes for Hani
   1254 			vs. IMPLICIT vs. TRAILING are only an example. An implementation is
   1255 			free to move them if it also moves the explicit TRAILING weights.
   1256 			This affects only a small number of explicit mappings in
   1257 			FractionalUCA.txt, such as for U+FFFD, U+FFFF, and the unassigned
   1258 			first primary. It is possible to use no SPECIAL bytes at all, and to
   1259 			use only the one primary lead byte FF for TRAILING weights.</p>
   1260 
   1261 		<h4>
   1262 			2.6.3 <a name="File_Format_UCA_Rules_txt"
   1263 				href="#File_Format_UCA_Rules_txt">UCA_Rules.txt</a>
   1264 		</h4>
   1265 		<p>
   1266 			The format for this file uses the CLDR collation syntax, see <i>Section
   1267 				3, <a href="#Collation_Tailorings">Collation Tailorings</a>
   1268 			</i>.
   1269 		</p>
   1270 
   1271 
   1272 		<h2>
   1273 			3 <a name="Collation_Tailorings" href="#Collation_Tailorings">Collation
   1274 				Tailorings</a>
   1275 		</h2>
   1276 		<p class="dtd">&lt;!ELEMENT collations (alias |
   1277 			(defaultCollation?, collation*, special*)) &gt;</p>
   1278 		<p class="dtd">&lt;!ELEMENT defaultCollation ( #PCDATA ) &gt;</p>
   1279 		<p>
   1280 			This element of the LDML format contains one or more <span
   1281 				class="element">collation</span> elements, distinguished by type.
   1282 			Each <span class="element">collation</span> contains elements with
   1283 			parametric settings, or rules that specify a certain sort order, as a
   1284 			tailoring of the root order, or both.
   1285 		</p>
   1286 		<p class="note">
   1287 			Note: CLDR collation tailoring data should follow the <a
   1288 				href="http://cldr.unicode.org/index/cldr-spec/collation-guidelines">CLDR
   1289 				Collation Guidelines</a>.
   1290 		</p>
   1291 
   1292 		<h3>
   1293 			3.1 <a name="Collation_Types" href="#Collation_Types">Collation
   1294 				Types</a>
   1295 		</h3>
   1296 		<p>
   1297 			Each locale may have multiple sort orders (types). The <span
   1298 				class="element">defaultCollation</span> element defines the default
   1299 			tailoring for a locale and its sublocales. For example:
   1300 		</p>
   1301 		<ul>
   1302 			<li>root.xml: <code>&lt;defaultCollation&gt;standard&lt;/defaultCollation&gt;</code></li>
   1303 			<li>zh.xml: <code>&lt;defaultCollation&gt;pinyin&lt;/defaultCollation&gt;</code></li>
   1304 			<li>zh_Hant.xml: <code>&lt;defaultCollation&gt;stroke&lt;/defaultCollation&gt;</code></li>
   1305 		</ul>
   1306 
   1307 		<p>
   1308 			To allow implementations in reduced memory environments to use CJK
   1309 			sorting, there are also short forms of each of these collation
   1310 			sequences. These provide for the most common characters in common
   1311 			use, and are marked with <span class="attribute">alt</span>=&quot;<span
   1312 				class="attributeValue">short</span>&quot;.
   1313 		</p>
   1314 
   1315 		<p>A collation type name that starts with "private-", for example,
   1316 			"private-kana", indicates an incomplete tailoring that is only
   1317 			intended for import into one or more other tailorings (usually for
   1318 			sharing common rules). It does not establish a complete sort order.
   1319 			An implementation should not build data tables for a private
   1320 			collation type, and should not include a private collation type in a
   1321 			list of available types.</p>
   1322 
   1323 		<p class="note">
   1324 			<b>Note:</b>
   1325 		</p>
   1326 		<ul>
   1327 			<li>There is an on-line demonstration of collation at [<a
   1328 				href="tr35.html#LocaleExplorer">LocaleExplorer</a>] that uses the
   1329 				same rule syntax. (Pick the locale and scroll to &quot;Collation
   1330 				Rules&quot;, near the end.)
   1331 			</li>
   1332 			<li class="note">In CLDR 23 and before, LDML collation files
   1333 				used an XML format. Starting with CLDR 24, the XML collation syntax
   1334 				is deprecated and no longer used. See the <i><a
   1335 					href="http://www.unicode.org/reports/tr35/tr35-31/tr35-collation.html#Collation_Tailorings">CLDR
   1336 						23 version of this document</a></i> for details about the XML collation
   1337 				syntax.
   1338 			</li>
   1339 		</ul>
   1340 
   1341 		<h4>
   1342 			3.1.1 <a name="Collation_Type_Fallback"
   1343 				href="#Collation_Type_Fallback">Collation Type Fallback</a>
   1344 		</h4>
   1345 		<p>When loading a requested tailoring from its data file and the
   1346 			parent file chain, use the following type fallback to find the
   1347 			tailoring.</p>
   1348 		<ol>
   1349 			<li>Determine the default type from the &lt;defaultCollation&gt;
   1350 				element; map the default type to its alias if one is defined. If
   1351 				there is no &lt;defaultCollation&gt; element, then use "standard" as
   1352 				the default type.</li>
   1353 			<li>If the request language tag specifies the collation type
   1354 				(keyword "co"), then map it to its alias if one is defined (e.g.,
   1355 				"-co-phonebk"  "phonebook"). If the language tag does not specify
   1356 				the type, then use the default type.</li>
   1357 			<li>Use the &lt;collation&gt; element with this type.</li>
   1358 			<li>If it does not exist, and the type starts with "search" but
   1359 				is longer, then set the type to "search" and use that
   1360 				&lt;collation&gt; element. (For example, "searchjl"  "search".)</li>
   1361 			<li>If it does not exist, and the type is not the default type,
   1362 				then set the type to the default type and use that &lt;collation&gt;
   1363 				element.</li>
   1364 			<li>If it does not exist, and the type is not "standard", then
   1365 				set the type to "standard" and use that &lt;collation&gt; element.</li>
   1366 			<li>If it does not exist, then use the CLDR root collation.</li>
   1367 		</ol>
   1368 		<p class="note">Note that the CLDR collation/root.xml contains
   1369 			&lt;defaultCollation&gt;standard&lt;/defaultCollation&gt;,
   1370 			&lt;collation type="standard"&gt; (with an empty tailoring, so this
   1371 			is the same as the CLDR root collation), and &lt;collation
   1372 			type="search"&gt;.</p>
   1373 
   1374 		<p>For example, assume that we have collation data for the
   1375 			following tailorings. ("da/search" is shorthand for
   1376 			"da-u-co-search".)</p>
   1377 		<ul>
   1378 			<li>root/defaultCollation=standard</li>
   1379 			<li>root/standard (this is the same as the CLDR root collator)</li>
   1380 			<li>root/search</li>
   1381 			<li>da/standard</li>
   1382 			<li>da/search</li>
   1383 			<li>el/standard</li>
   1384 			<li>ko/standard</li>
   1385 			<li>ko/search</li>
   1386 			<li>ko/searchjl</li>
   1387 			<li>zh/defaultCollation=pinyin</li>
   1388 			<li>zh/pinyin</li>
   1389 			<li>zh/stroke</li>
   1390 			<li>zh-Hant/defaultCollation=stroke</li>
   1391 		</ul>
   1392 		<table>
   1393 			<caption>
   1394 				<a name="Sample_requested_and_actual_collation_locales_and_types"
   1395 					href="#Sample_requested_and_actual_collation_locales_and_types">Sample
   1396 					requested and actual collation locales and types</a>
   1397 			</caption>
   1398 			<tr>
   1399 				<th>requested</th>
   1400 				<th>actual</th>
   1401 				<th>comment</th>
   1402 			</tr>
   1403 			<tr>
   1404 				<td>da/phonebook</td>
   1405 				<td>da/standard</td>
   1406 				<td>default type for Danish</td>
   1407 			</tr>
   1408 			<tr>
   1409 				<td>zh</td>
   1410 				<td>zh/pinyin</td>
   1411 				<td>default type for zh</td>
   1412 			</tr>
   1413 			<tr>
   1414 				<td>zh/standard</td>
   1415 				<td>root/standard</td>
   1416 				<td>no "standard" tailoring for zh, falls back to root</td>
   1417 			</tr>
   1418 			<tr>
   1419 				<td>zh/phonebook</td>
   1420 				<td>zh/pinyin</td>
   1421 				<td>default type for zh</td>
   1422 			</tr>
   1423 			<tr>
   1424 				<td>zh-Hant/phonebook</td>
   1425 				<td>zh/stroke</td>
   1426 				<td>default type for zh-Hant is "stroke"</td>
   1427 			</tr>
   1428 			<tr>
   1429 				<td>da/searchjl</td>
   1430 				<td>da/search</td>
   1431 				<td>"search.+" falls back to "search"</td>
   1432 			</tr>
   1433 			<tr>
   1434 				<td>el/search</td>
   1435 				<td>root/search</td>
   1436 				<td>no "search" tailoring for Greek</td>
   1437 			</tr>
   1438 			<tr>
   1439 				<td>el/searchjl</td>
   1440 				<td>root/search</td>
   1441 				<td>"search.+" falls back to "search", found in root</td>
   1442 			</tr>
   1443 			<tr>
   1444 				<td>ko/searchjl</td>
   1445 				<td>ko/searchjl</td>
   1446 				<td>requested data is actually available</td>
   1447 			</tr>
   1448 		</table>
   1449 
   1450 		<h3>
   1451 			3.2 <a name="Collation_Version" href="#Collation_Version">Version</a>
   1452 		</h3>
   1453 		<p>The version attribute is used in case a specific version of the
   1454 			UCA is to be specified. It is optional, and is specified if the
   1455 			results are to be identical on different systems. If it is not
   1456 			supplied, then the version is assumed to be the same as the Unicode
   1457 			version for the system as a whole.</p>
   1458 		<blockquote>
   1459 			<p class="note">
   1460 				<b>Note: </b>For version 3.1.1 of the UCA, the version of Unicode
   1461 				must also be specified with any versioning information; an example
   1462 				would be &quot;3.1.1/3.2&quot; for version 3.1.1 of the UCA, for
   1463 				version 3.2 of Unicode. This was changed by decision of the UTC, so
   1464 				that dual versions were no longer necessary. So for UCA 4.0 and
   1465 				beyond, the version just has a single number.
   1466 			</p>
   1467 		</blockquote>
   1468 
   1469 		<h3>
   1470 			3.3 <a name="Collation_Element" href="#Collation_Element">Collation
   1471 				Element</a>
   1472 		</h3>
   1473 		<p class="dtd">&lt;!ELEMENT collation (alias | (cr*, special*))
   1474 			&gt;</p>
   1475 		<p>
   1476 			The tailoring syntax is designed to be independent of the actual
   1477 			weights used in any particular UCA table. That way the same rules can
   1478 			be applied to UCA versions over time, even if the underlying weights
   1479 			change. The following illustrates the overall structure of a <span
   1480 				class="element">collation</span>:
   1481 		</p>
   1482 		<pre>&lt;collation type="phonebook"&gt;
   1483   &lt;cr&gt;&lt;![CDATA[
   1484     [caseLevel on]
   1485     &amp;c &lt; k
   1486   ]]&gt;&lt;/cr&gt;
   1487 &lt;/collation&gt;</pre>
   1488 
   1489 		<h3>
   1490 			3.4 <a name="Setting_Options" href="#Setting_Options">Setting
   1491 				Options</a>
   1492 		</h3>
   1493 		<p>
   1494 			Parametric settings can be specified in language tags or in rule
   1495 			syntax (in the form
   1496 			<code>[keyword value]</code>
   1497 			). For example,
   1498 			<code>-ks-level2</code>
   1499 			or
   1500 			<code>[strength 2]</code>
   1501 			will only compare strings based on their primary and secondary
   1502 			weights.
   1503 		</p>
   1504 		<p>
   1505 			If a setting is not present, the CLDR default (or the default for the
   1506 			locale, if there is one) is used. That default is listed in bold
   1507 			italics. Where there is a UCA default that is different, it is listed
   1508 			in bold with (<strong>UCA default</strong>). Note that the default
   1509 			value for a locale may be different than the normal default value for
   1510 			the setting.
   1511 		</p>
   1512 
   1513 		<table>
   1514 			<caption>
   1515 				<a name="Collation_Settings" href="#Collation_Settings">Collation
   1516 					Settings</a>
   1517 			</caption>
   1518 			<tr>
   1519 				<th>BCP47 Key</th>
   1520 				<th>BCP47 Value</th>
   1521 				<th>Rule Syntax</th>
   1522 				<th>Description</th>
   1523 			</tr>
   1524 			<tr>
   1525 				<td rowspan="5">ks</td>
   1526 				<td>level1</td>
   1527 				<td><code>[strength 1]</code><br>(primary)</td>
   1528 				<td rowspan="5">Sets the default strength for comparison, as
   1529 					described in the [<a
   1530 					href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>].<em>
   1531 						Note that a strength setting of greater than 4 may have the same
   1532 						effect as <strong>identical</strong>, depending on the locale and
   1533 						implementation.
   1534 				</em>
   1535 				</td>
   1536 			</tr>
   1537 			<tr>
   1538 				<td>level2</td>
   1539 				<td><code>[strength 2]</code><br>(secondary)</td>
   1540 			</tr>
   1541 			<tr>
   1542 				<td>level3</td>
   1543 				<td><em><strong><code>[strength 3]</code><br>(tertiary)</strong></em></td>
   1544 			</tr>
   1545 			<tr>
   1546 				<td>level4</td>
   1547 				<td><code>[strength 4]</code><br>(quaternary)</td>
   1548 			</tr>
   1549 			<tr>
   1550 				<td>identic</td>
   1551 				<td><code>[strength I]</code><br>(identical)</td>
   1552 			</tr>
   1553 			<tr>
   1554 				<td rowspan="3">ka</td>
   1555 				<td>noignore</td>
   1556 				<td><i><strong><code>[alternate
   1557 								non-ignorable]</code></strong></i><br></td>
   1558 				<td rowspan="3">Sets alternate handling for variable weights,
   1559 					as described in [<a
   1560 					href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>], where
   1561 					&quot;shifted&quot; causes certain characters to be ignored in
   1562 					comparison. <em>The default for LDML is different than it is
   1563 						in the UCA. In LDML, the default for alternate handling is <strong>non-ignorable</strong>,
   1564 						while in UCA it is <strong>shifted</strong>. In addition, in LDML
   1565 						only whitespace and punctuation are variable by default.
   1566 				</em>
   1567 				</td>
   1568 			</tr>
   1569 			<tr>
   1570 				<td>shifted</td>
   1571 				<td><strong><code>[alternate shifted]</code><br>(UCA
   1572 						default)</strong></td>
   1573 			</tr>
   1574 			<tr>
   1575 				<td><em>n/a</em></td>
   1576 				<td><i>n/a</i><br>(blanked)</td>
   1577 			</tr>
   1578 			<tr>
   1579 				<td rowspan="2">kb</td>
   1580 				<td>true</td>
   1581 				<td><code>[backwards 2]</code></td>
   1582 				<td rowspan="2">Sets the comparison for the second level to be
   1583 					<strong>backwards</strong>, as described in [<a
   1584 					href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>].
   1585 				</td>
   1586 			</tr>
   1587 			<tr>
   1588 				<td>false</td>
   1589 				<td><i><strong>n/a</strong></i></td>
   1590 			</tr>
   1591 			<tr>
   1592 				<td rowspan="2">kk</td>
   1593 				<td>true</td>
   1594 				<td><strong><code>[normalization on]</code><br>(UCA
   1595 						default)</strong></td>
   1596 				<td rowspan="2">If <strong>on</strong>, then the normal [<a
   1597 					href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>]
   1598 					algorithm is used. If <strong>off</strong>, then most strings
   1599 					should still sort correctly despite not normalizing to NFD first.<br>
   1600 					<em>Note that the default for CLDR locales may be different
   1601 						than in the UCA. The rules for particular locales have it set to <strong>on</strong>:
   1602 						those locales whose exemplar characters (in forms commonly
   1603 						interchanged) would be affected by normalization.
   1604 				</em>
   1605 				</td>
   1606 			</tr>
   1607 			<tr>
   1608 				<td>false</td>
   1609 				<td><i><strong><code>[normalization off]</code></strong></i></td>
   1610 			</tr>
   1611 			<tr>
   1612 				<td rowspan="2">kc</td>
   1613 				<td>true</td>
   1614 				<td><code>[caseLevel on]</code></td>
   1615 				<td rowspan="2">If set to <strong>on</strong><i>,</i> a level
   1616 					consisting only of case characteristics will be inserted in front
   1617 					of tertiary level, as a &quot;Level 2.5&quot;. To ignore accents
   1618 					but take case into account, set strength to <strong>primary</strong>
   1619 					and case level to <strong>on</strong>. For details, see <em>Section
   1620 						3.14, <a href="#Case_Parameters">Case Parameters</a>
   1621 				</em>.
   1622 				</td>
   1623 			</tr>
   1624 			<tr>
   1625 				<td>false</td>
   1626 				<td><i><strong><code>[caseLevel off]</code></strong></i></td>
   1627 			</tr>
   1628 			<tr>
   1629 				<td rowspan="3">kf</td>
   1630 				<td>upper</td>
   1631 				<td><code>[caseFirst upper]</code></td>
   1632 				<td rowspan="3">If set to <strong>upper</strong>, causes upper
   1633 					case to sort before lower case. If set to <strong>lower</strong>,
   1634 					causes lower case to sort before upper case. Useful for locales
   1635 					that have already supported ordering but require different order of
   1636 					cases. Affects case and tertiary levels. For details, see <em>Section
   1637 						3.14, <a href="#Case_Parameters">Case Parameters</a>
   1638 				</em>.
   1639 				</td>
   1640 			</tr>
   1641 			<tr>
   1642 				<td>lower</td>
   1643 				<td><code>[caseFirst lower]</code></td>
   1644 			</tr>
   1645 			<tr>
   1646 				<td>false</td>
   1647 				<td><i><strong><code>[caseFirst off]</code></strong></i></td>
   1648 			</tr>
   1649 			<tr>
   1650 				<td rowspan="2">kh</td>
   1651 				<td>true<br> <i><strong>Deprecated:</strong></i> Use rules
   1652 					with quater&shy;nary relations instead.
   1653 				</td>
   1654 				<td><code>[hiraganaQ on]</code></td>
   1655 				<td rowspan="2">Controls special treatment of Hiragana code
   1656 					points on quaternary level. If turned <strong>on</strong>, Hiragana
   1657 					codepoints will get lower values than all the other non-variable
   1658 					code points in <strong>shifted</strong>. That is, the normal Level
   1659 					4 value for a regular collation element is FFFF, as described in [<a
   1660 					href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>], <em>Section
   1661 						3.6, <a
   1662 						href="http://www.unicode.org/reports/tr10/#Variable_Weighting">Variable
   1663 							Weighting</a>
   1664 				</em>. This is changed to FFFE for [:script=Hiragana:] characters. The
   1665 					strength must be greater or equal than quaternary if this attribute
   1666 					is to have any effect.
   1667 				</td>
   1668 			</tr>
   1669 			<tr>
   1670 				<td>false</td>
   1671 				<td><i><strong><code>[hiraganaQ off]</code></strong></i></td>
   1672 			</tr>
   1673 			<tr>
   1674 				<td rowspan="2">kn</td>
   1675 				<td>true</td>
   1676 				<td><code>[numericOrdering on]</code></td>
   1677 				<td rowspan="2">If set to <strong>on</strong>, any sequence of
   1678 					Decimal Digits (General_Category = Nd in the [<a
   1679 					href="http://www.unicode.org/reports/tr41/#UAX44">UAX44</a>]) is
   1680 					sorted at a primary level with its numeric value. For example,
   1681 					&quot;A-21&quot; &lt; &quot;A-123&quot;. The computed primary
   1682 					weights are all at the start of the <strong>digit</strong>
   1683 					reordering group. Thus with an untailored UCA table, &quot;a$&quot;
   1684 					&lt; &quot;a0&quot; &lt; &quot;a2&quot; &lt; &quot;a12&quot; &lt;
   1685 					&quot;a&quot; &lt; &quot;aa&quot;.
   1686 				</td>
   1687 			</tr>
   1688 			<tr>
   1689 				<td>false</td>
   1690 				<td><i><strong><code>[numericOrdering off]</code></strong></i></td>
   1691 			</tr>
   1692 			<tr>
   1693 				<td>kr</td>
   1694 				<td>a sequence of one or more reorder codes: <strong>space,
   1695 						punct, symbol, currency, digit</strong>, or any BCP47 script ID
   1696 				</td>
   1697 				<td><code>[reorder Grek digit]</code></td>
   1698 				<td>Specifies a reordering of scripts or other significant
   1699 					blocks of characters such as symbols, punctuation, and digits. For
   1700 					the precise meaning and usage of the reorder codes, see <em>Section
   1701 						3.13, <a href="#Script_Reordering">Collation Reordering</a>.
   1702 				</em>
   1703 				</td>
   1704 			</tr>
   1705 			<tr>
   1706 				<td rowspan="4">kv</td>
   1707 				<td>space</td>
   1708 				<td><code>[maxVariable space]</code></td>
   1709 				<td rowspan="4">Sets the variable top to the top of the
   1710 					specified reordering group. All code points with primary weights
   1711 					less than or equal to the variable top will be considered variable,
   1712 					and thus affected by the alternate handling. Variables are
   1713 					ignorable by default in [<a
   1714 					href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>], but not
   1715 					in CLDR.
   1716 				</td>
   1717 			</tr>
   1718 			<tr>
   1719 				<td>punct</td>
   1720 				<td><i><strong><code>[maxVariable punct]</code></strong></i></td>
   1721 			</tr>
   1722 			<tr>
   1723 				<td>symbol</td>
   1724 				<td><strong><code>[maxVariable symbol]</code><br>(UCA
   1725 						default)</strong></td>
   1726 			</tr>
   1727 			<tr>
   1728 				<td>currency</td>
   1729 				<td><code>[maxVariable currency]</code></td>
   1730 			</tr>
   1731 			<tr>
   1732 				<td>vt</td>
   1733 				<td>See <i>Part 1 Section 3.6.4, <a
   1734 						href="tr35.html#Unicode_Locale_Extension_Data_Files">U
   1735 							Extension Data Files</a></i>.<br> <i><strong>Deprecated:</strong></i>
   1736 					Use maxVariable instead.
   1737 				</td>
   1738 				<td><code>&amp;\u00XX\uYYYY &lt; [variable top]</code><br>
   1739 					<br> (the default is set to the highest punctuation, thus
   1740 					including spaces and punctuation, but not symbols)</td>
   1741 				<td>
   1742 					<p>
   1743 						The BCP47 value is described in <i>Appendix Q: <a
   1744 							href="tr35.html#Locale_Extension_Key_and_Type_Data">Locale
   1745 								Extension Keys and Types</a>.
   1746 						</i>
   1747 					</p>
   1748 					<p>
   1749 						Sets the string value for the variable top. All the code points
   1750 						with primary weights less than or equal to the variable top will
   1751 						be considered variable, and thus affected by the alternate
   1752 						handling.<br> An implementation that supports the variableTop
   1753 						setting should also support the maxVariable setting, and it should
   1754 						"pin" ("round up") the variableTop to the top of the containing
   1755 						reordering group.<br> Variables are ignorable by default in [<a
   1756 							href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>], but
   1757 						not in CLDR. See below for more information.
   1758 					</p>
   1759 				</td>
   1760 			</tr>
   1761 			<tr>
   1762 				<td><em>n/a</em></td>
   1763 				<td><em>n/a</em></td>
   1764 				<td><em>n/a</em></td>
   1765 				<td>match-boundaries: <em><strong>none</strong></em> |
   1766 					whole-character | whole-word <br> Defined by <em>Section
   1767 						8, <a href="http://www.unicode.org/reports/tr10/#Searching">Searching
   1768 							and Matching</a>
   1769 				</em> of [<a href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>].
   1770 				</td>
   1771 			</tr>
   1772 			<tr>
   1773 				<td><em>n/a</em></td>
   1774 				<td><em>n/a</em></td>
   1775 				<td><em>n/a</em></td>
   1776 				<td>match-style: <em><strong>minimal</strong></em> | medial |
   1777 					maximal <br> Defined by <em>Section 8, <a
   1778 						href="http://www.unicode.org/reports/tr10/#Searching">Searching
   1779 							and Matching</a></em> of [<a
   1780 					href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>].
   1781 				</td>
   1782 			</tr>
   1783 		</table>
   1784 
   1785 		<h4>
   1786 			3.4.1 <a name="Common_Settings" href="#Common_Settings">Common
   1787 				settings combinations</a>
   1788 		</h4>
   1789 		<p>Some commonly used parametric collation settings are available
   1790 			via combinations of LDML settings attributes:</p>
   1791 		<ul>
   1792 			<li>Ignore accents: <strong>strength=primary</strong></li>
   1793 			<li>Ignore accents but take case into account: <strong>strength=primary
   1794 					caseLevel=on</strong></li>
   1795 			<li>Ignore case: <strong>strength=secondary</strong></li>
   1796 			<li>Ignore punctuation (completely): <strong>strength=tertiary
   1797 					alternate=shifted</strong></li>
   1798 			<li>Ignore punctuation but distinguish among punctuation
   1799 				marks: <strong>strength=quaternary alternate=shifted</strong>
   1800 			</li>
   1801 		</ul>
   1802 
   1803 		<h4>
   1804 			3.4.2 <a name="Normalization_Setting" href="#Normalization_Setting">Notes
   1805 				on the normalization setting</a>
   1806 		</h4>
   1807 		<p>The UCA always normalizes input strings into NFD form before
   1808 			the rest of the algorithm. However, this results in poor performance.</p>
   1809 		<p>
   1810 			With <strong>normalization=off</strong>, strings that are in [<a
   1811 				href="tr35.html#FCD">FCD</a>] and do not contain Tibetan precomposed
   1812 			vowels (U+0F73, U+0F75, U+0F81) should sort correctly. With <strong>normalization=on</strong>,
   1813 			an implementation that does not normalize to NFD must at least
   1814 			perform an incremental FCD check and normalize substrings as
   1815 			necessary. It should also always decompose the Tibetan precomposed
   1816 			vowels. (Otherwise discontiguous contractions across their leading
   1817 			components cannot be handled correctly.)
   1818 		</p>
   1819 		<p>Another complication for an implementation that does not always
   1820 			use NFD arises when contraction mappings overlap with canonical
   1821 			Decomposition_Mapping strings. For example, the Danish contraction
   1822 			aa overlaps with the decompositions of , , and other
   1823 			characters. In the root collation (and in the DUCET), Cyrillic 
   1824 			maps to a single collation element, which means that its
   1825 			decomposition +&#x25CC;&#x0308; forms a contraction, and its
   1826 			second character (U+0308) is the same as the first character in the
   1827 			Decomposition_Mapping of U+0344
   1828 			&#x25CC;&#x0344;=&#x25CC;&#x0308;+&#x25CC;&#x0301;.</p>
   1829 		<p>In order to handle strings with these characters (e.g., a
   1830 			and &#x0344; [which are in FCD]) exactly as with prior NFD
   1831 			normalization, an implementation needs to either add overlap
   1832 			contractions to its data (e.g., a+ and +&#x25CC;&#x0344;), or
   1833 			it needs to decompose the relevant composites (e.g.,  and
   1834 			&#x25CC;&#x0344;) as soon as they are encountered.</p>
   1835 
   1836 		<h4>
   1837 			3.4.3 <a name="Variable_Top_Settings" href="#Variable_Top_Settings">Notes
   1838 				on variable top settings</a>
   1839 		</h4>
   1840 		<p>
   1841 			Users may want to include more or fewer characters as Variable. For
   1842 			example, someone could want to restrict the Variable characters to
   1843 			just include space marks. In that case, maxVariable would be set to
   1844 			"space". (In CLDR 24 and earlier, the now-deprecated variableTop
   1845 			would be set to U+1680, see the Whitespace <a
   1846 				href="http://unicode.org/charts/collation/">UCA collation chart</a>).
   1847 			Alternatively, someone could want more of the Common characters in
   1848 			them, and include characters up to (but not including) '0', by
   1849 			setting maxVariable to "currency". (In CLDR 24 and earlier, the
   1850 			now-deprecated variableTop would be set to U+20BA, see the
   1851 			Currency-Symbol collation chart).
   1852 		</p>
   1853 		<p>The effect of these settings is to customize to ignore
   1854 			different sets of characters when comparing strings. For example, the
   1855 			locale identifier "de-u-ka-shifted-kv-currency" is requesting
   1856 			settings appropriate for German, including German sorting
   1857 			conventions, and that currency symbols and characters sorting below
   1858 			them are ignored in sorting.</p>
   1859 
   1860 		<h3>
   1861 			3.5 <a name="Rules" href="#Rules">Collation Rule Syntax</a>
   1862 		</h3>
   1863 		<p class="dtd">&lt;!ELEMENT cr #PCDATA &gt;</p>
   1864 		<p>
   1865 			The goal for the collation rule syntax is to have clearly expressed
   1866 			rules with a concise format. The CLDR rule syntax is a subset of the
   1867 			[<a href="tr35.html#ICUCollation">ICUCollation</a>] syntax.
   1868 		</p>
   1869 
   1870 		<p>
   1871 			For the CLDR root collation, the FractionalUCA.txt file defines all
   1872 			mappings for all of Unicode directly, and it also provides
   1873 			information about script boundaries, reordering groups, and other
   1874 			details. For tailorings, this is neither necessary nor practical. In
   1875 			particular, while the root collation sort order rarely changes for
   1876 			existing characters, their numeric collation weights change with
   1877 			every version. If tailorings also specified numeric weights directly,
   1878 			then they would have to change with every version, parallel with the
   1879 			root collation. Instead, for tailorings, mappings are added and
   1880 			modified relative to the root collation. (There is no syntax to <i>remove</i>
   1881 			mappings, except via <a href="#Special_Purpose_Commands">special
   1882 				[suppressContractions [...]] </a>.)
   1883 		</p>
   1884 
   1885 		<p>
   1886 			The ASCII [:P:] and [:S:] characters are reserved for collation
   1887 			syntax:
   1888 			<code>[\u0021-\u002F \u003A-\u0040 \u005B-\u0060
   1889 				\u007B-\u007E]</code>
   1890 		</p>
   1891 
   1892 		<p>Unicode Pattern_White_Space characters between tokens are
   1893 			ignored. Unquoted white space terminates reset and relation strings.</p>
   1894 
   1895 		<p>A pair of ASCII apostrophes encloses quoted literal text. They
   1896 			are normally used to enclose a syntax character or white space, or a
   1897 			whole reset/relation string containing one or more such characters,
   1898 			so that those are parsed as part of the reset/relation strings rather
   1899 			than treated as syntax. A pair of immediately adjacent apostrophes is
   1900 			used to encode one apostrophe.</p>
   1901 
   1902 		<p>
   1903 			Code points can be escaped with
   1904 			<code>\uhhhh</code>
   1905 			and
   1906 			<code>\U00hhhhhh</code>
   1907 			escapes, as well as common escapes like
   1908 			<code>\t</code>
   1909 			and
   1910 			<code>\n</code>
   1911 			. (For details see the documentation of ICU
   1912 			UnicodeString::unescape().) This is particularly useful for
   1913 			default-ignorable code points, combining marks, visually indistinct
   1914 			variants, hard-to-type characters, etc. These sequences are unescaped
   1915 			before the rules are parsed; this means that even escaped syntax and
   1916 			white space characters need to be enclosed in apostrophes. For
   1917 			example:
   1918 			<code>&amp;'\u0020'='\u3000'</code>.
   1919 			Note: The unescaping is done by ICU tools (genrb) and demos before passing
   1920 			rule strings into the ICU library code.
   1921 			The ICU collation API does not unescape rule strings.
   1922 		</p>
   1923 
   1924 		<p>
   1925 			The ASCII double quote must be both escaped (so that the collation
   1926 			syntax can be enclosed in pairs of double quotes in programming
   1927 			environments such as ICU resource bundle .txt files)
   1928 			and quoted. For example:
   1929 			<code>&amp;'\u0022'&lt;&lt;&lt;x</code>
   1930 		</p>
   1931 
   1932 		<p>
   1933 			Comments are allowed at the beginning, and after any complete reset,
   1934 			relation, setting, or command. A comment begins with a
   1935 			<code>#</code>
   1936 			and extends to the end of the line (according to the Unicode Newline
   1937 			Guidelines).
   1938 		</p>
   1939 
   1940 		<p>The collation syntax is case-sensitive.</p>
   1941 
   1942 		<h3>
   1943 			3.6 <a name="Orderings" href="#Orderings">Orderings</a>
   1944 		</h3>
   1945 
   1946 		<p>The root collation mappings form the initial state. Mappings
   1947 			are added and removed via a sequence of rule chains. Each tailoring
   1948 			rule builds on the current state after all of the preceding rules
   1949 			(and is not affected by any following rules). Rule chains may
   1950 			alternate with comments, settings, and special commands.</p>
   1951 
   1952 		<p>A rule chain consists of a reset followed by one or more
   1953 			relations. The reset position is a string which maps to one or more
   1954 			collation elements according to the current state. A relation
   1955 			consists of an operator and a string; it maps the string to the
   1956 			current collation elements, modified according to the operator.</p>
   1957 
   1958 		<table>
   1959 			<caption>
   1960 				<a name="Specifying_Collation_Ordering"
   1961 					href="#Specifying_Collation_Ordering">Specifying Collation
   1962 					Ordering</a>
   1963 
   1964 			</caption>
   1965 			<tr>
   1966 				<th>Relation Operator</th>
   1967 				<th>&nbsp;Example</th>
   1968 				<th>Description</th>
   1969 			</tr>
   1970 			<tr>
   1971 				<td><code>&amp;</code></td>
   1972 				<td><code>&amp; Z</code></td>
   1973 				<td>Map Z to collation elements according to the current state.
   1974 					These will be modified according to the following relation
   1975 					operators and then assigned to the corresponding relation strings.</td>
   1976 			</tr>
   1977 			<tr>
   1978 				<td><code>&lt;</code></td>
   1979 				<td><code>
   1980 						&amp; a<br> &lt; b
   1981 					</code></td>
   1982 				<td>Make &#39;b&#39; sort after &#39;a&#39;, as a <i>primary</i>
   1983 					(base-character) difference
   1984 				</td>
   1985 			</tr>
   1986 			<tr>
   1987 				<td><code>&lt;&lt;</code></td>
   1988 				<td><code>
   1989 						&amp; a<br> &lt;&lt; 
   1990 					</code></td>
   1991 				<td>Make &#39;&#39; sort after &#39;a&#39; as a <i>secondary</i>
   1992 					(accent) difference
   1993 				</td>
   1994 			</tr>
   1995 			<tr>
   1996 				<td><code>&lt;&lt;&lt;</code></td>
   1997 				<td><code>
   1998 						&amp; a<br> &lt;&lt;&lt; A
   1999 					</code></td>
   2000 				<td>Make &#39;A&#39; sort after &#39;a&#39; as a <i>tertiary</i>
   2001 					(case/variant) difference
   2002 				</td>
   2003 			</tr>
   2004 			<tr>
   2005 				<td><code>&lt;&lt;&lt;&lt;</code></td>
   2006 				<td><code>
   2007 						&amp; <br> &lt;&lt;&lt;&lt; 
   2008 					</code></td>
   2009 				<td>Make &#39;&#39; (Katakana Ka) sort after &#39;&#39;
   2010 					(Hiragana Ka) as a <i>quaternary</i> difference
   2011 				</td>
   2012 			</tr>
   2013 			<tr>
   2014 				<td><code>=&nbsp; </code></td>
   2015 				<td><code>
   2016 						&amp; v<br> = w&nbsp;
   2017 					</code></td>
   2018 				<td>Make &#39;w&#39; sort <i>identically</i> to &#39;v&#39;
   2019 				</td>
   2020 			</tr>
   2021 		</table>
   2022 		<p>The following shows the result of serially applying three
   2023 			rules.</p>
   2024 		<table>
   2025 			<tr>
   2026 				<th>&nbsp;</th>
   2027 				<th>Rules</th>
   2028 				<th>Result</th>
   2029 				<th>Comment</th>
   2030 			</tr>
   2031 			<tr>
   2032 				<td>1</td>
   2033 				<td>&amp; a &lt; g</td>
   2034 				<td>... a<font color="red"> &lt;<sub>1</sub> g
   2035 				</font> ...
   2036 				</td>
   2037 				<td>Put g after a.</td>
   2038 			</tr>
   2039 			<tr>
   2040 				<td>2</td>
   2041 				<td>&amp; a &lt; h &lt; k</td>
   2042 				<td>... a<font color="red"> &lt;<sub>1</sub> h &lt;<sub>1</sub>
   2043 						k
   2044 				</font> &lt;<sub>1</sub> g ...
   2045 				</td>
   2046 				<td>Now put h and k after a (inserting before the g).</td>
   2047 			</tr>
   2048 			<tr>
   2049 				<td>3</td>
   2050 				<td>&amp; h &lt;&lt; g</td>
   2051 				<td>... a &lt;<sub>1</sub> h<font color="red"> &lt;<sub>1</sub>
   2052 						g
   2053 				</font> &lt;<sub>1</sub> k ...
   2054 				</td>
   2055 				<td>Now put g after h (inserting before k).</td>
   2056 			</tr>
   2057 		</table>
   2058 		<p>Notice that relation strings can occur multiple times, and thus
   2059 			override previous rules.</p>
   2060 
   2061 		<p>Each relation uses and modifies the collation elements of the
   2062 			immediately preceding reset position or relation. A rule chain with
   2063 			two or more relations is equivalent to a sequence of atomic rules
   2064 			where each rule chain has exactly one relation, and each relation is
   2065 			followed by a reset to this same relation string.</p>
   2066 
   2067 		<p>
   2068 			<i>Example:</i>
   2069 		</p>
   2070 		<table>
   2071 			<tr>
   2072 				<th>Rules</th>
   2073 				<th>Equivalent Atomic Rules</th>
   2074 			</tr>
   2075 			<tr>
   2076 				<td>&amp; b &lt; q &lt;&lt;&lt; Q<br> &amp; a &lt; x
   2077 					&lt;&lt;&lt; X &lt;&lt; q &lt;&lt;&lt; Q &lt; z
   2078 				</td>
   2079 				<td>&amp; b &lt; q<br> &amp; q &lt;&lt;&lt; Q<br>
   2080 					&amp; a &lt; x<br> &amp; x &lt;&lt;&lt; X<br> &amp; X
   2081 					&lt;&lt; q<br> &amp; q &lt;&lt;&lt; Q<br> &amp; Q &lt; z
   2082 				</td>
   2083 			</tr>
   2084 		</table>
   2085 		<p>This is not always possible because prefix and extension
   2086 			strings can occur in a relation but not in a reset (see below).</p>
   2087 
   2088 		<p>
   2089 			The relation operator
   2090 			<code>=</code>
   2091 			maps its relation string to the current collation elements. Any other
   2092 			relation operator modifies the current collation elements as follows.
   2093 		</p>
   2094 		<ul>
   2095 			<li>Find the <i>last</i> collation element whose strength is at
   2096 				least as great as the strength of the operator. For example, for <code>&lt;&lt;</code>
   2097 				find the last primary or secondary CE. This CE will be modified; all
   2098 				following CEs should be removed. If there is no such CE, then reset
   2099 				the collation elements to a single completely-ignorable CE.
   2100 			</li>
   2101 			<li>Increment the collation element weight corresponding to the
   2102 				strength of the operator. For example, for <code>&lt;&lt;</code>
   2103 				increment the secondary weight.
   2104 			</li>
   2105 			<li>The new weight must be less than the next weight for the
   2106 				same combination of higher-level weights of any collation element
   2107 				according to the current state.</li>
   2108 			<li>Weights must be allocated in accordance with the <a
   2109 				href="http://www.unicode.org/reports/tr10/#Well-Formed">UCA
   2110 					well-formedness conditions</a>.
   2111 			</li>
   2112 			<li>When incrementing any weight, lower-level weights should be
   2113 				reset to the common values, to help with sort key compression.</li>
   2114 		</ul>
   2115 
   2116 		<p>
   2117 			In all cases, even for
   2118 			<code>=</code>
   2119 			, the case bits are recomputed according to <i>Section 3.13, <a
   2120 				href="#Case_Parameters">Case Parameters</a></i>. (This can be skipped if
   2121 			an implementation does not support the caseLevel or caseFirst
   2122 			settings.)
   2123 		</p>
   2124 
   2125 		<p>
   2126 			For example,
   2127 			<code>&amp;ae&lt;x</code>
   2128 			maps x to two collation elements. The first one is the same as for
   2129 			a, and the second one has a primary weight between those for e
   2130 			and f. As a result, x sorts between ae and af. (If the
   2131 			primary of the first collation element was incremented instead, then
   2132 			x would sort after az. While also sorting primary-after ae this
   2133 			would be surprising and sub-optimal.)
   2134 		</p>
   2135 
   2136 		<p>Some additional operators are provided to save space with large
   2137 			tailorings. The addition of a * to the relation operator indicates
   2138 			that each of the following single characters are to be handled as if
   2139 			they were separate relations with the corresponding strength. Each of
   2140 			the following single characters must be NFD-inert, that is, it does
   2141 			not have a canonical decomposition and it does not reorder (ccc=0).
   2142 			This keeps abbreviated rules unambiguous.</p>
   2143 		<p>
   2144 			A starred relation operator is followed by a sequence of characters
   2145 			with the same quoting/escaping rules as normal relation strings. Such
   2146 			a sequence can also be followed by one or more pairs of - and
   2147 			another sequence of characters. The single characters adjacent to the
   2148 			- establish a code point order range. The same character cannot be
   2149 			both the end of a range and the start of another range. (For example,
   2150 			<code>&lt;a-d-g</code>
   2151 			is not allowed.)
   2152 		</p>
   2153 		<table>
   2154 			<caption>
   2155 				<a name="Abbreviating_Ordering_Specifications"
   2156 					href="#Abbreviating_Ordering_Specifications">Abbreviating
   2157 					Ordering Specifications</a>
   2158 			</caption>
   2159 			<tr>
   2160 				<th>Relation Operator</th>
   2161 				<th>Example</th>
   2162 				<th>Equivalent</th>
   2163 			</tr>
   2164 			<tr>
   2165 				<td><code>&lt;*</code></td>
   2166 				<td><code>
   2167 						&amp; <span style="color: blue">a</span><br> &lt;* <span
   2168 							style="color: blue">bcd-gp-s</span>&nbsp;
   2169 					</code></td>
   2170 				<td><code>
   2171 						&amp; <span style="color: blue">a</span><br> &lt; <span
   2172 							style="color: blue">b </span>&lt;<span style="color: blue">
   2173 							c </span>&lt;<span style="color: blue"> d</span> &lt; <span
   2174 							style="color: blue">e</span> &lt; <span style="color: blue">f</span>
   2175 						&lt; <span style="color: blue">g</span> &lt; <span
   2176 							style="color: blue">p</span> &lt; <span style="color: blue">q</span>
   2177 						&lt; <span style="color: blue">r</span> &lt; <span
   2178 							style="color: blue">s</span>
   2179 					</code></td>
   2180 			</tr>
   2181 			<tr>
   2182 				<td><code>&lt;&lt;*</code></td>
   2183 				<td><code>
   2184 						&amp;<span style="color: blue"> a</span><br> &lt;&lt;*<span
   2185 							style="color: blue"> </span>
   2186 					</code></td>
   2187 				<td><code>
   2188 						&amp;<span style="color: blue"> a</span><br> &lt;&lt;<span
   2189 							style="color: blue">  </span>&lt;&lt; <span style="color: blue">
   2190 						</span>&lt;&lt; <span style="color: blue"></span>
   2191 					</code></td>
   2192 			</tr>
   2193 			<tr>
   2194 				<td><code>&lt;&lt;&lt;*</code></td>
   2195 				<td><code>
   2196 						&amp;<span style="color: blue"> p</span><br> &lt;&lt;&lt;* <span
   2197 							style="color: blue">P</span>
   2198 					</code></td>
   2199 				<td><code>
   2200 						&amp;<span style="color: blue"> p</span><br> &lt;&lt;&lt; <span
   2201 							style="color: blue">P</span> &lt;&lt;&lt; <span
   2202 							style="color: blue"></span> &lt;&lt;&lt; <span
   2203 							style="color: blue"></span>
   2204 					</code></td>
   2205 			</tr>
   2206 			<tr>
   2207 				<td><code>&lt;&lt;&lt;&lt;*</code></td>
   2208 				<td><code>
   2209 						&amp;<span style="color: blue"> k</span><br>
   2210 						&lt;&lt;&lt;&lt;* <span style="color: blue">qQ</span>
   2211 					</code></td>
   2212 				<td><code>
   2213 						&amp;<span style="color: blue"> k</span><br> &lt;&lt;&lt;&lt;
   2214 						<span style="color: blue">q</span> &lt;&lt;&lt;&lt; <span
   2215 							style="color: blue">Q</span>
   2216 					</code></td>
   2217 			</tr>
   2218 			<tr>
   2219 				<td><code>=*</code></td>
   2220 				<td><code>
   2221 						&amp;<span style="color: blue"> v</span><br> =* <span
   2222 							style="color: blue">VwW</span>
   2223 					</code></td>
   2224 				<td><code>
   2225 						&amp;<span style="color: blue"> v</span><br> = <span
   2226 							style="color: blue">V </span>= <span style="color: blue">w
   2227 						</span>= <span style="color: blue">W</span>
   2228 					</code></td>
   2229 			</tr>
   2230 		</table>
   2231 		<h3>
   2232 			3.7 <a name="Contractions" href="#Contractions">Contractions</a>
   2233 		</h3>
   2234 
   2235 		<p>A multi-character relation string defines a contraction.</p>
   2236 
   2237 		<table>
   2238 			<caption>
   2239 				<a name="Specifying_Contractions" href="#Specifying_Contractions">Specifying
   2240 					Contractions</a>
   2241 			</caption>
   2242 			<tr>
   2243 				<th>Example</th>
   2244 				<th>Description</th>
   2245 			</tr>
   2246 			<tr>
   2247 				<td><code>
   2248 						&amp; k<br> &lt; ch
   2249 					</code></td>
   2250 				<td>Make the sequence &#39;ch&#39; sort after &#39;k&#39;, as a
   2251 					primary (base-character) difference</td>
   2252 			</tr>
   2253 		</table>
   2254 
   2255 		<h3>
   2256 			3.8 <a name="Expansions" href="#Expansions">Expansions</a>
   2257 		</h3>
   2258 		<p>
   2259 			A mapping to multiple collation elements defines an expansion. This
   2260 			is normally the result of a reset position (and/or preceding
   2261 			relation) that yields multiple collation elements, for example
   2262 			<code>&amp;ae&lt;x</code>
   2263 			or
   2264 			<code>&amp;&lt;y</code>
   2265 			.
   2266 		</p>
   2267 
   2268 		<p>
   2269 			A relation string can also be followed by
   2270 			<code>/</code>
   2271 			and an <i>extension string</i>. The extension string is mapped to
   2272 			collation elements according to the current state, and the relation
   2273 			string is mapped to the concatenation of the regular CEs and the
   2274 			extension CEs. The extension CEs are not modified, not even their
   2275 			case bits. The extension CEs are <i>not</i> retained for following
   2276 			relations.
   2277 		</p>
   2278 
   2279 		<p>
   2280 			For example,
   2281 			<code>&amp;a&lt;z/e</code>
   2282 			maps z to an expansion similar to
   2283 			<code>&amp;ae&lt;x</code>
   2284 			. However, the first CE of z is primary-after that of a, and the
   2285 			second CE is exactly that of e, which yields the order ae &lt; x
   2286 			&lt; af &lt; ag &lt; ... &lt; az &lt; z &lt; b.
   2287 		</p>
   2288 
   2289 		<p>
   2290 			The choice of reset-to-expansion vs. use of an extension string can
   2291 			be exploited to affect contextual mappings. For example,
   2292 			<code>&amp;L=x</code>
   2293 			yields a second CE for x equal to the context-sensitive
   2294 			middle-dot-after-L (which is a secondary CE in the root collation).
   2295 			On the other hand,
   2296 			<code>&amp;L=x/</code>
   2297 			yields a second CE of the middle dot by itself (which is a primary
   2298 			CE).
   2299 		</p>
   2300 
   2301 		<p>
   2302 			The two ways of specifying expansions also differ in how case bits
   2303 			are computed. When some of the CEs are copied verbatim from an
   2304 			extension string, then the relation strings case bits are
   2305 			distributed over a smaller number of normal CEs. For example,
   2306 			<code>&amp;aE=Ch</code>
   2307 			yields an uppercase CE and a lowercase CE, but
   2308 			<code>&amp;a=Ch/E</code>
   2309 			yields a mixed-case CE (for C and h together) followed by an
   2310 			uppercase CE (copied from E).
   2311 		</p>
   2312 
   2313 		<p>In summary, there are two ways of specifying expansions which
   2314 			produce subtly different mappings. The use of extension strings is
   2315 			unusual but sometimes necessary.</p>
   2316 
   2317 
   2318 		<h3>
   2319 			3.9 <a name="Context_Before" href="#Context_Before">Context
   2320 				Before</a>
   2321 		</h3>
   2322 		<p>
   2323 			A relation string can have a prefix (context before) which makes the
   2324 			mapping from the relation string to its tailored position conditional
   2325 			on the string occurring after that prefix. For details see the
   2326 			specification of <i><a href="#Context_Sensitive_Mappings">Context-Sensitive
   2327 					Mappings</a></i>.
   2328 		</p>
   2329 		<p>For example, suppose that &quot;-&quot; is sorted like the
   2330 			previous vowel. Then one could have rules that take &quot;a-&quot;,
   2331 			&quot;e-&quot;, and so on. However, that means that every time a very
   2332 			common character (a, e, ...) is encountered, a system will slow down
   2333 			as it looks for possible contractions. An alternative is to indicate
   2334 			that when &quot;-&quot; is encountered, and it comes after an
   2335 			&#39;a&#39;, it sorts like an &#39;a&#39;, and so on.</p>
   2336 		<table>
   2337 			<caption>
   2338 				<a name="Specifying_Previous_Context"
   2339 					href="#Specifying_Previous_Context">Specifying Previous Context</a>
   2340 			</caption>
   2341 			<tr>
   2342 				<th>Rules</th>
   2343 			</tr>
   2344 			<tr>
   2345 				<td><code>
   2346 						&amp; a &lt;&lt;&lt; a | '-'<br> &amp; e &lt;&lt;&lt; e | '-'<br>
   2347 						...
   2348 					</code></td>
   2349 			</tr>
   2350 		</table>
   2351 		<p>Both the prefix and extension strings can occur in a relation.
   2352 			For example, the following are allowed:</p>
   2353 		<ul>
   2354 			<li><code>&lt; abc | def / ghi</code></li>
   2355 			<li><code>&lt; def / ghi</code></li>
   2356 			<li><code>&lt; abc | def</code></li>
   2357 		</ul>
   2358 		<h3>
   2359 			3.10 <a name="Placing_Characters_Before_Others"
   2360 				href="#Placing_Characters_Before_Others">Placing Characters
   2361 				Before Others</a>
   2362 		</h3>
   2363 		<p>There are certain circumstances where characters need to be
   2364 			placed before a given character, rather than after. This is the case
   2365 			with Pinyin, for example, where certain accented letters are
   2366 			positioned before the base letter. That is accomplished with the
   2367 			following syntax.</p>
   2368 		<pre>&amp;[before 2] a &lt;&lt; </pre>
   2369 		<p>The before-strength can be 1 (primary), 2 (secondary), or 3
   2370 			(tertiary).</p>
   2371 		<p>It is an error if the strength of the reset-before differs from
   2372 			the strength of the immediately following relation. Thus the
   2373 			following are errors.</p>
   2374 		<ul>
   2375 			<li><code>&amp;[before 2] a &lt;  # error</code></li>
   2376 			<li><code>&amp;[before 2] a &lt;&lt;&lt;  # error</code></li>
   2377 		</ul>
   2378 
   2379 		<h3>
   2380 			3.11 <a name="Logical_Reset_Positions"
   2381 				href="#Logical_Reset_Positions">Logical Reset Positions</a>
   2382 		</h3>
   2383 
   2384 		<p>The CLDR table (based on UCA) has the following overall
   2385 			structure for weights, going from low to high.</p>
   2386 		<table>
   2387 			<caption>
   2388 				<a name="Specifying_Logical_Positions"
   2389 					href="#Specifying_Logical_Positions">Specifying Logical
   2390 					Positions</a>
   2391 			</caption>
   2392 			<tr>
   2393 				<th>Name</th>
   2394 				<th>Description</th>
   2395 				<th>UCA Examples</th>
   2396 			</tr>
   2397 			<tr>
   2398 				<td>first tertiary ignorable<br> ...<br> last
   2399 					tertiary ignorable
   2400 				</td>
   2401 				<td>p, s, t = ignore</td>
   2402 				<td>Control Codes<br> Format Characters<br> Hebrew
   2403 					Points<br> Tibetan Signs<br> ...
   2404 				</td>
   2405 			</tr>
   2406 			<tr>
   2407 				<td>first secondary ignorable<br> ...<br> last
   2408 					secondary ignorable
   2409 				</td>
   2410 				<td>p, s = ignore</td>
   2411 				<td>None in UCA</td>
   2412 			</tr>
   2413 			<tr>
   2414 				<td>first primary ignorable<br> ...<br> last primary
   2415 					ignorable
   2416 				</td>
   2417 				<td>p = ignore</td>
   2418 				<td>Most combining marks</td>
   2419 			</tr>
   2420 			<tr>
   2421 				<td>first variable<br> ...<br> last variable
   2422 				</td>
   2423 				<td><i><b>if</b> alternate = non-ignorable<br> </i>p !=
   2424 					ignore,<br> <i><b>if</b> alternate = shifted</i><br> p,
   2425 					s, t = ignore</td>
   2426 				<td>Whitespace,<br> Punctuation
   2427 				</td>
   2428 			</tr>
   2429 			<tr>
   2430 				<td>first regular<br> ...<br> last regular
   2431 				</td>
   2432 				<td>p != ignore</td>
   2433 				<td>General Symbols<br> Currency Symbols<br> Numbers<br>
   2434 					Latin<br> Greek<br> ...
   2435 				</td>
   2436 			</tr>
   2437 			<tr>
   2438 				<td>first implicit<br>...<br>last implicit
   2439 				</td>
   2440 				<td>p != ignore, assigned automatically</td>
   2441 				<td>CJK, CJK compatibility (those that are not decomposed)<br>
   2442 					CJK Extension A, B, C, ...<br> Unassigned
   2443 				</td>
   2444 			</tr>
   2445 			<tr>
   2446 				<td>first trailing<br> ...<br> last trailing
   2447 				</td>
   2448 				<td>p != ignore,<br> used for trailing syllable components
   2449 				</td>
   2450 				<td>Jamo Trailing<br> Jamo Leading<br>U+FFFD<br>U+FFFF
   2451 				</td>
   2452 			</tr>
   2453 		</table>
   2454 		<p>
   2455 			Each of the above Names can be used with a reset to position
   2456 			characters relative to that logical position. That allows characters
   2457 			to be ordered before or after a <i>logical</i> position rather than a
   2458 			specific character.
   2459 		</p>
   2460 		<blockquote>
   2461 			<p class="note">
   2462 				<b>Note: </b>The reason for this is so that tailorings can be more
   2463 				stable. A future version of the UCA might add characters at any
   2464 				point in the above list. Suppose that you set character X to be
   2465 				after Y. It could be that you want X to come after Y, no matter what
   2466 				future characters are added; or it could be that you just want Y to
   2467 				come after a given logical position, for example, after the last
   2468 				primary ignorable.
   2469 			</p>
   2470 		</blockquote>
   2471 
   2472 		<p>Each of these special reset positions always maps to a single
   2473 			collation element.</p>
   2474 
   2475 		<p>Here is an example of the syntax:</p>
   2476 		<pre>&amp; [first tertiary ignorable] &lt;&lt;  </pre>
   2477 		<p>For example, to make a character be a secondary ignorable, one
   2478 			can make it be immediately after (at a secondary level) a specific
   2479 			character (like a combining diaeresis), or one can make it be
   2480 			immediately after the last secondary ignorable.</p>
   2481 
   2482 		<p>
   2483 			Each special reset position adjusts to the effects of preceding
   2484 			rules, just like normal reset position strings. For example, if a
   2485 			tailoring rule creates a new collation element after
   2486 			<code>&amp;[last variable]</code>
   2487 			(via explicit tailoring after that, or via tailoring after the
   2488 			relevant character), then this new CE becomes the new <i>last
   2489 				variable</i> CE, and is used in following resets to
   2490 			<code>[last variable]</code>
   2491 			.
   2492 		</p>
   2493 
   2494 		<p>[first variable] and [first regular] and [first trailing]
   2495 			should be the first real such CEs (e.g., CE(U+0060 &#x0060;)), as
   2496 			adjusted according to the tailoring, not the boundary CEs (see the
   2497 			FractionalUCA.txt first primary mappings starting with U+FDD1).</p>
   2498 
   2499 		<p>
   2500 			<code>[last regular]</code>
   2501 			is not actually the last normal CE with a primary weight before
   2502 			implicit primaries. It is used to tailor large numbers of characters,
   2503 			usually CJK, into the script=Hani range between the last regular
   2504 			script and the first implicit CE. (The first group of implicit CEs is
   2505 			for Han characters.) Therefore,
   2506 			<code>[last regular]</code>
   2507 			is set to the first Hani CE, the artificial script boundary CE at the
   2508 			beginning of this range. For example:
   2509 			<code>&amp;[last regular]&lt;*...</code>
   2510 		</p>
   2511 
   2512 		<p>The [last trailing] is the CE of U+FFFF. Tailoring to that is
   2513 			not allowed.</p>
   2514 
   2515 		<p>
   2516 			The
   2517 			<code>[last variable]</code>
   2518 			indicates the &quot;highest&quot; character that is treated as
   2519 			punctuation with alternate handling.
   2520 		</p>
   2521 		<p>
   2522 			The value can be changed by using the maxVariable setting. This takes
   2523 			effect, however, after the rules have been built, and does not affect
   2524 			any characters that are reset relative to the
   2525 			<code>[last variable]</code>
   2526 			value when the rules are being built. The maxVariable setting might
   2527 			also be changed via a runtime parameter. That also does not affect
   2528 			the rules.<br> (In CLDR 24 and earlier, the variable top could
   2529 			also be set by using a tailoring rule with
   2530 			<code>[variable top]</code>
   2531 			in the place of a relation string.)
   2532 		</p>
   2533 
   2534 		<h3>
   2535 			3.12 <a name="Special_Purpose_Commands"
   2536 				href="#Special_Purpose_Commands">Special-Purpose Commands</a>
   2537 		</h3>
   2538 		<p>The import command imports rules from another collation. This
   2539 			allows for better maintenance and smaller rule sizes. The source is a
   2540 			BCP 47 language tag with an optional collation type but without other
   2541 			extensions. The collation type is the BCP 47 form of the collation
   2542 			type in the source; it defaults to "standard".</p>
   2543 		<p>
   2544 			<em>Examples: </em>
   2545 		</p>
   2546 		<ul>
   2547 			<li><code>[import de-u-co-phonebk]</code> &nbsp; (not
   2548 				"...-co-phonebook")</li>
   2549 			<li><code>[import und-u-co-search]</code> &nbsp; (not
   2550 				"root-...")</li>
   2551 			<li><code>[import ja-u-co-private-kana]</code> &nbsp; (language
   2552 				"ja" required even when this import itself is in another "ja"
   2553 				tailoring.)</li>
   2554 		</ul>
   2555 
   2556 		<table>
   2557 			<caption>
   2558 				<a name="Special_Purpose_Elements" href="#Special_Purpose_Elements">Special-Purpose
   2559 					Elements</a>
   2560 			</caption>
   2561 			<tr>
   2562 				<th>Rule Syntax</th>
   2563 			</tr>
   2564 			<tr>
   2565 				<td>[suppressContractions [-]]</td>
   2566 			</tr>
   2567 			<tr>
   2568 				<td>[optimize [-]]</td>
   2569 			</tr>
   2570 		</table>
   2571 		<p>
   2572 			The <i>suppress contractions</i> tailoring command turns off any
   2573 			existing contractions that begin with those characters, as well as
   2574 			any prefixes for those characters. It is typically used to turn off
   2575 			the Cyrillic contractions in the UCA, since they are not used in many
   2576 			languages and have a considerable performance penalty. The argument
   2577 			is a <a href="tr35.html#Unicode_Sets">Unicode Set</a>.
   2578 		</p>
   2579 
   2580 		<p>
   2581 			The <i>suppress contractions</i> command has immediate effect on the
   2582 			current set of mappings, including mappings added by preceding rules.
   2583 			Following rules are processed after removing any context-sensitive
   2584 			mappings originating from any of the characters in the set.
   2585 		</p>
   2586 
   2587 		<p>
   2588 			The <i>optimize</i> tailoring command is purely for performance. It
   2589 			indicates that those characters are sufficiently common in the target
   2590 			language for the tailoring that their performance should be enhanced.
   2591 		</p>
   2592 		<p>The reason that these are not settings is so that their
   2593 			contents can be arbitrary characters.</p>
   2594 
   2595 		<hr width="50%">
   2596 		<p>
   2597 			<i>Example:</i>
   2598 		</p>
   2599 		<p>
   2600 			The following is a simple example that combines portions of different
   2601 			tailorings for illustration. For more complete examples, see the
   2602 			actual locale data: <a
   2603 				href="http://unicode.org/repos/cldr/tags/latest/common/collation/ja.xml">Japanese</a>,
   2604 			<a
   2605 				href="http://unicode.org/repos/cldr/tags/latest/common/collation/zh.xml">Chinese</a>,
   2606 			<a
   2607 				href="http://unicode.org/repos/cldr/tags/latest/common/collation/sv.xml">Swedish</a>,
   2608 			and <a
   2609 				href="http://unicode.org/repos/cldr/tags/latest/common/collation/de.xml">German</a>
   2610 			(type=&quot;phonebook&quot;) are particularly illustrative.
   2611 		</p>
   2612 		<pre>&lt;collation&gt;
   2613   &lt;cr&gt;&lt;![CDATA[
   2614     [caseLevel on]
   2615     &amp;Z
   2616     &lt;  &lt;&lt;&lt; 
   2617     &lt;  &lt;&lt;&lt;  &lt;&lt;&lt; aa &lt;&lt;&lt; aA &lt;&lt;&lt; Aa &lt;&lt;&lt; AA
   2618     &lt;  &lt;&lt;&lt; 
   2619     &lt;  &lt;&lt;&lt;  &lt;&lt;  &lt;&lt;&lt; 
   2620     &lt;  &lt;&lt;&lt;  &lt;&lt;  &lt;&lt;&lt; 
   2621     &amp;V &lt;&lt;&lt;* wW
   2622     &amp;Y &lt;&lt;&lt;* 
   2623     &amp;[last non-ignorable]
   2624     <span style="color: green"># The following is equivalent to &lt;&lt;&lt;...</span>
   2625     &lt;* 
   2626     &lt;* 
   2627   ]]&gt;&lt;/cr&gt;
   2628 &lt;/collation&gt;</pre>
   2629 
   2630 		<h3>
   2631 			3.13 <a name="Script_Reordering" href="#Script_Reordering">Collation
   2632 				Reordering</a>
   2633 		</h3>
   2634 		<p>Collation reordering allows scripts and certain other defined
   2635 			blocks of characters to be moved relative to each other
   2636 			parametrically, without changing the detailed rules for all the
   2637 			characters involved. This reordering is done on top of any specific
   2638 			ordering rules within the script or block currently in effect.
   2639 			Reordering can specify groups to be placed at the start and/or the
   2640 			end of the collation order. For example, to reorder Greek characters
   2641 			before Latin characters, and digits afterwards (but before other
   2642 			scripts), the following can be used:</p>
   2643 		<table>
   2644 			<tr>
   2645 				<th>Rule Syntax</th>
   2646 				<th>Locale Identifier</th>
   2647 			</tr>
   2648 			<tr>
   2649 				<td><code>[reorder Grek Latn digit]</code></td>
   2650 				<td><code>en-u-kr-grek-latn-digit</code></td>
   2651 			</tr>
   2652 		</table>
   2653 		<p>
   2654 			In each case, a sequence of <em><strong>reorder_codes</strong></em>
   2655 			is used, separated by spaces in the settings attribute and in rule
   2656 			syntax, and by hyphens in locale identifiers.
   2657 		</p>
   2658 		<p>
   2659 			A <strong><em>reorder_code</em></strong> is any of the following
   2660 			special codes:
   2661 		</p>
   2662 		<ol>
   2663 			<li><strong>space, punct, symbol, currency, digit</strong> -
   2664 				core groups of characters below 'a'</li>
   2665 			<li><strong>any script code</strong> except <strong>Common</strong>
   2666 				and <strong>Inherited</strong>.
   2667 				<ul>
   2668 					<li>Some pairs of scripts sort primary-equal and always
   2669 						reorder together. For example, Katakana characters are are always
   2670 						reordered with Hiragana.</li>
   2671 				</ul></li>
   2672 			<li><strong>others</strong> - where all codes not explicitly
   2673 				mentioned should be ordered. The script code <strong>Zzzz</strong>
   2674 				(Unknown Script) is a synonym for <strong>others</strong>.</li>
   2675 		</ol>
   2676 		<p>It is an error if a code occurs multiple times.</p>
   2677 
   2678 		<p>
   2679 			It is an error if the sequence of reorder codes is empty in the XML
   2680 			attribute or in the locale identifier. Some implementations may
   2681 			interpret an empty sequence in the
   2682 			<code>[reorder]</code>
   2683 			rule syntax as a reset to the DUCET ordering, synonymous with
   2684 			<code>[reorder others]</code>
   2685 			; other implementations may forbid an empty sequence in the rule
   2686 			syntax as well.
   2687 		</p>
   2688 
   2689 		<p>
   2690 			Interaction with <strong>alternate=shifted</strong>: Whether a
   2691 			primary weight is variable is determined according to the variable
   2692 			top, before applying script reordering. Once that is determined,
   2693 			script reordering is applied to the primary weight regardless of
   2694 			whether it is regular (used in the primary level) or shifted
   2695 			(used in the quaternary level).
   2696 		</p>
   2697 
   2698 		<h4>
   2699 			3.13.1 <a name="Interpretation_reordering"
   2700 				href="#Interpretation_reordering">Interpretation of a reordering
   2701 				list</a>
   2702 		</h4>
   2703 		<p>The reordering list is interpreted as if it were processed in
   2704 			the following way.</p>
   2705 		<ol>
   2706 			<li>If any core code is not present, then it is inserted at the
   2707 				front of the list in the order given above.</li>
   2708 			<li>If the <strong>others</strong> code is not present, then it
   2709 				is inserted at the end of the list.
   2710 			</li>
   2711 			<li>The <strong>others</strong> code is replaced by the list of
   2712 				all script codes not explicitly mentioned, in DUCET order.
   2713 			</li>
   2714 			<li>The reordering list is now complete, and used to reorder
   2715 				characters in collation accordingly.</li>
   2716 		</ol>
   2717 		<p>
   2718 			The locale data may have a particular ordering. For example, the
   2719 			Czech locale data could put digits after all letters, with
   2720 			<code>[reorder others digit]</code>
   2721 			. Any reordering codes specified on top of that (such as with a bcp47
   2722 			locale identifier) completely replace what was there. To specify a
   2723 			version of collation that completely resets any existing reordering
   2724 			to the DUCET ordering, the single code <strong>Zzzz</strong> or <strong>others</strong>
   2725 			can be used, as below<strong></strong>.
   2726 		</p>
   2727 		<p>
   2728 			<em>Examples: </em>
   2729 		</p>
   2730 		<table cellpadding="0" cellspacing="0">
   2731 			<tbody>
   2732 				<tr>
   2733 					<th>Locale Identifier</th>
   2734 					<th>Effect</th>
   2735 				</tr>
   2736 				<tr>
   2737 					<td><code>en-u-kr-latn-digit</code></td>
   2738 					<td>Reorder digits after Latin characters (but before other
   2739 						scripts like Cyrillic).</td>
   2740 				</tr>
   2741 				<tr>
   2742 					<td><code>en-u-kr-others-digit</code></td>
   2743 					<td>Reorder digits after all other characters.</td>
   2744 				</tr>
   2745 				<tr>
   2746 					<td><code>en-u-kr-arab-cyrl-others-symbol</code></td>
   2747 					<td>Reorder Arabic characters first, then Cyrillic, and put
   2748 						symbols at the endafter all other characters.</td>
   2749 				</tr>
   2750 				<tr>
   2751 					<td><code>en-u-kr-others</code></td>
   2752 					<td>Remove any locale-specific reordering, and use DUCET order
   2753 						for reordering blocks.</td>
   2754 				</tr>
   2755 			</tbody>
   2756 		</table>
   2757 		<p>
   2758 			The default reordering groups are defined by the FractionalUCA.txt
   2759 			file, based on the primary weights of associated collation elements.
   2760 			The file contains special mappings for the start of each group,
   2761 			script, and reorder-reserved range, see <i>Section 2.6.2, <a
   2762 				href="#File_Format_FractionalUCA_txt">FractionalUCA.txt</a></i>.
   2763 		</p>
   2764 
   2765 		<p>There are some special cases:</p>
   2766 		<ul>
   2767 			<li>The <strong>Hani</strong> group includes implicit weights
   2768 				for <em>Han characters</em> according to the UCA as well as any
   2769 				characters tailored relative to a Han character, or after <code>&amp;[first
   2770 					Hani]</code>.
   2771 			</li>
   2772 			<li>Implicit weights for <em>unassigned code points</em>
   2773 				according to the UCA reorder as the last weights in the <strong>others</strong>
   2774 				(<strong>Zzzz</strong>) group.<br> There is no script code to
   2775 				explicitly reorder the unassigned-implicit weights into a particular
   2776 				position. (Unassigned-implicit weights are used for non-Hani code
   2777 				points without any mappings. For a given Unicode version they are
   2778 				the code points with General_Category values Cn, Co, Cs.)
   2779 			</li>
   2780 			<li>The TRAILING group, the FIELD-SEPARATOR (associated with
   2781 				U+FFFE), and collation elements with only zero primary weights are
   2782 				not reordered.</li>
   2783 			<li>The TERMINATOR, LEVEL-SEPARATOR, and SPECIAL groups are
   2784 				never associated with characters.</li>
   2785 		</ul>
   2786 		<p>
   2787 			For example,
   2788 			<code>reorder="Hani Zzzz Grek"</code>
   2789 			sorts Hani, Latin, Cyrillic, ... (all other scripts) ..., unassigned,
   2790 			Greek, TRAILING.
   2791 		</p>
   2792 
   2793 		<p>Notes for implementations that write sort keys:</p>
   2794 		<ul>
   2795 			<li>Primaries must always be offset by one or more whole primary
   2796 				lead bytes. (Otherwise the number of bytes in a fractional weight
   2797 				may change, compressible scripts may span multiple lead bytes, or
   2798 				trailing primary bytes may collide with separators and
   2799 				primary-compression terminators.)</li>
   2800 			<li>When a script is reordered that does not start and end on
   2801 				whole-primary-lead-byte boundaries, then the lead byte needs to be
   2802 				split, and a reserved byte is used up. The data supports this via
   2803 				reorder-reserved ranges of primary weights that are not used for
   2804 				collation elements.</li>
   2805 			<li>Primary weights from different original lead bytes can be
   2806 				reordered to a shared lead byte, as long as they do not overlap.
   2807 				Primary compression ends when the target lead byte differs or when
   2808 				the original lead byte of the next primary is not compressible.</li>
   2809 			<li>Non-compressible groups and scripts begin or end on
   2810 				whole-primary-lead-byte boundaries (or both), so that reordering
   2811 				cannot surround a non-compressible script by two compressible ones
   2812 				within the same target lead byte. This is so that primary
   2813 				compression can be terminated reliably (choosing the low or high
   2814 				terminator byte) simply by comparing the previous and current
   2815 				primary weights. Otherwise it would have to also check for another
   2816 				condition (e.g., equal scripts).</li>
   2817 		</ul>
   2818 
   2819 		<h4>
   2820 			3.13.2 <a name="Reordering_Groups_allkeys"
   2821 				href="#Reordering_Groups_allkeys">Reordering Groups for
   2822 				allkeys.txt</a>
   2823 		</h4>
   2824 		<p>
   2825 			For allkeys_CLDR.txt, the start of each reordering group can be
   2826 			determined from FractionalUCA.txt, by finding the first real mapping
   2827 			(after xyz first primary) of that group (e.g.,
   2828 			<code>0060; [0D 07, 05, 05] # Zyyy Sk [0312.0020.0002] * GRAVE
   2829 				ACCENT</code>
   2830 			), and looking for that mapping's character sequence (
   2831 			<code>0060</code>
   2832 			) in allkeys_CLDR.txt. The comment in FractionalUCA.txt (
   2833 			<code>[0312.0020.0002]</code>
   2834 			) also shows the allkeys_CLDR.txt collation elements.
   2835 		</p>
   2836 
   2837 		<p>The DUCET ordering of some characters is slightly different
   2838 			from the CLDR root collation order. The reordering groups for the
   2839 			DUCET are not specified. The following describes how reordering
   2840 			groups for the DUCET can be derived.</p>
   2841 		<p>
   2842 			For allkeys_DUCET.txt, the start of each reordering group is normally
   2843 			the primary weight corresponding to the same character sequence as
   2844 			for allkeys_CLDR.txt. In a few cases this requires adjustment,
   2845 			especially for the special reordering groups, due to CLDRs ordering
   2846 			the common characters more strictly by category than the DUCET (as
   2847 			described in <i>Section 2, <a href="#Root_Collation">Root
   2848 					Collation</a></i>). The necessary adjustment would set the start of each
   2849 			allkeys_DUCET.txt reordering group to the primary weight of the first
   2850 			mapping for the relevant General_Category for a special reordering
   2851 			group (for characters that sort before a), or the primary weight of
   2852 			the first mapping for the first script (e.g., sc=Grek) of an
   2853 			alphabetic group (for characters that sort at or after a).
   2854 		</p>
   2855 		<p>Note that the following only applies to primary weights greater
   2856 			than the one for U+FFFE and less than "trailing" weights.</p>
   2857 		<p>The special reordering groups correspond to General_Category
   2858 			values as follows:</p>
   2859 		<ul>
   2860 			<li>punct: P</li>
   2861 			<li>symbol: Sk, Sm, So</li>
   2862 			<li>space: Z, Cc</li>
   2863 			<li>currency: Sc</li>
   2864 			<li>digit: Nd</li>
   2865 		</ul>
   2866 		<p>In the DUCET, some characters that sort below a and have
   2867 			other General_Category values not mentioned above (e.g., gc=Lm) are
   2868 			also grouped with symbols. Variants of numbers (gc=No or Nl) can be
   2869 			found among punctuation, symbols, and digits.</p>
   2870 		<p>Each collation element of an expansion may be in a different
   2871 			reordering group, for example for parenthesized characters.</p>
   2872 
   2873 		<h3>
   2874 			3.14 <a name="Case_Parameters" href="#Case_Parameters">Case
   2875 				Parameters</a>
   2876 		</h3>
   2877 		<p>
   2878 			The <strong>case level</strong> is an <em>optional</em> intermediate
   2879 			level (&quot;2.5&quot;) between Level 2 and Level 3 (or after Level
   2880 			1, if there is no Level 2 due to strength settings). The case level
   2881 			is used to support two parametric features: ignoring non-case
   2882 			variants (Level 3 differences) except for case, and giving case
   2883 			differences a higher-level priority than other tertiary differences.
   2884 			Distinctions between small and large Kana characters are also
   2885 			included as case differences, to support Japanese collation.
   2886 		</p>
   2887 		<p>
   2888 			The <strong>case first</strong> parameter controls whether to swap
   2889 			the order of upper and lowercase. It can be used with or without the
   2890 			case level.
   2891 		</p>
   2892 		<p>
   2893 			Importantly, the case parameters have no effect in many instances.
   2894 			For example, they have no effect on the comparison of two
   2895 			non-ignorable characters with different primary weights, or with
   2896 			different secondary weights if the strength = <strong>secondary
   2897 				(or higher).</strong>
   2898 		</p>
   2899 		<p>
   2900 			When either the <strong>case level</strong> or <strong>case
   2901 				first</strong> parameters are set, the following describes the derivation of
   2902 			the modified collation elements. It assumes the original levels for
   2903 			the code point are [p.s.t] (primary, secondary, tertiary). This
   2904 			derivation may change in future versions of LDML, to track the case
   2905 			characteristics more closely.
   2906 		</p>
   2907 
   2908 		<h4>
   2909 			3.14.1 <a name="Case_Untailored" href="#Case_Untailored">Untailored
   2910 				Characters</a>
   2911 		</h4>
   2912 		<p>For untailored characters and strings, that is, for mappings in
   2913 			the root collation, the case value for each collation element is
   2914 			computed from the tertiary weight listed in allkeys_CLDR.txt. This is
   2915 			used to modify the collation element.</p>
   2916 		<p>Look up a case value for the tertiary weight x of each
   2917 			collation element:</p>
   2918 		<ol>
   2919 			<li>UPPER if x  {08-0C, 0E, 11, 12, 1D}</li>
   2920 			<li>UNCASED otherwise</li>
   2921 			<li>FractionalUCA.txt encodes the case information in bits 6 and
   2922 				7 of the first byte in each tertiary weight. The case bits are set
   2923 				to 00 for UNCASED and LOWERCASE, and 10 for UPPER. There is no MIXED
   2924 				case value (01) in the root collation.</li>
   2925 		</ol>
   2926 
   2927 		<h4>
   2928 			3.14.2 <a name="Case_Weights" href="#Case_Weights">Compute
   2929 				Modified Collation Elements</a>
   2930 		</h4>
   2931 		<p>
   2932 			From a computed case value, set a weight <strong>c</strong> according
   2933 			to the following.
   2934 		</p>
   2935 		<ol>
   2936 			<li>If <strong>CaseFirst=UpperFirst</strong>, set <strong>c</strong>
   2937 				= UPPER ? <strong>1</strong> : MIXED ? 2 : <strong>3</strong></li>
   2938 			<li>Otherwise set <strong>c</strong> = UPPER ? <strong>3</strong>
   2939 				: MIXED ? 2 : <strong>1</strong></li>
   2940 		</ol>
   2941 		<p>
   2942 			Compute a new collation element according to the following table. The
   2943 			notation <em>xt</em> means that the values are numerically combined
   2944 			into a single level, such that xt &lt; yu whenever x &lt; y. The
   2945 			fourth level (if it exists) is unaffected. Note that a secondary CE
   2946 			must have a secondary weight S which is greater than the secondary
   2947 			weight s of any primary CE; and a tertiary CE must have a tertiary
   2948 			weight T which is greater than the tertiary weight t of any primary
   2949 			or secondary CE ([<a
   2950 				href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>] <a
   2951 				href="http://www.unicode.org/reports/tr10/#WF2">WF2</a>).
   2952 		</p>
   2953 
   2954 		<div align="center">
   2955 			<table>
   2956 				<tbody>
   2957 					<tr>
   2958 						<th>Case Level</th>
   2959 						<th>Strength</th>
   2960 						<th>Original CE</th>
   2961 						<th>Modified CE</th>
   2962 						<th>Comment</th>
   2963 					</tr>
   2964 					<tr>
   2965 						<td rowspan="5"><strong>on</strong></td>
   2966 						<td rowspan="2"><strong>primary</strong></td>
   2967 						<td><code>0.S.t</code></td>
   2968 						<td><code>0.0</code></td>
   2969 						<td rowspan="2">ignore case level weights of
   2970 							primary-ignorable CEs</td>
   2971 					</tr>
   2972 					<tr>
   2973 						<td><code>p.s.t</code></td>
   2974 						<td><code>p.c</code></td>
   2975 					</tr>
   2976 					<tr>
   2977 						<td rowspan="3"><strong>secondary<br>
   2978 						</strong>or higher</td>
   2979 						<td><code>0.0.T</code></td>
   2980 						<td><code>0.0.0.T</code></td>
   2981 						<td rowspan="3">ignore case level weights of
   2982 							secondary-ignorable CEs</td>
   2983 					</tr>
   2984 					<tr>
   2985 						<td><code>0.S.t</code></td>
   2986 						<td><code>0.S.c.t</code></td>
   2987 					</tr>
   2988 					<tr>
   2989 						<td><code>p.s.t</code></td>
   2990 						<td><code>p.s.c.t</code></td>
   2991 					</tr>
   2992 					<tr>
   2993 						<td rowspan="4"><strong>off</strong></td>
   2994 						<td rowspan="4">any</td>
   2995 						<td><code>0.0.0</code></td>
   2996 						<td><code>0.0.00</code></td>
   2997 						<td rowspan="4">ignore case level weights of
   2998 							tertiary-ignorable CEs</td>
   2999 					</tr>
   3000 					<tr>
   3001 						<td><code>0.0.T</code></td>
   3002 						<td><code> 0.0.3T </code></td>
   3003 					</tr>
   3004 					<tr>
   3005 						<td><code>0.S.t</code></td>
   3006 						<td><code>0.S.ct</code></td>
   3007 					</tr>
   3008 					<tr>
   3009 						<td><code>p.s.t</code></td>
   3010 						<td><code>p.s.ct</code></td>
   3011 					</tr>
   3012 				</tbody>
   3013 			</table>
   3014 		</div>
   3015 
   3016 		<p>For primary+case, which is used for ignore accents but not
   3017 			case collation, primary ignorables are ignored so that a = . For
   3018 			secondary+case, which would by analogy mean ignore variants but not
   3019 			case, secondary ignorables are ignored for equivalent behavior.</p>
   3020 		<p>
   3021 			When using <strong>caseFirst</strong> but not <strong>caseLevel</strong>,
   3022 			the combined case+tertiary weight of a tertiary CE must be greater
   3023 			than the combined case+tertiary weight of any primary or secondary CE
   3024 			so that [<a href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>]
   3025 			<a href="http://www.unicode.org/reports/tr10/#WF2">well-formedness
   3026 				condition 2</a> is fulfilled. Since the tertiary CEs tertiary weight T
   3027 			is already greater than any t of primary or secondary CEs, it is
   3028 			sufficient to set its case weight to UPPER=3. It must not be affected
   3029 			by <strong>caseFirst=upper</strong>. (The table uses the constant 3
   3030 			in this case rather than the computed c.)
   3031 		</p>
   3032 		<p>
   3033 			The case weight of a tertiary-ignorable CE must be 0 so that [<a
   3034 				href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>] <a
   3035 				href="http://www.unicode.org/reports/tr10/#WF1">well-formedness
   3036 				condition 1</a> is fulfilled.
   3037 		</p>
   3038 
   3039 		<h4>
   3040 			3.14.3 <a name="Case_Tailored" href="#Case_Tailored">Tailored
   3041 				Strings</a>
   3042 		</h4>
   3043 		<p>Characters and strings that are tailored have case values
   3044 			computed from their root collation case bits.</p>
   3045 
   3046 		<ol>
   3047 			<li>Look up the tailored strings root CEs. (Ignore any prefix
   3048 				or extension strings.) N=number of primary root CEs.</li>
   3049 			<li>Determine the number and type (primary vs. weaker) of CEs a
   3050 				tailored string maps to. M=number of primary tailored CEs.</li>
   3051 			<li>If N&lt;=M (no more root than tailoring primary CEs): Copy
   3052 				the root case bits for primary CEs 0..N-1.
   3053 				<ul>
   3054 					<li>If N&lt;M (fewer root primary CEs): Clear the case bits of
   3055 						the remaining tailored primary CEs. (uncased/lowercase/small Kana)</li>
   3056 				</ul>
   3057 			</li>
   3058 			<li>If N&gt;M (more root primary CEs): Copy the root case bits
   3059 				for primary CEs 0..M-2. Set the case bits for tailored primary CE
   3060 				M-1 according to the remaining root primary CEs M-1..N-1:
   3061 				<ul>
   3062 					<li>Set to uncased/lower if all remaining root primary CEs
   3063 						have uncased/lower.</li>
   3064 					<li>Set to uppercase if all remaining root primary CEs have
   3065 						uppercase.</li>
   3066 					<li>Otherwise, set to mixed.</li>
   3067 				</ul>
   3068 			</li>
   3069 			<li>Clear the case bits for secondary CEs 0.s.t.</li>
   3070 			<li>Tertiary CEs 0.0.t must get uppercase bits.</li>
   3071 			<li>Tertiary-ignorable CEs 0.0.0 must get
   3072 				ignorable-case=lowercase bits.</li>
   3073 		</ol>
   3074 		<p class="note">Note: Almost all Cased characters have primary
   3075 			(non-ignorable) root collation CEs, except for U+0345 Combining
   3076 			Ypogegrammeni which is Lowercase. All Uppercase characters have
   3077 			primary root collation CEs.</p>
   3078 
   3079 
   3080 		<h3>
   3081 			3.15 <a name="Visibility" href="#Visibility">Visibility</a>
   3082 		</h3>
   3083 		<p>
   3084 			Collations have external visibility by default, meaning that they can
   3085 			be displayed in a list of collation options for users to choose from.
   3086 			A collation whose type name starts with "private-" is internal and
   3087 			should not be shown in such a list. Collations are typically internal
   3088 			when they are partial sequences included in other collations. See <i>Section
   3089 				3.1, <a href="#Collation_Types">Collation Types</a>
   3090 			</i>.
   3091 		</p>
   3092 
   3093 		<h3>
   3094 			3.16 <a name="Collation_Indexes" href="#Collation_Indexes">Collation
   3095 				Indexes</a>
   3096 		</h3>
   3097 		<h4>
   3098 			3.16.1 <a name="Index_Characters" href="#Index_Characters">Index
   3099 				Characters</a>
   3100 		</h4>
   3101 		<p>
   3102 			The main data includes &lt;exemplarCharacters&gt; for collation
   3103 			indexes. See <i>Part 2 General, Section 3, <a
   3104 				href="tr35-general.html#Character_Elements">Character Elements</a></i>,
   3105 			for general information about exemplar characters.
   3106 		</p>
   3107 		<p>The index characters are a set of characters for use as a UI
   3108 			"index", that is, a list of clickable characters (or character
   3109 			sequences) that allow the user to see a segment of a larger "target"
   3110 			list. Each character corresponds to a bucket in the target list. One
   3111 			may have different kinds of index lists; one that produces an index
   3112 			list that is relatively static, and the other is a list that produces
   3113 			roughly equally-sized buckets. While CLDR is mostly focused on the
   3114 			first, there is provision for supporting the second as well.</p>
   3115 		<p>The index characters need to be used in conjunction with a
   3116 			collation for the locale, which will determine the order of the
   3117 			characters. It will also determine which index characters show up.</p>
   3118 		<p>The static list would be presented as something like the
   3119 			following (either vertically or horizontally):</p>
   3120 		<p align="center">A B C D E F G H CH I J K L M N O P Q R S T U V
   3121 			W X Y Z</p>
   3122 		<p>In the "A" bucket, you would find all items that are primary
   3123 			greater than or equal to "A" in collation order, and primary less
   3124 			than "B". The use of the list requires that the target list be sorted
   3125 			according to the locale that is used to create that list. Although we
   3126 			say "character" above, the index character could be a sequence, like
   3127 			"CH" above. The index exemplar characters must always be used with a
   3128 			collation appropriate for the locale. Any characters that do not have
   3129 			primary differences from others in the set should be removed.</p>
   3130 		<p>Details:</p>
   3131 		<ol>
   3132 			<li>The primary weight (according to the collation) is used to
   3133 				determine which bucket a string is in. There are special buckets for
   3134 				before the first character, between buckets of different scripts,
   3135 				and after the last bucket (and of a different script).</li>
   3136 			<li>Characters in the <em>index characters</em> do not need to
   3137 				have distinct primary weights. That is, the <em>index
   3138 					characters</em> are adapted to the underlying collation: normally  is
   3139 				in the  bucket for Russian, but if someone used a variant of
   3140 				Russian collation that distinguished them on a primary level, then 
   3141 				would show up as its own bucket.
   3142 			</li>
   3143 			<li>If an <em>index character</em> string ends with a single "*"
   3144 				(U+002A), for example "Sch*" and "St*" in German, then there will be
   3145 				a separate bucket for the string minus the "*", for example "Sch"
   3146 				and "St", even if that string does not sort distinctly.
   3147 			</li>
   3148 			<li>An <em>index character</em> can have multiple primary
   3149 				weights, for example "" and "Sch". Names that have the same initial
   3150 				primary weights sort into this <em>index character</em>s bucket.
   3151 				This can be achieved by using an upper-boundary string that is the
   3152 				concatenation of the <em>index character</em> and U+FFFF, for
   3153 				example "\uFFFF" and "Sch\uFFFF". Names that sort greater than this
   3154 				upper boundary but less than the next index character are redirected
   3155 				to the last preceding single-primary index character (A and S for
   3156 				the examples here).
   3157 			</li>
   3158 		</ol>
   3159 		<p>
   3160 			For example, for index characters
   3161 			<code>[A  B R S {Sch*} {St*} T]</code>
   3162 			the following sample names are sorted into an index as shown.
   3163 		</p>
   3164 		<ul>
   3165 			<li>A &mdash; Adelbert, Afrika</li>
   3166 			<li> &mdash; sculap, Aesthet</li>
   3167 			<li>B &mdash; Berlin</li>
   3168 			<li>R &mdash; Rilke</li>
   3169 			<li>S &mdash; Sacher, Seiler, Sultan</li>
   3170 			<li>Sch &mdash; Schiller</li>
   3171 			<li>St &mdash; Steiff</li>
   3172 			<li>T &mdash; Thomas</li>
   3173 		</ul>
   3174 		<p>
   3175 			Theitems are special: each is a bucket for everything else, either
   3176 			less or greater. They are inserted at the start and end of the index
   3177 			list, <em>and</em> on script boundaries. Each script has its own
   3178 			range, except where scripts sort primary-equal (e.g., Hira &amp;
   3179 			Kana). All characters that sort in one of the low reordering groups
   3180 			(whitespace, punctuation, symbols, currency symbols, digits) are
   3181 			treated as a single script for this purpose.
   3182 		</p>
   3183 		<p>If you tailor a Greek character into the Cyrillic script, that
   3184 			Greek character will be bucketed (and sorted) among the Cyrillic
   3185 			ones.</p>
   3186 
   3187 		<p>
   3188 			Even in an implementation that reorders groups of scripts rather than
   3189 			single scripts, for example Hebrew together with Phoenician and
   3190 			Samaritan, the index boundaries are really script boundaries, <em>not</em>
   3191 			multi-script-group boundaries. So if you had a collation that
   3192 			reordered Hebrew after Ethiopic, you would still get index boundaries
   3193 			between the following (and in that order):
   3194 		</p>
   3195 		<ol>
   3196 			<li>Ethiopic</li>
   3197 			<li>Hebrew</li>
   3198 			<li>Phoenician<em>// included in the Hebrew reordering
   3199 					group</em></li>
   3200 			<li>Samaritan<em>// included in the Hebrew reordering
   3201 					group</em></li>
   3202 			<li>Devanagari</li>
   3203 		</ol>
   3204 		<p>(Beginning with CLDR 27, single scripts can be reordered.)</p>
   3205 		<p>In the UI, an index character could also be omitted or grayed
   3206 			out if its bucket is empty. For example, if there is nothing in the
   3207 			bucket for Q, then Q could be omitted. That would be up to the
   3208 			implementation. Additional buckets could be added if other characters
   3209 			are present. For example, we might see something like the following:</p>
   3210 		<table border="1" cellspacing="0">
   3211 			<tbody>
   3212 				<tr align="center">
   3213 					<td><div align="center">
   3214 							<strong>Sample Greek Index<br>
   3215 							</strong>
   3216 						</div></td>
   3217 					<td><strong>Contents<br>
   3218 					</strong></td>
   3219 				</tr>
   3220 				<tr align="center">
   3221 					<td><div align="center">               
   3222 							       </div></td>
   3223 					<td>With only content beginning with Greek letters<br>
   3224 					</td>
   3225 				</tr>
   3226 				<tr align="center">
   3227 					<td><div align="center">               
   3228 							         </div></td>
   3229 					<td>With some content before or after</td>
   3230 				</tr>
   3231 				<tr align="center">
   3232 					<td><div align="center"> 9              
   3233 							          </div></td>
   3234 					<td>With numbers, and nothing between 9 and Alpha</td>
   3235 				</tr>
   3236 				<tr align="center">
   3237 					<td><div align="center">
   3238 							  9<em>A-Z</em>                      
   3239 							 
   3240 						</div></td>
   3241 					<td>With numbers, some Latin</td>
   3242 				</tr>
   3243 			</tbody>
   3244 		</table>
   3245 		<p>Here is a sample of the XML structure:</p>
   3246 		<pre>&lt;exemplarCharacters type=&quot;index&quot;&gt;[A B C D E F G H I J K L M N O P Q R S T U V W X Y Z]&lt;/exemplarCharacters&gt;</pre>
   3247 		<p>
   3248 			The display of the index characters can be modified with the Index
   3249 			labels elements, discussed in the <i>Part 2 General, Section 3.3,
   3250 				<a href="tr35-general.html#IndexLabels">Index Labels</a>
   3251 			</i>.
   3252 		</p>
   3253 
   3254 		<h4>
   3255 			3.16.2 <a name="CJK_Index_Markers" href="#CJK_Index_Markers">CJK
   3256 				Index Markers</a>
   3257 		</h4>
   3258 		<p>Special index markers have been added to the CJK collations for
   3259 			stroke, pinyin, zhuyin, and unihan. These markers allow for effective
   3260 			and robust use of indexes for these collations.</p>
   3261 		<p>The per-language index exemplar characters are not useful for
   3262 			collation indexes for CJK because for each such language there are
   3263 			multiple sort orders in use (for example, Chinese pinyin vs. stroke
   3264 			vs. unihan vs. zhuyin), and these sort orders use very different
   3265 			index characters. In addition, sometimes the boundary strings are
   3266 			different from the bucket label strings. For collations that contain
   3267 			index markers, the boundary strings and bucket labels should be
   3268 			derived from those index markers, ignoring the index exemplar
   3269 			characters.</p>
   3270 		<p>For example, near the start of the pinyin tailoring there is
   3271 			the following:</p>
   3272 		<p>
   3273 			&lt;p&gt; A&lt;/p&gt;&lt;!-- INDEX A --&gt;<br>
   3274 			&lt;pc&gt;&lt;/pc&gt;&lt;!--  --&gt;
   3275 		</p>
   3276 		<p></p>
   3277 		<p>
   3278 			&lt;pc&gt;&lt;/pc&gt;&lt;!-- ao --&gt;<br> &lt;p&gt;
   3279 			B&lt;/p&gt;&lt;!-- INDEX B --&gt;
   3280 		</p>
   3281 		<p>These indicate the boundaries of &quot;buckets&quot; that can
   3282 			be used for indexing. They are always two characters starting with
   3283 			the noncharacter U+FDD0, and thus will not occur in normal text. For
   3284 			pinyin the second character is A-Z; for unihan it is one of the
   3285 			radicals; and for stroke it is a character after U+2800 indicating
   3286 			the number of strokes, such as . For zhuyin the second character is
   3287 			one of the standard Bopomofo characters in the range U+3105 through
   3288 			U+3129.</p>
   3289 
   3290 		<p>The corresponding bucket label strings are the boundary strings
   3291 			with the leading U+FDD0 removed. For example, the Pinyin boundary
   3292 			string "\uFDD0A" yields the label string "A".</p>
   3293 
   3294 		<p>However, for stroke order, the label string is the stroke count
   3295 			(second character minus U+2800) as a decimal-digit number followed by
   3296 			&#x5283; (U+5283). For example, the stroke order boundary string
   3297 			"\uFDD0\u2805" yields the label string "5&#x5283;".</p>
   3298 
   3299 		<hr>
   3300 		<p class="copyright">
   3301 			Copyright  20012018 Unicode, Inc. All
   3302 			Rights Reserved. The Unicode Consortium makes no expressed or implied
   3303 			warranty of any kind, and assumes no liability for errors or
   3304 			omissions. No liability is assumed for incidental and consequential
   3305 			damages in connection with or arising out of the use of the
   3306 			information or programs contained or accompanying this technical
   3307 			report. The Unicode <a href="http://unicode.org/copyright.html">Terms
   3308 				of Use</a> apply.
   3309 		</p>
   3310 		<p class="copyright">Unicode and the Unicode logo are trademarks
   3311 			of Unicode, Inc., and are registered in some jurisdictions.</p>
   3312 	</div>
   3313 
   3314 </body>
   3315 
   3316 </html>
   3317