1 * Copyright (C) 2004-2010, International Business Machines 2 * Corporation and others. All Rights Reserved. 3 * 4 * file name: changes.txt 5 * encoding: US-ASCII 6 * tab size: 8 (not used) 7 * indentation:4 8 * 9 * created on: 2004may06 10 * created by: Markus W. Scherer 11 * 12 * change log for Unicode updates 13 14 ---------------------------------------------------------------------------- *** 15 16 Unicode 6.0 update 17 18 *** related ICU Trac tickets 19 20 7264 Unicode 6.0 Update 21 22 *** Unicode version numbers 23 - makedata.mak 24 - uchar.h 25 (configure.in & configure: have been modified to extract the version from uchar.h) 26 - com.ibm.icu.util.VersionInfo 27 28 *** data files & enums & parser code 29 30 * file preparation 31 32 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed 33 - This now prepares both unidata and testdata files in respective output subfolders. 34 35 * PropertyAliases.txt changes 36 - new Script_Extensions property defined in the new ScriptExtensions.txt file 37 but not listed in PropertyAliases.txt; reported to unicode.org; 38 -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt 39 scx; Script_Extensions 40 -> uchar.h with new UProperty section 41 -> com.ibm.icu.lang.UProperty, parallel with uchar.h 42 43 * PropertyValueAliases.txt changes 44 - 12 new block names: 45 Alchemical_Symbols 46 Bamum_Supplement 47 Batak 48 Brahmi 49 CJK_Unified_Ideographs_Extension_D 50 Emoticons 51 Ethiopic_Extended_A 52 Kana_Supplement 53 Mandaic 54 Miscellaneous_Symbols_And_Pictographs 55 Playing_Cards 56 Transport_And_Map_Symbols 57 -> add to uchar.h 58 -> add to UCharacter.UnicodeBlock 59 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 60 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 61 - Joining_Group (jg) values: 62 Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias 63 -> uchar.h & UCharacter.JoiningGroup 64 - 3 new scripts: 65 sc ; Batk ; Batak 66 sc ; Brah ; Brahmi 67 sc ; Mand ; Mandaic 68 -> remove these from SyntheticPropertyValueAliases.txt 69 -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN 70 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 71 and in com.ibm.icu.dev.test.lang.TestUScript.java 72 - 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 73 (added 2009-11-11..2010-07-18) 74 Bass 259 Bassa Vah 75 Dupl 755 Duployan shortand 76 Elba 226 Elbasan 77 Gran 343 Grantha 78 Kpel 436 Kpelle 79 Loma 437 Loma 80 Mend 438 Mende 81 Merc 101 Meroitic Cursive 82 Narb 106 Old North Arabian 83 Nbat 159 Nabataean 84 Palm 126 Palmyrene 85 Sind 318 Sindhi 86 Wara 262 Warang Citi 87 -> uscript.h 88 -> com.ibm.icu.lang.UScript 89 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 90 replace public static final int \1 = \2;\3 91 -> SyntheticPropertyValueAliases.txt 92 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 93 and in com.ibm.icu.dev.test.lang.TestUScript.java 94 - ISO 15924 name change 95 Mero 100 Meroitic Hieroglyphs (was Meroitic) 96 -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC 97 - property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt 98 99 * UnicodeData.txt changes 100 - new CJK block: 101 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;; 102 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;; 103 -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion 104 105 * build Unicode tools using CMake+make 106 107 * run genpname/preparse.pl (on Linux) 108 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 109 + make sure that data.h is writable 110 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 111 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 112 113 * rebuild Unicode tools (at least genpname) using make 114 - You might first need to "make install" ICU so that the tools build can pick 115 up the new definitions from the installed header files. 116 117 * run genpname 118 - ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 119 - rebuild ICU & tools 120 121 * update source/data/unidata/norm2/nfkc_cf.txt 122 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt 123 124 * update source/data/unidata/norm2/uts46.txt 125 - download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt 126 to ~/svn.icu/tools/trunk/src/unicode/py 127 - adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values 128 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py 129 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 130 131 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 132 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 133 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 134 - Unicode 6.0: U+2260, U+226E, U+226F 135 136 * generate core properties data files 137 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 138 - rebuild ICU & tools 139 - run makeuca.sh so that genuca picks up the new nfc.nrm: 140 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 141 - rebuild ICU & tools 142 143 * implement new Script_Extensions property (provisional) 144 - parser & generator: genprops & uprops.icu 145 - uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp 146 - UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java 147 148 * switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2 149 - (one-time change) 150 - genbidi/gencase/genprops tools changes 151 - re-run makeprops.sh (see above) 152 - UCharacterProperty.java, UCharacterTypeIterator.java, 153 UBiDiProps.java, UCaseProps.java, and several others with minor changes; 154 UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java 155 156 * update Java data files 157 - refresh just the UCD-related files, just to be safe 158 - see (ICU4C)/source/data/icu4j-readme.txt 159 - mkdir /tmp/icu4j 160 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 161 output: 162 ... 163 Unicode .icu files built to ./out/build/icudt45l 164 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b 165 echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 166 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b 167 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b 168 mkdir -p /tmp/icu4j/main/shared/data 169 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 170 - copy the big-endian Unicode data files to another location, 171 separate from the other data files 172 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 173 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr 174 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b 175 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu 176 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b 177 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 178 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr 179 - refresh ICU4J 180 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b 181 182 * refresh Java test .txt files 183 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 184 185 * un-hardcode normalization skippable (NF*_Inert) test data 186 - removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools 187 188 * copy updated break iterator test files 189 - now handled by early ucdcopy.py and 190 copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata 191 (old instructions: 192 copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt 193 to ~/svn.icu/trunk/src/source/test/testdata) 194 - they are not used in ICU4J 195 196 * UCA 197 198 - get output from Mark's tools; look in 199 http://www.unicode.org/~book/incoming/mark/uca6.0.0/ 200 http://www.macchiato.com/unicode/utc/additional-uca-files 201 http://www.unicode.org/Public/UCA/6.0.0/ 202 http://www.unicode.org/~mdavis/uca/ 203 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 204 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 205 - update Han-implicit ranges for new CJK extensions: 206 swapCJK() in ucol.cpp & ImplicitCEGenerator.java 207 - genuca: allow bytes 02 for U+FFFE, new merge-sort character; 208 do not add it into invuca so that tailoring primary-after an ignorable works 209 - genuca: permit space between [variable top] bytes 210 - ucol.cpp: treat noncharacters like unassigned rather than ignorable 211 - run makeuca.sh: 212 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 213 - rebuild ICU4C 214 - refresh ICU4J collation data: 215 (subset of instructions above for properties data refresh, except copies all coll/*) 216 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 217 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 218 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 219 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b 220 - update (ICU)/source/test/testdata/CollationTest_*.txt 221 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 222 with output from Mark's Unicode tools 223 - run all tests with the *_SHORT.txt or the full files (the full ones have comments) 224 - note on intltest: if collate/UCAConformanceTest fails, then 225 utility/MultithreadTest/TestCollators will fail as well; 226 fix the conformance test before looking into the multi-thread test 227 228 * When refreshing all of ICU4J data from ICU4C 229 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 230 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 231 or 232 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 233 234 *** LayoutEngine script information 235 236 (For details see the Unicode 5.2 change log below.) 237 238 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, 239 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates 240 ScriptRunData.cpp, which is no longer needed.) 241 242 The generated files have a current copyright date and "@draft" statement. 243 244 * copy the above files into <icu>/source/layout, replacing the old files. 245 * fix mixed line endings 246 * review the diffs and fix incorrect @draft and missing aliases; 247 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 248 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 249 250 ---------------------------------------------------------------------------- *** 251 252 Unicode 5.2 update 253 254 *** related ICU Trac tickets 255 256 7084 Unicode 5.2 257 258 7167 verify collation bytes 259 7235 Java test NAME_ALIAS 260 7236 Java DerivedCoreProperties.txt test 261 7237 Java BidiTest.txt 262 7238 UTrie2 in core unidata 263 7239 test for tailoring gaps 264 7240 Java fix CollationMiscTest 265 7243 update layout engine for Unicode 5.2 266 267 *** Unicode version numbers 268 - makedata.mak 269 - uchar.h 270 - configure.in & configure 271 - update ucdVersion in gennames.c if an algorithmic range changes 272 273 *** data files & enums & parser code 274 275 * file preparation 276 277 python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata 278 - includes finding files regardless of version numbers, 279 copying them, and performing the equivalent processing of the 280 ucdstrip and ucdmerge tools on the desired set of files 281 282 * notes on changes 283 - PropertyAliases.txt 284 moved from numeric to enumerated: 285 ccc ; Canonical_Combining_Class 286 new string properties: 287 NFKC_CF ; NFKC_Casefold 288 Name_Alias; Name_Alias 289 new binary properties: 290 Cased ; Cased 291 CI ; Case_Ignorable 292 CWCF ; Changes_When_Casefolded 293 CWCM ; Changes_When_Casemapped 294 CWKCF ; Changes_When_NFKC_Casefolded 295 CWL ; Changes_When_Lowercased 296 CWT ; Changes_When_Titlecased 297 CWU ; Changes_When_Uppercased 298 new CJK Unihan properties (not supported by ICU) 299 - PropertyValueAliases.txt 300 new block names 301 new scripts 302 one script code change: 303 sc ; Qaai ; Inherited 304 -> 305 sc ; Zinh ; Inherited ; Qaai 306 new Line_Break (lb) value: 307 lb ; CP ; Close_Parenthesis 308 new Joining_Group (jg) values: Farsi_Yeh, Nya 309 other new values: 310 ccc; 214; ATA ; Attached_Above 311 - DerivedBidiClass.txt 312 new default-R range: U+1E800 - U+1EFFF 313 - UnicodeData.txt 314 all of the ISO comments are gone 315 new CJK block end: 316 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last> 317 new CJK block: 318 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;; 319 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;; 320 321 * genpname 322 - run preparse.pl 323 + cd \svn\icuproj\icu\trunk\source\tools\genpname 324 + make sure that data.h is writable 325 + perl preparse.pl \svn\icuproj\icu\trunk > out.txt 326 + preparse.pl complains with errors like the following: 327 Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34. 328 This is because ICU 4.0 had scripts from ISO 15924 which are now 329 added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt 330 and PropertyValueAliases.txt. 331 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: 332 Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt 333 + preparse.pl complains with errors about block names missing from uchar.h; add them 334 335 * uchar.h & uscript.h & uprops.h & uprops.c & genprops 336 - new block & script values 337 + 26 new blocks 338 copy new blocks from Blocks.txt 339 MS VC++ 2008 regular expression: 340 find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$" 341 replace with " UBLOCK_\3 = 172, /*[\1]*/" 342 + several new script values already added in ICU 4.0 for ISO 15924 coverage 343 (removed from SyntheticPropertyValueAliases.txt, see genpname notes above) 344 + 3 new script values added for ISO 15924 and Unicode 5.2 coverage 345 + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2) 346 (added to SyntheticPropertyValueAliases.txt) 347 - new Joining Group (JG) values: Farsi_Yeh, Nya 348 - new Line_Break (lb) value: 349 lb ; CP ; Close_Parenthesis 350 351 * hardcoded Unihan range end/limit 352 - Unihan range end moves from 9FC3 to 9FCB 353 search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive) 354 + do change gennames.c 355 356 * Compare definitions of new binary properties with what we used to use 357 in algorithms, to see if the definitions changed. 358 - Verified that definitions for Cased and Case_Ignorable are unchanged. 359 The gencase tool now parses the newly public Case_Ignorable values 360 in case the definition changes in the future. 361 362 * uchar.c & uprops.h & uprops.c & genprops 363 - new numeric values that didn't exist in Unicode data before: 364 1/7, 1/9, 1/10, 3/10, 1/16, 3/16 365 the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5, 366 therefore redesign the encoding of numeric types and values for formatVersion 6; 367 design for simple numbers up to at least 144 ("one gross"), 368 large values up to at least 10^20, 369 and fractions with numerators -1..17 and denominators 1..16 370 to cover current and expected future values 371 (e.g., more Han numeric values, Meroitic twelfths) 372 373 * reimplement Hangul_Syllable_Type for new Jamo characters 374 - the old code assumed that all Jamo characters are in the 11xx block 375 - Unicode 5.2 fills holes there and adds new Jamo characters in 376 A960..A97F; Hangul Jamo Extended-A 377 and in 378 D7B0..D7FF; Hangul Jamo Extended-B 379 - Hangul_Syllable_Type can be trivially derived from a subset of 380 Grapheme_Cluster_Break values 381 382 * build Unicode data source code for hardcoding core data 383 C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data 384 385 ICU data make path is \svn\icuproj\icu\trunk\source\data\ 386 ICU root path is \svn\icuproj\icu\trunk 387 Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 388 Information: cannot find "brklocal.mk". Not building user-additional break iterator files. 389 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. 390 Information: cannot find "collocal.mk". Not building user-additional resource bundle files. 391 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. 392 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. 393 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. 394 Information: cannot find "spreplocal.mk". Not building user-additional stringprep files. 395 Creating data file for Unicode Property Names 396 Creating data file for Unicode Character Properties 397 Creating data file for Unicode Case Mapping Properties 398 Creating data file for Unicode BiDi/Shaping Properties 399 Creating data file for Unicode Normalization 400 Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l" 401 Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp" 402 403 - copy the .c source files to C:\svn\icuproj\icu\trunk\source\common 404 and rebuild the common library 405 406 *** UCA 407 408 - update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools) 409 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools 410 - update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools 411 [ Begin obsolete instructions: 412 Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files. 413 - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py 414 on Windows: 415 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt 416 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt 417 End obsolete instructions] 418 - run all tests with the *_SHORT.txt or the full files (the full ones have comments) 419 not just the *_STUB.txt files 420 - note on intltest: if collate/UCAConformanceTest fails, then 421 utility/MultithreadTest/TestCollators will fail as well; 422 fix the conformance test before looking into the multi-thread test 423 424 *** Implement Cased & Case_Ignorable properties 425 - via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable() 426 - Problem: These properties should be disjoint, but aren't 427 - UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not 428 - change ucase.icu to be able to store any combination of Cased and Case_Ignorable 429 430 *** Implement Changes_When_Xyz properties 431 - without stored data 432 433 *** Implement Name_Alias property 434 - add it as another name field in unames.icu 435 - make it available via u_charName() and UCharNameChoice and 436 - consider it in u_charFromName() 437 438 *** Break iterators 439 440 * Update break iterator rules to new UAX versions and new property values 441 * Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary 442 443 *** new BidiTest file 444 - review format and data 445 - copy BidiTest.txt to source/test/testdata 446 - write test code using this data 447 - fix ICU code where it fails the conformance test 448 449 *** Java 450 - generally, find and update code corresponding to C/C++ 451 - UCharacter.UnicodeBlock constants: 452 a) add an _ID integer per new block, update COUNT 453 b) add a class instance per new block 454 Visual Studio regex: 455 find UBLOCK_{[^ ]+} = [0-9]+, {/.+} 456 replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 457 - CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias() 458 459 - port test changes to Java 460 461 *** LayoutEngine script information 462 463 (For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833) 464 465 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, 466 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates 467 ScriptRunData.cpp, which is no longer needed.) 468 469 The generated files have a current copyright date and "@draft" statement. 470 471 -> Eric Mader wrote in email on 20090930: 472 "I think the tool has been modified to update @draft to @stable for 473 older scripts and to add @draft for new scripts. 474 (I worked with an intern on this last year.) 475 You should check the output after you run it." 476 477 * copy the above files into <icu>/source/layout, replacing the old files. 478 * fix mixed line endings 479 * review the diffs and fix incorrect @draft and missing aliases 480 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 481 482 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 483 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 484 485 -> Eric Mader wrote in email on 20090930: 486 "This is just a matter of making sure that all the per-script tables have 487 entries for any new scripts that were added. 488 If any new Indic characters were added, then the class tables in 489 IndicClassTables.cpp should be updated to reflect this. 490 John Emmons should know how to do this if it's required." 491 492 * rebuild the layout and layoutex libraries. 493 494 *** Documentation 495 - Update User Guide 496 + Jamo_Short_Name, sfc->scf, binary property value aliases 497 498 ---------------------------------------------------------------------------- *** 499 500 Unicode 5.1 update 501 502 *** related ICU Trac tickets 503 504 5696 Update to Unicode 5.1 505 506 *** Unicode version numbers 507 - makedata.mak 508 - uchar.h 509 - configure.in & configure 510 - update ucdVersion in gennames.c if an algorithmic range changes 511 512 *** data files & enums & parser code 513 514 * file preparation 515 - ucdstrip: 516 DerivedCoreProperties.txt 517 DerivedNormalizationProps.txt 518 NormalizationTest.txt 519 PropList.txt 520 Scripts.txt 521 GraphemeBreakProperty.txt 522 SentenceBreakProperty.txt 523 WordBreakProperty.txt 524 - ucdstrip and ucdmerge: 525 EastAsianWidth.txt 526 LineBreak.txt 527 528 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) 529 copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\ 530 copy 5.1.0\ucd\Blocks.txt ..\unidata\ 531 copy 5.1.0\ucd\CaseFolding.txt ..\unidata\ 532 copy 5.1.0\ucd\DerivedAge.txt ..\unidata\ 533 copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ 534 copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ 535 copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ 536 copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ 537 copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\ 538 copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\ 539 copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\ 540 copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\ 541 copy 5.1.0\ucd\UnicodeData.txt ..\unidata\ 542 543 ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt 544 ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt 545 ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt 546 ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt 547 ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt 548 ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt 549 ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt 550 ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt 551 ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt 552 ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt 553 554 * genpname 555 - run preparse.pl 556 + cd \svn\icuproj\icu\uni51\source\tools\genpname 557 + make sure that data.h is writable 558 + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt 559 + preparse.pl complains with errors like the following: 560 Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30. 561 This is because ICU 3.8 had scripts from ISO 15924 which are now 562 added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt 563 and PropertyValueAliases.txt. 564 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: 565 Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii 566 + PropertyValueAliases.txt now explicitly contains values for boolean properties: 567 N/Y, No/Yes, F/T, False/True 568 -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases. 569 It will use further values from the file if present. 570 571 * uchar.h & uscript.h & uprops.h & uprops.c & genprops 572 - new block & script values 573 + 17 new blocks 574 + 11 new script values already added in ICU 3.8 for ISO 15924 coverage 575 (removed from SyntheticPropertyValueAliases.txt) 576 + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1) 577 (added to SyntheticPropertyValueAliases.txt) 578 - uprops.icu (uprops.h) only provides 7 bits for script codes. 579 In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now. 580 There is none above 127 yet which is the script code for an 581 assigned Unicode character, so ICU 4.0 uprops.icu does not store any 582 script code values greater than 127. 583 However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129 584 in a parallel bit field, and that overflows now. 585 Also, future values >=128 would be incompatible anyway. 586 uprops.h is modified to move around several of the bit fields 587 in the properties vector words, and now uses 8 bits for the script code. 588 Two other bit fields also grow to accommodate future growth: 589 Block (current count: 172) grows from 8 to 9 bits, 590 and Word_Break grows from 4 to 5 bits. 591 - renamed property Simple_Case_Folding (sfc->scf) 592 + nothing to be done: handled as normal alias 593 - new property JSN Jamo_Short_Name 594 + no new API: only contributes to the Name property 595 - new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark 596 - new Joining Group (JG) value: Burushashki_Yeh_Barree 597 - new Sentence_Break (SB) values: 598 SB ; CR ; CR 599 SB ; EX ; Extend 600 SB ; LF ; LF 601 SB ; SC ; SContinue 602 - new Word_Break (WB) values: 603 WB ; CR ; CR 604 WB ; Extend ; Extend 605 WB ; LF ; LF 606 WB ; MB ; MidNumLet 607 608 * Further changes in the 2008-02-29 update: 609 - Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP 610 because they should not normally be invisible. 611 - new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed) 612 - new Grapheme_Cluster_Break (GCB) value: PP=Prepend 613 - new Word_Break (WB) value: NL=Newline 614 615 * hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison) 616 - Unihan range end moves from 9FBB to 9FC3 617 search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive) 618 + do change gennames.c 619 620 * build Unicode data source code for hardcoding core data 621 C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data 622 623 ICU data make path is \svn\icuproj\icu\uni51\source\data\ 624 ICU root path is \svn\icuproj\icu\uni51 625 Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 626 Information: cannot find "brklocal.mk". Not building user-additional break iterator files. 627 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. 628 Information: cannot find "collocal.mk". Not building user-additional resource bundle files. 629 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. 630 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. 631 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. 632 Creating data file for Unicode Character Properties 633 Creating data file for Unicode Case Mapping Properties 634 Creating data file for Unicode BiDi/Shaping Properties 635 Creating data file for Unicode Normalization 636 Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l" 637 Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp" 638 639 - copy the .c source files to C:\svn\icuproj\icu\uni51\source\common 640 and rebuild the common library 641 642 *** Break iterators 643 644 * Update break iterator rules to new UAX versions and new property values 645 646 *** UCA 647 648 * update FractionalUCA.txt and UCARules.txt with new canonical closure 649 650 *** Test suites 651 - Test that APIs using Unicode property value aliases (like UnicodeSet) 652 support all of the boolean values N/Y, No/Yes, F/T, False/True 653 -> TestBinaryValues() tests in both cintltst and intltest 654 655 *** LayoutEngine script information 656 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, 657 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates 658 ScriptRunData.cpp, which is no longer needed.) 659 660 The generated files have a current copyright date and "@draft" statement. 661 662 * copy the above files into <icu>/source/layout, replacing the old files. 663 664 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 665 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 666 667 * rebuild the layout and layoutex libraries. 668 669 *** Documentation 670 - Update User Guide 671 + Jamo_Short_Name, sfc->scf, binary property value aliases 672 673 ---------------------------------------------------------------------------- *** 674 675 Unicode 5.0 update 676 677 *** related Jitterbugs 678 679 5084 RFE: Update to Unicode 5.0 680 681 *** data files & enums & parser code 682 683 * file preparation 684 - ucdstrip: 685 DerivedCoreProperties.txt 686 DerivedNormalizationProps.txt 687 NormalizationTest.txt 688 PropList.txt 689 Scripts.txt 690 GraphemeBreakProperty.txt 691 SentenceBreakProperty.txt 692 WordBreakProperty.txt 693 - ucdstrip and ucdmerge: 694 EastAsianWidth.txt 695 LineBreak.txt 696 697 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) 698 copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\ 699 copy 5.0.0\ucd\Blocks.txt ..\unidata\ 700 copy 5.0.0\ucd\CaseFolding.txt ..\unidata\ 701 copy 5.0.0\ucd\DerivedAge.txt ..\unidata\ 702 copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ 703 copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ 704 copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ 705 copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ 706 copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\ 707 copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\ 708 copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\ 709 copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\ 710 copy 5.0.0\ucd\UnicodeData.txt ..\unidata\ 711 712 ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt 713 ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt 714 ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt 715 ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt 716 ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt 717 ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt 718 ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt 719 ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt 720 ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt 721 ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt 722 723 * update FractionalUCA.txt and UCARules.txt with new canonical closure 724 725 * genpname 726 - run preparse.pl 727 + make sure that data.h is writable 728 + perl preparse.pl \cvs\oss\icu > out.txt 729 730 * uchar.h & uscript.h & uprops.h & uprops.c & genprops 731 - new block & script values 732 + script values already added in ICU 3.6 because all of ISO 15924 is now covered 733 734 * build Unicode data source code for hardcoding core data 735 C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data 736 737 ICU data make path is \cvs\oss\icu\source\data\ 738 ICU root path is \cvs\oss\icu 739 Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 740 [etc.] 741 Creating data file for Unicode Character Properties 742 Creating data file for Unicode Case Mapping Properties 743 Creating data file for Unicode BiDi/Shaping Properties 744 Creating data file for Unicode Normalization 745 Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l" 746 Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp" 747 748 - copy the .c source files to C:\cvs\oss\icu\source\common 749 and rebuild the common library 750 751 *** Unicode version numbers 752 - makedata.mak 753 - uchar.h 754 - configure.in 755 756 *** LayoutEngine script information 757 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, 758 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates 759 ScriptRunData.cpp, which is no longer needed.) 760 761 The generated files have a current copyright date and "@draft" statement. 762 763 * copy the above files into <icu>/source/layout, replacing the old files. 764 765 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 766 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 767 768 * rebuild the layout and layoutex libraries. 769 770 ---------------------------------------------------------------------------- *** 771 772 Unicode 4.1 update 773 774 *** related Jitterbugs 775 776 4332 RFE: Update to Unicode 4.1 777 4157 RBBI, TR29 4.1 updates 778 779 *** data files & enums & parser code 780 781 * file preparation 782 - ucdstrip: 783 DerivedCoreProperties.txt 784 DerivedNormalizationProps.txt 785 NormalizationTest.txt 786 GraphemeBreakProperty.txt 787 SentenceBreakProperty.txt 788 WordBreakProperty.txt 789 - ucdstrip and ucdmerge: 790 EastAsianWidth.txt 791 LineBreak.txt 792 793 * add new files to the repository 794 GraphemeBreakProperty.txt 795 SentenceBreakProperty.txt 796 WordBreakProperty.txt 797 798 * update FractionalUCA.txt and UCARules.txt with new canonical closure 799 800 * genpname 801 - handle new enumerated properties in sub read_uchar 802 - run preparse.pl 803 804 * uchar.h & uscript.h & uprops.h & uprops.c & genprops 805 - new binary properties 806 + Pattern_Syntax 807 + Pattern_White_Space 808 - new enumerated properties 809 + Grapheme_Cluster_Break 810 + Sentence_Break 811 + Word_Break 812 - new block & script & line break values 813 814 * gencase 815 - case-ignorable changes 816 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods 817 now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk 818 819 *** Unicode version numbers 820 - makedata.mak 821 - uchar.h 822 - configure.in 823 824 *** tests 825 - verify that u_charMirror() round-trips 826 - test all new properties and some new values of old properties 827 828 *** other code 829 830 * hardcoded Unihan range end/limit 831 - Unihan range end moves from 9FA5 to 9FBB 832 search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive) 833 + do not modify BOCU/BOCSU code because that would change the encoding 834 and break binary compatibility! 835 + similarly, do not change the GB 18030 range data (ucnvmbcs.c), 836 NamePrepProfile.txt 837 + ignore trietest.c: test data is arbitrary 838 + ignore tstnorm.cpp: test optimization, not important 839 + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF 840 + do change line_th.txt and word_th.txt 841 by replacing hardcoded ranges with the new property values 842 + do change gennames.c 843 844 source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 845 source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 846 source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5, 847 848 * case mappings 849 - compare new special casing context conditions with previous ones 850 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods 851 852 * genpname 853 - consider storing only the short name if it is the same as the long name 854 855 *** other reviews 856 - UAX #29 changes (grapheme/word/sentence breaks) 857 - UAX #14 changes (line breaks) 858 - Pattern_Syntax & Pattern_White_Space 859 860 ---------------------------------------------------------------------------- *** 861 862 Unicode 4.0.1 update 863 864 *** related Jitterbugs 865 866 3170 RFE: Update to Unicode 4.0.1 867 3171 Add new Unicode 4.0.1 properties 868 3520 use Unicode 4.0.1 updates for break iteration 869 870 *** data files & enums & parser code 871 872 * file preparation 873 - ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt 874 - ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt 875 876 * file fixes 877 - fix UnicodeData.txt general categories of Ethiopic digits Nd->No 878 according to PRI #26 879 http://www.unicode.org/review/resolved-pri.html#pri26 880 - undone again because no corrigendum in sight; 881 instead modified tests to not check consistency on this for Unicode 4.0.1 882 883 * ucdterms.txt 884 - update from http://www.unicode.org/copyright.html 885 formatted for plain text 886 887 * uchar.h & uprops.h & uprops.c & genprops 888 - add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed 889 - add U_LB_INSEPARABLE due to a spelling fix 890 + put short name comment only on line with new constant 891 for genpname perl script parser 892 - new binary properties 893 + STerm 894 + Variation_Selector 895 896 * genpname 897 - fix genpname perl script so that it doesn't choke on more than 2 names per property value 898 - perl script: correctly calculate the maximum number of fields per row 899 900 * uscript.h 901 - new script code Hrkt=Katakana_Or_Hiragana 902 903 * gennorm.c track changes in DerivedNormalizationProps.txt 904 - "FNC" -> "FC_NFKC" 905 - single field "NFD_NO" -> two fields "NFD_QC; N" etc. 906 907 * genprops/props2.c track changes in DerivedNumericValues.txt 908 - changed from 3 columns to 2, dropping the numeric type 909 + assume that the type is always numeric for Han characters, 910 and that only those are added in addition to what UnicodeData.txt lists 911 912 *** Unicode version numbers 913 - makedata.mak 914 - uchar.h 915 - configure.in 916 917 *** tests 918 - update test of default bidi classes according to PRI #28 919 /tsutil/cucdtst/TestUnicodeData 920 http://www.unicode.org/review/resolved-pri.html#pri28 921 - bidi tests: change exemplar character for ES depending on Unicode version 922 - change hardcoded expected property values where they change 923 924 *** other code 925 926 * name matching 927 - read UCD.html 928 929 * scripts 930 - use new Hrkt=Katakana_Or_Hiragana 931 932 * ZWJ & ZWNJ 933 - are now part of combining character sequences 934 - break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ 935