Unicode Collation Algorithm |
The following files provide conformance tests for the Unicode Collation Algorithm (UTS #10: Unicode Collation Algorithm).
These files are large, and thus packaged in zip format to save download time.
There are four different files:
The format is illustrated by the following example:
0385 0021; # (΅) GREEK DIALYTIKA TONOS [0316 015D | 0020 0032 0020 | 0002 0002 0002 |]
The part before the semicolon is the hex representation of a sequence of Unicode code points. After the hash mark is a comment. This comment is purely informational, and may change in the future. Currently it consists of the characters of the sequence in parentheses, the name of the first code point, and a representation of the sort key for the sequence.
The sort key representation is in square brackets. It uses a vertical bar for the ZERO separator. Between the bars are the primary, secondary, tertiary, and quaternary weights (if any), in hex.
Note: The sort key is purely informational. UCA does not require the production of any particular sort key, as long as the results of comparisons match.
The files are designed so each line in the file will order as being greater than or equal to the previous one, when using the UCA and the Default Unicode Collation Element Table. A test program can read in each line, compare it to the last line, and signal an error if order is not correct. The exact comparison that should be used is as follows:
If there are any errors, then the UCA implementation is not compliant.
These files contain test cases that include ill-formed strings, with surrogate code points. Implementations that do not weight surrogate code points the same way as reserved code points may filter out such lines lines in the test cases, before testing for conformance.
Note: This test is only valid for an untailored DUCET table.
Beginning with UCA 6.2,
the test data strings are compared with strength = identical,
using UCA S3.10 as a tie-breaker, which compares the NFD forms of the strings in code point order.
Before UCA 6.2, the test files did not use strength = identical,
and instead used as a tie-breaker the comparison of the unnormalized strings.
Therefore, implementations which use the UCA test files to test
multiple versions of UCA need to use different tie-breaker comparisons
depending on the UCA version.
Test data files for UCA 6.1 and earlier versions were generated with code that had a bug in the contraction matching. In that code, matches for certain contractions of Tibetan characters were found despite intervening combining marks, so that some test cases were not in proper order according to the UCA and the DUCET. UCA 6.2 test files omitted the relevant test cases. For UCA 6.3, the test data generation code was fixed and those test cases were restored.
For example, in the defective test data generation code, the strings 0FB2 0F80 0F71 0334 and 0F77 0334 compared equal. (U+0F77 is the TIBETAN VOWEL SIGN VOCALIC RR.) However, UCA processing with the DUCET will not find the contraction 0FB2 0F71 0F80:
See “Also note that the Algorithm employs two distinct contraction matching methods:” at the end of Section 7.2, Produce Collation Element Arrays.
© 2020 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.