Unicode Character Database | |
Revision | 13.0.0 |
Authors | Asmus Freytag, Ken Whistler |
Date | 2020-02-12 |
This Version | http://www.unicode.org/Public/13.0.0/ucd/NamesList.html |
Previous Version | http://www.unicode.org/Public/12.1.0/ucd/NamesList.html |
Latest Version | http://www.unicode.org/Public/UCD/latest/ucd/NamesList.html |
This file describes the format and contents of NamesList.txt
The file and the files described herein are part of the Unicode Character Database (UCD). The Unicode Terms of Use apply.
The Unicode name list file NamesList.txt (also NamesList.lst) is a plain text file used to drive the layout of the character code charts in the Unicode Standard. The information in this file is a combination of several fields from the UnicodeData.txt and Blocks.txt files, together with additional annotations for many characters.
This document describes the syntax rules for the file format, but also gives brief information on how each construct is rendered when laid out for the code charts. Some of the syntax elements are used only in preparation of the drafts of the code charts and are not present in the final, released form of the NamesList.txt file.
Over time, the syntax has been extended by adding new features. The syntax for formal aliases and index tabs was introduced with Unicode 5.0. The syntax for marginal sidebar comments is utilized extensively in draft versions of the NamesList.txt file. The support for UTF-8 encoded files and the syntax for the UTF-8 charset declaration in a comment at the head of the file were introduced after Unicode 6.1.0 was published, as was the syntax for the specification of variation sequences and alternate glyphs and their respective summaries. The repertoire restriction in comments and aliases in the names list format was loosened from the prior limitation to U+0020..U+00FF, to include the wider range U+0020..U+02FF, as of Unicode 11.0.
The same input file can be used for the preparation of drafts and final editions for ISO/IEC 10646. Earlier versions of that standard used a different style, referred to below as ISO-style. That style necessitated the presence of some information in the name list file that is not needed (and in fact removed during parsing) for the Unicode code charts.
With access to the layout program (Unibook) it is a simple matter of creating name lists for the purpose of formatting working drafts or other documents containing proposed characters.
The content of the NamesList.txt file is optimized for code chart creation. Some information that can be inferred by the reader from context has been suppressed to make the code charts more readable. See the chapter on Code Charts in the Unicode Standard.
The NamesList files are plain text files which in their most simple form look like this:
@@<tab>0020<tab>BASIC LATIN<tab>007F
; this is a file comment (ignored)
0020<tab>SPACE
0021<tab>EXCLAMATION MARK
0022<tab>QUOTATION MARK
. . .
007F<tab>DELETE
The semicolon (as first character), @ and <tab> characters are used by the file syntax and must be provided as shown. Hexadecimal digits must be in UPPERCASE. A double @@ introduces a block header, with the title, and start and ending code of the block provided as shown.
For a minimal name list, only the NAME_LINE and BLOCKHEADER and their constituent syntax elements are needed.
The full syntax with all the options is provided in the following sections.
This section defines the overall file structure
NAMELIST: TITLE_PAGE* EXTENDED_BLOCK* TITLE_PAGE: TITLE | TITLE_PAGE SUBTITLE | TITLE_PAGE SUBHEADER | TITLE_PAGE IGNORED_LINE | TITLE_PAGE EMPTY_LINE | TITLE_PAGE NOTICE_LINE | TITLE_PAGE COMMENT_LINE | TITLE_PAGE PAGEBREAK | TITLE_PAGE FILE_COMMENT | FILE_COMMENT EXTENDED_BLOCK: BLOCK | BLOCK SUMMARY BLOCK: BLOCKHEADER | BLOCKHEADER INDEX_TAB | BLOCK CHAR_ENTRY | BLOCK SUBHEADER | BLOCK NOTICE_LINE | BLOCK EMPTY_LINE | BLOCK IGNORED_LINE | BLOCK SIDEBAR_LINE | BLOCK PAGEBREAK | BLOCK FILE_COMMENT | BLOCK CROSS_REF CHAR_ENTRY: NAME_LINE | RESERVED_LINE | CHAR_ENTRY ALIAS_LINE | CHAR_ENTRY FORMALALIAS_LINE | CHAR_ENTRY COMMENT_LINE | CHAR_ENTRY CROSS_REF | CHAR_ENTRY DECOMPOSITION | CHAR_ENTRY COMPAT_MAPPING | CHAR_ENTRY IGNORED_LINE | CHAR_ENTRY EMPTY_LINE | CHAR_ENTRY NOTICE_LINE | CHAR_ENTRY FILE_COMMENT | CHAR_ENTRY VARIATION_LINE
In other words:
Neither TITLE nor SUBTITLE may occur after the first BLOCKHEADER.
Only TITLE, SUBTITLE, SUBHEADER, PAGEBREAK, COMMENT_LINE, NOTICE_LINE, EMPTY_LINE, IGNORED_LINE and FILE_COMMENT may occur before the first BLOCKHEADER.
Directly following either a NAME_LINE or a RESERVED_LINE an uninterrupted sequence of the following lines may occur (in any order and repeated as often as needed): ALIAS_LINE, CROSS_REF, DECOMPOSITION, COMPAT_MAPPING, FORMALALIAS_LINE, NOTICE_LINE, EMPTY_LINE, IGNORED_LINE, VARIATION_LINE and FILE_COMMENT.
Except for CROSS_REF, NOTICE_LINE, SIDEBAR_LINE, EMPTY_LINE, IGNORED_LINE and FILE_COMMENT, none of these lines may occur in any other place.
A PAGEBREAK may appear anywhere, except the middle of a CHARACTER_ENTRY. A PAGEBREAK before the file title lines may not be supported. INDEX_TABs may appear after any block header.
If the first line of a file is a file comment, it may contain a UTF-8 charset declaration (see below). Alternatively, or in addition, a BOM may be present at the very beginning of the file, forcing the encoding to be interpreted as UTF-16 (little-endian only) or UTF-8. When declared as UTF-8, the names list format will support use of characters in the range U+0020..U+02FF in LINE and LABEL elements. Otherwise, the supported repertoire is limited to Latin-1, and attempted use of characters outside the Latin-1 range will result in data corruption.
Several of these elements, while part of the formal definition of the file format, do not occur in final published versions of NamesList.txt in the UCD.
A block may be extended by a summary of standard variation sequences or selected alternate glyphs (or both) defined for characters in the block:
SUMMARY: ALTGLYPH_SUMMARY | VARIATION SUMMARY | ALTGLYPH_SUMMARY VARIATION_SUMMARY | MIXED_SUMMARY ALTGLYPH_SUMMARY: ALTGLYPH_SUBHEADER | ALTGLYPH_SUMMARY SUMMARY_LINE VARIATION_SUMMARY: VARIATION_SUBHEADER | VARIATION_SUMMARY SUMMARY_LINE MIXED_SUMMARY: MIXED_SUBHEADER | MIXED_SUMMARY SUMMARY_LINE SUMMARY_LINE: SUBHEADER | NOTICE_LINE | FILE_COMMENT | EMPTY_LINE
When formatted for display, each summary will recap the information presented in the VARIATION_LINE elements of the preceding block, grouped by alternate glyph variants and standardized variation sequences, and preceded by the corresponding subheader. Additional SUBHEADER and NOTICE lines, if provided, immediately follow the ALTGLYPH_SUBHEADER, VARIATION_SUBHEADER or MIXED_SUBHEADER. There is no provision to provide subheaders that are interspersed between items in the summary.
These syntax constructs are entirely optional. If the ALTGLYPH_SUBHEADER or VARIATION_SUBHEADER are omitted from the names list, but the preceding block nevertheless contains VARIATION_LINE elements as described below, Unibook will automatically generate any required summaries using a default format for the headers.
Thus, the main purpose for providing ALTGLYPH_SUBHEADER or VARIATION_SUBHEADER elements would be to provide specific contents for these summary titles as well as allow the ability to add additional information via SUBHEADER and NOTICE elements. The final published version of the Unicode names list is machine generated and will always explicitly provide any summary subheaders.
This section provides the details of the syntax for the individual elements.
ELEMENT SYNTAX // How rendered NAME_LINE: CHAR TAB NAME LF // The CHAR and the corresponding image are echoed, // followed by the name as given in NAME | CHAR TAB "<" LCNAME ">" LF // Control and noncharacters use this form of // lowercase, bracketed pseudo character name | CHAR TAB NAME SP COMMENT LF // Names may have a comment, which is stripped off // unless the file is parsed for an ISO style list | CHAR TAB "<" LCNAME ">" SP COMMENT LF // Control and noncharacters may also have comments RESERVED_LINE: CHAR TAB "<reserved>" LF // The CHAR is echoed followed by an icon for the // reserved character and a fixed string e.g. "<reserved>" COMMENT_LINE: TAB "*" SP EXPAND_LINE // * is replaced by BULLET, output line as comment | TAB EXPAND_LINE // Output line as comment ALIAS_LINE: TAB "=" SP LINE // Replace = by itself, output line as alias FORMALALIAS_LINE: TAB "%" SP NAME LF // Replace % by U+203B, output line as formal alias CROSS_REF: TAB "x" SP CHAR SP LCNAME LF | TAB "x" SP CHAR SP "<" LCNAME ">" LF // x is replaced by a right arrow | TAB "x" SP "(" LCNAME SP "-" SP CHAR ")" LF | TAB "x" SP "(" "<" LCNAME ">" SP "-" SP CHAR ")" LF // x is replaced by a right arrow; // (second type as used for control and noncharacters) // In the forms with parentheses the "(","-" and ")" are removed // and the order of CHAR and LCNAME is reversed; // i.e. all inputs result in the same order of output | TAB "x" SP CHAR LF // x is replaced by a right arrow // (this type is the only one without LCNAME // and is used for ideographs) VARIATION_LINE: TAB "~" SP CHAR VARSEL SP LABEL LF | TAB "~" SP CHAR VARSEL SP LABEL "(" LCTAG ")"LF // output standardized variation sequence or simply the char code in case of alternate // glyphs, followed by the alternate glyph or variation glyph and the label and context FILE_COMMENT: ";" LINE EMPTY_LINE: LF // Empty and ignored lines as well as // file comments are ignored IGNORED_LINE: TAB ";" LINE // Ignore LINE SIDEBAR_LINE: ";;" LINE // Output LINE as marginal note DECOMPOSITION: TAB ":" SP EXPAND_LINE | TAB ":" SP "<" TAG ">" SP EXPAND_LINE // Replace ':' by EQUIV, expand line into decomposition // The <tag> gives optional information, // e.g., about composition exclusion. // by convention the tag has initial lowercase COMPAT_MAPPING: TAB "#" SP EXPAND_LINE | TAB "#" SP "<" TAG ">" SP EXPAND_LINE // Replace '#' by APPROX, output line as mapping // The <tag> is the optional compatibility decomposition tag. // by convention the tag has initial lowercase NOTICE_LINE: "@+" TAB LINE // Output LINE as notice | "@+" TAB * SP LINE // Output LINE as notice // "*" expands to a bullet character // Notices following a character code apply to the // character and are indented. Notices not following // a character code apply to the page/block/column // and are italicized, but not indented TITLE: "@@@" TAB LINE // Output LINE as text // Title is used in page headers SUBTITLE: "@@@+" TAB LINE // Output LINE as subtitle SUBHEADER: "@" TAB LINE // Output LINE as column header VARIATION_SUBHEADER: "@~" TAB LINE // Output LINE as column header (summary subheader) | "@~" // Output a default standard variation sequences summary subheader | "@~" TAB "!" // Suppress output of a default standard variant sequences summary subheader // and disable display of summary | "@~" TAB "!" VARSEL_LIST | "@~" TAB "!" VARSEL_LIST LINE // Output a standard summary subheader, using default or LINE respectively // Suppress any std variation sequences using selectors from the list ALTGLYPH_SUBHEADER: "@@~" TAB LINE // Output LINE as column header (summary subheader) | "@@~" // Output a default alternate glyph summary subheader | "@@~" TAB "!" // Suppress output of a default alternate glyph summary subheader // and disable display of summary MIXED_SUBHEADER: "@@@~" TAB LINE // Output LINE as column header (summary subheader) | "@@@~" // Output a default combined variation and alternate glyph summary subheader | "@@@~" TAB "!" // Suppress output of a default alternate glyph summary subheader // and disable display of summary | "@@@~" TAB "!" VARSEL_LIST | "@@@~" TAB "!" VARSEL_LIST LINE // Output a combined summary subheader, using default or LINE respectively // Suppress any std variation sequences using selectors from the list BLOCKHEADER: "@@" TAB BLOCKSTART TAB BLOCKNAME TAB BLOCKEND LF // Cause a page break and optional // blank page, then output one or more charts // followed by the list of character names. // Use BLOCKSTART and BLOCKEND to define // what characters belong to a block. // Use BLOCKNAME in page and table headers BLOCKNAME: LABEL | LABEL SP "(" LABEL ")" // If an alternate label is present it replaces // the BLOCKNAME when an ISO-style names list is // laid out; it is ignored in the Unicode charts BLOCKSTART: CHAR // First character position in block BLOCKEND: CHAR // Last character position in block PAGEBREAK: "@@" // Insert a (column) break INDEX_TAB: "@@+" // Start a new index tab at latest BLOCKSTART EXPAND_LINE: {ESC_CHAR | CHAR | STRING | ESC +}+ LF // Instances of CHAR (see Notes) are replaced by // CHAR NBSP x NBSP where x is the single Unicode // character corresponding to CHAR. // If character is combining, it is replaced with // CHAR NBSP <circ> x NBSP where <circ> is the // dotted circleNotes:
The following are the primitives and terminals for the NamesList syntax.
LINE: STRING LF COMMENT: "(" LABEL ")" | "(" LABEL ")" SP "*" | "*" NAME: <sequence of uppercase ASCII letters, digits, space and hyphen> LCNAME: <sequence of lowercase ASCII letters, digits, space and hyphen> | LCNAME "-" CHAR TAG: <sequence of ASCII letters> LCTAG: <sequence of lowercase ASCII letters> STRING: <sequence of characters in the range U+0020..U+02FF, except controls> LABEL: <sequence of characters in the range U+0020..U+02FF, except controls, "(" or ")"> VARSEL: CHAR | ALT ( "1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9" ) VARSEL_LIST: "{" CHAR_LIST "}" CHAR_LIST: CHAR | CHAR_LIST SP CHAR CHAR: X X X X | X X X X X | X X X X X X X: "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"|"A"|"B"|"C"|"D"|"E"|"F" ESC_CHAR: ESC CHAR ESC: "\" // Special semantics of backslash (\) are supported // only in EXPAND_LINE. TAB: <sequence of one or more ASCII tab characters 0x09> SP: <ASCII 20> LF: <any sequence of ASCII 0A and 0D>
Notes:
Version 13.0.0
Version 12.1.0
Version 12.0.0
Version 11.0.0
Version 10.0.0
Version 9.0.0
Version 8.0.0
Version 7.0.0
Version 6.3.0
Version 6.2.0
Version 6.1.0
Version 6.0.0
Version 5.2.0
Version 5.1.0
Version 5.0.0
Version 4.0.0
Version 3.2.0
Version 3.1.0 (2)