Unicode NamesList File Format

Revision	6.0.0
Authors	Asmus Freytag, Ken Whistler
Date	2010-06-25
This Version	http://www.unicode.org/Public/6.0.0/ucd/NamesList.html
Previous Version	http://www.unicode.org/Public/5.2.0/ucd/NamesList.html
Latest Version	http://www.unicode.org/Public/UNIDATA/NamesList.html

Summary

This file describes the format and contents of NamesList.txt

Status

The file and the files described herein are part of the Unicode Character Database (UCD) and are governed by the UCD Terms of Use stated at the end.

1.0 Introduction

The Unicode name list file NamesList.txt (also NamesList.lst) is a plain text file used to drive the layout of the character code charts in the Unicode Standard. The information in this file is a combination of several fields from the UnicodeData.txt and Blocks.txt files, together with additional annotations for many characters.

This document describes the syntax rules for the file format, but also gives brief information on how each construct is rendered when laid out for the code charts. Some of the syntax elements are used only in preparation of the drafts of the code charts and are not present in the final, released form of the NamesList.txt file.

The syntax for formal aliases and index tabs was introduced with Unicode 5.0. The syntax for marginal sidebar comments is utilized extensively in draft versions of the NamesList.txt file.

The same input file can be used for the draft preparation for ISO/IEC 10646 (referred below as ISO-style). This necessitates the presence of some information in the name list file that is not needed (and in fact removed during parsing) for the Unicode code charts.

With access to the layout program (unibook.exe) it is a simple matter of creating name lists for the purpose of formatting working drafts containing proposed characters.

The content of the NamesList.txt file is optimized for code chart creation. Some information that can be inferred by the reader from context has been suppressed to make the code charts more readable.

1.1 NamesList File Overview

The NamesList files are plain text files which in their most simple form look like this:

@@<tab>0020<tab>BASIC LATIN<tab>007F
; this is a file comment (ignored)
0020<tab>SPACE
0021<tab>EXCLAMATION MARK
0022<tab>QUOTATION MARK
. . .
007F<tab>DELETE

The semicolon (as first character), @ and <tab> characters are used by the file syntax and must be provided as shown. Hexadecimal digits must be in UPPERCASE. A double @@ introduces a block header, with the title, and start and ending code of the block provided as shown.

For a minimal name list, only the NAME_LINE and BLOCKHEADER and their constituent syntax elements are needed.

The full syntax with all the options is provided in the following sections.

1.2 NamesList File Structure

This section defines the overall file structure

NAMELIST:     TITLE_PAGE* BLOCK* 

TITLE_PAGE:   TITLE 
		| TITLE_PAGE SUBTITLE 
		| TITLE_PAGE SUBHEADER 
		| TITLE_PAGE IGNORED_LINE 
		| TITLE_PAGE EMPTY_LINE
		| TITLE_PAGE NOTICE_LINE
		| TITLE_PAGE COMMENT_LINE
		| TITLE_PAGE PAGEBREAK 
		| TITLE_PAGE FILE_COMMENT 

BLOCK:	      BLOCKHEADER 
		| BLOCKHEADER INDEX_TAB
		| BLOCK CHAR_ENTRY 
		| BLOCK SUBHEADER 
		| BLOCK NOTICE_LINE 
		| BLOCK EMPTY_LINE 
		| BLOCK IGNORED_LINE
		| BLOCK SIDEBAR_LINE
		| BLOCK PAGEBREAK
		| BLOCK FILE_COMMENT 

CHAR_ENTRY:   NAME_LINE | RESERVED_LINE
		| CHAR_ENTRY ALIAS_LINE
		| CHAR_ENTRY FORMALALIAS_LINE
		| CHAR_ENTRY COMMENT_LINE
		| CHAR_ENTRY CROSS_REF
		| CHAR_ENTRY DECOMPOSITION
		| CHAR_ENTRY COMPAT_MAPPING
		| CHAR_ENTRY IGNORED_LINE
		| CHAR_ENTRY EMPTY_LINE
		| CHAR_ENTRY NOTICE_LINE
		| CHAR_ENTRY FILE_COMMENT

In other words:

Neither TITLE nor SUBTITLE may occur after the first BLOCKHEADER.

Only TITLE, SUBTITLE, SUBHEADER, PAGEBREAK, COMMENT_LINE, NOTICE_LINE, EMPTY_LINE, IGNORED_LINE and FILE_COMMENT may occur before the first BLOCKHEADER.

CROSS_REF, DECOMPOSITION, COMPAT_MAPPING, ALIAS and FORMAL_ALIAS lines occurring before the first block header are treated as if they were COMMENT_LINEs.

Directly following either a NAME_LINE or a RESERVED_LINE an uninterrupted sequence of the following lines may occur (in any order and repeated as often as needed): ALIAS_LINE, CROSS_REF, DECOMPOSITION, COMPAT_MAPPING, FORMALALIAS_LINE, NOTICE_LINE, EMPTY_LINE, IGNORED_LINE and FILE_COMMENT.

The conventional order of elements in a char entry: NAME_LINE, FORMAL_ALIAS, ALIAS, COMMENT_LINE or NOTICE_LINE, CROSS_REFs, and optionally ending in either DECOMPOSITION or COMPAT_MAPPING is not enforced by the code chart formatter.

Except for EMPTY_LINE, NOTICE_LINE, SIDEBAR_LINE, IGNORED_LINE and FILE_COMMENT, none of these lines may occur in any other place.

A NOTICE_LINE displays differently depending on whether it follows a header or title or is part of a CHAR_ENTRY

A PAGEBREAK may appear anywhere, except the middle of a CHARACTER_ENTRY. A PAGEBREAK before the file title lines may not be supported. INDEX_TABs may appear after any block header.

Several of these elements, while part of the formal definition of the file format, do not occur in final published versions of the nameslist.

1.3 NamesList File Elements

This section provides the details of the syntax for the individual elements.

ELEMENT		SYNTAX	// How rendered

NAME_LINE:	CHAR TAB NAME LF
			// The CHAR and the corresponding image are echoed, 
			// followed by the name as given in NAME

		CHAR TAB "<" LCNAME ">" LF
			// Control and noncharacters use this form of									
			// lowercase, bracketed pseudo character name
		CHAR TAB NAME SP COMMENT LF
			// Names may have a comment, which is stripped off
			// unless the file is parsed for an ISO style list
		CHAR TAB "<" LCNAME ">" SP COMMENT LF
			// Control and noncharacters may also have comments
										
RESERVED_LINE:	CHAR TAB "<reserved>" LF
			// The CHAR is echoed followed by an icon for the
			// reserved character and a fixed string e.g. "<reserved>"
	
COMMENT_LINE:	TAB "*" SP EXPAND_LINE
			// * is replaced by BULLET, output line as comment
		TAB EXPAND_LINE	
			// Output line as comment

ALIAS_LINE:	TAB "=" SP LINE	
			// Replace = by itself, output line as alias

FORMALALIAS_LINE:
		TAB "%" SP NAME LF	
			// Replace % by U+203B, output line as formal alias

CROSS_REF:	TAB "x" SP CHAR SP LCNAME LF	
		TAB "x" SP CHAR SP "<" LCNAME ">" LF
			// x is replaced by a right arrow
		TAB "x" SP "(" LCNAME SP "-" SP CHAR ")" LF	
		TAB "x" SP "(" "<" LCNAME ">" SP "-" SP CHAR ")" LF	
			// x is replaced by a right arrow;
			// (second type as used for control and noncharacters)

			// In the forms with parens the "(","-" and ")" are removed
			// and the order of CHAR and STRING is reversed;
			// i.e. all inputs result in the same order of output

		TAB "x" SP CHAR LF
			// x is replaced by a right arrow
			// (this type is the only one without LCNAME 
			// and is used for ideographs)

FILE_COMMENT:	";"  LINE 
EMPTY_LINE:	LF			
			// Empty and ignored lines as well as 
			// file comments are ignored

IGNORED_LINE:	TAB ";" EXPAND_LINE
			// Skip ';' character, ignore text

SIDEBAR_LINE: 	";;" LINE
			// Skip ';;' characters, output line
			// as marginal note

DECOMPOSITION:	TAB ":" SP EXPAND_LINE	
			// Replace ':' by EQUIV, expand line into 
			// decomposition 

COMPAT_MAPPING:	TAB "#" SP EXPAND_LINE	
COMPAT_MAPPING:	TAB "#" SP "<" LCTAG ">" SP EXPAND_LINE	
			// Replace '#' by APPROX, output line as mapping;
			// check the <tag> for balanced < >

NOTICE_LINE:	"@+" TAB LINE		
			// Skip '@+', output text as notice
		"@+" TAB * SP LINE	
			// Skip '@', output text as notice
			// "*" expands to a bullet character
			// Notices following a character code apply to the
			// character and are indented. Notices not following
			// a character code apply to the page/block/column 
			// and are italicized, but not indented

SUBTITLE:	"@@@+" TAB LINE	
			// Skip "@@@+", output text as subtitle

SUBHEADER:	"@" TAB LINE	
			// Skip '@', output line as text as column header

BLOCKHEADER:	"@@" TAB BLOCKSTART TAB BLOCKNAME TAB BLOCKEND LF
			// Skip "@@", cause a page break and optional
			// blank page, then output one or more charts
			// followed by the list of character names. 
			// Use BLOCKSTART and BLOCKEND to define
			// what characters belong to a block.
			// Use blockname in page and table headers
		
BLOCKNAME:	LABEL
		LABEL SP "(" LABEL ")"			
			// If an alternate label is present it replaces 
			// the blockname when an ISO-style namelist is
			// laid out; it is ignored in the Unicode charts

BLOCKSTART:	CHAR	// First character position in block
BLOCKEND:	CHAR	// Last character position in block
PAGE_BREAK:	"@@"	// Insert a (column) break
INDEX_TAB:		"@@+"	// Start a new index tab at latest BLOCKSTART

TITLE:		"@@@" TAB LINE	
			// Skip "@@@", output line as text
			// Title is used in page headers

EXPAND_LINE:	{ESC_CHAR | CHAR | STRING | ESC +}+ LF
			// Instances of CHAR (see Notes) are replaced by 
			// CHAR NBSP x NBSP where x is the single Unicode
			// character corresponding to CHAR.
			// If character is combining, it is replaced with
			// CHAR NBSP <circ> x NBSP where <circ> is the 
			// dotted circle

Notes:

Blocks must be aligned on 16-code point boundary and contain an integer multiple of 16-code point columns. The exception to that rule is for blocks of ideographs, etc., for which no names are listed in the file. Such blocks must end on the actual last character.
Blocks must be non-overlapping and in ascending order. NAME_LINEs must be in ascending order and follow the block header for the block to which they belong.
Reserved entries are optional, and will normally be supplied automatically. They are required whenever followed by ALIAS_LINE, COMMENT_LINE, NOTICE_LINE or CROSS_REF.
The French version of the nameslist uses French rules, which allow apostrophe and accented letters in character names.

1.4 NamesList File Primitives

The following are the primitives and terminals for the NamesList syntax.

LINE:		STRING LF
COMMENT:	"(" LABEL ")"
		"(" LABEL ")" SP "*"
		"*"

NAME:	  	<sequence of uppercase ASCII letters, digits, space and hyphen> 
LCNAME:		<sequence of lowercase ASCII letters, digits space and hyphen>
		LCNAME "-" CHAR

LCTAG:		<sequence of lowercase ASCII letters>
STRING:	  	<sequence of Latin-1 characters, except controls> 
LABEL:	  	<sequence of Latin-1 characters, except controls, "(" or ")"> 
CHAR:		X X X X
		| X X X X X 
		| X X X X X X 
X:	  	"0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"|"A"|"B"|"C"|"D"|"E"|"F" 
ESC_CHAR:	ESC CHAR	
ESC:	        "\"	
			// Special semantics of backslash (\) are supported
			// only in EXPAND_LINE.
TAB:	  	<sequence of one or more ASCII tab characters 0x09>	
SP:	  	<ASCII 20>
LF:	  	<any sequence of ASCII 0A and 0D>

Notes:

Multiple or leading spaces, multiple or leading hyphens, as well as word-initial digits in NAMEs or LCNAMEs are illegal.
Special lookahead logic prevents a 4 digit number for a standard, such as ISO 9999 from being misinterpreted as ISO CHAR. Currently recognized are "ISO", "DIN", "IEC" and "S X" and "S C" for the JIS X and JIS C series of standards. For other standards, or for four-digit years in a comment, use a NOTICE_LINE instead, which prevents expansion.
The hyphen in a character range CHAR-CHAR is replaced by an EN DASH on output.
The final LF in the file must be present.
While the format allows multiple <tab> characters, by convention the actual number of tabs is always one or two, chosen to provide the best layout of the plaintext file.
A CHAR inside ' or " is expanded, but only its glyph image is printed, the code value is not echoed.
Single and double straight quotes in an EXPAND_LINE are replaced by curly quotes using English rules. Smart apostrophes are supported, but nested quotes are not. Single quotes can only be applied around a single word.
Inside an EXPAND_LINE, backslash is treated as an escape character that removes the special meaning of any literal character and also prevents the following digit sequence from being expanded. A backslash character in isolation is never displayed. A sequence of two backslash characters results in display of a single backslash, but has no effect on the interpretation of following characters.
The NamesList.txt file is encoded in Latin-1. While the code chart formatter can accept files in either Latin-1 and little-endian UTF-16, prefixed with a BOM, the character repertoire for running text (anything other than CHAR) is effectively restricted to Latin-1 characters.
When names containing code points are lowercased to make them LCNAMEs, the code point values remain uppercase. Such code points by convention follow a hyphen and are the last element in the name.
Earlier published versions of the NamesList file may contain extra spaces or tab characters; while these are errors in the files, they are not being corrected, to retain stability of the published versions. Anyone writing a parser for older versions of this file may need to be prepared to handle such exceptions.

Modifications

Version 6.0.0

Added definitions for ESC_CHAR and ESC primitives.
Clarified interpretation of backslash escapes in EXPAND_LINE.

Version 5.2.0

Better aligned the rules section with the actual published files and behavior of existing parsers. This included fixing some obvious typos and clarifying some notes as well as the following changes, which are listed individually.
Replaced instances of <tab> by TAB throughout.
NAME_LINE for special names may have trailing COMMENTs including COMMENTs consisting entirely of "*".
In CROSS_REF added the form without LCNAME, fixed the literal to the correct lowercase "x" and noted that LCNAME may have "<" and ">" around it in the data. Also added missing LF in the rules.
Removed a redundant rule for BLOCKHEADER.
Changed FORMALALIAS_LINE from LINE to NAME to match actual restriction on contents.
Extended the documentation of lookahead logic for CHAR.
Accounted for FILE_COMMENT in overall file structure.

Version 5.1.0

Noted that comments in NAME_LINEs must be preceded by SP.
Provided additional information on allowable characters in names.
Added SIDEBAR_LINE.
Noted that CROSS_REF must contain a SP and CHAR, and that COMPAT_MAPPING must contain a SP and may contain a <tag>
Noted that LCNAME may contain uppercase characters under exceptional circumstances.
Relaxed the restriction on lines starting with #, :, %, x and = on the TITLE_PAGE. These are now treated as comments.

Version 5.0.0

Added FORMALALIAS_LINE and INDEX_TAB to syntax.
Fixed the list of lines that may appear before a blockheader by adding NOTICE_LINE.
Minor fixes to the wording of several syntax definitions.

Version 4.0.0

Fixed syntax to better reflect restrictions on characters in character and block names.
Better document treatment of comments in block names, plus French name rules.

Version 3.2.0

Fixed several broken links, added a left margin, changed version numbering.

Version 3.1.0 (2)

Use of 4-6 digit hex notation is now supported.

UCD Terms of Use

Disclaimer

The Unicode Character Database is provided as is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. If this file has been purchased on magnetic or optical media from Unicode, Inc., the sole remedy for any claim will be exchange of defective media within 90 days of receipt.

This disclaimer is applicable for all other data files accompanying the Unicode Character Database, some of which have been compiled by the Unicode Consortium, and some of which have been supplied by other sources.

Limitations on Rights to Redistribute This Data

Recipient is granted the right to make copies in any form for internal distribution and to freely use the information supplied in the creation of products supporting the Unicode^TM Standard. The files in the Unicode Character Database can be redistributed to third parties or other organizations (whether for profit or not) as long as this notice and the disclaimer notice are retained. Information can be extracted from these files and used in documentation or programs, as long as there is an accompanying notice indicating the source.