UCD: Extended Character Properties

PropList.txt contains extended properties that supplement the General Category property described in UnicodeData.html. Unlike the derived properties, the properties in PropList.txt cannot be derived directly from UnicodeData.txt or other data files of the UCD. These properties are listed in the following table.

Property Value	N/I	Definition and Usage
White_space	N	Space characters and those format control characters (such as TAB, CR and LF) which should be treated by programming languages as "white space" for the purpose of parsing elements. Note: ZERO WIDTH SPACE and ZERO WIDTH NO-BREAK SPACE are not included, since their functions are restricted to line-break control. Their names are unfortunately misleading in this respect. Note: There are other senses of "whitespace" that encompass a different set of characters.
Bidi_Control	N	Those format control characters which have specific functions in the Bidirectional Algorithm.
Join_Control	N	Those format control characters which have specific functions for control of cursive joining and ligation.
Dash	I	Those punctuation characters explicitly called out as dashes in the Unicode Standard, plus compatibility equivalents to those. Most of these have the Pd General Category, but some have the Sm General Category because of their use in mathematics.
Hyphen	I	Those dashes used to mark connections between pieces of words, plus the Katakana middle dot. The Katakana middle dot functions like a hyphen, but is shaped like a dot rather than a dash.
Quotation_Mark	I	Those punctuation characters that function as quotation marks.
Terminal_Punctuation	I	Those punctuation characters that generally mark the end of textual units.
Other_Math	I	Math characters that do not have the Sm General Category.
Hex_Digit	I	Characters commonly used for the representation of hexadecimal numbers, plus their compatibility equivalents.
Other_Alphabetic	I	Alphabetic characters that do not have L as their major class for the General Category (Lu, Ll, Lt, Lm, Lo).
Ideographic	I	Characters considered to be CJKV (Chinese, Japanese, Korean, and Vietnamese) ideographs.
Diacritic	I	Characters that linguistically modify the meaning of another character to which they apply. Some diacritics are not combining characters, and some combining characters are not diacritics.
Extender	I	Characters whose principal function is to extend the value or shape of a preceding alphabetic character. Typical of these are length and iteration marks.
Other_Lowercase	I	Lowercase characters that do not have the Ll General Category.
Other_Uppercase	I	Uppercase characters that do not have the Lu General Category.
Noncharacter_Code_Point	N	Code points that are explicitly defined as illegal for the encoding of characters. See Unicode 3.1 for more information.

Revision	3.1.0
Authors	Mark Davis
Date	2001-02-28
This Version	http://www.unicode.org/Public/3.1-Update/PropList-3.1.0.html
Previous Version	n/a
Latest Version	http://www.unicode.org/Public/UNIDATA/PropList.html

Extended Character Properties

Summary

Status

Introduction

UCD Terms of Use

Disclaimer

Limitations on Rights to Redistribute This Data