Word Break Chart

Unicode Version: 5.0.0

Date: 2006-06-13, 23:23:45 GMT

This page illustrates the application of the boundary specifications. The first chart shows where breaks would appear between different sample characters or strings. The sample characters are chosen mechanically to represent the different properties used by the specification. Where properties used in the rules have 'overlaps', the samples are given 'composed' names. For example, SentenceBreak uses GCLF_Sep: Sep is the SentenceBreak property, but it overlaps with the GraphemeClusterBreak property LF.

OtherGCControlGCExtendGCLF_SepGCCR_SepGCControl_SepGCControl_FormatKatakanaALetterMidLetterMidNumNumericExtendNumLet
Other÷÷×÷÷÷×÷÷÷÷÷÷
GCControl÷÷×÷÷÷×÷÷÷÷÷÷
GCExtend÷÷×÷÷÷×÷÷÷÷÷÷
GCLF_Sep÷÷÷÷÷÷÷÷÷÷÷÷÷
GCCR_Sep÷÷÷×÷÷÷÷÷÷÷÷÷
GCControl_Sep÷÷÷÷÷÷÷÷÷÷÷÷÷
GCControl_Format÷÷×÷÷÷×÷÷÷÷÷÷
Katakana÷÷×÷÷÷××÷÷÷÷×
ALetter÷÷×÷÷÷×÷×÷÷××
MidLetter÷÷×÷÷÷×÷÷÷÷÷÷
MidNum÷÷×÷÷÷×÷÷÷÷÷÷
Numeric÷÷×÷÷÷×÷×÷÷××
ExtendNumLet÷÷×÷÷÷×××÷÷××
 
ALetter GCControl_Format÷÷×÷÷÷×÷×÷÷××
ALetter MidLetter÷÷×÷÷÷×÷×÷÷÷÷
ALetter MidLetter÷÷×÷÷÷×÷×÷÷÷÷
ALetter MidLetter GCControl_Format÷÷×÷÷÷×÷×÷÷÷÷
ALetter MidNum÷÷×÷÷÷×÷÷÷÷÷÷
Numeric MidLetter÷÷×÷÷÷×÷÷÷÷÷÷
Numeric MidLetter÷÷×÷÷÷×÷÷÷÷÷÷
Numeric MidNum÷÷×÷÷÷×÷÷÷÷×÷
Numeric MidNum GCControl_Format÷÷×÷÷÷×÷÷÷÷×÷

Rules

Due to the way they have been mechanically processed for generation, the following rules do not match the UAX rules precisely. In particular:

  1. The rules are cast into a more regex-style.
  2. The rules "sot ÷", "÷ eot", and "÷ Any" are added mechanically, and have artificial numbers.
  3. The rules are given decimal numbers, so rules such as 11a are given a number using tenths, such as 11.1.
  4. Where a rule has multiple parts (lines), each one is numbered using hundredths, such as 21.01) × BA, 21.02) × HY,...
  5. Any 'treat as' or 'ignore' rules are handled as discussed in Unicode Standard Annex #29, and thusreflected in a transformation of the rules not visible here.

For the original rules, see the UAX.

Sample Strings

The following samples illustrate the application of the rules. The blue lines indicate possible break points. If your browser supports titles, then positioning the mouse over each character will show its name, white positioning between characters shows the rule number of the rule responsible for the break-status.

  1.   c  a  n  '  t  
  2.   c  a  n    t  
  3.   a  b    b  y  
  4.   a  $  -  3  4  ,  5  6  7  .  1  4  %  b  
  5.   3  a  
  6.     c    a    n    '    t      
  7.     c    a    n        t      
  8.     a    b        b    y      
  9.     a    $    -    3    4    ,    5    6    7    .    1    4    %    b      
  10.     3    a