A Call for a New Structure for Character Code Sets
Copyright 1999 Joel Matthew Rees
Takino-cho, Kato-gun, Hyogo-ken, Japan
Always Under Construction
We need extensible, deterministically parsable character codes.
Extensible code sets are necessary for two reasons. First, the size of the
common Chinese Hanji and Japanese Kanji character sets are always
mistakenly estimated. Any one
ordinary person uses no more than around 3000 on a regular basis. Specialists
may use as many as 9000. But there are always the ocassions, once a year or so,
when you need a few odd characters. Far worse, the subsets used by particular
individuals and groups are always disjoint. Even the native users of both sets
tend to be unaware how disjoint their spheres of use are. The original estimates
for a minimal useful subset of Chinese have grown from around 3000 to over 10,000
as computer use enters more fields. Relatively complete modern dictionaries of
Chinese register in the range of 50,000+ characters. Japanese estimates have
followed a similar pattern as specialists in various fields begin to realize the
benefits of computers. Again, relatively complete dictionaries of modern Japanese
also include in the range of 50,000 characters. The Japanese 50,000 are not the
same as the Chinese 50,000. Taiwan uses older characters, Japan uses some
abbreviations, and the mainland invented a general simplification in the 50's
that is in some ways as distinct from either the Taiwanese or Japanese characters
as Cyrillic is from Anglo/Latin.
To make computers as useful to Chinese and Japanese users, all these characters
must be reproduceably representable.
Second, character sets are dynamic. Fifteen years ago, in college, I was content
to ignore the problems I knew would exist with Japanese character sets, because
the 8-bit ASCII set formed such a very convenient ring, and it was so easily
adapted to various uses not originally intended. But I knew even then where the
edges would fray, even for the small character set languages. Japanese and Chinese
both get new characters every year. Most of these are in specialty fields, in which
specialists are really not anxious to worry about making gaiji (non-standard
characters), and are especially not interested in worrying about how to make sure
the receiver of an electronic document will be able to read the gaiji.
Specialists of any national background need to be able to make up new characters
and mix them freely with the standard characters.
Incidentally, the Hanji (and Kanji) characters are
themselves conglomerations of simpler characters, called in Japanese, bushu,
but in English, radicals. There are less than 500 radicals. Each
dictionary uses a slightly different set. Also, the methods of combining radicals
are neither linear nor context-free, which makes it difficult for computers to use
them for encoding characters. (But I think they must be the foundation for
On a separate track, present code sets present some difficult parsing problems.
Specifically, it is often hard to tell where the boundaries between characters
is. If you scan from the beginning of a transmission or document, it is not too
bad, but if you need to scan backwards or start in the middle, finding the
boundaries often requires either guessing or going back to the beginning.
A technique that has mostly lost it's popularity is the use of escape codes to
switch into the extension sets, just like switching between fundamental sets.
This is rather clumsy, and tends to lead to character codes of varying width.
When using such escape code sequences, you don't know what character set or
extension set you are in unless you start from the beginning. That probably
means you don't know how wide the characters are, either.
Another technique is the reservation of certain ranges as lead words for
extension ranges. For example, SJIS (Shift-JIS) defines all the codes for
Japanese characters (plus two-byte codes for Anglo/Roman, Russian, Greek,
Japanese Kana, and various decorative and symbolic characters) in the ranges
from 0x8140 to 0x84BE, 0x889F to 0x9FFC, and 0xE040 to 0xEAA4. This leaves
gaps for mixing seven-bit (modified) ANSI and one-byte katakana (which was
convenient in the initial ports of certain major software products but is
now a source of significant headache).
(Edited 3-5 June 2000 starting here.)
The ranges are as follows:
(This is from an old word processing character dictionary, circa 1991. The current
standard has a lot more characters, I think, but I don't know where they squeeze
them all in, because I'm too cheap to buy an official copy of the standard.)
- 0x00~0x1F, 0x7F: The usual control codes.
- 0x20~0x7E: ASCII/ANSI/ISO whatever it is.
- 0x80: ?? -- Not sure what the standard says about this unused code.
- 0x8100~0x9FFF: Symbols, English, Greek, Russian, kana, kanji.
(valid character ranges: 0xXX40~0xXX7E, 0xXX80~0xXXFC)
- 0xA0: ?? -- Not sure about this one either.
- 0xA1~0xDF: half-width katakana, now officially discouraged.
- 0xE000~0xEAFF: More kanji.
(valid character ranges: 0xXX40~0xXX7E, 0xXX80~0xXXFC)
The holes where codes are not defined in the SJIS ranges make it fairly easy to
guess where the boundaries between characters are, but there are some bad cases
which programs must be able to handle. All the backing up and going forward to
find character boundaries, of course, chews up processor time.
Even UNICODE, as it becomes evident that 216 characters are not sufficient,
is being defined to include a range of codes that will be lead words for expansion
However, UTF-16 (Look it up on your favorite search engine.) defines for UNICODE, not
really lead words, but rather "surrogate pairs". In other words, the lead words are
from one specific range and the tail words are from another, both borrowed from ranges
initially intended for custom character codes.
The two ranges include 1024 values each, for a total of 20 (actually a little more
than 20 and a little less than 21) useable bits for extension, over a million character
codes total. I don't think this is quite enough, but I am admittably strange.
Anyway, the use of surrogate pairs will allow predictable parsing, even from the
middle or end of a string. Both lead and tail words for the extension codes are now
defined to be outside the ranges for one-word codes, so we can start in the middle
and go either forward or backward.
But UNICODE text streams contain lots of octets (half words) that cause many
standard and de-facto standard analysis algorithms to cough, break, and/or explode.
Once again, how do we avoid the necessity of guessing at character boundaries? If
the lead words or bytes are disjoint from the tail words or bytes, and both disjount
from non-extended codes, the boundaries are immediately apparent. I had originally
thought that making the lead and tail word ranges disjoint would be a waste of code
space, but working with variable width characters proves otherwise. I have not
performed a formal analysis, but the principal of squareness leads me to believe
that the most efficient distributions will be derived from the use of the lead
bit for distinguishing between lead and trailing words. If we want some
compatibility with the present 7-bit ASCII/ANSI/ISO code sets (including one-byte
NUL characters), the high bit should be clear for trailing bytes, including
single byte codes, and it should be set for the lead byte of multiple-byte codes.
The ranges allowed would be as follows:
- 0x00..0x7f (27 codes)
- 0x80..0xFF : 0x00..0x7F (214 codes)
- 0x80..0xFF : 0x80..0xFF : 0x00..0x7F (221 codes)
- 0x80..0xFF : 0x80..0xFF : 0x80..0xFF : 0x00..0x7F (228 codes)
An application would necessarily be free in its response to excessively long
character codes. Unfortunately, giving up in the middle may allow sneaky people to
hide potentially dangerous characters like the characters that delimit parts of
file and file path names, or the characters that allow inserting shell commands in
text that we want to think is ordinary text.
One thing I would like to do is to map existing code sets into three and four byte
ranges. UNICODE, for instance, might be given the ranges
0xA0..0xA7 : 0x80..0xFF : 0x00..0x7F (217 codes)
with JIS in the ranges
0xA8..0xA9 : 0x80..0xFF : 0x00..0x7F (215 codes)
and the existing Chinese and Korean encodings in following ranges.
Another thing that must be done is to define a set of command codes for defining
new characters. The definitions of the new characters (and for characters that the
recipient might not have fonts for) would be pasted on the front of transmissions/
(Added 2 Feb 2000 edited 3-5 June 2000:)
Well, now I've seen an encoding method for UNICODE (and perhaps others?) being promoted
in the UNIX sector, UTF-8 or something like that, which uses an expandable approach
like the one above, but also uses the lead bits of the lead octet to indicate the
total number of octets included in the character, and makes lead and tail
ranges disjoint. I think it lays out something like this:
THIS IS NOT UTF-8!!! (See below for the real thing.)
(My gray matter failed me here.)
- 0x00..0x7F (27 codes)
- 0xE0..0xEF : 0x80..0xBF (210 codes)
- 0xF0..0xF7 : 0xC0..0xDF : 0x80..0xBF (215 codes)
- 0xF8..0xFB : 0xC0..0xDF : 0xC0..0xDF : 0x80..0xBF (220 codes)
- 0xFC..0xFD : 0xC0..0xDF : 0xC0..0xDF : 0xC0..0xDF : 0x80..0xBF (225 codes)
As I recall, it's supposed
to provide at least thirty-one useable bits of encoding, with 0x00 occuring only as a
single-octet NUL, and the particularly problematic 0xFF and 0xFE not occuring at all.
I am not sure of the utility of knowing before you scan how far to scan, and I
instinctively want to leave lots more room to expand. But this encoding should work
for the next ten years or so, at least. Look it up on google.com.
(Added 3 June 2000:) I finally found the time to look it up
again, (on google, check out the LINUX man page references for conciseness) and put
up a real table:
- 0x00..0x7F (27 codes)
- 0xC0..0xDF : 0x80..0xBF (211 codes)
- 0xE0..0xEF : 0x80..0xBF : 0x80..0xBF (216 codes)
- 0xF0..0xF7 : 0x80..0xBF : 0x80..0xBF : 0x80..0xBF (221 codes)
- 0xF8..0xFB : 0x80..0xBF : 0x80..0xBF : 0x80..0xBF : 0x80..0xBF (226 codes)
- 0xFC..0xFD : 0x80..0xBF : 0x80..0xBF : 0x80..0xBF : 0x80..0xBF : 0x80..0xBF (231 codes)
If you are interested in this kind of stuff, you'll want to look up UCS-2 and UCS-4, and
A key point to UTF-8 is that the 0, 0xFE, and 0xFF octets, and all the traditional control
codes occur only once as valid codes, never occuring again within valid codes.
UTF-16 requires the system to be re-written from the ground up with 16 bit characters,
and it does not like to have 8 bit characters mixed into things like file names and command
streams. It's hard to get a sixteen bit clean system up without 16 bit clean tools, and
it's hard to make 16 bit clean tools without a 16 bit clean system.
And then there is the confusion of little endianism. A sixteen bit character is not a sixteen
bit integer on least significant byte first CPUs.
UTF-8 avoids endianism problems, allows using existing 8 bit text of all sorts, and only
breaks a (relatively) small number of the traditional character processing algorithms.
But all of the above is irrelevant. I waved my hands at the real issues above, but then
got sidetracked into math games and bit pattern problems instead.
The real problem with CJKV characters is that the industry over here can't see
the character processing for the characters. Imagine what computer systems, database
operations, applications, etc. would be like if every word in the (say, English) language
had its own code. Kanji are not the fundamental characters of any of the
If we want to do with Japanese or Chinese the things we take for granted with English, we
must develop an encoding scheme based on the radical elements, of which there are less
than 500, maybe even less than 100, if we analyze them right.
Also, the unified encoding should not try to cover all the gadgetry of each language. I'm
still not convinced that we are well served be a unified encoding. It seems to me to make
more sense to keep the traditional code sets, let the people who use them the most control
them, and simply define a common interface across which we can send renderings of unusual
included characters in headers to files traveling language boundaries. It seems complicated,
but when you look at the complexity of UNICODE, maybe it isn't really so complicated after
^v:charcode c00.00.05e 2000.06.06