A Call for a New Structure for Character Code Sets
Copyright 1999 Joel Matthew Rees
Takino-cho, Kato-gun, Hyogo-ken, Japan
Always Under Construction


We need extensible, deterministically parsable character codes.

Extensible code sets are necessary for two reasons. First, the size of the common Chinese Hanji and Japanese Kanji character sets are always mistakenly estimated. Any one ordinary person uses no more than around 3000 on a regular basis. Specialists may use as many as 9000. But there are always the ocassions, once a year or so, when you need a few odd characters. Far worse, the subsets used by particular individuals and groups are always disjoint. Even the native users of both sets tend to be unaware how disjoint their spheres of use are. The original estimates for a minimal useful subset of Chinese have grown from around 3000 to over 10,000 as computer use enters more fields. Relatively complete modern dictionaries of Chinese register in the range of 50,000+ characters. Japanese estimates have followed a similar pattern as specialists in various fields begin to realize the benefits of computers. Again, relatively complete dictionaries of modern Japanese also include in the range of 50,000 characters. The Japanese 50,000 are not the same as the Chinese 50,000. Taiwan uses older characters, Japan uses some abbreviations, and the mainland invented a general simplification in the 50's that is in some ways as distinct from either the Taiwanese or Japanese characters as Cyrillic is from Anglo/Latin.

To make computers as useful to Chinese and Japanese users, all these characters must be reproduceably representable.

Second, character sets are dynamic. Fifteen years ago, in college, I was content to ignore the problems I knew would exist with Japanese character sets, because the 8-bit ASCII set formed such a very convenient ring, and it was so easily adapted to various uses not originally intended. But I knew even then where the edges would fray, even for the small character set languages. Japanese and Chinese both get new characters every year. Most of these are in specialty fields, in which specialists are really not anxious to worry about making gaiji (non-standard characters), and are especially not interested in worrying about how to make sure the receiver of an electronic document will be able to read the gaiji. Specialists of any national background need to be able to make up new characters and mix them freely with the standard characters.

Incidentally, the Hanji (and Kanji) characters are themselves conglomerations of simpler characters, called in Japanese, bushu, but in English, radicals. There are less than 500 radicals. Each dictionary uses a slightly different set. Also, the methods of combining radicals are neither linear nor context-free, which makes it difficult for computers to use them for encoding characters. (But I think they must be the foundation for Ideographic characters.)

On a separate track, present code sets present some difficult parsing problems. Specifically, it is often hard to tell where the boundaries between characters is. If you scan from the beginning of a transmission or document, it is not too bad, but if you need to scan backwards or start in the middle, finding the boundaries often requires either guessing or going back to the beginning.

A technique that has mostly lost it's popularity is the use of escape codes to switch into the extension sets, just like switching between fundamental sets. This is rather clumsy, and tends to lead to character codes of varying width. When using such escape code sequences, you don't know what character set or extension set you are in unless you start from the beginning. That probably means you don't know how wide the characters are, either.

Another technique is the reservation of certain ranges as lead words for extension ranges. For example, SJIS (Shift-JIS) defines all the codes for Japanese characters (plus two-byte codes for Anglo/Roman, Russian, Greek, Japanese Kana, and various decorative and symbolic characters) in the ranges from 0x8140 to 0x84BE, 0x889F to 0x9FFC, and 0xE040 to 0xEAA4. This leaves gaps for mixing seven-bit (modified) ANSI and one-byte katakana (which was convenient in the initial ports of certain major software products but is now a source of significant headache).

(Edited 3-5 June 2000 starting here.)

The ranges are as follows:
(This is from an old word processing character dictionary, circa 1991. The current standard has a lot more characters, I think, but I don't know where they squeeze them all in, because I'm too cheap to buy an official copy of the standard.)


The holes where codes are not defined in the SJIS ranges make it fairly easy to guess where the boundaries between characters are, but there are some bad cases which programs must be able to handle. All the backing up and going forward to find character boundaries, of course, chews up processor time.

Even UNICODE, as it becomes evident that 216 characters are not sufficient, is being defined to include a range of codes that will be lead words for expansion codes.

However, UTF-16 (Look it up on your favorite search engine.) defines for UNICODE, not really lead words, but rather "surrogate pairs". In other words, the lead words are from one specific range and the tail words are from another, both borrowed from ranges initially intended for custom character codes.

The two ranges include 1024 values each, for a total of 20 (actually a little more than 20 and a little less than 21) useable bits for extension, over a million character codes total. I don't think this is quite enough, but I am admittably strange.

Anyway, the use of surrogate pairs will allow predictable parsing, even from the middle or end of a string. Both lead and tail words for the extension codes are now defined to be outside the ranges for one-word codes, so we can start in the middle and go either forward or backward.

But UNICODE text streams contain lots of octets (half words) that cause many standard and de-facto standard analysis algorithms to cough, break, and/or explode.


Once again, how do we avoid the necessity of guessing at character boundaries? If the lead words or bytes are disjoint from the tail words or bytes, and both disjount from non-extended codes, the boundaries are immediately apparent. I had originally thought that making the lead and tail word ranges disjoint would be a waste of code space, but working with variable width characters proves otherwise. I have not performed a formal analysis, but the principal of squareness leads me to believe that the most efficient distributions will be derived from the use of the lead bit for distinguishing between lead and trailing words. If we want some compatibility with the present 7-bit ASCII/ANSI/ISO code sets (including one-byte NUL characters), the high bit should be clear for trailing bytes, including single byte codes, and it should be set for the lead byte of multiple-byte codes. The ranges allowed would be as follows:


An application would necessarily be free in its response to excessively long character codes. Unfortunately, giving up in the middle may allow sneaky people to hide potentially dangerous characters like the characters that delimit parts of file and file path names, or the characters that allow inserting shell commands in text that we want to think is ordinary text.

One thing I would like to do is to map existing code sets into three and four byte ranges. UNICODE, for instance, might be given the ranges

0xA0..0xA7 : 0x80..0xFF : 0x00..0x7F (217 codes)

with JIS in the ranges

0xA8..0xA9 : 0x80..0xFF : 0x00..0x7F (215 codes)

and the existing Chinese and Korean encodings in following ranges.

Another thing that must be done is to define a set of command codes for defining new characters. The definitions of the new characters (and for characters that the recipient might not have fonts for) would be pasted on the front of transmissions/ files.

(Added 2 Feb 2000 edited 3-5 June 2000:) Well, now I've seen an encoding method for UNICODE (and perhaps others?) being promoted in the UNIX sector, UTF-8 or something like that, which uses an expandable approach like the one above, but also uses the lead bits of the lead octet to indicate the total number of octets included in the character, and makes lead and tail ranges disjoint. I think it lays out something like this:

THIS IS NOT UTF-8!!! (See below for the real thing.) (My gray matter failed me here.)

As I recall, it's supposed to provide at least thirty-one useable bits of encoding, with 0x00 occuring only as a single-octet NUL, and the particularly problematic 0xFF and 0xFE not occuring at all. I am not sure of the utility of knowing before you scan how far to scan, and I instinctively want to leave lots more room to expand. But this encoding should work for the next ten years or so, at least. Look it up on google.com.

(Added 3 June 2000:) I finally found the time to look it up again, (on google, check out the LINUX man page references for conciseness) and put up a real table:


If you are interested in this kind of stuff, you'll want to look up UCS-2 and UCS-4, and also UTF-7.

A key point to UTF-8 is that the 0, 0xFE, and 0xFF octets, and all the traditional control codes occur only once as valid codes, never occuring again within valid codes.

UTF-16 requires the system to be re-written from the ground up with 16 bit characters, and it does not like to have 8 bit characters mixed into things like file names and command streams. It's hard to get a sixteen bit clean system up without 16 bit clean tools, and it's hard to make 16 bit clean tools without a 16 bit clean system.

And then there is the confusion of little endianism. A sixteen bit character is not a sixteen bit integer on least significant byte first CPUs.

UTF-8 avoids endianism problems, allows using existing 8 bit text of all sorts, and only breaks a (relatively) small number of the traditional character processing algorithms.

But all of the above is irrelevant. I waved my hands at the real issues above, but then got sidetracked into math games and bit pattern problems instead.

The real problem with CJKV characters is that the industry over here can't see the character processing for the characters. Imagine what computer systems, database operations, applications, etc. would be like if every word in the (say, English) language had its own code. Kanji are not the fundamental characters of any of the CJKV languages.

If we want to do with Japanese or Chinese the things we take for granted with English, we must develop an encoding scheme based on the radical elements, of which there are less than 500, maybe even less than 100, if we analyze them right.

Also, the unified encoding should not try to cover all the gadgetry of each language. I'm still not convinced that we are well served be a unified encoding. It seems to me to make more sense to keep the traditional code sets, let the people who use them the most control them, and simply define a common interface across which we can send renderings of unusual included characters in headers to files traveling language boundaries. It seems complicated, but when you look at the complexity of UNICODE, maybe it isn't really so complicated after all.


Home


^v:charcode c00.00.05e 2000.06.06