A Plan for Character Typing for Kanji
(JIS Family Encodings)
Copyright 2000, 2001 Joel Matthew Rees
Takino-cho, Kato-gun, Hyogo-ken, Japan

See the include file for the actual specs. The slow version of the library is almost through the test phase. Check the test project description for information on building. Check the testbed source to find out how far I've got in testing.

実際の 定義を 見るには その インクルード ファイルを 御覧ください。 ビルドの 説明を テスト プロジェクトの 描写文に 記しました。 ライブラリの 最適化 されていない スローバージョンが 只今 テスト中です。 テスト プログラムの ソースを 見ると、 テストが どこまで すすんでいるかが わかると 思います。

The reasons to have built this library are almost no longer valid. I am leaving these superceded specs here for historical purposes.

この ライブラリを 作成する 必要が もうほとんど なくなった 現在です。 定義の 古いバージョンの このファイルを 歴史的な 目的に ここに 置いておきます。



The ctype functions are perhaps the cornerstone of C's standard libraries. It is the ability to anylyze text with concision that makes C a complete tool for developing applications. Unfortunately, these functions are not yet available for kanji, or at least not widely available. 個人の 意見ですが、 多分、 ctype の 関数が C言語の 標準ライブラリの 礎石機能なの でしょう。 C言語が アプリケーションの 開発に 適していますが、 簡潔に 文書の 文字を 解析する 表現が できるため 機能が 揃っています。 残念ながら、 漢字用の ctype ライブラリが 未だ できていません。 ある物は 在りますが、 一般的に は手に 入りません。
(Microsoft almost included the bare minimum necessary for business apps in Visual Studio 6. I haven't yet had a chance to look at Apple's text analysis engine that is being built in connection with their move to freeBSD, but it looks kind of complicated from the READMEs.) (Microsoft社が 業務用の アプリケーションに 必要だけの 最低限の ほとんどを ビジュアルスタジオ6に 提供して 下さったのです。 アップル社の フリーBSDへの 移動の 関連に 開発を 進ませている 「text analysis engine」を 吟味する 機会が まだ できていませんが、 READMEファイルを 読んだら 少々複雑に 見えます。)
The sheer number of kanji is one of the impediments. A couple of hundred characters can be characterized, classified, and typed in a few man-months of research. Six thousand characters are going to require a few man-years, and the product will need to be used several years to establish a moderate level of confidence as a tool. The real number of kanji is in the range of sixty thousand, which pushes the end of the project beyond the fiscal edge. Now I am asserting elsewhere that kanji are even less a closed set than Latin-whatever-the-ISO. Be warned.
In truth, the ctype libraries are only a close approximation. A similar kanji typing library would then appear to be an order of magnitude or so less close. Not useless, but less "correct".
For a variety of reasons, some more of which I ought to outline sometime, Japanese programmers seem in general to try to avoid text analysis. When dealing with things like numeric input and e-mail addresses, they like to use the Latin (Roman over here) encoding. A foreigner who suggests such things as using kanji for variable names in programs is begging to be treated as a novice or a crazy.
Hey! that's me! I don't deny it. I lack experience, and I live just a wee bit over what most people seem to consider to be the edge.
So, this page is an outline of the approach I want to take to write a jtype library.
First, the time problem: I don't have time to consider every character individually. So I am going to implement simple range tests. I will take the usual two step approach, using the range test version in a loop to generate the faster bit table version. That means there will probably be some special cases I miss. As I find them, I can add exceptions to the range tests and re-build the tables. I hope.
Of course, since this is my spare time, I may never finish the first step. I tried to do this from a slightly different approach at work once over a year and a half back, but my boss stopped me. When I asked permission to publish on the web what I had produced to that point (the first step slow version), he never got back to me. But he says I can publish my own extra-curricular work as much I please, so, if he will ever leave me any spare time again, I will build what I describe below, and up load.
Also, I don't really have the tools or the time for dealing with the auxiliary set established in 1990. So, if I ever finish this project, it will only deal with the 6,355 or so specified in JIS Levels 1 and 2. Since my old Mac is shift-JIS, that will be the initial encoding supported. If I ever make it to EUC and JIS, I will be flying blind (unless I get a job that pays me enough I can buy some new boxes to play with BSD, LINUX, and/or Solaris). UTF-8 (am I remembering the acronym right?) UNICODE is a whole 'nuther ball of wax, for which I have no plans as yet.
Second, the typedef for the characters: This is the tough one. JIS EUC and shift-JIS are both variable width. JIS seems to be constant width until you remember that shifting to Latin doesn't really take you out of the Japanese locale. We have a lot of data in variable width characters. Either we convert to a constant width encoding before analyzing, or we build the typing functions to work with variable width characters. From what I've seen of conversions, I tend to think that working with variable width characters will be more reliable.
(I'm talking through my hat here. I just don't want to take the time to develop an argument for something I take for granted.)
This means that my jtype library will depart from the design of the ctype libraries, in that there is no way to directly return variable width characters in standard C. (Just try making a buffer full of something like
struct { unsigned char width; unsigned char ch[ 3 ] }.)
I'll have to pass char pointers around, and remember that they point to bytes, not characters. Where I must return characters, I must add a parameter for the return buffer, and code very defensively. (And I am going to refuse to compile if CHAR_BIT is not 8, at least in the first versions.)
Failing to return the result of the bit-and function (the usual approach for the ctype macros) will get in the way of optimizing complex tests, so, to help optimization, I will support an additional call syntax that makes direct use of the type bits.
Initially, I will build the functionality provided by ctype, but eventually I want to add conversion and analysis specific to JIS -- isRoman(), isGreek(), isKana(), toKata(), toHira(), etc., and isNumberKanji() and the like. So I need to be careful to allow adding other range tests, and allow adding other type bits to the type tables.
Note, please, that this is at significant variance with the approaches taken by the Japanese LINUX community (<linux.or.jp>), among others. Look their approaches up with such keywords as: "JIS ctype" and "EUC ctype".
Most of the functions will return a byte count, or zero if not the queried type. Note that I am not using octets, just specifying that the char type must be eight bits.

So, the initial, standard function headers:

int sjIsCntrl( unsigned char * mbc )
As near as I can tell, all one byte, between 0 and 0x1f, inclusive. Returns byte count.

int sjIsSpace( unsigned char * mbc )
Adds one two byte version of the space character. Returns byte count.

int sjIsPrint( unsigned char * mbc )
All graphic characters, including non-control space characters. Returns byte count.

int sjIsGraph( unsigned char * mbc )
All graphic non-space characters. Returns byte count.

int sjIsPunct( unsigned char * mbc )
All non-word-forming characters. Will later be subdivided for the richer JIS set. Returns byte count.

int sjIsDigit( unsigned char * mbc )
The standard digits 0..9, as specified in ANSI/ISO ctype. Includes both one and two byte digits. Does not include kanji numbers. Returns byte count.

int sjIsXDigit( unsigned char * mbc )
The standard hexadecimal digits specified in ANSI/ISO ctype. Includes both one and two byte digits. Does not include kanji numbers. Returns byte count.

int sjIsAlpha( unsigned char * mbc )
Characters used to form words, as used by non-programmers. Does not include the standard decimal digits, but does include the kanji numbers. Includes a lot of caseless characters, of course. Returns byte count.

int sjIsAlNum( unsigned char * mbc )
Characters used to form words, as used by programmers, thus including digits. Returns byte count.

int sjIsUpper( unsigned char * mbc )
Upper cased characters, includes 8 bit JIS-Latin; 16 bit JIS-Roman, -Greek, and -Russian characters, but no kanji. Returns byte count.

int sjIsLower( unsigned char * mbc )
Lower cased characters, includes 8 bit JIS-Latin; 16 bit JIS-Roman, -Greek, and -Russian characters, but no kanji. Returns byte count.

int sjToLower( unsigned char * mbcin, unsigned char * mbcout )
Converts cased word forming characters to lower case, including 8 bit JIS-Latin; 16 bit JIS-Roman, -Greek, and -Russian characters, but no kanji. Returns byte count converted or zero.

int sjToUpper( unsigned char * mbcin, unsigned char * mbcout )
Converts cased word forming characters to upper case, including 8 bit JIS-Latin; 16 bit JIS-Roman, -Greek, and -Russian characters, but no kanji. Returns byte count converted or zero.

Now, some of the foreseeable necessary extensions:

int sjIsMath( unsigned char * mbc )
The plethora of math and logic symbols in JIS. Returns byte count.

int sjIsUnit( unsigned char * mbc )
The plethora of unit symbols in JIS, but not system specific extensions like m2. Does not include kanji. Returns byte count.

int sjIsQuote( unsigned char * mbc )
The plethora of quoting and parenthetic characters in JIS. Returns byte count.

int sjIsKanji( unsigned char * mbc )
All the proper kanji characters. Returns byte count.

int isNumberKanji( unsigned char * mbc )
All the number kanji, including the special ones used, for example, on currency and bank notes. Returns byte count.

int sjIsKana( unsigned char * mbc )
All the katakana and hiragana characters, including the one byte katakana. Also including the free-standing voicing and plosive symbols, dakuten and handakuten. Returns byte count.

int sjIsKata( unsigned char * mbc )
All the katakana, including the SJIS one byte katakana, but not the free-standing voicing and plosive symbols, dakuten and handakuten. Returns byte count.

int sjIsHira( unsigned char * mbc )
All the hiragana, not including the free-standing voicing and plosive symbols, dakuten and handakuten. Returns byte count.

int sjToKata( unsigned char * mbcin, unsigned char * mbcout )
Converts hiragana to katakana. Returns byte count converted or zero.

int sjToHira( unsigned char * mbcin, unsigned char * mbcout )
Converts katakana to hiragana, where possible. Moves the unconvertable katakana as they are. Does not convert the one byte katakana. Returns byte count converted or zero.

int sjTo16Kata( unsigned char * mbcin, unsigned char * mbcout )
Converts the one byte katakana to two byte katakana. Round trip sjTo16Kata() -> sjTo8Kata() should be guaranteeable. Returns byte count converted or zero.

int sjTo8Kata( unsigned char * mbcin, unsigned char * mbcout )
Converts two byte katakana to one byte katakana, where possible. Round trip sjTo8Kata() -> sjTo16Kata() may be guaranteeable, I'm not sure yet. Returns byte count converted or zero.

Some of the hypothetical extensions:

int sjIsMusic( unsigned char * mbc )
The music symbols in JIS. Returns byte count.

int sjIsKanjiUnit( unsigned char * mbc )
The kanji version of units, including also ten, hundred, thousand, ten-thousand, etc. Returns byte count.

int sjIsRoman( unsigned char * mbc )
All the JIS Roman (two byte Latin) characters. Returns byte count.

int sjIsGreek( unsigned char * mbc )
All the JIS Greek characters. Returns byte count.

int sjIsRussian( unsigned char * mbc )
All the JIS Russian characters. Returns byte count.

int sjIsLatin( unsigned char * mbc )
All the Latin characters, including the two byte Roman (Latin) and one byte Latin. Returns byte count.

int sjIs1Byte( unsigned char * mbc )
Valid one byte character. Returns byte count.

int sjIs2Byte( unsigned char * mbc )
Valid two byte character? Returns byte count.

int jOughtToBe2Byte( unsigned char * mbc )
A combination of valid lead byte and valid tail byte? Returns byte count.

int sjToRoman( unsigned char * mbcin, unsigned char * mbcout )
Convert one byte Latin to two byte JIS Roman (Latin). Returns byte count converted or zero.

int sjToLatin( unsigned char * mbcin, unsigned char * mbcout )
Convert two byte JIS Roman (Latin) to one byte Latin. Returns byte count converted or zero.

The second, or fast version sjIsXX() functions will use constants of the pattern sjIsXX_k. The constants and the general call will also be provided in the source header, as mentioned above, for optimization:

int sjIsCType( unsigned long type, unsigned char * mbc )
Test the type formed by the bit-or of the type constants passed as the first parameter. Returns byte count on test true or zero on test false.

The initial slow version functions will have names of the pattern slowsjIsXX() so they can co-exist during debugging.



Home


^v:jtype c00.00.04e 2001.09.19