| 1 | Mapping files for Japanese encodings |
| 2 | |
| 3 | 1998 12/25 |
| 4 | |
| 5 | Fuji Xerox Information Systems |
| 6 | MURATA Makoto |
| 7 | |
| 8 | 1. Overview |
| 9 | |
| 10 | This version of XML::Parser and XML::Encoding does not come with map files for |
| 11 | the charset "Shift_JIS" and the charset "euc-jp". Unfortunately, each of these |
| 12 | charsets has more than one mapping. None of these mappings are |
| 13 | considered as authoritative. |
| 14 | |
| 15 | Therefore, we have come to believe that it is dangerous to provide map files |
| 16 | for these charsets. Rather, we introduce several private charsets and map |
| 17 | files for these private charsets. If IANA, Unicode Consoritum, and JIS |
| 18 | eventually reach a consensus, we will be able to provide map files for |
| 19 | "Shift_JIS" and "euc-jp". |
| 20 | |
| 21 | 2. Different mappings from existing charsets to Unicode |
| 22 | |
| 23 | 1) Different mappings in JIS X0221 and Unicode |
| 24 | |
| 25 | The mapping between JIS X0208:1990 and Unicode 1.1 and the mapping |
| 26 | between JIS X0212:1990 and Unicode 1.1 are published from Unicode |
| 27 | consortium. They are available at |
| 28 | ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0208.TXT and |
| 29 | ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0212.TXT, |
| 30 | respectively.) These mapping files have a note as below: |
| 31 | |
| 32 | # The kanji mappings are a normative part of ISO/IEC 10646. The |
| 33 | # non-kanji mappings are provisional, pending definition of |
| 34 | # official mappings by Japanese standards bodies. |
| 35 | |
| 36 | Unfortunately, the non-kanji mappings in the Japanese standard for ISO 10646/1, |
| 37 | namely JIS X 0221:1995, is different from the Unicode Consortium mapping since |
| 38 | 0x213D of JIS X 0208 is mapped to U+2014 (em dash) rather than U+2015 |
| 39 | (horizontal bar). Furthermore, JIS X 0221 clearly says that the mapping is |
| 40 | informational and non-normative. As a result, some companies (e.g., Microsoft and |
| 41 | Apple) have introduced slightly different mappings. Therefore, neither the |
| 42 | Unicode consortium mapping nor the JIS X 0221 mapping are considered as |
| 43 | authoritative. |
| 44 | |
| 45 | 2) Shift-JIS |
| 46 | |
| 47 | This charset is especially problematic, since its definition has been unclear |
| 48 | since its inception. |
| 49 | |
| 50 | The current registration of the charset "Shift_JIS" is as below: |
| 51 | |
| 52 | >Name: Shift_JIS (preferred MIME name) |
| 53 | >MIBenum: 17 |
| 54 | >Source: A Microsoft code that extends csHalfWidthKatakana to include |
| 55 | > kanji by adding a second byte when the value of the first |
| 56 | > byte is in the ranges 81-9F or E0-EF. |
| 57 | >Alias: MS_Kanji |
| 58 | >Alias: csShiftJIS |
| 59 | |
| 60 | First, this does not reference to the mapping "Shift-JIS to Unicode" |
| 61 | published by the Unicode consortium (available at |
| 62 | ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/SHIFTJIS.TXT). |
| 63 | |
| 64 | Second, "kanji" in this registration can be interepreted in different ways. |
| 65 | Does this "kanji" reference to JIS X0208:1978, JIS X0208:1983, or JIS |
| 66 | X0208:1990(== JIS X0208:1997)? These three standards are *incompatible* with |
| 67 | each other. Moreover, we can even argue that "kanji" refers to JIS X0212 or |
| 68 | ideographic characters in other countries. |
| 69 | |
| 70 | Third, each company has extended Shift JIS. For example, Microsoft introduced |
| 71 | OEM extensions (NEC extensionsand IBM extensions). |
| 72 | |
| 73 | Forth, Shift JIS uses JIS X0201, which is almost upper-compatible with US-ASCII |
| 74 | but is not quite. 5C and 7E of JIS X 0201 are different from backslash and |
| 75 | tilde, respectively. However, many programming languages (e.g., Java) |
| 76 | ignore this difference and assumes that 5C and 7E of Shift JIS are backslash |
| 77 | and tilde. |
| 78 | |
| 79 | |
| 80 | 3. Proposed charsets and mappings |
| 81 | |
| 82 | As a tentative solution, we introduce two private charsets for EUC-JP and four |
| 83 | priviate charsets for Shift JIS. |
| 84 | |
| 85 | 1) EUC-JP |
| 86 | |
| 87 | We have two charsets, namely "x-eucjp-unicode" and "x-eucjp-jisx0221". Their |
| 88 | difference is only one code point. The mapping for the former is based |
| 89 | on the Unicode Consortium mapping, while the latter is based on the JIS X0221 |
| 90 | mapping. |
| 91 | |
| 92 | 2) Shift JIS |
| 93 | |
| 94 | We have four charsets, namely x-sjis-unicode, x-sjis-jisx0221, |
| 95 | x-sjis-jdk117, and x-sjis-cp932. |
| 96 | |
| 97 | The mapping for the charset x-sjis-unicode is the one published by the Unicode |
| 98 | consortium. The mapping for x-sjis-jisx0221 is almost equivalent to |
| 99 | x-sjis-unicode, but 0x213D of JIS X 0208 is mapped to U+2014 (em dash) rather |
| 100 | than U+2015. The charset x-sjis-jdk117 is again almost equivalent to |
| 101 | x-sjis-unicode, but 0x5C and 0x7E of JIS X0201 are mapped to backslash and |
| 102 | tilde. |
| 103 | |
| 104 | The charset x-sjis-cp932 is used by Microsoft Windows, and its mapping is |
| 105 | published from the Unicode Consortium (available at: |
| 106 | ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.txt). The |
| 107 | coded character set for this charset includes NEC-extensions and |
| 108 | IBM-extensions. 0x5C and 0x7E of JIS X0201 are mapped to backslash and tilde; |
| 109 | 0x213D is mapped to U+2015; and 0x2140, 0x2141, 0x2142, and 0x215E of JIS X |
| 110 | 0208 are mapped to compatibility characters. |
| 111 | |
| 112 | Makoto |
| 113 | |
| 114 | Fuji Xerox Information Systems |
| 115 | |
| 116 | Tel: +81-44-812-7230 Fax: +81-44-812-7231 |
| 117 | E-mail: murata@apsdc.ksp.fujixerox.co.jp |