Commit | Line | Data |
---|---|---|
920dae64 AT |
1 | Mapping files for Japanese encodings |
2 | ||
3 | 1998 12/25 | |
4 | ||
5 | Fuji Xerox Information Systems | |
6 | MURATA Makoto | |
7 | ||
8 | 1. Overview | |
9 | ||
10 | This version of XML::Parser and XML::Encoding does not come with map files for | |
11 | the charset "Shift_JIS" and the charset "euc-jp". Unfortunately, each of these | |
12 | charsets has more than one mapping. None of these mappings are | |
13 | considered as authoritative. | |
14 | ||
15 | Therefore, we have come to believe that it is dangerous to provide map files | |
16 | for these charsets. Rather, we introduce several private charsets and map | |
17 | files for these private charsets. If IANA, Unicode Consoritum, and JIS | |
18 | eventually reach a consensus, we will be able to provide map files for | |
19 | "Shift_JIS" and "euc-jp". | |
20 | ||
21 | 2. Different mappings from existing charsets to Unicode | |
22 | ||
23 | 1) Different mappings in JIS X0221 and Unicode | |
24 | ||
25 | The mapping between JIS X0208:1990 and Unicode 1.1 and the mapping | |
26 | between JIS X0212:1990 and Unicode 1.1 are published from Unicode | |
27 | consortium. They are available at | |
28 | ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0208.TXT and | |
29 | ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0212.TXT, | |
30 | respectively.) These mapping files have a note as below: | |
31 | ||
32 | # The kanji mappings are a normative part of ISO/IEC 10646. The | |
33 | # non-kanji mappings are provisional, pending definition of | |
34 | # official mappings by Japanese standards bodies. | |
35 | ||
36 | Unfortunately, the non-kanji mappings in the Japanese standard for ISO 10646/1, | |
37 | namely JIS X 0221:1995, is different from the Unicode Consortium mapping since | |
38 | 0x213D of JIS X 0208 is mapped to U+2014 (em dash) rather than U+2015 | |
39 | (horizontal bar). Furthermore, JIS X 0221 clearly says that the mapping is | |
40 | informational and non-normative. As a result, some companies (e.g., Microsoft and | |
41 | Apple) have introduced slightly different mappings. Therefore, neither the | |
42 | Unicode consortium mapping nor the JIS X 0221 mapping are considered as | |
43 | authoritative. | |
44 | ||
45 | 2) Shift-JIS | |
46 | ||
47 | This charset is especially problematic, since its definition has been unclear | |
48 | since its inception. | |
49 | ||
50 | The current registration of the charset "Shift_JIS" is as below: | |
51 | ||
52 | >Name: Shift_JIS (preferred MIME name) | |
53 | >MIBenum: 17 | |
54 | >Source: A Microsoft code that extends csHalfWidthKatakana to include | |
55 | > kanji by adding a second byte when the value of the first | |
56 | > byte is in the ranges 81-9F or E0-EF. | |
57 | >Alias: MS_Kanji | |
58 | >Alias: csShiftJIS | |
59 | ||
60 | First, this does not reference to the mapping "Shift-JIS to Unicode" | |
61 | published by the Unicode consortium (available at | |
62 | ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/SHIFTJIS.TXT). | |
63 | ||
64 | Second, "kanji" in this registration can be interepreted in different ways. | |
65 | Does this "kanji" reference to JIS X0208:1978, JIS X0208:1983, or JIS | |
66 | X0208:1990(== JIS X0208:1997)? These three standards are *incompatible* with | |
67 | each other. Moreover, we can even argue that "kanji" refers to JIS X0212 or | |
68 | ideographic characters in other countries. | |
69 | ||
70 | Third, each company has extended Shift JIS. For example, Microsoft introduced | |
71 | OEM extensions (NEC extensionsand IBM extensions). | |
72 | ||
73 | Forth, Shift JIS uses JIS X0201, which is almost upper-compatible with US-ASCII | |
74 | but is not quite. 5C and 7E of JIS X 0201 are different from backslash and | |
75 | tilde, respectively. However, many programming languages (e.g., Java) | |
76 | ignore this difference and assumes that 5C and 7E of Shift JIS are backslash | |
77 | and tilde. | |
78 | ||
79 | ||
80 | 3. Proposed charsets and mappings | |
81 | ||
82 | As a tentative solution, we introduce two private charsets for EUC-JP and four | |
83 | priviate charsets for Shift JIS. | |
84 | ||
85 | 1) EUC-JP | |
86 | ||
87 | We have two charsets, namely "x-eucjp-unicode" and "x-eucjp-jisx0221". Their | |
88 | difference is only one code point. The mapping for the former is based | |
89 | on the Unicode Consortium mapping, while the latter is based on the JIS X0221 | |
90 | mapping. | |
91 | ||
92 | 2) Shift JIS | |
93 | ||
94 | We have four charsets, namely x-sjis-unicode, x-sjis-jisx0221, | |
95 | x-sjis-jdk117, and x-sjis-cp932. | |
96 | ||
97 | The mapping for the charset x-sjis-unicode is the one published by the Unicode | |
98 | consortium. The mapping for x-sjis-jisx0221 is almost equivalent to | |
99 | x-sjis-unicode, but 0x213D of JIS X 0208 is mapped to U+2014 (em dash) rather | |
100 | than U+2015. The charset x-sjis-jdk117 is again almost equivalent to | |
101 | x-sjis-unicode, but 0x5C and 0x7E of JIS X0201 are mapped to backslash and | |
102 | tilde. | |
103 | ||
104 | The charset x-sjis-cp932 is used by Microsoft Windows, and its mapping is | |
105 | published from the Unicode Consortium (available at: | |
106 | ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.txt). The | |
107 | coded character set for this charset includes NEC-extensions and | |
108 | IBM-extensions. 0x5C and 0x7E of JIS X0201 are mapped to backslash and tilde; | |
109 | 0x213D is mapped to U+2015; and 0x2140, 0x2141, 0x2142, and 0x215E of JIS X | |
110 | 0208 are mapped to compatibility characters. | |
111 | ||
112 | Makoto | |
113 | ||
114 | Fuji Xerox Information Systems | |
115 | ||
116 | Tel: +81-44-812-7230 Fax: +81-44-812-7231 | |
117 | E-mail: murata@apsdc.ksp.fujixerox.co.jp |