Commit | Line | Data |
---|---|---|
920dae64 AT |
1 | =head1 NAME |
2 | ||
3 | Encode::Supported -- Encodings supported by Encode | |
4 | ||
5 | =head1 DESCRIPTION | |
6 | ||
7 | =head2 Encoding Names | |
8 | ||
9 | Encoding names are case insensitive. White space in names | |
10 | is ignored. In addition, an encoding may have aliases. | |
11 | Each encoding has one "canonical" name. The "canonical" | |
12 | name is chosen from the names of the encoding by picking | |
13 | the first in the following sequence (with a few exceptions). | |
14 | ||
15 | =over 4 | |
16 | ||
17 | =item * | |
18 | ||
19 | The name used by the Perl community. That includes 'utf8' and 'ascii'. | |
20 | Unlike aliases, canonical names directly reach the method so such | |
21 | frequently used words like 'utf8' don't need to do alias lookups. | |
22 | ||
23 | =item * | |
24 | ||
25 | The MIME name as defined in IETF RFCs. This includes all "iso-"s. | |
26 | ||
27 | =item * | |
28 | ||
29 | The name in the IANA registry. | |
30 | ||
31 | =item * | |
32 | ||
33 | The name used by the organization that defined it. | |
34 | ||
35 | =back | |
36 | ||
37 | In case I<de jure> canonical names differ from that of the Encode | |
38 | module, they are always aliased if it ever be implemented. So you can | |
39 | safely tell if a given encoding is implemented or not just by passing | |
40 | the canonical name. | |
41 | ||
42 | Because of all the alias issues, and because in the general case | |
43 | encodings have state, "Encode" uses an encoding object internally | |
44 | once an operation is in progress. | |
45 | ||
46 | =head1 Supported Encodings | |
47 | ||
48 | As of Perl 5.8.0, at least the following encodings are recognized. | |
49 | Note that unless otherwise specified, they are all case insensitive | |
50 | (via alias) and all occurrence of spaces are replaced with '-'. | |
51 | In other words, "ISO 8859 1" and "iso-8859-1" are identical. | |
52 | ||
53 | Encodings are categorized and implemented in several different modules | |
54 | but you don't have to C<use Encode::XX> to make them available for | |
55 | most cases. Encode.pm will automatically load those modules on demand. | |
56 | ||
57 | =head2 Built-in Encodings | |
58 | ||
59 | The following encodings are always available. | |
60 | ||
61 | Canonical Aliases Comments & References | |
62 | ---------------------------------------------------------------- | |
63 | ascii US-ascii ISO-646-US [ECMA] | |
64 | ascii-ctrl Special Encoding | |
65 | iso-8859-1 latin1 [ISO] | |
66 | null Special Encoding | |
67 | utf8 UTF-8 [RFC2279] | |
68 | ---------------------------------------------------------------- | |
69 | ||
70 | I<null> and I<ascii-ctrl> are special. "null" fails for all character | |
71 | so when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL | |
72 | CHARACTERS will fall back to character references. Ditto for | |
73 | "ascii-ctrl" except for control characters. For fallback modes, see | |
74 | L<Encode>. | |
75 | ||
76 | =head2 Encode::Unicode -- other Unicode encodings | |
77 | ||
78 | Unicode coding schemes other than native utf8 are supported by | |
79 | Encode::Unicode, which will be autoloaded on demand. | |
80 | ||
81 | ---------------------------------------------------------------- | |
82 | UCS-2BE UCS-2, iso-10646-1 [IANA, UC] | |
83 | UCS-2LE [UC] | |
84 | UTF-16 [UC] | |
85 | UTF-16BE [UC] | |
86 | UTF-16LE [UC] | |
87 | UTF-32 [UC] | |
88 | UTF-32BE UCS-4 [UC] | |
89 | UTF-32LE [UC] | |
90 | UTF-7 [RFC2152] | |
91 | ---------------------------------------------------------------- | |
92 | ||
93 | To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another, | |
94 | see L<Encode::Unicode>. | |
95 | ||
96 | UTF-7 is a special encoding which "re-encodes" UTF-16BE into a 7-bit | |
97 | encoding. It is implemented seperately by Encode::Unicode::UTF7. | |
98 | ||
99 | =head2 Encode::Byte -- Extended ASCII | |
100 | ||
101 | Encode::Byte implements most single-byte encodings except for | |
102 | Symbols and EBCDIC. The following encodings are based on single-byte | |
103 | encodings implemented as extended ASCII. Most of them map | |
104 | \x80-\xff (upper half) to non-ASCII characters. | |
105 | ||
106 | =over 4 | |
107 | ||
108 | =item ISO-8859 and corresponding vendor mappings | |
109 | ||
110 | Since there are so many, they are presented in table format with | |
111 | languages and corresponding encoding names by vendors. Note that | |
112 | the table is sorted in order of ISO-8859 and the corresponding vendor | |
113 | mappings are slightly different from that of ISO. See | |
114 | L<http://czyborra.com/charsets/iso8859.html> for details. | |
115 | ||
116 | Lang/Regions ISO/Other Std. DOS Windows Macintosh Others | |
117 | ---------------------------------------------------------------- | |
118 | N. America (ASCII) cp437 AdobeStandardEncoding | |
119 | cp863 (DOSCanadaF) | |
120 | W. Europe iso-8859-1 cp850 cp1252 MacRoman nextstep | |
121 | hp-roman8 | |
122 | cp860 (DOSPortuguese) | |
123 | Cntrl. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman | |
124 | MacCroatian | |
125 | MacRomanian | |
126 | MacRumanian | |
127 | Latin3[1] iso-8859-3 | |
128 | Latin4[2] iso-8859-4 | |
129 | Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic | |
130 | (See also next section) cp866 MacUkrainian | |
131 | Arabic iso-8859-6 cp864 cp1256 MacArabic | |
132 | cp1006 MacFarsi | |
133 | Greek iso-8859-7 cp737 cp1253 MacGreek | |
134 | cp869 (DOSGreek2) | |
135 | Hebrew iso-8859-8 cp862 cp1255 MacHebrew | |
136 | Turkish iso-8859-9 cp857 cp1254 MacTurkish | |
137 | Nordics iso-8859-10 cp865 | |
138 | cp861 MacIcelandic | |
139 | MacSami | |
140 | Thai iso-8859-11[3] cp874 MacThai | |
141 | (iso-8859-12 is nonexistent. Reserved for Indics?) | |
142 | Baltics iso-8859-13 cp775 cp1257 | |
143 | Celtics iso-8859-14 | |
144 | Latin9 [4] iso-8859-15 | |
145 | Latin10 iso-8859-16 | |
146 | Vietnamese viscii cp1258 MacVietnamese | |
147 | ---------------------------------------------------------------- | |
148 | ||
149 | [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9. | |
150 | [2] Baltics. Now on 8859-10, except for Latvian. | |
151 | [3] TIS 620 + Non-Breaking Space (0xA0 / U+00A0) | |
152 | [4] Nicknamed Latin0; the Euro sign as well as French and Finnish | |
153 | letters that are missing from 8859-1 were added. | |
154 | ||
155 | All cp* are also available as ibm-*, ms-*, and windows-* . See also | |
156 | L<http://czyborra.com/charsets/codepages.html>. | |
157 | ||
158 | Macintosh encodings don't seem to be registered in such entities as | |
159 | IANA. "Canonical" names in Encode are based upon Apple's Tech Note | |
160 | 1150. See L<http://developer.apple.com/technotes/tn/tn1150.html> | |
161 | for details. | |
162 | ||
163 | =item KOI8 - De Facto Standard for the Cyrillic world | |
164 | ||
165 | Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more | |
166 | popular in the Net. L<Encode> comes with the following KOI charsets. | |
167 | For gory details, see L<http://czyborra.com/charsets/cyrillic.html> | |
168 | ||
169 | ---------------------------------------------------------------- | |
170 | koi8-f | |
171 | koi8-r cp878 [RFC1489] | |
172 | koi8-u [RFC2319] | |
173 | ---------------------------------------------------------------- | |
174 | ||
175 | =item gsm0338 - Hentai Latin 1 | |
176 | ||
177 | GSM0338 is for GSM handsets. Though it shares alphanumerals with | |
178 | ASCII, control character ranges and other parts are mapped very | |
179 | differently, mainly to store Greek characters. There are also escape | |
180 | sequences (starting with 0x1B) to cover e.g. the Euro sign. Some | |
181 | special cases like a trailing 0x00 byte or a lone 0x1B byte are not | |
182 | well-defined and decode() will return an empty string for them. | |
183 | One possible workaround is | |
184 | ||
185 | $gsm =~ s/\x00\z/\x00\x00/; | |
186 | $uni = decode("gsm0338", $gsm); | |
187 | $uni .= "\xA0" if $gsm =~ /\x1B\z/; | |
188 | ||
189 | Note that the Encode implementation of GSM0338 does not implement the | |
190 | reuse of Latin capital letters as Greek capital letters (for example, | |
191 | the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK CAPITAL | |
192 | LETTER ZETA). | |
193 | ||
194 | The GSM0338 is also covered in Encode::Byte even though it is not | |
195 | an "extended ASCII" encoding. | |
196 | ||
197 | =back | |
198 | ||
199 | =head2 CJK: Chinese, Japanese, Korean (Multibyte) | |
200 | ||
201 | Note that Vietnamese is listed above. Also read "Encoding vs Charset" | |
202 | below. Also note that these are implemented in distinct modules by | |
203 | countries, due to the size concerns (simplified Chinese is mapped | |
204 | to 'CN', continental China, while traditional Chinese is mapped to | |
205 | 'TW', Taiwan). Please refer to their respective documentation pages. | |
206 | ||
207 | =over 4 | |
208 | ||
209 | =item Encode::CN -- Continental China | |
210 | ||
211 | Standard DOS/Win Macintosh Comment/Reference | |
212 | ---------------------------------------------------------------- | |
213 | euc-cn [1] MacChineseSimp | |
214 | (gbk) cp936 [2] | |
215 | gb12345-raw { GB12345 without CES } | |
216 | gb2312-raw { GB2312 without CES } | |
217 | hz | |
218 | iso-ir-165 | |
219 | ---------------------------------------------------------------- | |
220 | ||
221 | [1] GB2312 is aliased to this. See L<Microsoft-related naming mess> | |
222 | [2] gbk is aliased to this. See L<Microsoft-related naming mess> | |
223 | ||
224 | =item Encode::JP -- Japan | |
225 | ||
226 | Standard DOS/Win Macintosh Comment/Reference | |
227 | ---------------------------------------------------------------- | |
228 | euc-jp | |
229 | shiftjis cp932 macJapanese | |
230 | 7bit-jis | |
231 | iso-2022-jp [RFC1468] | |
232 | iso-2022-jp-1 [RFC2237] | |
233 | jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES } | |
234 | jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES } | |
235 | jis0212-raw { JIS X 0212 (Extended Kanji) without CES } | |
236 | ---------------------------------------------------------------- | |
237 | ||
238 | =item Encode::KR -- Korea | |
239 | ||
240 | Standard DOS/Win Macintosh Comment/Reference | |
241 | ---------------------------------------------------------------- | |
242 | euc-kr MacKorean [RFC1557] | |
243 | cp949 [1] | |
244 | iso-2022-kr [RFC1557] | |
245 | johab [KS X 1001:1998, Annex 3] | |
246 | ksc5601-raw { KSC5601 without CES } | |
247 | ---------------------------------------------------------------- | |
248 | ||
249 | [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this. | |
250 | See below. | |
251 | ||
252 | =item Encode::TW -- Taiwan | |
253 | ||
254 | Standard DOS/Win Macintosh Comment/Reference | |
255 | ---------------------------------------------------------------- | |
256 | big5-eten cp950 MacChineseTrad {big5 aliased to big5-eten} | |
257 | big5-hkscs | |
258 | ---------------------------------------------------------------- | |
259 | ||
260 | =item Encode::HanExtra -- More Chinese via CPAN | |
261 | ||
262 | Due to the size concerns, additional Chinese encodings below are | |
263 | distributed separately on CPAN, under the name Encode::HanExtra. | |
264 | ||
265 | Standard DOS/Win Macintosh Comment/Reference | |
266 | ---------------------------------------------------------------- | |
267 | big5ext CMEX's Big5e Extension | |
268 | big5plus CMEX's Big5+ Extension | |
269 | cccii Chinese Character Code for Information Interchange | |
270 | euc-tw EUC (Extended Unix Character) | |
271 | gb18030 GBK with Traditional Characters | |
272 | ---------------------------------------------------------------- | |
273 | ||
274 | =item Encode::JIS2K -- JIS X 0213 encodings via CPAN | |
275 | ||
276 | Due to size concerns, additional Japanese encodings below are | |
277 | distributed separately on CPAN, under the name Encode::JIS2K. | |
278 | ||
279 | Standard DOS/Win Macintosh Comment/Reference | |
280 | ---------------------------------------------------------------- | |
281 | euc-jisx0213 | |
282 | shiftjisx0123 | |
283 | iso-2022-jp-3 | |
284 | jis0213-1-raw | |
285 | jis0213-2-raw | |
286 | ---------------------------------------------------------------- | |
287 | ||
288 | =back | |
289 | ||
290 | =head2 Miscellaneous encodings | |
291 | ||
292 | =over 4 | |
293 | ||
294 | =item Encode::EBCDIC | |
295 | ||
296 | See L<perlebcdic> for details. | |
297 | ||
298 | ---------------------------------------------------------------- | |
299 | cp37 | |
300 | cp500 | |
301 | cp875 | |
302 | cp1026 | |
303 | cp1047 | |
304 | posix-bc | |
305 | ---------------------------------------------------------------- | |
306 | ||
307 | =item Encode::Symbols | |
308 | ||
309 | For symbols and dingbats. | |
310 | ||
311 | ---------------------------------------------------------------- | |
312 | symbol | |
313 | dingbats | |
314 | MacDingbats | |
315 | AdobeZdingbat | |
316 | AdobeSymbol | |
317 | ---------------------------------------------------------------- | |
318 | ||
319 | =item Encode::MIME::Header | |
320 | ||
321 | Strictly speaking, MIME header encoding documented in RFC 2047 is more | |
322 | of encapsulation than encoding. However, their support in modern | |
323 | world is imperative so they are supported. | |
324 | ||
325 | ---------------------------------------------------------------- | |
326 | MIME-Header [RFC2047] | |
327 | MIME-B [RFC2047] | |
328 | MIME-Q [RFC2047] | |
329 | ---------------------------------------------------------------- | |
330 | ||
331 | =item Encode::Guess | |
332 | ||
333 | This one is not a name of encoding but a utility that lets you pick up | |
334 | the most appropriate encoding for a data out of given I<suspects>. See | |
335 | L<Encode::Guess> for details. | |
336 | ||
337 | =back | |
338 | ||
339 | =head1 Unsupported encodings | |
340 | ||
341 | The following encodings are not supported as yet; some because they | |
342 | are rarely used, some because of technical difficulties. They may | |
343 | be supported by external modules via CPAN in the future, however. | |
344 | ||
345 | =over 4 | |
346 | ||
347 | =item ISO-2022-JP-2 [RFC1554] | |
348 | ||
349 | Not very popular yet. Needs Unicode Database or equivalent to | |
350 | implement encode() (because it includes JIS X 0208/0212, KSC5601, and | |
351 | GB2312 simultaneously, whose code points in Unicode overlap. So you | |
352 | need to lookup the database to determine to what character set a given | |
353 | Unicode character should belong). | |
354 | ||
355 | =item ISO-2022-CN [RFC1922] | |
356 | ||
357 | Not very popular. Needs CNS 11643-1 and -2 which are not available in | |
358 | this module. CNS 11643 is supported (via euc-tw) in Encode::HanExtra. | |
359 | Autrijus Tang may add support for this encoding in his module in future. | |
360 | ||
361 | =item Various HP-UX encodings | |
362 | ||
363 | The following are unsupported due to the lack of mapping data. | |
364 | ||
365 | '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8 | |
366 | '15' - japanese15, korean15, and roi15 | |
367 | ||
368 | =item Cyrillic encoding ISO-IR-111 | |
369 | ||
370 | Anton Tagunov doubts its usefulness. | |
371 | ||
372 | =item ISO-8859-8-1 [Hebrew] | |
373 | ||
374 | None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and | |
375 | MacHebrew are supported because and just because there were mappings | |
376 | available at L<http://www.unicode.org/>). Contributions welcome. | |
377 | ||
378 | =item ISIRI 3342, Iran System, ISIRI 2900 [Farsi] | |
379 | ||
380 | Ditto. | |
381 | ||
382 | =item Thai encoding TCVN | |
383 | ||
384 | Ditto. | |
385 | ||
386 | =item Vietnamese encodings VPS | |
387 | ||
388 | Though Jungshik Shin has reported that Mozilla supports this encoding, | |
389 | it was too late before 5.8.0 for us to add it. In the future, it | |
390 | may be available via a separate module. See | |
391 | L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf> | |
392 | and | |
393 | L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut> | |
394 | if you are interested in helping us. | |
395 | ||
396 | =item Various Mac encodings | |
397 | ||
398 | The following are unsupported due to the lack of mapping data. | |
399 | ||
400 | MacArmenian, MacBengali, MacBurmese, MacEthiopic | |
401 | MacExtArabic, MacGeorgian, MacKannada, MacKhmer | |
402 | MacLaotian, MacMalayalam, MacMongolian, MacOriya | |
403 | MacSinhalese, MacTamil, MacTelugu, MacTibetan | |
404 | MacVietnamese | |
405 | ||
406 | The rest which are already available are based upon the vendor mappings | |
407 | at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> . | |
408 | ||
409 | =item (Mac) Indic encodings | |
410 | ||
411 | The maps for the following are available at L<http://www.unicode.org/> | |
412 | but remain unsupport because those encodings need algorithmical | |
413 | approach, currently unsupported by F<enc2xs>: | |
414 | ||
415 | MacDevanagari | |
416 | MacGurmukhi | |
417 | MacGujarati | |
418 | ||
419 | For details, please see C<Unicode mapping issues and notes:> at | |
420 | L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> . | |
421 | ||
422 | I believe this issue is prevalent not only for Mac Indics but also in | |
423 | other Indic encodings, but the above were the only Indic encodings | |
424 | maps that I could find at L<http://www.unicode.org/> . | |
425 | ||
426 | =back | |
427 | ||
428 | =head1 Encoding vs. Charset -- terminology | |
429 | ||
430 | We are used to using the term (character) I<encoding> and I<character | |
431 | set> interchangeably. But just as confusing the terms byte and | |
432 | character is dangerous and the terms should be differentiated when | |
433 | needed, we need to differentiate I<encoding> and I<character set>. | |
434 | ||
435 | To understand that, here is a description of how we make computers | |
436 | grok our characters. | |
437 | ||
438 | =over 4 | |
439 | ||
440 | =item * | |
441 | ||
442 | First we start with which characters to include. We call this | |
443 | collection of characters I<character repertoire>. | |
444 | ||
445 | =item * | |
446 | ||
447 | Then we have to give each character a unique ID so your computer can | |
448 | tell the difference between 'a' and 'A'. This itemized character | |
449 | repertoire is now a I<character set>. | |
450 | ||
451 | =item * | |
452 | ||
453 | If your computer can grow the character set without further | |
454 | processing, you can go ahead and use it. This is called a I<coded | |
455 | character set> (CCS) or I<raw character encoding>. ASCII is used this | |
456 | way for most cases. | |
457 | ||
458 | =item * | |
459 | ||
460 | But in many cases, especially multi-byte CJK encodings, you have to | |
461 | tweak a little more. Your network connection may not accept any data | |
462 | with the Most Significant Bit set, and your computer may not be able to | |
463 | tell if a given byte is a whole character or just half of it. So you | |
464 | have to I<encode> the character set to use it. | |
465 | ||
466 | A I<character encoding scheme> (CES) determines how to encode a given | |
467 | character set, or a set of multiple character sets. 7bit ISO-2022 is | |
468 | an example of a CES. You switch between character sets via I<escape | |
469 | sequences>. | |
470 | ||
471 | =back | |
472 | ||
473 | Technically, or mathematically, speaking, a character set encoded in | |
474 | such a CES that maps character by character may form a CCS. EUC is such | |
475 | an example. The CES of EUC is as follows: | |
476 | ||
477 | =over 4 | |
478 | ||
479 | =item * | |
480 | ||
481 | Map ASCII unchanged. | |
482 | ||
483 | =item * | |
484 | ||
485 | Map such a character set that consists of 94 or 96 powered by N | |
486 | members by adding 0x80 to each byte. | |
487 | ||
488 | =item * | |
489 | ||
490 | You can also use 0x8e and 0x8f to indicate that the following sequence of | |
491 | characters belongs to yet another character set. To each following byte | |
492 | is added the value 0x80. | |
493 | ||
494 | =back | |
495 | ||
496 | By carefully looking at the encoded byte sequence, you can find that the | |
497 | byte sequence conforms a unique number. In that sense, EUC is a CCS | |
498 | generated by a CES above from up to four CCS (complicated?). UTF-8 | |
499 | falls into this category. See L<perlUnicode/"UTF-8"> to find out how | |
500 | UTF-8 maps Unicode to a byte sequence. | |
501 | ||
502 | You may also have found out by now why 7bit ISO-2022 cannot comprise | |
503 | a CCS. If you look at a byte sequence \x21\x21, you can't tell if | |
504 | it is two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1 | |
505 | so you have no trouble differentiating between "!!". and S<" ">. | |
506 | ||
507 | =head1 Encoding Classification (by Anton Tagunov and Dan Kogai) | |
508 | ||
509 | This section tries to classify the supported encodings by their | |
510 | applicability for information exchange over the Internet and to | |
511 | choose the most suitable aliases to name them in the context of | |
512 | such communication. | |
513 | ||
514 | =over 4 | |
515 | ||
516 | =item * | |
517 | ||
518 | To (en|de)code encodings marked by C<(**)>, you need | |
519 | C<Encode::HanExtra>, available from CPAN. | |
520 | ||
521 | =back | |
522 | ||
523 | Encoding names | |
524 | ||
525 | US-ASCII UTF-8 ISO-8859-* KOI8-R | |
526 | Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1 | |
527 | EUC-KR Big5 GB2312 | |
528 | ||
529 | are registered with IANA as preferred MIME names and may | |
530 | be used over the Internet. | |
531 | ||
532 | C<Shift_JIS> has been officialized by JIS X 0208:1997. | |
533 | L<Microsoft-related naming mess> gives details. | |
534 | ||
535 | C<GB2312> is the IANA name for C<EUC-CN>. | |
536 | See L<Microsoft-related naming mess> for details. | |
537 | ||
538 | C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw> | |
539 | with Encode. See L<Encode::CN> for details. | |
540 | ||
541 | EUC-CN | |
542 | KOI8-U [RFC2319] | |
543 | ||
544 | have not been registered with IANA (as of March 2002) but | |
545 | seem to be supported by major web browsers. | |
546 | The IANA name for C<EUC-CN> is C<GB2312>. | |
547 | ||
548 | KS_C_5601-1987 | |
549 | ||
550 | is heavily misused. | |
551 | See L<Microsoft-related naming mess> for details. | |
552 | ||
553 | C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw> | |
554 | with Encode. See L<Encode::KR> for details. | |
555 | ||
556 | UTF-16 UTF-16BE UTF-16LE | |
557 | ||
558 | are IANA-registered C<charset>s. See [RFC 2781] for details. | |
559 | Jungshik Shin reports that UTF-16 with a BOM is well accepted | |
560 | by MS IE 5/6 and NS 4/6. Beware however that | |
561 | ||
562 | =over 4 | |
563 | ||
564 | =item * | |
565 | ||
566 | C<UTF-16> support in any software you're going to be | |
567 | using/interoperating with has probably been less tested | |
568 | then C<UTF-8> support | |
569 | ||
570 | =item * | |
571 | ||
572 | C<UTF-8> coded data seamlessly passes traditional | |
573 | command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded | |
574 | data is likely to cause confusion (with its zero bytes, | |
575 | for example) | |
576 | ||
577 | =item * | |
578 | ||
579 | it is beyond the power of words to describe the way HTML browsers | |
580 | encode non-C<ASCII> form data. To get a general impression, visit | |
581 | L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>. | |
582 | While encoding of form data has stabilized for C<UTF-8> encoded pages | |
583 | (at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to | |
584 | expect fun (and cross-browser discrepancies) with C<UTF-16> encoded | |
585 | pages! | |
586 | ||
587 | =back | |
588 | ||
589 | The rule of thumb is to use C<UTF-8> unless you know what | |
590 | you're doing and unless you really benefit from using C<UTF-16>. | |
591 | ||
592 | ISO-IR-165 [RFC1345] | |
593 | VISCII | |
594 | GB 12345 | |
595 | GB 18030 (**) (see links bellow) | |
596 | EUC-TW (**) | |
597 | ||
598 | are totally valid encodings but not registered at IANA. | |
599 | The names under which they are listed here are probably the | |
600 | most widely-known names for these encodings and are recommended | |
601 | names. | |
602 | ||
603 | BIG5PLUS (**) | |
604 | ||
605 | is a proprietary name. | |
606 | ||
607 | =head2 Microsoft-related naming mess | |
608 | ||
609 | Microsoft products misuse the following names: | |
610 | ||
611 | =over 4 | |
612 | ||
613 | =item KS_C_5601-1987 | |
614 | ||
615 | Microsoft extension to C<EUC-KR>. | |
616 | ||
617 | Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla). | |
618 | ||
619 | See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html> | |
620 | for details. | |
621 | ||
622 | Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common | |
623 | misusage. I<Raw> C<KS_C_5601-1987> encoding is available as | |
624 | C<kcs5601-raw>. | |
625 | ||
626 | See L<Encode::KR> for details. | |
627 | ||
628 | =item GB2312 | |
629 | ||
630 | Microsoft extension to C<EUC-CN>. | |
631 | ||
632 | Proper names: C<CP936>, C<GBK>. | |
633 | ||
634 | C<GB2312> has been registered in the C<EUC-CN> meaning at | |
635 | IANA. This has partially repaired the situation: Microsoft's | |
636 | C<GB2312> has become a superset of the official C<GB2312>. | |
637 | ||
638 | Encode aliases C<GB2312> to C<euc-cn> in full agreement with | |
639 | IANA registration. C<cp936> is supported separately. | |
640 | I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>. | |
641 | ||
642 | See L<Encode::CN> for details. | |
643 | ||
644 | =item Big5 | |
645 | ||
646 | Microsoft extension to C<Big5>. | |
647 | ||
648 | Proper name: C<CP950>. | |
649 | ||
650 | Encode separately supports C<Big5> and C<cp950>. | |
651 | ||
652 | =item Shift_JIS | |
653 | ||
654 | Microsoft's understanding of C<Shift_JIS>. | |
655 | ||
656 | JIS has not endorsed the full Microsoft standard however. | |
657 | The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208 | |
658 | character sets, while Microsoft has always used C<Shift_JIS> | |
659 | to encode a wider character repertoire. See C<IANA> registration for | |
660 | C<Windows-31J>. | |
661 | ||
662 | As a historical predecessor, Microsoft's variant | |
663 | probably has more rights for the name, though it may be objected | |
664 | that Microsoft shouldn't have used JIS as part of the name | |
665 | in the first place. | |
666 | ||
667 | Unambiguous name: C<CP932>. C<IANA> name (also used by Mozilla, and | |
668 | provided as an alias by Encode): C<Windows-31J>. | |
669 | ||
670 | Encode separately supports C<Shift_JIS> and C<cp932>. | |
671 | ||
672 | =back | |
673 | ||
674 | =head1 Glossary | |
675 | ||
676 | =over 4 | |
677 | ||
678 | =item character repertoire | |
679 | ||
680 | A collection of unique characters. A I<character> set in the strictest | |
681 | sense. At this stage, characters are not numbered. | |
682 | ||
683 | =item coded character set (CCS) | |
684 | ||
685 | A character set that is mapped in a way computers can use directly. | |
686 | Many character encodings, including EUC, fall in this category. | |
687 | ||
688 | =item character encoding scheme (CES) | |
689 | ||
690 | An algorithm to map a character set to a byte sequence. You don't | |
691 | have to be able to tell which character set a given byte sequence | |
692 | belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an | |
693 | example of being both a CCS and CES. | |
694 | ||
695 | =item charset (in MIME context) | |
696 | ||
697 | has long been used in the meaning of C<encoding>, CES. | |
698 | ||
699 | While the word combination C<character set> has lost this meaning | |
700 | in MIME context since [RFC 2130], the C<charset> abbreviation has | |
701 | retained it. This is how [RFC 2277] and [RFC 2278] bless C<charset>: | |
702 | ||
703 | This document uses the term "charset" to mean a set of rules for | |
704 | mapping from a sequence of octets to a sequence of characters, such | |
705 | as the combination of a coded character set and a character encoding | |
706 | scheme; this is also what is used as an identifier in MIME "charset=" | |
707 | parameters, and registered in the IANA charset registry ... (Note | |
708 | that this is NOT a term used by other standards bodies, such as ISO). | |
709 | [RFC 2277] | |
710 | ||
711 | =item EUC | |
712 | ||
713 | Extended Unix Character. See ISO-2022. | |
714 | ||
715 | =item ISO-2022 | |
716 | ||
717 | A CES that was carefully designed to coexist with ASCII. There are a 7 | |
718 | bit version and an 8 bit version. | |
719 | ||
720 | The 7 bit version switches character set via escape sequence so it | |
721 | cannot form a CCS. Since this is more difficult to handle in programs | |
722 | than the 8 bit version, the 7 bit version is not very popular except for | |
723 | iso-2022-jp, the I<de facto> standard CES for e-mails. | |
724 | ||
725 | The 8 bit version can form a CCS. EUC and ISO-8859 are two examples | |
726 | thereof. Pre-5.6 perl could use them as string literals. | |
727 | ||
728 | =item UCS | |
729 | ||
730 | Short for I<Universal Character Set>. When you say just UCS, it means | |
731 | I<Unicode>. | |
732 | ||
733 | =item UCS-2 | |
734 | ||
735 | ISO/IEC 10646 encoding form: Universal Character Set coded in two | |
736 | octets. | |
737 | ||
738 | =item Unicode | |
739 | ||
740 | A character set that aims to include all character repertoires of the | |
741 | world. Many character sets in various national as well as industrial | |
742 | standards have become, in a way, just subsets of Unicode. | |
743 | ||
744 | =item UTF | |
745 | ||
746 | Short for I<Unicode Transformation Format>. Determines how to map a | |
747 | Unicode character into a byte sequence. | |
748 | ||
749 | =item UTF-16 | |
750 | ||
751 | A UTF in 16-bit encoding. Can either be in big endian or little | |
752 | endian. The big endian version is called UTF-16BE (equal to UCS-2 + | |
753 | surrogate support) and the little endian version is called UTF-16LE. | |
754 | ||
755 | =back | |
756 | ||
757 | =head1 See Also | |
758 | ||
759 | L<Encode>, | |
760 | L<Encode::Byte>, | |
761 | L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>, | |
762 | L<Encode::EBCDIC>, L<Encode::Symbol> | |
763 | L<Encode::MIME::Header>, L<Encode::Guess> | |
764 | ||
765 | =head1 References | |
766 | ||
767 | =over 4 | |
768 | ||
769 | =item ECMA | |
770 | ||
771 | European Computer Manufacturers Association | |
772 | L<http://www.ecma.ch> | |
773 | ||
774 | =over 4 | |
775 | ||
776 | =item ECMA-035 (eq C<ISO-2022>) | |
777 | ||
778 | L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> | |
779 | ||
780 | The specification of ISO-2022 is available from the link above. | |
781 | ||
782 | =back | |
783 | ||
784 | =item IANA | |
785 | ||
786 | Internet Assigned Numbers Authority | |
787 | L<http://www.iana.org/> | |
788 | ||
789 | =over 4 | |
790 | ||
791 | =item Assigned Charset Names by IANA | |
792 | ||
793 | L<http://www.iana.org/assignments/character-sets> | |
794 | ||
795 | Most of the C<canonical names> in Encode derive from this list | |
796 | so you can directly apply the string you have extracted from MIME | |
797 | header of mails and web pages. | |
798 | ||
799 | =back | |
800 | ||
801 | =item ISO | |
802 | ||
803 | International Organization for Standardization | |
804 | L<http://www.iso.ch/> | |
805 | ||
806 | =item RFC | |
807 | ||
808 | Request For Comments -- need I say more? | |
809 | L<http://www.rfc-editor.org/>, L<http://www.rfc.net/>, | |
810 | L<http://www.faqs.org/rfcs/> | |
811 | ||
812 | =item UC | |
813 | ||
814 | Unicode Consortium | |
815 | L<http://www.unicode.org/> | |
816 | ||
817 | =over 4 | |
818 | ||
819 | =item Unicode Glossary | |
820 | ||
821 | L<http://www.unicode.org/glossary/> | |
822 | ||
823 | The glossary of this document is based upon this site. | |
824 | ||
825 | =back | |
826 | ||
827 | =back | |
828 | ||
829 | =head2 Other Notable Sites | |
830 | ||
831 | =over 4 | |
832 | ||
833 | =item czyborra.com | |
834 | ||
835 | L<http://czyborra.com/> | |
836 | ||
837 | Contains a lot of useful information, especially gory details of ISO | |
838 | vs. vendor mappings. | |
839 | ||
840 | =item CJK.inf | |
841 | ||
842 | L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html> | |
843 | ||
844 | Somewhat obsolete (last update in 1996), but still useful. Also try | |
845 | ||
846 | L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf> | |
847 | ||
848 | You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>. | |
849 | ||
850 | =item Jungshik Shin's Hangul FAQ | |
851 | ||
852 | L<http://jshin.net/faq> | |
853 | ||
854 | And especially its subject 8. | |
855 | ||
856 | L<http://jshin.net/faq/qa8.html> | |
857 | ||
858 | A comprehensive overview of the Korean (C<KS *>) standards. | |
859 | ||
860 | =item debian.org: "Introduction to i18n" | |
861 | ||
862 | A brief description for most of the mentioned CJK encodings is | |
863 | contained in | |
864 | L<http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html> | |
865 | ||
866 | =back | |
867 | ||
868 | =head2 Offline sources | |
869 | ||
870 | =over 4 | |
871 | ||
872 | =item C<CJKV Information Processing> by Ken Lunde | |
873 | ||
874 | CJKV Information Processing | |
875 | 1999 O'Reilly & Associates, ISBN : 1-56592-224-7 | |
876 | ||
877 | The modern successor of C<CJK.inf>. | |
878 | ||
879 | Features a comprehensive coverage of CJKV character sets and | |
880 | encodings along with many other issues faced by anyone trying | |
881 | to better support CJKV languages/scripts in all the areas of | |
882 | information processing. | |
883 | ||
884 | To purchase this book, visit | |
885 | L<http://www.oreilly.com/catalog/cjkvinfo/> | |
886 | or your favourite bookstore. | |
887 | ||
888 | =back | |
889 | ||
890 | =cut |