git.subgeniuskitty.com - OpenSPARC-T2-SAM/.git/blame_incremental - sam-t2/devtools/v8plus/lib/perl5/5.8.8/Encode/Supported.pod

... / ...

Commit	Line	Data
	1	=head1 NAME
	2
	3	Encode::Supported -- Encodings supported by Encode
	4
	5	=head1 DESCRIPTION
	6
	7	=head2 Encoding Names
	8
	9	Encoding names are case insensitive. White space in names
	10	is ignored. In addition, an encoding may have aliases.
	11	Each encoding has one "canonical" name. The "canonical"
	12	name is chosen from the names of the encoding by picking
	13	the first in the following sequence (with a few exceptions).
	14
	15	=over 4
	16
	17	=item *
	18
	19	The name used by the Perl community. That includes 'utf8' and 'ascii'.
	20	Unlike aliases, canonical names directly reach the method so such
	21	frequently used words like 'utf8' don't need to do alias lookups.
	22
	23	=item *
	24
	25	The MIME name as defined in IETF RFCs. This includes all "iso-"s.
	26
	27	=item *
	28
	29	The name in the IANA registry.
	30
	31	=item *
	32
	33	The name used by the organization that defined it.
	34
	35	=back
	36
	37	In case I<de jure> canonical names differ from that of the Encode
	38	module, they are always aliased if it ever be implemented. So you can
	39	safely tell if a given encoding is implemented or not just by passing
	40	the canonical name.
	41
	42	Because of all the alias issues, and because in the general case
	43	encodings have state, "Encode" uses an encoding object internally
	44	once an operation is in progress.
	45
	46	=head1 Supported Encodings
	47
	48	As of Perl 5.8.0, at least the following encodings are recognized.
	49	Note that unless otherwise specified, they are all case insensitive
	50	(via alias) and all occurrence of spaces are replaced with '-'.
	51	In other words, "ISO 8859 1" and "iso-8859-1" are identical.
	52
	53	Encodings are categorized and implemented in several different modules
	54	but you don't have to C<use Encode::XX> to make them available for
	55	most cases. Encode.pm will automatically load those modules on demand.
	56
	57	=head2 Built-in Encodings
	58
	59	The following encodings are always available.
	60
	61	Canonical Aliases Comments & References
	62	----------------------------------------------------------------
	63	ascii US-ascii ISO-646-US [ECMA]
	64	ascii-ctrl Special Encoding
	65	iso-8859-1 latin1 [ISO]
	66	null Special Encoding
	67	utf8 UTF-8 [RFC2279]
	68	----------------------------------------------------------------
	69
	70	I<null> and I<ascii-ctrl> are special. "null" fails for all character
	71	so when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL
	72	CHARACTERS will fall back to character references. Ditto for
	73	"ascii-ctrl" except for control characters. For fallback modes, see
	74	L<Encode>.
	75
	76	=head2 Encode::Unicode -- other Unicode encodings
	77
	78	Unicode coding schemes other than native utf8 are supported by
	79	Encode::Unicode, which will be autoloaded on demand.
	80
	81	----------------------------------------------------------------
	82	UCS-2BE UCS-2, iso-10646-1 [IANA, UC]
	83	UCS-2LE [UC]
	84	UTF-16 [UC]
	85	UTF-16BE [UC]
	86	UTF-16LE [UC]
	87	UTF-32 [UC]
	88	UTF-32BE UCS-4 [UC]
	89	UTF-32LE [UC]
	90	UTF-7 [RFC2152]
	91	----------------------------------------------------------------
	92
	93	To find how (UCS-2\|UTF-(16\|32))(LE\|BE)? differ from one another,
	94	see L<Encode::Unicode>.
	95
	96	UTF-7 is a special encoding which "re-encodes" UTF-16BE into a 7-bit
	97	encoding. It is implemented seperately by Encode::Unicode::UTF7.
	98
	99	=head2 Encode::Byte -- Extended ASCII
	100
	101	Encode::Byte implements most single-byte encodings except for
	102	Symbols and EBCDIC. The following encodings are based on single-byte
	103	encodings implemented as extended ASCII. Most of them map
	104	\x80-\xff (upper half) to non-ASCII characters.
	105
	106	=over 4
	107
	108	=item ISO-8859 and corresponding vendor mappings
	109
	110	Since there are so many, they are presented in table format with
	111	languages and corresponding encoding names by vendors. Note that
	112	the table is sorted in order of ISO-8859 and the corresponding vendor
	113	mappings are slightly different from that of ISO. See
	114	L<http://czyborra.com/charsets/iso8859.html> for details.
	115
	116	Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
	117	----------------------------------------------------------------
	118	N. America (ASCII) cp437 AdobeStandardEncoding
	119	cp863 (DOSCanadaF)
	120	W. Europe iso-8859-1 cp850 cp1252 MacRoman nextstep
	121	hp-roman8
	122	cp860 (DOSPortuguese)
	123	Cntrl. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
	124	MacCroatian
	125	MacRomanian
	126	MacRumanian
	127	Latin3[1] iso-8859-3
	128	Latin4[2] iso-8859-4
	129	Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
	130	(See also next section) cp866 MacUkrainian
	131	Arabic iso-8859-6 cp864 cp1256 MacArabic
	132	cp1006 MacFarsi
	133	Greek iso-8859-7 cp737 cp1253 MacGreek
	134	cp869 (DOSGreek2)
	135	Hebrew iso-8859-8 cp862 cp1255 MacHebrew
	136	Turkish iso-8859-9 cp857 cp1254 MacTurkish
	137	Nordics iso-8859-10 cp865
	138	cp861 MacIcelandic
	139	MacSami
	140	Thai iso-8859-11[3] cp874 MacThai
	141	(iso-8859-12 is nonexistent. Reserved for Indics?)
	142	Baltics iso-8859-13 cp775 cp1257
	143	Celtics iso-8859-14
	144	Latin9 [4] iso-8859-15
	145	Latin10 iso-8859-16
	146	Vietnamese viscii cp1258 MacVietnamese
	147	----------------------------------------------------------------
	148
	149	[1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
	150	[2] Baltics. Now on 8859-10, except for Latvian.
	151	[3] TIS 620 + Non-Breaking Space (0xA0 / U+00A0)
	152	[4] Nicknamed Latin0; the Euro sign as well as French and Finnish
	153	letters that are missing from 8859-1 were added.
	154
	155	All cp* are also available as ibm-, ms-, and windows-* . See also
	156	L<http://czyborra.com/charsets/codepages.html>.
	157
	158	Macintosh encodings don't seem to be registered in such entities as
	159	IANA. "Canonical" names in Encode are based upon Apple's Tech Note
	160	1150. See L<http://developer.apple.com/technotes/tn/tn1150.html>
	161	for details.
	162
	163	=item KOI8 - De Facto Standard for the Cyrillic world
	164
	165	Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
	166	popular in the Net. L<Encode> comes with the following KOI charsets.
	167	For gory details, see L<http://czyborra.com/charsets/cyrillic.html>
	168
	169	----------------------------------------------------------------
	170	koi8-f
	171	koi8-r cp878 [RFC1489]
	172	koi8-u [RFC2319]
	173	----------------------------------------------------------------
	174
	175	=item gsm0338 - Hentai Latin 1
	176
	177	GSM0338 is for GSM handsets. Though it shares alphanumerals with
	178	ASCII, control character ranges and other parts are mapped very
	179	differently, mainly to store Greek characters. There are also escape
	180	sequences (starting with 0x1B) to cover e.g. the Euro sign. Some
	181	special cases like a trailing 0x00 byte or a lone 0x1B byte are not
	182	well-defined and decode() will return an empty string for them.
	183	One possible workaround is
	184
	185	$gsm =~ s/\x00\z/\x00\x00/;
	186	$uni = decode("gsm0338", $gsm);
	187	$uni .= "\xA0" if $gsm =~ /\x1B\z/;
	188
	189	Note that the Encode implementation of GSM0338 does not implement the
	190	reuse of Latin capital letters as Greek capital letters (for example,
	191	the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK CAPITAL
	192	LETTER ZETA).
	193
	194	The GSM0338 is also covered in Encode::Byte even though it is not
	195	an "extended ASCII" encoding.
	196
	197	=back
	198
	199	=head2 CJK: Chinese, Japanese, Korean (Multibyte)
	200
	201	Note that Vietnamese is listed above. Also read "Encoding vs Charset"
	202	below. Also note that these are implemented in distinct modules by
	203	countries, due to the size concerns (simplified Chinese is mapped
	204	to 'CN', continental China, while traditional Chinese is mapped to
	205	'TW', Taiwan). Please refer to their respective documentation pages.
	206
	207	=over 4
	208
	209	=item Encode::CN -- Continental China
	210
	211	Standard DOS/Win Macintosh Comment/Reference
	212	----------------------------------------------------------------
	213	euc-cn [1] MacChineseSimp
	214	(gbk) cp936 [2]
	215	gb12345-raw { GB12345 without CES }
	216	gb2312-raw { GB2312 without CES }
	217	hz
	218	iso-ir-165
	219	----------------------------------------------------------------
	220
	221	[1] GB2312 is aliased to this. See L<Microsoft-related naming mess>
	222	[2] gbk is aliased to this. See L<Microsoft-related naming mess>
	223
	224	=item Encode::JP -- Japan
	225
	226	Standard DOS/Win Macintosh Comment/Reference
	227	----------------------------------------------------------------
	228	euc-jp
	229	shiftjis cp932 macJapanese
	230	7bit-jis
	231	iso-2022-jp [RFC1468]
	232	iso-2022-jp-1 [RFC2237]
	233	jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES }
	234	jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES }
	235	jis0212-raw { JIS X 0212 (Extended Kanji) without CES }
	236	----------------------------------------------------------------
	237
	238	=item Encode::KR -- Korea
	239
	240	Standard DOS/Win Macintosh Comment/Reference
	241	----------------------------------------------------------------
	242	euc-kr MacKorean [RFC1557]
	243	cp949 [1]
	244	iso-2022-kr [RFC1557]
	245	johab [KS X 1001:1998, Annex 3]
	246	ksc5601-raw { KSC5601 without CES }
	247	----------------------------------------------------------------
	248
	249	[1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
	250	See below.
	251
	252	=item Encode::TW -- Taiwan
	253
	254	Standard DOS/Win Macintosh Comment/Reference
	255	----------------------------------------------------------------
	256	big5-eten cp950 MacChineseTrad {big5 aliased to big5-eten}
	257	big5-hkscs
	258	----------------------------------------------------------------
	259
	260	=item Encode::HanExtra -- More Chinese via CPAN
	261
	262	Due to the size concerns, additional Chinese encodings below are
	263	distributed separately on CPAN, under the name Encode::HanExtra.
	264
	265	Standard DOS/Win Macintosh Comment/Reference
	266	----------------------------------------------------------------
	267	big5ext CMEX's Big5e Extension
	268	big5plus CMEX's Big5+ Extension
	269	cccii Chinese Character Code for Information Interchange
	270	euc-tw EUC (Extended Unix Character)
	271	gb18030 GBK with Traditional Characters
	272	----------------------------------------------------------------
	273
	274	=item Encode::JIS2K -- JIS X 0213 encodings via CPAN
	275
	276	Due to size concerns, additional Japanese encodings below are
	277	distributed separately on CPAN, under the name Encode::JIS2K.
	278
	279	Standard DOS/Win Macintosh Comment/Reference
	280	----------------------------------------------------------------
	281	euc-jisx0213
	282	shiftjisx0123
	283	iso-2022-jp-3
	284	jis0213-1-raw
	285	jis0213-2-raw
	286	----------------------------------------------------------------
	287
	288	=back
	289
	290	=head2 Miscellaneous encodings
	291
	292	=over 4
	293
	294	=item Encode::EBCDIC
	295
	296	See L<perlebcdic> for details.
	297
	298	----------------------------------------------------------------
	299	cp37
	300	cp500
	301	cp875
	302	cp1026
	303	cp1047
	304	posix-bc
	305	----------------------------------------------------------------
	306
	307	=item Encode::Symbols
	308
	309	For symbols and dingbats.
	310
	311	----------------------------------------------------------------
	312	symbol
	313	dingbats
	314	MacDingbats
	315	AdobeZdingbat
	316	AdobeSymbol
	317	----------------------------------------------------------------
	318
	319	=item Encode::MIME::Header
	320
	321	Strictly speaking, MIME header encoding documented in RFC 2047 is more
	322	of encapsulation than encoding. However, their support in modern
	323	world is imperative so they are supported.
	324
	325	----------------------------------------------------------------
	326	MIME-Header [RFC2047]
	327	MIME-B [RFC2047]
	328	MIME-Q [RFC2047]
	329	----------------------------------------------------------------
	330
	331	=item Encode::Guess
	332
	333	This one is not a name of encoding but a utility that lets you pick up
	334	the most appropriate encoding for a data out of given I<suspects>. See
	335	L<Encode::Guess> for details.
	336
	337	=back
	338
	339	=head1 Unsupported encodings
	340
	341	The following encodings are not supported as yet; some because they
	342	are rarely used, some because of technical difficulties. They may
	343	be supported by external modules via CPAN in the future, however.
	344
	345	=over 4
	346
	347	=item ISO-2022-JP-2 [RFC1554]
	348
	349	Not very popular yet. Needs Unicode Database or equivalent to
	350	implement encode() (because it includes JIS X 0208/0212, KSC5601, and
	351	GB2312 simultaneously, whose code points in Unicode overlap. So you
	352	need to lookup the database to determine to what character set a given
	353	Unicode character should belong).
	354
	355	=item ISO-2022-CN [RFC1922]
	356
	357	Not very popular. Needs CNS 11643-1 and -2 which are not available in
	358	this module. CNS 11643 is supported (via euc-tw) in Encode::HanExtra.
	359	Autrijus Tang may add support for this encoding in his module in future.
	360
	361	=item Various HP-UX encodings
	362
	363	The following are unsupported due to the lack of mapping data.
	364
	365	'8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
	366	'15' - japanese15, korean15, and roi15
	367
	368	=item Cyrillic encoding ISO-IR-111
	369
	370	Anton Tagunov doubts its usefulness.
	371
	372	=item ISO-8859-8-1 [Hebrew]
	373
	374	None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
	375	MacHebrew are supported because and just because there were mappings
	376	available at L<http://www.unicode.org/>). Contributions welcome.
	377
	378	=item ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
	379
	380	Ditto.
	381
	382	=item Thai encoding TCVN
	383
	384	Ditto.
	385
	386	=item Vietnamese encodings VPS
	387
	388	Though Jungshik Shin has reported that Mozilla supports this encoding,
	389	it was too late before 5.8.0 for us to add it. In the future, it
	390	may be available via a separate module. See
	391	L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
	392	and
	393	L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
	394	if you are interested in helping us.
	395
	396	=item Various Mac encodings
	397
	398	The following are unsupported due to the lack of mapping data.
	399
	400	MacArmenian, MacBengali, MacBurmese, MacEthiopic
	401	MacExtArabic, MacGeorgian, MacKannada, MacKhmer
	402	MacLaotian, MacMalayalam, MacMongolian, MacOriya
	403	MacSinhalese, MacTamil, MacTelugu, MacTibetan
	404	MacVietnamese
	405
	406	The rest which are already available are based upon the vendor mappings
	407	at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
	408
	409	=item (Mac) Indic encodings
	410
	411	The maps for the following are available at L<http://www.unicode.org/>
	412	but remain unsupport because those encodings need algorithmical
	413	approach, currently unsupported by F<enc2xs>:
	414
	415	MacDevanagari
	416	MacGurmukhi
	417	MacGujarati
	418
	419	For details, please see C<Unicode mapping issues and notes:> at
	420	L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
	421
	422	I believe this issue is prevalent not only for Mac Indics but also in
	423	other Indic encodings, but the above were the only Indic encodings
	424	maps that I could find at L<http://www.unicode.org/> .
	425
	426	=back
	427
	428	=head1 Encoding vs. Charset -- terminology
	429
	430	We are used to using the term (character) I<encoding> and I<character
	431	set> interchangeably. But just as confusing the terms byte and
	432	character is dangerous and the terms should be differentiated when
	433	needed, we need to differentiate I<encoding> and I<character set>.
	434
	435	To understand that, here is a description of how we make computers
	436	grok our characters.
	437
	438	=over 4
	439
	440	=item *
	441
	442	First we start with which characters to include. We call this
	443	collection of characters I<character repertoire>.
	444
	445	=item *
	446
	447	Then we have to give each character a unique ID so your computer can
	448	tell the difference between 'a' and 'A'. This itemized character
	449	repertoire is now a I<character set>.
	450
	451	=item *
	452
	453	If your computer can grow the character set without further
	454	processing, you can go ahead and use it. This is called a I<coded
	455	character set> (CCS) or I<raw character encoding>. ASCII is used this
	456	way for most cases.
	457
	458	=item *
	459
	460	But in many cases, especially multi-byte CJK encodings, you have to
	461	tweak a little more. Your network connection may not accept any data
	462	with the Most Significant Bit set, and your computer may not be able to
	463	tell if a given byte is a whole character or just half of it. So you
	464	have to I<encode> the character set to use it.
	465
	466	A I<character encoding scheme> (CES) determines how to encode a given
	467	character set, or a set of multiple character sets. 7bit ISO-2022 is
	468	an example of a CES. You switch between character sets via I<escape
	469	sequences>.
	470
	471	=back
	472
	473	Technically, or mathematically, speaking, a character set encoded in
	474	such a CES that maps character by character may form a CCS. EUC is such
	475	an example. The CES of EUC is as follows:
	476
	477	=over 4
	478
	479	=item *
	480
	481	Map ASCII unchanged.
	482
	483	=item *
	484
	485	Map such a character set that consists of 94 or 96 powered by N
	486	members by adding 0x80 to each byte.
	487
	488	=item *
	489
	490	You can also use 0x8e and 0x8f to indicate that the following sequence of
	491	characters belongs to yet another character set. To each following byte
	492	is added the value 0x80.
	493
	494	=back
	495
	496	By carefully looking at the encoded byte sequence, you can find that the
	497	byte sequence conforms a unique number. In that sense, EUC is a CCS
	498	generated by a CES above from up to four CCS (complicated?). UTF-8
	499	falls into this category. See L<perlUnicode/"UTF-8"> to find out how
	500	UTF-8 maps Unicode to a byte sequence.
	501
	502	You may also have found out by now why 7bit ISO-2022 cannot comprise
	503	a CCS. If you look at a byte sequence \x21\x21, you can't tell if
	504	it is two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1
	505	so you have no trouble differentiating between "!!". and S<" ">.
	506
	507	=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
	508
	509	This section tries to classify the supported encodings by their
	510	applicability for information exchange over the Internet and to
	511	choose the most suitable aliases to name them in the context of
	512	such communication.
	513
	514	=over 4
	515
	516	=item *
	517
	518	To (en\|de)code encodings marked by C<(**)>, you need
	519	C<Encode::HanExtra>, available from CPAN.
	520
	521	=back
	522
	523	Encoding names
	524
	525	US-ASCII UTF-8 ISO-8859-* KOI8-R
	526	Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
	527	EUC-KR Big5 GB2312
	528
	529	are registered with IANA as preferred MIME names and may
	530	be used over the Internet.
	531
	532	C<Shift_JIS> has been officialized by JIS X 0208:1997.
	533	L<Microsoft-related naming mess> gives details.
	534
	535	C<GB2312> is the IANA name for C<EUC-CN>.
	536	See L<Microsoft-related naming mess> for details.
	537
	538	C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
	539	with Encode. See L<Encode::CN> for details.
	540
	541	EUC-CN
	542	KOI8-U [RFC2319]
	543
	544	have not been registered with IANA (as of March 2002) but
	545	seem to be supported by major web browsers.
	546	The IANA name for C<EUC-CN> is C<GB2312>.
	547
	548	KS_C_5601-1987
	549
	550	is heavily misused.
	551	See L<Microsoft-related naming mess> for details.
	552
	553	C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
	554	with Encode. See L<Encode::KR> for details.
	555
	556	UTF-16 UTF-16BE UTF-16LE
	557
	558	are IANA-registered C<charset>s. See [RFC 2781] for details.
	559	Jungshik Shin reports that UTF-16 with a BOM is well accepted
	560	by MS IE 5/6 and NS 4/6. Beware however that
	561
	562	=over 4
	563
	564	=item *
	565
	566	C<UTF-16> support in any software you're going to be
	567	using/interoperating with has probably been less tested
	568	then C<UTF-8> support
	569
	570	=item *
	571
	572	C<UTF-8> coded data seamlessly passes traditional
	573	command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded
	574	data is likely to cause confusion (with its zero bytes,
	575	for example)
	576
	577	=item *
	578
	579	it is beyond the power of words to describe the way HTML browsers
	580	encode non-C<ASCII> form data. To get a general impression, visit
	581	L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
	582	While encoding of form data has stabilized for C<UTF-8> encoded pages
	583	(at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
	584	expect fun (and cross-browser discrepancies) with C<UTF-16> encoded
	585	pages!
	586
	587	=back
	588
	589	The rule of thumb is to use C<UTF-8> unless you know what
	590	you're doing and unless you really benefit from using C<UTF-16>.
	591
	592	ISO-IR-165 [RFC1345]
	593	VISCII
	594	GB 12345
	595	GB 18030 (**) (see links bellow)
	596	EUC-TW (**)
	597
	598	are totally valid encodings but not registered at IANA.
	599	The names under which they are listed here are probably the
	600	most widely-known names for these encodings and are recommended
	601	names.
	602
	603	BIG5PLUS (**)
	604
	605	is a proprietary name.
	606
	607	=head2 Microsoft-related naming mess
	608
	609	Microsoft products misuse the following names:
	610
	611	=over 4
	612
	613	=item KS_C_5601-1987
	614
	615	Microsoft extension to C<EUC-KR>.
	616
	617	Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla).
	618
	619	See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
	620	for details.
	621
	622	Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
	623	misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
	624	C<kcs5601-raw>.
	625
	626	See L<Encode::KR> for details.
	627
	628	=item GB2312
	629
	630	Microsoft extension to C<EUC-CN>.
	631
	632	Proper names: C<CP936>, C<GBK>.
	633
	634	C<GB2312> has been registered in the C<EUC-CN> meaning at
	635	IANA. This has partially repaired the situation: Microsoft's
	636	C<GB2312> has become a superset of the official C<GB2312>.
	637
	638	Encode aliases C<GB2312> to C<euc-cn> in full agreement with
	639	IANA registration. C<cp936> is supported separately.
	640	I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
	641
	642	See L<Encode::CN> for details.
	643
	644	=item Big5
	645
	646	Microsoft extension to C<Big5>.
	647
	648	Proper name: C<CP950>.
	649
	650	Encode separately supports C<Big5> and C<cp950>.
	651
	652	=item Shift_JIS
	653
	654	Microsoft's understanding of C<Shift_JIS>.
	655
	656	JIS has not endorsed the full Microsoft standard however.
	657	The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
	658	character sets, while Microsoft has always used C<Shift_JIS>
	659	to encode a wider character repertoire. See C<IANA> registration for
	660	C<Windows-31J>.
	661
	662	As a historical predecessor, Microsoft's variant
	663	probably has more rights for the name, though it may be objected
	664	that Microsoft shouldn't have used JIS as part of the name
	665	in the first place.
	666
	667	Unambiguous name: C<CP932>. C<IANA> name (also used by Mozilla, and
	668	provided as an alias by Encode): C<Windows-31J>.
	669
	670	Encode separately supports C<Shift_JIS> and C<cp932>.
	671
	672	=back
	673
	674	=head1 Glossary
	675
	676	=over 4
	677
	678	=item character repertoire
	679
	680	A collection of unique characters. A I<character> set in the strictest
	681	sense. At this stage, characters are not numbered.
	682
	683	=item coded character set (CCS)
	684
	685	A character set that is mapped in a way computers can use directly.
	686	Many character encodings, including EUC, fall in this category.
	687
	688	=item character encoding scheme (CES)
	689
	690	An algorithm to map a character set to a byte sequence. You don't
	691	have to be able to tell which character set a given byte sequence
	692	belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an
	693	example of being both a CCS and CES.
	694
	695	=item charset (in MIME context)
	696
	697	has long been used in the meaning of C<encoding>, CES.
	698
	699	While the word combination C<character set> has lost this meaning
	700	in MIME context since [RFC 2130], the C<charset> abbreviation has
	701	retained it. This is how [RFC 2277] and [RFC 2278] bless C<charset>:
	702
	703	This document uses the term "charset" to mean a set of rules for
	704	mapping from a sequence of octets to a sequence of characters, such
	705	as the combination of a coded character set and a character encoding
	706	scheme; this is also what is used as an identifier in MIME "charset="
	707	parameters, and registered in the IANA charset registry ... (Note
	708	that this is NOT a term used by other standards bodies, such as ISO).
	709	[RFC 2277]
	710
	711	=item EUC
	712
	713	Extended Unix Character. See ISO-2022.
	714
	715	=item ISO-2022
	716
	717	A CES that was carefully designed to coexist with ASCII. There are a 7
	718	bit version and an 8 bit version.
	719
	720	The 7 bit version switches character set via escape sequence so it
	721	cannot form a CCS. Since this is more difficult to handle in programs
	722	than the 8 bit version, the 7 bit version is not very popular except for
	723	iso-2022-jp, the I<de facto> standard CES for e-mails.
	724
	725	The 8 bit version can form a CCS. EUC and ISO-8859 are two examples
	726	thereof. Pre-5.6 perl could use them as string literals.
	727
	728	=item UCS
	729
	730	Short for I<Universal Character Set>. When you say just UCS, it means
	731	I<Unicode>.
	732
	733	=item UCS-2
	734
	735	ISO/IEC 10646 encoding form: Universal Character Set coded in two
	736	octets.
	737
	738	=item Unicode
	739
	740	A character set that aims to include all character repertoires of the
	741	world. Many character sets in various national as well as industrial
	742	standards have become, in a way, just subsets of Unicode.
	743
	744	=item UTF
	745
	746	Short for I<Unicode Transformation Format>. Determines how to map a
	747	Unicode character into a byte sequence.
	748
	749	=item UTF-16
	750
	751	A UTF in 16-bit encoding. Can either be in big endian or little
	752	endian. The big endian version is called UTF-16BE (equal to UCS-2 +
	753	surrogate support) and the little endian version is called UTF-16LE.
	754
	755	=back
	756
	757	=head1 See Also
	758
	759	L<Encode>,
	760	L<Encode::Byte>,
	761	L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
	762	L<Encode::EBCDIC>, L<Encode::Symbol>
	763	L<Encode::MIME::Header>, L<Encode::Guess>
	764
	765	=head1 References
	766
	767	=over 4
	768
	769	=item ECMA
	770
	771	European Computer Manufacturers Association
	772	L<http://www.ecma.ch>
	773
	774	=over 4
	775
	776	=item ECMA-035 (eq C<ISO-2022>)
	777
	778	L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
	779
	780	The specification of ISO-2022 is available from the link above.
	781
	782	=back
	783
	784	=item IANA
	785
	786	Internet Assigned Numbers Authority
	787	L<http://www.iana.org/>
	788
	789	=over 4
	790
	791	=item Assigned Charset Names by IANA
	792
	793	L<http://www.iana.org/assignments/character-sets>
	794
	795	Most of the C<canonical names> in Encode derive from this list
	796	so you can directly apply the string you have extracted from MIME
	797	header of mails and web pages.
	798
	799	=back
	800
	801	=item ISO
	802
	803	International Organization for Standardization
	804	L<http://www.iso.ch/>
	805
	806	=item RFC
	807
	808	Request For Comments -- need I say more?
	809	L<http://www.rfc-editor.org/>, L<http://www.rfc.net/>,
	810	L<http://www.faqs.org/rfcs/>
	811
	812	=item UC
	813
	814	Unicode Consortium
	815	L<http://www.unicode.org/>
	816
	817	=over 4
	818
	819	=item Unicode Glossary
	820
	821	L<http://www.unicode.org/glossary/>
	822
	823	The glossary of this document is based upon this site.
	824
	825	=back
	826
	827	=back
	828
	829	=head2 Other Notable Sites
	830
	831	=over 4
	832
	833	=item czyborra.com
	834
	835	L<http://czyborra.com/>
	836
	837	Contains a lot of useful information, especially gory details of ISO
	838	vs. vendor mappings.
	839
	840	=item CJK.inf
	841
	842	L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
	843
	844	Somewhat obsolete (last update in 1996), but still useful. Also try
	845
	846	L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
	847
	848	You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>.
	849
	850	=item Jungshik Shin's Hangul FAQ
	851
	852	L<http://jshin.net/faq>
	853
	854	And especially its subject 8.
	855
	856	L<http://jshin.net/faq/qa8.html>
	857
	858	A comprehensive overview of the Korean (C<KS *>) standards.
	859
	860	=item debian.org: "Introduction to i18n"
	861
	862	A brief description for most of the mentioned CJK encodings is
	863	contained in
	864	L<http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html>
	865
	866	=back
	867
	868	=head2 Offline sources
	869
	870	=over 4
	871
	872	=item C<CJKV Information Processing> by Ken Lunde
	873
	874	CJKV Information Processing
	875	1999 O'Reilly & Associates, ISBN : 1-56592-224-7
	876
	877	The modern successor of C<CJK.inf>.
	878
	879	Features a comprehensive coverage of CJKV character sets and
	880	encodings along with many other issues faced by anyone trying
	881	to better support CJKV languages/scripts in all the areas of
	882	information processing.
	883
	884	To purchase this book, visit
	885	L<http://www.oreilly.com/catalog/cjkvinfo/>
	886	or your favourite bookstore.
	887
	888	=back
	889
	890	=cut