Initial commit of OpenSPARC T2 architecture model.
[OpenSPARC-T2-SAM] / sam-t2 / devtools / v8plus / man / man3 / encoding.3
CommitLineData
920dae64
AT
1.\" Automatically generated by Pod::Man v1.37, Pod::Parser v1.32
2.\"
3.\" Standard preamble:
4.\" ========================================================================
5.de Sh \" Subsection heading
6.br
7.if t .Sp
8.ne 5
9.PP
10\fB\\$1\fR
11.PP
12..
13.de Sp \" Vertical space (when we can't use .PP)
14.if t .sp .5v
15.if n .sp
16..
17.de Vb \" Begin verbatim text
18.ft CW
19.nf
20.ne \\$1
21..
22.de Ve \" End verbatim text
23.ft R
24.fi
25..
26.\" Set up some character translations and predefined strings. \*(-- will
27.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
28.\" double quote, and \*(R" will give a right double quote. | will give a
29.\" real vertical bar. \*(C+ will give a nicer C++. Capital omega is used to
30.\" do unbreakable dashes and therefore won't be available. \*(C` and \*(C'
31.\" expand to `' in nroff, nothing in troff, for use with C<>.
32.tr \(*W-|\(bv\*(Tr
33.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
34.ie n \{\
35. ds -- \(*W-
36. ds PI pi
37. if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
38. if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch
39. ds L" ""
40. ds R" ""
41. ds C` ""
42. ds C' ""
43'br\}
44.el\{\
45. ds -- \|\(em\|
46. ds PI \(*p
47. ds L" ``
48. ds R" ''
49'br\}
50.\"
51.\" If the F register is turned on, we'll generate index entries on stderr for
52.\" titles (.TH), headers (.SH), subsections (.Sh), items (.Ip), and index
53.\" entries marked with X<> in POD. Of course, you'll have to process the
54.\" output yourself in some meaningful fashion.
55.if \nF \{\
56. de IX
57. tm Index:\\$1\t\\n%\t"\\$2"
58..
59. nr % 0
60. rr F
61.\}
62.\"
63.\" For nroff, turn off justification. Always turn off hyphenation; it makes
64.\" way too many mistakes in technical documents.
65.hy 0
66.if n .na
67.\"
68.\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2).
69.\" Fear. Run. Save yourself. No user-serviceable parts.
70. \" fudge factors for nroff and troff
71.if n \{\
72. ds #H 0
73. ds #V .8m
74. ds #F .3m
75. ds #[ \f1
76. ds #] \fP
77.\}
78.if t \{\
79. ds #H ((1u-(\\\\n(.fu%2u))*.13m)
80. ds #V .6m
81. ds #F 0
82. ds #[ \&
83. ds #] \&
84.\}
85. \" simple accents for nroff and troff
86.if n \{\
87. ds ' \&
88. ds ` \&
89. ds ^ \&
90. ds , \&
91. ds ~ ~
92. ds /
93.\}
94.if t \{\
95. ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u"
96. ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u'
97. ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u'
98. ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u'
99. ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u'
100. ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u'
101.\}
102. \" troff and (daisy-wheel) nroff accents
103.ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V'
104.ds 8 \h'\*(#H'\(*b\h'-\*(#H'
105.ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#]
106.ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H'
107.ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u'
108.ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#]
109.ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#]
110.ds ae a\h'-(\w'a'u*4/10)'e
111.ds Ae A\h'-(\w'A'u*4/10)'E
112. \" corrections for vroff
113.if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u'
114.if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u'
115. \" for low resolution devices (crt and lpr)
116.if \n(.H>23 .if \n(.V>19 \
117\{\
118. ds : e
119. ds 8 ss
120. ds o a
121. ds d- d\h'-1'\(ga
122. ds D- D\h'-1'\(hy
123. ds th \o'bp'
124. ds Th \o'LP'
125. ds ae ae
126. ds Ae AE
127.\}
128.rm #[ #] #H #V #F C
129.\" ========================================================================
130.\"
131.IX Title "encoding 3"
132.TH encoding 3 "2001-09-21" "perl v5.8.8" "Perl Programmers Reference Guide"
133.SH "NAME"
134encoding \- allows you to write your script in non\-ascii or non\-utf8
135.SH "SYNOPSIS"
136.IX Header "SYNOPSIS"
137.Vb 2
138\& use encoding "greek"; # Perl like Greek to you?
139\& use encoding "euc-jp"; # Jperl!
140.Ve
141.PP
142.Vb 1
143\& # or you can even do this if your shell supports your native encoding
144.Ve
145.PP
146.Vb 2
147\& perl -Mencoding=latin2 -e '...' # Feeling centrally European?
148\& perl -Mencoding=euc-kr -e '...' # Or Korean?
149.Ve
150.PP
151.Vb 1
152\& # more control
153.Ve
154.PP
155.Vb 2
156\& # A simple euc-cn => utf-8 converter
157\& use encoding "euc-cn", STDOUT => "utf8"; while(<>){print};
158.Ve
159.PP
160.Vb 2
161\& # "no encoding;" supported (but not scoped!)
162\& no encoding;
163.Ve
164.PP
165.Vb 3
166\& # an alternate way, Filter
167\& use encoding "euc-jp", Filter=>1;
168\& # now you can use kanji identifiers -- in euc-jp!
169.Ve
170.PP
171.Vb 7
172\& # switch on locale -
173\& # note that this probably means that unless you have a complete control
174\& # over the environments the application is ever going to be run, you should
175\& # NOT use the feature of encoding pragma allowing you to write your script
176\& # in any recognized encoding because changing locale settings will wreck
177\& # the script; you can of course still use the other features of the pragma.
178\& use encoding ':locale';
179.Ve
180.SH "ABSTRACT"
181.IX Header "ABSTRACT"
182Let's start with a bit of history: Perl 5.6.0 introduced Unicode
183support. You could apply \f(CW\*(C`substr()\*(C'\fR and regexes even to complex \s-1CJK\s0
184characters \*(-- so long as the script was written in \s-1UTF\-8\s0. But back
185then, text editors that supported \s-1UTF\-8\s0 were still rare and many users
186instead chose to write scripts in legacy encodings, giving up a whole
187new feature of Perl 5.6.
188.PP
189Rewind to the future: starting from perl 5.8.0 with the \fBencoding\fR
190pragma, you can write your script in any encoding you like (so long
191as the \f(CW\*(C`Encode\*(C'\fR module supports it) and still enjoy Unicode support.
192This pragma achieves that by doing the following:
193.IP "\(bu" 4
194Internally converts all literals (\f(CW\*(C`q//,qq//,qr//,qw///, qx//\*(C'\fR) from
195the encoding specified to utf8. In Perl 5.8.1 and later, literals in
196\&\f(CW\*(C`tr///\*(C'\fR and \f(CW\*(C`DATA\*(C'\fR pseudo-filehandle are also converted.
197.IP "\(bu" 4
198Changing PerlIO layers of \f(CW\*(C`STDIN\*(C'\fR and \f(CW\*(C`STDOUT\*(C'\fR to the encoding
199 specified.
200.Sh "Literal Conversions"
201.IX Subsection "Literal Conversions"
202You can write code in EUC-JP as follows:
203.PP
204.Vb 3
205\& my $Rakuda = "\exF1\exD1\exF1\exCC"; # Camel in Kanji
206\& #<-char-><-char-> # 4 octets
207\& s/\ebCamel\eb/$Rakuda/;
208.Ve
209.PP
210And with \f(CW\*(C`use encoding "euc\-jp"\*(C'\fR in effect, it is the same thing as
211the code in \s-1UTF\-8:\s0
212.PP
213.Vb 2
214\& my $Rakuda = "\ex{99F1}\ex{99DD}"; # two Unicode Characters
215\& s/\ebCamel\eb/$Rakuda/;
216.Ve
217.ie n .Sh "PerlIO layers for ""STD(IN|OUT)"""
218.el .Sh "PerlIO layers for \f(CWSTD(IN|OUT)\fP"
219.IX Subsection "PerlIO layers for STD(IN|OUT)"
220The \fBencoding\fR pragma also modifies the filehandle layers of
221\&\s-1STDIN\s0 and \s-1STDOUT\s0 to the specified encoding. Therefore,
222.PP
223.Vb 5
224\& use encoding "euc-jp";
225\& my $message = "Camel is the symbol of perl.\en";
226\& my $Rakuda = "\exF1\exD1\exF1\exCC"; # Camel in Kanji
227\& $message =~ s/\ebCamel\eb/$Rakuda/;
228\& print $message;
229.Ve
230.PP
231Will print \*(L"\exF1\exD1\exF1\exCC is the symbol of perl.\en\*(R",
232not \*(L"\ex{99F1}\ex{99DD} is the symbol of perl.\en\*(R".
233.PP
234You can override this by giving extra arguments; see below.
235.Sh "Implicit upgrading for byte strings"
236.IX Subsection "Implicit upgrading for byte strings"
237By default, if strings operating under byte semantics and strings
238with Unicode character data are concatenated, the new string will
239be created by decoding the byte strings as \fI\s-1ISO\s0 8859\-1 (Latin\-1)\fR.
240.PP
241The \fBencoding\fR pragma changes this to use the specified encoding
242instead. For example:
243.PP
244.Vb 5
245\& use encoding 'utf8';
246\& my $string = chr(20000); # a Unicode string
247\& utf8::encode($string); # now it's a UTF-8 encoded byte string
248\& # concatenate with another Unicode string
249\& print length($string . chr(20000));
250.Ve
251.PP
252Will print \f(CW2\fR, because \f(CW$string\fR is upgraded as \s-1UTF\-8\s0. Without
253\&\f(CW\*(C`use encoding 'utf8';\*(C'\fR, it will print \f(CW4\fR instead, since \f(CW$string\fR
254is three octets when interpreted as Latin\-1.
255.SH "FEATURES THAT REQUIRE 5.8.1"
256.IX Header "FEATURES THAT REQUIRE 5.8.1"
257Some of the features offered by this pragma requires perl 5.8.1. Most
258of these are done by Inaba Hiroto. Any other features and changes
259are good for 5.8.0.
260.ie n .IP """\s-1NON\-EUC\s0"" doublebyte encodings" 4
261.el .IP "``\s-1NON\-EUC\s0'' doublebyte encodings" 4
262.IX Item "NON-EUC doublebyte encodings"
263Because perl needs to parse script before applying this pragma, such
264encodings as Shift_JIS and Big\-5 that may contain '\e' (\s-1BACKSLASH\s0;
265\&\ex5c) in the second byte fails because the second byte may
266accidentally escape the quoting character that follows. Perl 5.8.1
267or later fixes this problem.
268.IP "tr//" 4
269.IX Item "tr//"
270\&\f(CW\*(C`tr//\*(C'\fR was overlooked by Perl 5 porters when they released perl 5.8.0
271See the section below for details.
272.IP "\s-1DATA\s0 pseudo-filehandle" 4
273.IX Item "DATA pseudo-filehandle"
274Another feature that was overlooked was \f(CW\*(C`DATA\*(C'\fR.
275.SH "USAGE"
276.IX Header "USAGE"
277.IP "use encoding [\fI\s-1ENCNAME\s0\fR] ;" 4
278.IX Item "use encoding [ENCNAME] ;"
279Sets the script encoding to \fI\s-1ENCNAME\s0\fR. And unless ${^UNICODE}
280exists and non\-zero, PerlIO layers of \s-1STDIN\s0 and \s-1STDOUT\s0 are set to
281":encoding(\fI\s-1ENCNAME\s0\fR)".
282.Sp
283Note that \s-1STDERR\s0 \s-1WILL\s0 \s-1NOT\s0 be changed.
284.Sp
285Also note that non-STD file handles remain unaffected. Use \f(CW\*(C`use
286open\*(C'\fR or \f(CW\*(C`binmode\*(C'\fR to change layers of those.
287.Sp
288If no encoding is specified, the environment variable \s-1PERL_ENCODING\s0
289is consulted. If no encoding can be found, the error \f(CW\*(C`Unknown encoding
290\&'\f(CI\s-1ENCNAME\s0\f(CW'\*(C'\fR will be thrown.
291.IP "use encoding \fI\s-1ENCNAME\s0\fR [ \s-1STDIN\s0 => \fI\s-1ENCNAME_IN\s0\fR ...] ;" 4
292.IX Item "use encoding ENCNAME [ STDIN => ENCNAME_IN ...] ;"
293You can also individually set encodings of \s-1STDIN\s0 and \s-1STDOUT\s0 via the
294\&\f(CW\*(C`STDIN => \f(CI\s-1ENCNAME\s0\f(CW\*(C'\fR form. In this case, you cannot omit the
295first \fI\s-1ENCNAME\s0\fR. \f(CW\*(C`STDIN => undef\*(C'\fR turns the \s-1IO\s0 transcoding
296completely off.
297.Sp
298When ${^UNICODE} exists and non\-zero, these options will completely
299ignored. ${^UNICODE} is a variable introduced in perl 5.8.1. See
300perlrun see \*(L"${^UNICODE}\*(R" in perlvar and \*(L"\-C\*(R" in perlrun for
301details (perl 5.8.1 and later).
302.IP "use encoding \fI\s-1ENCNAME\s0\fR Filter=>1;" 4
303.IX Item "use encoding ENCNAME Filter=>1;"
304This turns the encoding pragma into a source filter. While the
305default approach just decodes interpolated literals (in \fIqq()\fR and
306\&\fIqr()\fR), this will apply a source filter to the entire source code. See
307\&\*(L"The Filter Option\*(R" below for details.
308.IP "no encoding;" 4
309.IX Item "no encoding;"
310Unsets the script encoding. The layers of \s-1STDIN\s0, \s-1STDOUT\s0 are
311reset to \*(L":raw\*(R" (the default unprocessed raw stream of bytes).
312.SH "The Filter Option"
313.IX Header "The Filter Option"
314The magic of \f(CW\*(C`use encoding\*(C'\fR is not applied to the names of
315identifiers. In order to make \f(CW\*(C`${"\ex{4eba}"}++\*(C'\fR ($human++, where human
316is a single Han ideograph) work, you still need to write your script
317in \s-1UTF\-8\s0 \*(-- or use a source filter. That's what 'Filter=>1' does.
318.PP
319What does this mean? Your source code behaves as if it is written in
320\&\s-1UTF\-8\s0 with 'use utf8' in effect. So even if your editor only supports
321Shift_JIS, for example, you can still try examples in Chapter 15 of
322\&\f(CW\*(C`Programming Perl, 3rd Ed.\*(C'\fR. For instance, you can use \s-1UTF\-8\s0
323identifiers.
324.PP
325This option is significantly slower and (as of this writing) non-ASCII
326identifiers are not very stable \s-1WITHOUT\s0 this option and with the
327source code written in \s-1UTF\-8\s0.
328.Sh "Filter-related changes at Encode version 1.87"
329.IX Subsection "Filter-related changes at Encode version 1.87"
330.IP "\(bu" 4
331The Filter option now sets \s-1STDIN\s0 and \s-1STDOUT\s0 like non-filter options.
332And \f(CW\*(C`STDIN=>\f(CI\s-1ENCODING\s0\f(CW\*(C'\fR and \f(CW\*(C`STDOUT=>\f(CI\s-1ENCODING\s0\f(CW\*(C'\fR work like
333non-filter version.
334.IP "\(bu" 4
335\&\f(CW\*(C`use utf8\*(C'\fR is implicitly declared so you no longer have to \f(CW\*(C`use
336utf8\*(C'\fR to \f(CW\*(C`${"\ex{4eba}"}++\*(C'\fR.
337.SH "CAVEATS"
338.IX Header "CAVEATS"
339.Sh "\s-1NOT\s0 \s-1SCOPED\s0"
340.IX Subsection "NOT SCOPED"
341The pragma is a per script, not a per block lexical. Only the last
342\&\f(CW\*(C`use encoding\*(C'\fR or \f(CW\*(C`no encoding\*(C'\fR matters, and it affects
343\&\fBthe whole script\fR. However, the <no encoding> pragma is supported and
344\&\fBuse encoding\fR can appear as many times as you want in a given script.
345The multiple use of this pragma is discouraged.
346.PP
347By the same reason, the use this pragma inside modules is also
348discouraged (though not as strongly discouraged as the case above.
349See below).
350.PP
351If you still have to write a module with this pragma, be very careful
352of the load order. See the codes below;
353.PP
354.Vb 5
355\& # called module
356\& package Module_IN_BAR;
357\& use encoding "bar";
358\& # stuff in "bar" encoding here
359\& 1;
360.Ve
361.PP
362.Vb 4
363\& # caller script
364\& use encoding "foo"
365\& use Module_IN_BAR;
366\& # surprise! use encoding "bar" is in effect.
367.Ve
368.PP
369The best way to avoid this oddity is to use this pragma \s-1RIGHT\s0 \s-1AFTER\s0
370other modules are loaded. i.e.
371.PP
372.Vb 2
373\& use Module_IN_BAR;
374\& use encoding "foo";
375.Ve
376.Sh "\s-1DO\s0 \s-1NOT\s0 \s-1MIX\s0 \s-1MULTIPLE\s0 \s-1ENCODINGS\s0"
377.IX Subsection "DO NOT MIX MULTIPLE ENCODINGS"
378Notice that only literals (string or regular expression) having only
379legacy code points are affected: if you mix data like this
380.PP
381.Vb 1
382\& \exDF\ex{100}
383.Ve
384.PP
385the data is assumed to be in (Latin 1 and) Unicode, not in your native
386encoding. In other words, this will match in \*(L"greek\*(R":
387.PP
388.Vb 1
389\& "\exDF" =~ /\ex{3af}/
390.Ve
391.PP
392but this will not
393.PP
394.Vb 1
395\& "\exDF\ex{100}" =~ /\ex{3af}\ex{100}/
396.Ve
397.PP
398since the \f(CW\*(C`\exDF\*(C'\fR (\s-1ISO\s0 8859\-7 \s-1GREEK\s0 \s-1SMALL\s0 \s-1LETTER\s0 \s-1IOTA\s0 \s-1WITH\s0 \s-1TONOS\s0) on
399the left will \fBnot\fR be upgraded to \f(CW\*(C`\ex{3af}\*(C'\fR (Unicode \s-1GREEK\s0 \s-1SMALL\s0
400\&\s-1LETTER\s0 \s-1IOTA\s0 \s-1WITH\s0 \s-1TONOS\s0) because of the \f(CW\*(C`\ex{100}\*(C'\fR on the left. You
401should not be mixing your legacy data and Unicode in the same string.
402.PP
403This pragma also affects encoding of the 0x80..0xFF code point range:
404normally characters in that range are left as eight-bit bytes (unless
405they are combined with characters with code points 0x100 or larger,
406in which case all characters need to become \s-1UTF\-8\s0 encoded), but if
407the \f(CW\*(C`encoding\*(C'\fR pragma is present, even the 0x80..0xFF range always
408gets \s-1UTF\-8\s0 encoded.
409.PP
410After all, the best thing about this pragma is that you don't have to
411resort to \ex{....} just to spell your name in a native encoding.
412So feel free to put your strings in your encoding in quotes and
413regexes.
414.Sh "tr/// with ranges"
415.IX Subsection "tr/// with ranges"
416The \fBencoding\fR pragma works by decoding string literals in
417\&\f(CW\*(C`q//,qq//,qr//,qw///, qx//\*(C'\fR and so forth. In perl 5.8.0, this
418does not apply to \f(CW\*(C`tr///\*(C'\fR. Therefore,
419.PP
420.Vb 4
421\& use encoding 'euc-jp';
422\& #....
423\& $kana =~ tr/\exA4\exA1-\exA4\exF3/\exA5\exA1-\exA5\exF3/;
424\& # -------- -------- -------- --------
425.Ve
426.PP
427Does not work as
428.PP
429.Vb 1
430\& $kana =~ tr/\ex{3041}-\ex{3093}/\ex{30a1}-\ex{30f3}/;
431.Ve
432.IP "Legend of characters above" 4
433.IX Item "Legend of characters above"
434.Vb 6
435\& utf8 euc-jp charnames::viacode()
436\& -----------------------------------------
437\& \ex{3041} \exA4\exA1 HIRAGANA LETTER SMALL A
438\& \ex{3093} \exA4\exF3 HIRAGANA LETTER N
439\& \ex{30a1} \exA5\exA1 KATAKANA LETTER SMALL A
440\& \ex{30f3} \exA5\exF3 KATAKANA LETTER N
441.Ve
442.PP
443This counterintuitive behavior has been fixed in perl 5.8.1.
444.PP
445\fIworkaround to tr///;\fR
446.IX Subsection "workaround to tr///;"
447.PP
448In perl 5.8.0, you can work around as follows;
449.PP
450.Vb 3
451\& use encoding 'euc-jp';
452\& # ....
453\& eval qq{ \e$kana =~ tr/\exA4\exA1-\exA4\exF3/\exA5\exA1-\exA5\exF3/ };
454.Ve
455.PP
456Note the \f(CW\*(C`tr//\*(C'\fR expression is surrounded by \f(CW\*(C`qq{}\*(C'\fR. The idea behind
457is the same as classic idiom that makes \f(CW\*(C`tr///\*(C'\fR 'interpolate'.
458.PP
459.Vb 2
460\& tr/$from/$to/; # wrong!
461\& eval qq{ tr/$from/$to/ }; # workaround.
462.Ve
463.PP
464Nevertheless, in case of \fBencoding\fR pragma even \f(CW\*(C`q//\*(C'\fR is affected so
465\&\f(CW\*(C`tr///\*(C'\fR not being decoded was obviously against the will of Perl5
466Porters so it has been fixed in Perl 5.8.1 or later.
467.SH "EXAMPLE \- Greekperl"
468.IX Header "EXAMPLE - Greekperl"
469.Vb 1
470\& use encoding "iso 8859-7";
471.Ve
472.PP
473.Vb 1
474\& # \exDF in ISO 8859-7 (Greek) is \ex{3af} in Unicode.
475.Ve
476.PP
477.Vb 2
478\& $a = "\exDF";
479\& $b = "\ex{100}";
480.Ve
481.PP
482.Vb 1
483\& printf "%#x\en", ord($a); # will print 0x3af, not 0xdf
484.Ve
485.PP
486.Vb 1
487\& $c = $a . $b;
488.Ve
489.PP
490.Vb 1
491\& # $c will be "\ex{3af}\ex{100}", not "\ex{df}\ex{100}".
492.Ve
493.PP
494.Vb 1
495\& # chr() is affected, and ...
496.Ve
497.PP
498.Vb 1
499\& print "mega\en" if ord(chr(0xdf)) == 0x3af;
500.Ve
501.PP
502.Vb 1
503\& # ... ord() is affected by the encoding pragma ...
504.Ve
505.PP
506.Vb 1
507\& print "tera\en" if ord(pack("C", 0xdf)) == 0x3af;
508.Ve
509.PP
510.Vb 1
511\& # ... as are eq and cmp ...
512.Ve
513.PP
514.Vb 2
515\& print "peta\en" if "\ex{3af}" eq pack("C", 0xdf);
516\& print "exa\en" if "\ex{3af}" cmp pack("C", 0xdf) == 0;
517.Ve
518.PP
519.Vb 2
520\& # ... but pack/unpack C are not affected, in case you still
521\& # want to go back to your native encoding
522.Ve
523.PP
524.Vb 1
525\& print "zetta\en" if unpack("C", (pack("C", 0xdf))) == 0xdf;
526.Ve
527.SH "KNOWN PROBLEMS"
528.IX Header "KNOWN PROBLEMS"
529.IP "literals in regex that are longer than 127 bytes" 4
530.IX Item "literals in regex that are longer than 127 bytes"
531For native multibyte encodings (either fixed or variable length),
532the current implementation of the regular expressions may introduce
533recoding errors for regular expression literals longer than 127 bytes.
534.IP "\s-1EBCDIC\s0" 4
535.IX Item "EBCDIC"
536The encoding pragma is not supported on \s-1EBCDIC\s0 platforms.
537(Porters who are willing and able to remove this limitation are
538welcome.)
539.IP "format" 4
540.IX Item "format"
541This pragma doesn't work well with format because PerlIO does not
542get along very well with it. When format contains non-ascii
543characters it prints funny or gets \*(L"wide character warnings\*(R".
544To understand it, try the code below.
545.Sp
546.Vb 11
547\& # Save this one in utf8
548\& # replace *non-ascii* with a non-ascii string
549\& my $camel;
550\& format STDOUT =
551\& *non-ascii*@>>>>>>>
552\& $camel
553\& .
554\& $camel = "*non-ascii*";
555\& binmode(STDOUT=>':encoding(utf8)'); # bang!
556\& write; # funny
557\& print $camel, "\en"; # fine
558.Ve
559.Sp
560Without binmode this happens to work but without binmode, \fIprint()\fR
561fails instead of \fIwrite()\fR.
562.Sp
563At any rate, the very use of format is questionable when it comes to
564unicode characters since you have to consider such things as character
565width (i.e. double-width for ideographs) and directions (i.e. \s-1BIDI\s0 for
566Arabic and Hebrew).
567.Sh "The Logic of :locale"
568.IX Subsection "The Logic of :locale"
569The logic of \f(CW\*(C`:locale\*(C'\fR is as follows:
570.IP "1." 4
571If the platform supports the langinfo(\s-1CODESET\s0) interface, the codeset
572returned is used as the default encoding for the open pragma.
573.IP "2." 4
574If 1. didn't work but we are under the locale pragma, the environment
575variables \s-1LC_ALL\s0 and \s-1LANG\s0 (in that order) are matched for encodings
576(the part after \f(CW\*(C`.\*(C'\fR, if any), and if any found, that is used
577as the default encoding for the open pragma.
578.IP "3." 4
579If 1. and 2. didn't work, the environment variables \s-1LC_ALL\s0 and \s-1LANG\s0
580(in that order) are matched for anything looking like \s-1UTF\-8\s0, and if
581any found, \f(CW\*(C`:utf8\*(C'\fR is used as the default encoding for the open
582pragma.
583.PP
584If your locale environment variables (\s-1LC_ALL\s0, \s-1LC_CTYPE\s0, \s-1LANG\s0)
585contain the strings '\s-1UTF\-8\s0' or '\s-1UTF8\s0' (case\-insensitive matching),
586the default encoding of your \s-1STDIN\s0, \s-1STDOUT\s0, and \s-1STDERR\s0, and of
587\&\fBany subsequent file open\fR, is \s-1UTF\-8\s0.
588.SH "HISTORY"
589.IX Header "HISTORY"
590This pragma first appeared in Perl 5.8.0. For features that require
5915.8.1 and better, see above.
592.PP
593The \f(CW\*(C`:locale\*(C'\fR subpragma was implemented in 2.01, or Perl 5.8.6.
594.SH "SEE ALSO"
595.IX Header "SEE ALSO"
596perlunicode, Encode, open, Filter::Util::Call,
597.PP
598Ch. 15 of \f(CW\*(C`Programming Perl (3rd Edition)\*(C'\fR
599by Larry Wall, Tom Christiansen, Jon Orwant;
600O'Reilly & Associates; \s-1ISBN\s0 0\-596\-00027\-8