| 1 | =head1 NAME |
| 2 | |
| 3 | Encode::PerlIO -- a detailed document on Encode and PerlIO |
| 4 | |
| 5 | =head1 Overview |
| 6 | |
| 7 | It is very common to want to do encoding transformations when |
| 8 | reading or writing files, network connections, pipes etc. |
| 9 | If Perl is configured to use the new 'perlio' IO system then |
| 10 | C<Encode> provides a "layer" (see L<PerlIO>) which can transform |
| 11 | data as it is read or written. |
| 12 | |
| 13 | Here is how the blind poet would modernise the encoding: |
| 14 | |
| 15 | use Encode; |
| 16 | open(my $iliad,'<:encoding(iso-8859-7)','iliad.greek'); |
| 17 | open(my $utf8,'>:utf8','iliad.utf8'); |
| 18 | my @epic = <$iliad>; |
| 19 | print $utf8 @epic; |
| 20 | close($utf8); |
| 21 | close($illiad); |
| 22 | |
| 23 | In addition, the new IO system can also be configured to read/write |
| 24 | UTF-8 encoded characters (as noted above, this is efficient): |
| 25 | |
| 26 | open(my $fh,'>:utf8','anything'); |
| 27 | print $fh "Any \x{0021} string \N{SMILEY FACE}\n"; |
| 28 | |
| 29 | Either of the above forms of "layer" specifications can be made the default |
| 30 | for a lexical scope with the C<use open ...> pragma. See L<open>. |
| 31 | |
| 32 | Once a handle is open, its layers can be altered using C<binmode>. |
| 33 | |
| 34 | Without any such configuration, or if Perl itself is built using the |
| 35 | system's own IO, then write operations assume that the file handle |
| 36 | accepts only I<bytes> and will C<die> if a character larger than 255 is |
| 37 | written to the handle. When reading, each octet from the handle becomes |
| 38 | a byte-in-a-character. Note that this default is the same behaviour |
| 39 | as bytes-only languages (including Perl before v5.6) would have, |
| 40 | and is sufficient to handle native 8-bit encodings e.g. iso-8859-1, |
| 41 | EBCDIC etc. and any legacy mechanisms for handling other encodings |
| 42 | and binary data. |
| 43 | |
| 44 | In other cases, it is the program's responsibility to transform |
| 45 | characters into bytes using the API above before doing writes, and to |
| 46 | transform the bytes read from a handle into characters before doing |
| 47 | "character operations" (e.g. C<lc>, C</\W+/>, ...). |
| 48 | |
| 49 | You can also use PerlIO to convert larger amounts of data you don't |
| 50 | want to bring into memory. For example, to convert between ISO-8859-1 |
| 51 | (Latin 1) and UTF-8 (or UTF-EBCDIC in EBCDIC machines): |
| 52 | |
| 53 | open(F, "<:encoding(iso-8859-1)", "data.txt") or die $!; |
| 54 | open(G, ">:utf8", "data.utf") or die $!; |
| 55 | while (<F>) { print G } |
| 56 | |
| 57 | # Could also do "print G <F>" but that would pull |
| 58 | # the whole file into memory just to write it out again. |
| 59 | |
| 60 | More examples: |
| 61 | |
| 62 | open(my $f, "<:encoding(cp1252)") |
| 63 | open(my $g, ">:encoding(iso-8859-2)") |
| 64 | open(my $h, ">:encoding(latin9)") # iso-8859-15 |
| 65 | |
| 66 | See also L<encoding> for how to change the default encoding of the |
| 67 | data in your script. |
| 68 | |
| 69 | =head1 How does it work? |
| 70 | |
| 71 | Here is a crude diagram of how filehandle, PerlIO, and Encode |
| 72 | interact. |
| 73 | |
| 74 | filehandle <-> PerlIO PerlIO <-> scalar (read/printed) |
| 75 | \ / |
| 76 | Encode |
| 77 | |
| 78 | When PerlIO receives data from either direction, it fills a buffer |
| 79 | (currently with 1024 bytes) and passes the buffer to Encode. |
| 80 | Encode tries to convert the valid part and passes it back to PerlIO, |
| 81 | leaving invalid parts (usually a partial character) in the buffer. |
| 82 | PerlIO then appends more data to the buffer, calls Encode again, |
| 83 | and so on until the data stream ends. |
| 84 | |
| 85 | To do so, PerlIO always calls (de|en)code methods with CHECK set to 1. |
| 86 | This ensures that the method stops at the right place when it |
| 87 | encounters partial character. The following is what happens when |
| 88 | PerlIO and Encode tries to encode (from utf8) more than 1024 bytes |
| 89 | and the buffer boundary happens to be in the middle of a character. |
| 90 | |
| 91 | A B C .... ~ \x{3000} .... |
| 92 | 41 42 43 .... 7E e3 80 80 .... |
| 93 | <- buffer ---------------> |
| 94 | << encoded >>>>>>>>>> |
| 95 | <- next buffer ------ |
| 96 | |
| 97 | Encode converts from the beginning to \x7E, leaving \xe3 in the buffer |
| 98 | because it is invalid (partial character). |
| 99 | |
| 100 | Unfortunately, this scheme does not work well with escape-based |
| 101 | encodings such as ISO-2022-JP. |
| 102 | |
| 103 | =head1 Line Buffering |
| 104 | |
| 105 | Now let's see what happens when you try to decode from ISO-2022-JP and |
| 106 | the buffer ends in the middle of a character. |
| 107 | |
| 108 | JIS208-ESC \x{5f3e} |
| 109 | A B C .... ~ \e $ B |DAN | .... |
| 110 | 41 42 43 .... 7E 1b 24 41 43 46 .... |
| 111 | <- buffer ---------------------------> |
| 112 | << encoded >>>>>>>>>>>>>>>>>>>>>>> |
| 113 | |
| 114 | As you see, the next buffer begins with \x43. But \x43 is 'C' in |
| 115 | ASCII, which is wrong in this case because we are now in JISX 0208 |
| 116 | area so it has to convert \x43\x46, not \x43. Unlike utf8 and EUC, |
| 117 | in escape-based encodings you can't tell if a given octet is a whole |
| 118 | character or just part of it. |
| 119 | |
| 120 | Fortunately PerlIO also supports line buffer if you tell PerlIO to use |
| 121 | one instead of fixed buffer. Since ISO-2022-JP is guaranteed to revert to ASCII at the end of the line, partial |
| 122 | character will never happen when line buffer is used. |
| 123 | |
| 124 | To tell PerlIO to use line buffer, implement -E<gt>needs_lines method |
| 125 | for your encoding object. See L<Encode::Encoding> for details. |
| 126 | |
| 127 | Thanks to these efforts most encodings that come with Encode support |
| 128 | PerlIO but that still leaves following encodings. |
| 129 | |
| 130 | iso-2022-kr |
| 131 | MIME-B |
| 132 | MIME-Header |
| 133 | MIME-Q |
| 134 | |
| 135 | Fortunately iso-2022-kr is hardly used (according to Jungshik) and |
| 136 | MIME-* are very unlikely to be fed to PerlIO because they are for mail |
| 137 | headers. See L<Encode::MIME::Header> for details. |
| 138 | |
| 139 | =head2 How can I tell whether my encoding fully supports PerlIO ? |
| 140 | |
| 141 | As of this writing, any encoding whose class belongs to Encode::XS and |
| 142 | Encode::Unicode works. The Encode module has a C<perlio_ok> method |
| 143 | which you can use before applying PerlIO encoding to the filehandle. |
| 144 | Here is an example: |
| 145 | |
| 146 | my $use_perlio = perlio_ok($enc); |
| 147 | my $layer = $use_perlio ? "<:raw" : "<:encoding($enc)"; |
| 148 | open my $fh, $layer, $file or die "$file : $!"; |
| 149 | while(<$fh>){ |
| 150 | $_ = decode($enc, $_) unless $use_perlio; |
| 151 | # .... |
| 152 | } |
| 153 | |
| 154 | =head1 SEE ALSO |
| 155 | |
| 156 | L<Encode::Encoding>, |
| 157 | L<Encode::Supported>, |
| 158 | L<Encode::PerlIO>, |
| 159 | L<encoding>, |
| 160 | L<perlebcdic>, |
| 161 | L<perlfunc/open>, |
| 162 | L<perlunicode>, |
| 163 | L<utf8>, |
| 164 | the Perl Unicode Mailing List E<lt>perl-unicode@perl.orgE<gt> |
| 165 | |
| 166 | =cut |
| 167 | |