Commit | Line | Data |
---|---|---|
920dae64 AT |
1 | =head1 NAME |
2 | ||
3 | perlebcdic - Considerations for running Perl on EBCDIC platforms | |
4 | ||
5 | =head1 DESCRIPTION | |
6 | ||
7 | An exploration of some of the issues facing Perl programmers | |
8 | on EBCDIC based computers. We do not cover localization, | |
9 | internationalization, or multi byte character set issues other | |
10 | than some discussion of UTF-8 and UTF-EBCDIC. | |
11 | ||
12 | Portions that are still incomplete are marked with XXX. | |
13 | ||
14 | =head1 COMMON CHARACTER CODE SETS | |
15 | ||
16 | =head2 ASCII | |
17 | ||
18 | The American Standard Code for Information Interchange is a set of | |
19 | integers running from 0 to 127 (decimal) that imply character | |
20 | interpretation by the display and other system(s) of computers. | |
21 | The range 0..127 can be covered by setting the bits in a 7-bit binary | |
22 | digit, hence the set is sometimes referred to as a "7-bit ASCII". | |
23 | ASCII was described by the American National Standards Institute | |
24 | document ANSI X3.4-1986. It was also described by ISO 646:1991 | |
25 | (with localization for currency symbols). The full ASCII set is | |
26 | given in the table below as the first 128 elements. Languages that | |
27 | can be written adequately with the characters in ASCII include | |
28 | English, Hawaiian, Indonesian, Swahili and some Native American | |
29 | languages. | |
30 | ||
31 | There are many character sets that extend the range of integers | |
32 | from 0..2**7-1 up to 2**8-1, or 8 bit bytes (octets if you prefer). | |
33 | One common one is the ISO 8859-1 character set. | |
34 | ||
35 | =head2 ISO 8859 | |
36 | ||
37 | The ISO 8859-$n are a collection of character code sets from the | |
38 | International Organization for Standardization (ISO) each of which | |
39 | adds characters to the ASCII set that are typically found in European | |
40 | languages many of which are based on the Roman, or Latin, alphabet. | |
41 | ||
42 | =head2 Latin 1 (ISO 8859-1) | |
43 | ||
44 | A particular 8-bit extension to ASCII that includes grave and acute | |
45 | accented Latin characters. Languages that can employ ISO 8859-1 | |
46 | include all the languages covered by ASCII as well as Afrikaans, | |
47 | Albanian, Basque, Catalan, Danish, Faroese, Finnish, Norwegian, | |
48 | Portuguese, Spanish, and Swedish. Dutch is covered albeit without | |
49 | the ij ligature. French is covered too but without the oe ligature. | |
50 | German can use ISO 8859-1 but must do so without German-style | |
51 | quotation marks. This set is based on Western European extensions | |
52 | to ASCII and is commonly encountered in world wide web work. | |
53 | In IBM character code set identification terminology ISO 8859-1 is | |
54 | also known as CCSID 819 (or sometimes 0819 or even 00819). | |
55 | ||
56 | =head2 EBCDIC | |
57 | ||
58 | The Extended Binary Coded Decimal Interchange Code refers to a | |
59 | large collection of slightly different single and multi byte | |
60 | coded character sets that are different from ASCII or ISO 8859-1 | |
61 | and typically run on host computers. The EBCDIC encodings derive | |
62 | from 8 bit byte extensions of Hollerith punched card encodings. | |
63 | The layout on the cards was such that high bits were set for the | |
64 | upper and lower case alphabet characters [a-z] and [A-Z], but there | |
65 | were gaps within each latin alphabet range. | |
66 | ||
67 | Some IBM EBCDIC character sets may be known by character code set | |
68 | identification numbers (CCSID numbers) or code page numbers. Leading | |
69 | zero digits in CCSID numbers within this document are insignificant. | |
70 | E.g. CCSID 0037 may be referred to as 37 in places. | |
71 | ||
72 | =head2 13 variant characters | |
73 | ||
74 | Among IBM EBCDIC character code sets there are 13 characters that | |
75 | are often mapped to different integer values. Those characters | |
76 | are known as the 13 "variant" characters and are: | |
77 | ||
78 | \ [ ] { } ^ ~ ! # | $ @ ` | |
79 | ||
80 | =head2 0037 | |
81 | ||
82 | Character code set ID 0037 is a mapping of the ASCII plus Latin-1 | |
83 | characters (i.e. ISO 8859-1) to an EBCDIC set. 0037 is used | |
84 | in North American English locales on the OS/400 operating system | |
85 | that runs on AS/400 computers. CCSID 37 differs from ISO 8859-1 | |
86 | in 237 places, in other words they agree on only 19 code point values. | |
87 | ||
88 | =head2 1047 | |
89 | ||
90 | Character code set ID 1047 is also a mapping of the ASCII plus | |
91 | Latin-1 characters (i.e. ISO 8859-1) to an EBCDIC set. 1047 is | |
92 | used under Unix System Services for OS/390 or z/OS, and OpenEdition | |
93 | for VM/ESA. CCSID 1047 differs from CCSID 0037 in eight places. | |
94 | ||
95 | =head2 POSIX-BC | |
96 | ||
97 | The EBCDIC code page in use on Siemens' BS2000 system is distinct from | |
98 | 1047 and 0037. It is identified below as the POSIX-BC set. | |
99 | ||
100 | =head2 Unicode code points versus EBCDIC code points | |
101 | ||
102 | In Unicode terminology a I<code point> is the number assigned to a | |
103 | character: for example, in EBCDIC the character "A" is usually assigned | |
104 | the number 193. In Unicode the character "A" is assigned the number 65. | |
105 | This causes a problem with the semantics of the pack/unpack "U", which | |
106 | are supposed to pack Unicode code points to characters and back to numbers. | |
107 | The problem is: which code points to use for code points less than 256? | |
108 | (for 256 and over there's no problem: Unicode code points are used) | |
109 | In EBCDIC, for the low 256 the EBCDIC code points are used. This | |
110 | means that the equivalences | |
111 | ||
112 | pack("U", ord($character)) eq $character | |
113 | unpack("U", $character) == ord $character | |
114 | ||
115 | will hold. (If Unicode code points were applied consistently over | |
116 | all the possible code points, pack("U",ord("A")) would in EBCDIC | |
117 | equal I<A with acute> or chr(101), and unpack("U", "A") would equal | |
118 | 65, or I<non-breaking space>, not 193, or ord "A".) | |
119 | ||
120 | =head2 Remaining Perl Unicode problems in EBCDIC | |
121 | ||
122 | =over 4 | |
123 | ||
124 | =item * | |
125 | ||
126 | Many of the remaining seem to be related to case-insensitive matching: | |
127 | for example, C<< /[\x{131}]/ >> (LATIN SMALL LETTER DOTLESS I) does | |
128 | not match "I" case-insensitively, as it should under Unicode. | |
129 | (The match succeeds in ASCII-derived platforms.) | |
130 | ||
131 | =item * | |
132 | ||
133 | The extensions Unicode::Collate and Unicode::Normalized are not | |
134 | supported under EBCDIC, likewise for the encoding pragma. | |
135 | ||
136 | =back | |
137 | ||
138 | =head2 Unicode and UTF | |
139 | ||
140 | UTF is a Unicode Transformation Format. UTF-8 is a Unicode conforming | |
141 | representation of the Unicode standard that looks very much like ASCII. | |
142 | UTF-EBCDIC is an attempt to represent Unicode characters in an EBCDIC | |
143 | transparent manner. | |
144 | ||
145 | =head2 Using Encode | |
146 | ||
147 | Starting from Perl 5.8 you can use the standard new module Encode | |
148 | to translate from EBCDIC to Latin-1 code points | |
149 | ||
150 | use Encode 'from_to'; | |
151 | ||
152 | my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' ); | |
153 | ||
154 | # $a is in EBCDIC code points | |
155 | from_to($a, $ebcdic{ord '^'}, 'latin1'); | |
156 | # $a is ISO 8859-1 code points | |
157 | ||
158 | and from Latin-1 code points to EBCDIC code points | |
159 | ||
160 | use Encode 'from_to'; | |
161 | ||
162 | my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' ); | |
163 | ||
164 | # $a is ISO 8859-1 code points | |
165 | from_to($a, 'latin1', $ebcdic{ord '^'}); | |
166 | # $a is in EBCDIC code points | |
167 | ||
168 | For doing I/O it is suggested that you use the autotranslating features | |
169 | of PerlIO, see L<perluniintro>. | |
170 | ||
171 | Since version 5.8 Perl uses the new PerlIO I/O library. This enables | |
172 | you to use different encodings per IO channel. For example you may use | |
173 | ||
174 | use Encode; | |
175 | open($f, ">:encoding(ascii)", "test.ascii"); | |
176 | print $f "Hello World!\n"; | |
177 | open($f, ">:encoding(cp37)", "test.ebcdic"); | |
178 | print $f "Hello World!\n"; | |
179 | open($f, ">:encoding(latin1)", "test.latin1"); | |
180 | print $f "Hello World!\n"; | |
181 | open($f, ">:encoding(utf8)", "test.utf8"); | |
182 | print $f "Hello World!\n"; | |
183 | ||
184 | to get two files containing "Hello World!\n" in ASCII, CP 37 EBCDIC, | |
185 | ISO 8859-1 (Latin-1) (in this example identical to ASCII) respective | |
186 | UTF-EBCDIC (in this example identical to normal EBCDIC). See the | |
187 | documentation of Encode::PerlIO for details. | |
188 | ||
189 | As the PerlIO layer uses raw IO (bytes) internally, all this totally | |
190 | ignores things like the type of your filesystem (ASCII or EBCDIC). | |
191 | ||
192 | =head1 SINGLE OCTET TABLES | |
193 | ||
194 | The following tables list the ASCII and Latin 1 ordered sets including | |
195 | the subsets: C0 controls (0..31), ASCII graphics (32..7e), delete (7f), | |
196 | C1 controls (80..9f), and Latin-1 (a.k.a. ISO 8859-1) (a0..ff). In the | |
197 | table non-printing control character names as well as the Latin 1 | |
198 | extensions to ASCII have been labelled with character names roughly | |
199 | corresponding to I<The Unicode Standard, Version 3.0> albeit with | |
200 | substitutions such as s/LATIN// and s/VULGAR// in all cases, | |
201 | s/CAPITAL LETTER// in some cases, and s/SMALL LETTER ([A-Z])/\l$1/ | |
202 | in some other cases (the C<charnames> pragma names unfortunately do | |
203 | not list explicit names for the C0 or C1 control characters). The | |
204 | "names" of the C1 control set (128..159 in ISO 8859-1) listed here are | |
205 | somewhat arbitrary. The differences between the 0037 and 1047 sets are | |
206 | flagged with ***. The differences between the 1047 and POSIX-BC sets | |
207 | are flagged with ###. All ord() numbers listed are decimal. If you | |
208 | would rather see this table listing octal values then run the table | |
209 | (that is, the pod version of this document since this recipe may not | |
210 | work with a pod2_other_format translation) through: | |
211 | ||
212 | =over 4 | |
213 | ||
214 | =item recipe 0 | |
215 | ||
216 | =back | |
217 | ||
218 | perl -ne 'if(/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \ | |
219 | -e '{printf("%s%-9o%-9o%-9o%o\n",$1,$2,$3,$4,$5)}' perlebcdic.pod | |
220 | ||
221 | If you want to retain the UTF-x code points then in script form you | |
222 | might want to write: | |
223 | ||
224 | =over 4 | |
225 | ||
226 | =item recipe 1 | |
227 | ||
228 | =back | |
229 | ||
230 | open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!"; | |
231 | while (<FH>) { | |
232 | if (/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)\s+(\d+)\.?(\d*)/) { | |
233 | if ($7 ne '' && $9 ne '') { | |
234 | printf("%s%-9o%-9o%-9o%-9o%-3o.%-5o%-3o.%o\n",$1,$2,$3,$4,$5,$6,$7,$8,$9); | |
235 | } | |
236 | elsif ($7 ne '') { | |
237 | printf("%s%-9o%-9o%-9o%-9o%-3o.%-5o%o\n",$1,$2,$3,$4,$5,$6,$7,$8); | |
238 | } | |
239 | else { | |
240 | printf("%s%-9o%-9o%-9o%-9o%-9o%o\n",$1,$2,$3,$4,$5,$6,$8); | |
241 | } | |
242 | } | |
243 | } | |
244 | ||
245 | If you would rather see this table listing hexadecimal values then | |
246 | run the table through: | |
247 | ||
248 | =over 4 | |
249 | ||
250 | =item recipe 2 | |
251 | ||
252 | =back | |
253 | ||
254 | perl -ne 'if(/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \ | |
255 | -e '{printf("%s%-9X%-9X%-9X%X\n",$1,$2,$3,$4,$5)}' perlebcdic.pod | |
256 | ||
257 | Or, in order to retain the UTF-x code points in hexadecimal: | |
258 | ||
259 | =over 4 | |
260 | ||
261 | =item recipe 3 | |
262 | ||
263 | =back | |
264 | ||
265 | open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!"; | |
266 | while (<FH>) { | |
267 | if (/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)\s+(\d+)\.?(\d*)/) { | |
268 | if ($7 ne '' && $9 ne '') { | |
269 | printf("%s%-9X%-9X%-9X%-9X%-2X.%-6X%-2X.%X\n",$1,$2,$3,$4,$5,$6,$7,$8,$9); | |
270 | } | |
271 | elsif ($7 ne '') { | |
272 | printf("%s%-9X%-9X%-9X%-9X%-2X.%-6X%X\n",$1,$2,$3,$4,$5,$6,$7,$8); | |
273 | } | |
274 | else { | |
275 | printf("%s%-9X%-9X%-9X%-9X%-9X%X\n",$1,$2,$3,$4,$5,$6,$8); | |
276 | } | |
277 | } | |
278 | } | |
279 | ||
280 | ||
281 | incomp- incomp- | |
282 | 8859-1 lete lete | |
283 | chr 0819 0037 1047 POSIX-BC UTF-8 UTF-EBCDIC | |
284 | ------------------------------------------------------------------------------------ | |
285 | <NULL> 0 0 0 0 0 0 | |
286 | <START OF HEADING> 1 1 1 1 1 1 | |
287 | <START OF TEXT> 2 2 2 2 2 2 | |
288 | <END OF TEXT> 3 3 3 3 3 3 | |
289 | <END OF TRANSMISSION> 4 55 55 55 4 55 | |
290 | <ENQUIRY> 5 45 45 45 5 45 | |
291 | <ACKNOWLEDGE> 6 46 46 46 6 46 | |
292 | <BELL> 7 47 47 47 7 47 | |
293 | <BACKSPACE> 8 22 22 22 8 22 | |
294 | <HORIZONTAL TABULATION> 9 5 5 5 9 5 | |
295 | <LINE FEED> 10 37 21 21 10 21 *** | |
296 | <VERTICAL TABULATION> 11 11 11 11 11 11 | |
297 | <FORM FEED> 12 12 12 12 12 12 | |
298 | <CARRIAGE RETURN> 13 13 13 13 13 13 | |
299 | <SHIFT OUT> 14 14 14 14 14 14 | |
300 | <SHIFT IN> 15 15 15 15 15 15 | |
301 | <DATA LINK ESCAPE> 16 16 16 16 16 16 | |
302 | <DEVICE CONTROL ONE> 17 17 17 17 17 17 | |
303 | <DEVICE CONTROL TWO> 18 18 18 18 18 18 | |
304 | <DEVICE CONTROL THREE> 19 19 19 19 19 19 | |
305 | <DEVICE CONTROL FOUR> 20 60 60 60 20 60 | |
306 | <NEGATIVE ACKNOWLEDGE> 21 61 61 61 21 61 | |
307 | <SYNCHRONOUS IDLE> 22 50 50 50 22 50 | |
308 | <END OF TRANSMISSION BLOCK> 23 38 38 38 23 38 | |
309 | <CANCEL> 24 24 24 24 24 24 | |
310 | <END OF MEDIUM> 25 25 25 25 25 25 | |
311 | <SUBSTITUTE> 26 63 63 63 26 63 | |
312 | <ESCAPE> 27 39 39 39 27 39 | |
313 | <FILE SEPARATOR> 28 28 28 28 28 28 | |
314 | <GROUP SEPARATOR> 29 29 29 29 29 29 | |
315 | <RECORD SEPARATOR> 30 30 30 30 30 30 | |
316 | <UNIT SEPARATOR> 31 31 31 31 31 31 | |
317 | <SPACE> 32 64 64 64 32 64 | |
318 | ! 33 90 90 90 33 90 | |
319 | " 34 127 127 127 34 127 | |
320 | # 35 123 123 123 35 123 | |
321 | $ 36 91 91 91 36 91 | |
322 | % 37 108 108 108 37 108 | |
323 | & 38 80 80 80 38 80 | |
324 | ' 39 125 125 125 39 125 | |
325 | ( 40 77 77 77 40 77 | |
326 | ) 41 93 93 93 41 93 | |
327 | * 42 92 92 92 42 92 | |
328 | + 43 78 78 78 43 78 | |
329 | , 44 107 107 107 44 107 | |
330 | - 45 96 96 96 45 96 | |
331 | . 46 75 75 75 46 75 | |
332 | / 47 97 97 97 47 97 | |
333 | 0 48 240 240 240 48 240 | |
334 | 1 49 241 241 241 49 241 | |
335 | 2 50 242 242 242 50 242 | |
336 | 3 51 243 243 243 51 243 | |
337 | 4 52 244 244 244 52 244 | |
338 | 5 53 245 245 245 53 245 | |
339 | 6 54 246 246 246 54 246 | |
340 | 7 55 247 247 247 55 247 | |
341 | 8 56 248 248 248 56 248 | |
342 | 9 57 249 249 249 57 249 | |
343 | : 58 122 122 122 58 122 | |
344 | ; 59 94 94 94 59 94 | |
345 | < 60 76 76 76 60 76 | |
346 | = 61 126 126 126 61 126 | |
347 | > 62 110 110 110 62 110 | |
348 | ? 63 111 111 111 63 111 | |
349 | @ 64 124 124 124 64 124 | |
350 | A 65 193 193 193 65 193 | |
351 | B 66 194 194 194 66 194 | |
352 | C 67 195 195 195 67 195 | |
353 | D 68 196 196 196 68 196 | |
354 | E 69 197 197 197 69 197 | |
355 | F 70 198 198 198 70 198 | |
356 | G 71 199 199 199 71 199 | |
357 | H 72 200 200 200 72 200 | |
358 | I 73 201 201 201 73 201 | |
359 | J 74 209 209 209 74 209 | |
360 | K 75 210 210 210 75 210 | |
361 | L 76 211 211 211 76 211 | |
362 | M 77 212 212 212 77 212 | |
363 | N 78 213 213 213 78 213 | |
364 | O 79 214 214 214 79 214 | |
365 | P 80 215 215 215 80 215 | |
366 | Q 81 216 216 216 81 216 | |
367 | R 82 217 217 217 82 217 | |
368 | S 83 226 226 226 83 226 | |
369 | T 84 227 227 227 84 227 | |
370 | U 85 228 228 228 85 228 | |
371 | V 86 229 229 229 86 229 | |
372 | W 87 230 230 230 87 230 | |
373 | X 88 231 231 231 88 231 | |
374 | Y 89 232 232 232 89 232 | |
375 | Z 90 233 233 233 90 233 | |
376 | [ 91 186 173 187 91 173 *** ### | |
377 | \ 92 224 224 188 92 224 ### | |
378 | ] 93 187 189 189 93 189 *** | |
379 | ^ 94 176 95 106 94 95 *** ### | |
380 | _ 95 109 109 109 95 109 | |
381 | ` 96 121 121 74 96 121 ### | |
382 | a 97 129 129 129 97 129 | |
383 | b 98 130 130 130 98 130 | |
384 | c 99 131 131 131 99 131 | |
385 | d 100 132 132 132 100 132 | |
386 | e 101 133 133 133 101 133 | |
387 | f 102 134 134 134 102 134 | |
388 | g 103 135 135 135 103 135 | |
389 | h 104 136 136 136 104 136 | |
390 | i 105 137 137 137 105 137 | |
391 | j 106 145 145 145 106 145 | |
392 | k 107 146 146 146 107 146 | |
393 | l 108 147 147 147 108 147 | |
394 | m 109 148 148 148 109 148 | |
395 | n 110 149 149 149 110 149 | |
396 | o 111 150 150 150 111 150 | |
397 | p 112 151 151 151 112 151 | |
398 | q 113 152 152 152 113 152 | |
399 | r 114 153 153 153 114 153 | |
400 | s 115 162 162 162 115 162 | |
401 | t 116 163 163 163 116 163 | |
402 | u 117 164 164 164 117 164 | |
403 | v 118 165 165 165 118 165 | |
404 | w 119 166 166 166 119 166 | |
405 | x 120 167 167 167 120 167 | |
406 | y 121 168 168 168 121 168 | |
407 | z 122 169 169 169 122 169 | |
408 | { 123 192 192 251 123 192 ### | |
409 | | 124 79 79 79 124 79 | |
410 | } 125 208 208 253 125 208 ### | |
411 | ~ 126 161 161 255 126 161 ### | |
412 | <DELETE> 127 7 7 7 127 7 | |
413 | <C1 0> 128 32 32 32 194.128 32 | |
414 | <C1 1> 129 33 33 33 194.129 33 | |
415 | <C1 2> 130 34 34 34 194.130 34 | |
416 | <C1 3> 131 35 35 35 194.131 35 | |
417 | <C1 4> 132 36 36 36 194.132 36 | |
418 | <C1 5> 133 21 37 37 194.133 37 *** | |
419 | <C1 6> 134 6 6 6 194.134 6 | |
420 | <C1 7> 135 23 23 23 194.135 23 | |
421 | <C1 8> 136 40 40 40 194.136 40 | |
422 | <C1 9> 137 41 41 41 194.137 41 | |
423 | <C1 10> 138 42 42 42 194.138 42 | |
424 | <C1 11> 139 43 43 43 194.139 43 | |
425 | <C1 12> 140 44 44 44 194.140 44 | |
426 | <C1 13> 141 9 9 9 194.141 9 | |
427 | <C1 14> 142 10 10 10 194.142 10 | |
428 | <C1 15> 143 27 27 27 194.143 27 | |
429 | <C1 16> 144 48 48 48 194.144 48 | |
430 | <C1 17> 145 49 49 49 194.145 49 | |
431 | <C1 18> 146 26 26 26 194.146 26 | |
432 | <C1 19> 147 51 51 51 194.147 51 | |
433 | <C1 20> 148 52 52 52 194.148 52 | |
434 | <C1 21> 149 53 53 53 194.149 53 | |
435 | <C1 22> 150 54 54 54 194.150 54 | |
436 | <C1 23> 151 8 8 8 194.151 8 | |
437 | <C1 24> 152 56 56 56 194.152 56 | |
438 | <C1 25> 153 57 57 57 194.153 57 | |
439 | <C1 26> 154 58 58 58 194.154 58 | |
440 | <C1 27> 155 59 59 59 194.155 59 | |
441 | <C1 28> 156 4 4 4 194.156 4 | |
442 | <C1 29> 157 20 20 20 194.157 20 | |
443 | <C1 30> 158 62 62 62 194.158 62 | |
444 | <C1 31> 159 255 255 95 194.159 255 ### | |
445 | <NON-BREAKING SPACE> 160 65 65 65 194.160 128.65 | |
446 | <INVERTED EXCLAMATION MARK> 161 170 170 170 194.161 128.66 | |
447 | <CENT SIGN> 162 74 74 176 194.162 128.67 ### | |
448 | <POUND SIGN> 163 177 177 177 194.163 128.68 | |
449 | <CURRENCY SIGN> 164 159 159 159 194.164 128.69 | |
450 | <YEN SIGN> 165 178 178 178 194.165 128.70 | |
451 | <BROKEN BAR> 166 106 106 208 194.166 128.71 ### | |
452 | <SECTION SIGN> 167 181 181 181 194.167 128.72 | |
453 | <DIAERESIS> 168 189 187 121 194.168 128.73 *** ### | |
454 | <COPYRIGHT SIGN> 169 180 180 180 194.169 128.74 | |
455 | <FEMININE ORDINAL INDICATOR> 170 154 154 154 194.170 128.81 | |
456 | <LEFT POINTING GUILLEMET> 171 138 138 138 194.171 128.82 | |
457 | <NOT SIGN> 172 95 176 186 194.172 128.83 *** ### | |
458 | <SOFT HYPHEN> 173 202 202 202 194.173 128.84 | |
459 | <REGISTERED TRADE MARK SIGN> 174 175 175 175 194.174 128.85 | |
460 | <MACRON> 175 188 188 161 194.175 128.86 ### | |
461 | <DEGREE SIGN> 176 144 144 144 194.176 128.87 | |
462 | <PLUS-OR-MINUS SIGN> 177 143 143 143 194.177 128.88 | |
463 | <SUPERSCRIPT TWO> 178 234 234 234 194.178 128.89 | |
464 | <SUPERSCRIPT THREE> 179 250 250 250 194.179 128.98 | |
465 | <ACUTE ACCENT> 180 190 190 190 194.180 128.99 | |
466 | <MICRO SIGN> 181 160 160 160 194.181 128.100 | |
467 | <PARAGRAPH SIGN> 182 182 182 182 194.182 128.101 | |
468 | <MIDDLE DOT> 183 179 179 179 194.183 128.102 | |
469 | <CEDILLA> 184 157 157 157 194.184 128.103 | |
470 | <SUPERSCRIPT ONE> 185 218 218 218 194.185 128.104 | |
471 | <MASC. ORDINAL INDICATOR> 186 155 155 155 194.186 128.105 | |
472 | <RIGHT POINTING GUILLEMET> 187 139 139 139 194.187 128.106 | |
473 | <FRACTION ONE QUARTER> 188 183 183 183 194.188 128.112 | |
474 | <FRACTION ONE HALF> 189 184 184 184 194.189 128.113 | |
475 | <FRACTION THREE QUARTERS> 190 185 185 185 194.190 128.114 | |
476 | <INVERTED QUESTION MARK> 191 171 171 171 194.191 128.115 | |
477 | <A WITH GRAVE> 192 100 100 100 195.128 138.65 | |
478 | <A WITH ACUTE> 193 101 101 101 195.129 138.66 | |
479 | <A WITH CIRCUMFLEX> 194 98 98 98 195.130 138.67 | |
480 | <A WITH TILDE> 195 102 102 102 195.131 138.68 | |
481 | <A WITH DIAERESIS> 196 99 99 99 195.132 138.69 | |
482 | <A WITH RING ABOVE> 197 103 103 103 195.133 138.70 | |
483 | <CAPITAL LIGATURE AE> 198 158 158 158 195.134 138.71 | |
484 | <C WITH CEDILLA> 199 104 104 104 195.135 138.72 | |
485 | <E WITH GRAVE> 200 116 116 116 195.136 138.73 | |
486 | <E WITH ACUTE> 201 113 113 113 195.137 138.74 | |
487 | <E WITH CIRCUMFLEX> 202 114 114 114 195.138 138.81 | |
488 | <E WITH DIAERESIS> 203 115 115 115 195.139 138.82 | |
489 | <I WITH GRAVE> 204 120 120 120 195.140 138.83 | |
490 | <I WITH ACUTE> 205 117 117 117 195.141 138.84 | |
491 | <I WITH CIRCUMFLEX> 206 118 118 118 195.142 138.85 | |
492 | <I WITH DIAERESIS> 207 119 119 119 195.143 138.86 | |
493 | <CAPITAL LETTER ETH> 208 172 172 172 195.144 138.87 | |
494 | <N WITH TILDE> 209 105 105 105 195.145 138.88 | |
495 | <O WITH GRAVE> 210 237 237 237 195.146 138.89 | |
496 | <O WITH ACUTE> 211 238 238 238 195.147 138.98 | |
497 | <O WITH CIRCUMFLEX> 212 235 235 235 195.148 138.99 | |
498 | <O WITH TILDE> 213 239 239 239 195.149 138.100 | |
499 | <O WITH DIAERESIS> 214 236 236 236 195.150 138.101 | |
500 | <MULTIPLICATION SIGN> 215 191 191 191 195.151 138.102 | |
501 | <O WITH STROKE> 216 128 128 128 195.152 138.103 | |
502 | <U WITH GRAVE> 217 253 253 224 195.153 138.104 ### | |
503 | <U WITH ACUTE> 218 254 254 254 195.154 138.105 | |
504 | <U WITH CIRCUMFLEX> 219 251 251 221 195.155 138.106 ### | |
505 | <U WITH DIAERESIS> 220 252 252 252 195.156 138.112 | |
506 | <Y WITH ACUTE> 221 173 186 173 195.157 138.113 *** ### | |
507 | <CAPITAL LETTER THORN> 222 174 174 174 195.158 138.114 | |
508 | <SMALL LETTER SHARP S> 223 89 89 89 195.159 138.115 | |
509 | <a WITH GRAVE> 224 68 68 68 195.160 139.65 | |
510 | <a WITH ACUTE> 225 69 69 69 195.161 139.66 | |
511 | <a WITH CIRCUMFLEX> 226 66 66 66 195.162 139.67 | |
512 | <a WITH TILDE> 227 70 70 70 195.163 139.68 | |
513 | <a WITH DIAERESIS> 228 67 67 67 195.164 139.69 | |
514 | <a WITH RING ABOVE> 229 71 71 71 195.165 139.70 | |
515 | <SMALL LIGATURE ae> 230 156 156 156 195.166 139.71 | |
516 | <c WITH CEDILLA> 231 72 72 72 195.167 139.72 | |
517 | <e WITH GRAVE> 232 84 84 84 195.168 139.73 | |
518 | <e WITH ACUTE> 233 81 81 81 195.169 139.74 | |
519 | <e WITH CIRCUMFLEX> 234 82 82 82 195.170 139.81 | |
520 | <e WITH DIAERESIS> 235 83 83 83 195.171 139.82 | |
521 | <i WITH GRAVE> 236 88 88 88 195.172 139.83 | |
522 | <i WITH ACUTE> 237 85 85 85 195.173 139.84 | |
523 | <i WITH CIRCUMFLEX> 238 86 86 86 195.174 139.85 | |
524 | <i WITH DIAERESIS> 239 87 87 87 195.175 139.86 | |
525 | <SMALL LETTER eth> 240 140 140 140 195.176 139.87 | |
526 | <n WITH TILDE> 241 73 73 73 195.177 139.88 | |
527 | <o WITH GRAVE> 242 205 205 205 195.178 139.89 | |
528 | <o WITH ACUTE> 243 206 206 206 195.179 139.98 | |
529 | <o WITH CIRCUMFLEX> 244 203 203 203 195.180 139.99 | |
530 | <o WITH TILDE> 245 207 207 207 195.181 139.100 | |
531 | <o WITH DIAERESIS> 246 204 204 204 195.182 139.101 | |
532 | <DIVISION SIGN> 247 225 225 225 195.183 139.102 | |
533 | <o WITH STROKE> 248 112 112 112 195.184 139.103 | |
534 | <u WITH GRAVE> 249 221 221 192 195.185 139.104 ### | |
535 | <u WITH ACUTE> 250 222 222 222 195.186 139.105 | |
536 | <u WITH CIRCUMFLEX> 251 219 219 219 195.187 139.106 | |
537 | <u WITH DIAERESIS> 252 220 220 220 195.188 139.112 | |
538 | <y WITH ACUTE> 253 141 141 141 195.189 139.113 | |
539 | <SMALL LETTER thorn> 254 142 142 142 195.190 139.114 | |
540 | <y WITH DIAERESIS> 255 223 223 223 195.191 139.115 | |
541 | ||
542 | If you would rather see the above table in CCSID 0037 order rather than | |
543 | ASCII + Latin-1 order then run the table through: | |
544 | ||
545 | =over 4 | |
546 | ||
547 | =item recipe 4 | |
548 | ||
549 | =back | |
550 | ||
551 | perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\ | |
552 | -e '{push(@l,$_)}' \ | |
553 | -e 'END{print map{$_->[0]}' \ | |
554 | -e ' sort{$a->[1] <=> $b->[1]}' \ | |
555 | -e ' map{[$_,substr($_,42,3)]}@l;}' perlebcdic.pod | |
556 | ||
557 | If you would rather see it in CCSID 1047 order then change the digit | |
558 | 42 in the last line to 51, like this: | |
559 | ||
560 | =over 4 | |
561 | ||
562 | =item recipe 5 | |
563 | ||
564 | =back | |
565 | ||
566 | perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\ | |
567 | -e '{push(@l,$_)}' \ | |
568 | -e 'END{print map{$_->[0]}' \ | |
569 | -e ' sort{$a->[1] <=> $b->[1]}' \ | |
570 | -e ' map{[$_,substr($_,51,3)]}@l;}' perlebcdic.pod | |
571 | ||
572 | If you would rather see it in POSIX-BC order then change the digit | |
573 | 51 in the last line to 60, like this: | |
574 | ||
575 | =over 4 | |
576 | ||
577 | =item recipe 6 | |
578 | ||
579 | =back | |
580 | ||
581 | perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\ | |
582 | -e '{push(@l,$_)}' \ | |
583 | -e 'END{print map{$_->[0]}' \ | |
584 | -e ' sort{$a->[1] <=> $b->[1]}' \ | |
585 | -e ' map{[$_,substr($_,60,3)]}@l;}' perlebcdic.pod | |
586 | ||
587 | ||
588 | =head1 IDENTIFYING CHARACTER CODE SETS | |
589 | ||
590 | To determine the character set you are running under from perl one | |
591 | could use the return value of ord() or chr() to test one or more | |
592 | character values. For example: | |
593 | ||
594 | $is_ascii = "A" eq chr(65); | |
595 | $is_ebcdic = "A" eq chr(193); | |
596 | ||
597 | Also, "\t" is a C<HORIZONTAL TABULATION> character so that: | |
598 | ||
599 | $is_ascii = ord("\t") == 9; | |
600 | $is_ebcdic = ord("\t") == 5; | |
601 | ||
602 | To distinguish EBCDIC code pages try looking at one or more of | |
603 | the characters that differ between them. For example: | |
604 | ||
605 | $is_ebcdic_37 = "\n" eq chr(37); | |
606 | $is_ebcdic_1047 = "\n" eq chr(21); | |
607 | ||
608 | Or better still choose a character that is uniquely encoded in any | |
609 | of the code sets, e.g.: | |
610 | ||
611 | $is_ascii = ord('[') == 91; | |
612 | $is_ebcdic_37 = ord('[') == 186; | |
613 | $is_ebcdic_1047 = ord('[') == 173; | |
614 | $is_ebcdic_POSIX_BC = ord('[') == 187; | |
615 | ||
616 | However, it would be unwise to write tests such as: | |
617 | ||
618 | $is_ascii = "\r" ne chr(13); # WRONG | |
619 | $is_ascii = "\n" ne chr(10); # ILL ADVISED | |
620 | ||
621 | Obviously the first of these will fail to distinguish most ASCII machines | |
622 | from either a CCSID 0037, a 1047, or a POSIX-BC EBCDIC machine since "\r" eq | |
623 | chr(13) under all of those coded character sets. But note too that | |
624 | because "\n" is chr(13) and "\r" is chr(10) on the MacIntosh (which is an | |
625 | ASCII machine) the second C<$is_ascii> test will lead to trouble there. | |
626 | ||
627 | To determine whether or not perl was built under an EBCDIC | |
628 | code page you can use the Config module like so: | |
629 | ||
630 | use Config; | |
631 | $is_ebcdic = $Config{'ebcdic'} eq 'define'; | |
632 | ||
633 | =head1 CONVERSIONS | |
634 | ||
635 | =head2 tr/// | |
636 | ||
637 | In order to convert a string of characters from one character set to | |
638 | another a simple list of numbers, such as in the right columns in the | |
639 | above table, along with perl's tr/// operator is all that is needed. | |
640 | The data in the table are in ASCII order hence the EBCDIC columns | |
641 | provide easy to use ASCII to EBCDIC operations that are also easily | |
642 | reversed. | |
643 | ||
644 | For example, to convert ASCII to code page 037 take the output of the second | |
645 | column from the output of recipe 0 (modified to add \\ characters) and use | |
646 | it in tr/// like so: | |
647 | ||
648 | $cp_037 = | |
649 | '\000\001\002\003\234\011\206\177\227\215\216\013\014\015\016\017' . | |
650 | '\020\021\022\023\235\205\010\207\030\031\222\217\034\035\036\037' . | |
651 | '\200\201\202\203\204\012\027\033\210\211\212\213\214\005\006\007' . | |
652 | '\220\221\026\223\224\225\226\004\230\231\232\233\024\025\236\032' . | |
653 | '\040\240\342\344\340\341\343\345\347\361\242\056\074\050\053\174' . | |
654 | '\046\351\352\353\350\355\356\357\354\337\041\044\052\051\073\254' . | |
655 | '\055\057\302\304\300\301\303\305\307\321\246\054\045\137\076\077' . | |
656 | '\370\311\312\313\310\315\316\317\314\140\072\043\100\047\075\042' . | |
657 | '\330\141\142\143\144\145\146\147\150\151\253\273\360\375\376\261' . | |
658 | '\260\152\153\154\155\156\157\160\161\162\252\272\346\270\306\244' . | |
659 | '\265\176\163\164\165\166\167\170\171\172\241\277\320\335\336\256' . | |
660 | '\136\243\245\267\251\247\266\274\275\276\133\135\257\250\264\327' . | |
661 | '\173\101\102\103\104\105\106\107\110\111\255\364\366\362\363\365' . | |
662 | '\175\112\113\114\115\116\117\120\121\122\271\373\374\371\372\377' . | |
663 | '\134\367\123\124\125\126\127\130\131\132\262\324\326\322\323\325' . | |
664 | '\060\061\062\063\064\065\066\067\070\071\263\333\334\331\332\237' ; | |
665 | ||
666 | my $ebcdic_string = $ascii_string; | |
667 | eval '$ebcdic_string =~ tr/' . $cp_037 . '/\000-\377/'; | |
668 | ||
669 | To convert from EBCDIC 037 to ASCII just reverse the order of the tr/// | |
670 | arguments like so: | |
671 | ||
672 | my $ascii_string = $ebcdic_string; | |
673 | eval '$ascii_string =~ tr/\000-\377/' . $cp_037 . '/'; | |
674 | ||
675 | Similarly one could take the output of the third column from recipe 0 to | |
676 | obtain a C<$cp_1047> table. The fourth column of the output from recipe | |
677 | 0 could provide a C<$cp_posix_bc> table suitable for transcoding as well. | |
678 | ||
679 | =head2 iconv | |
680 | ||
681 | XPG operability often implies the presence of an I<iconv> utility | |
682 | available from the shell or from the C library. Consult your system's | |
683 | documentation for information on iconv. | |
684 | ||
685 | On OS/390 or z/OS see the iconv(1) manpage. One way to invoke the iconv | |
686 | shell utility from within perl would be to: | |
687 | ||
688 | # OS/390 or z/OS example | |
689 | $ascii_data = `echo '$ebcdic_data'| iconv -f IBM-1047 -t ISO8859-1` | |
690 | ||
691 | or the inverse map: | |
692 | ||
693 | # OS/390 or z/OS example | |
694 | $ebcdic_data = `echo '$ascii_data'| iconv -f ISO8859-1 -t IBM-1047` | |
695 | ||
696 | For other perl based conversion options see the Convert::* modules on CPAN. | |
697 | ||
698 | =head2 C RTL | |
699 | ||
700 | The OS/390 and z/OS C run time libraries provide _atoe() and _etoa() functions. | |
701 | ||
702 | =head1 OPERATOR DIFFERENCES | |
703 | ||
704 | The C<..> range operator treats certain character ranges with | |
705 | care on EBCDIC machines. For example the following array | |
706 | will have twenty six elements on either an EBCDIC machine | |
707 | or an ASCII machine: | |
708 | ||
709 | @alphabet = ('A'..'Z'); # $#alphabet == 25 | |
710 | ||
711 | The bitwise operators such as & ^ | may return different results | |
712 | when operating on string or character data in a perl program running | |
713 | on an EBCDIC machine than when run on an ASCII machine. Here is | |
714 | an example adapted from the one in L<perlop>: | |
715 | ||
716 | # EBCDIC-based examples | |
717 | print "j p \n" ^ " a h"; # prints "JAPH\n" | |
718 | print "JA" | " ph\n"; # prints "japh\n" | |
719 | print "JAPH\nJunk" & "\277\277\277\277\277"; # prints "japh\n"; | |
720 | print 'p N$' ^ " E<H\n"; # prints "Perl\n"; | |
721 | ||
722 | An interesting property of the 32 C0 control characters | |
723 | in the ASCII table is that they can "literally" be constructed | |
724 | as control characters in perl, e.g. C<(chr(0) eq "\c@")> | |
725 | C<(chr(1) eq "\cA")>, and so on. Perl on EBCDIC machines has been | |
726 | ported to take "\c@" to chr(0) and "\cA" to chr(1) as well, but the | |
727 | thirty three characters that result depend on which code page you are | |
728 | using. The table below uses the character names from the previous table | |
729 | but with substitutions such as s/START OF/S.O./; s/END OF /E.O./; | |
730 | s/TRANSMISSION/TRANS./; s/TABULATION/TAB./; s/VERTICAL/VERT./; | |
731 | s/HORIZONTAL/HORIZ./; s/DEVICE CONTROL/D.C./; s/SEPARATOR/SEP./; | |
732 | s/NEGATIVE ACKNOWLEDGE/NEG. ACK./;. The POSIX-BC and 1047 sets are | |
733 | identical throughout this range and differ from the 0037 set at only | |
734 | one spot (21 decimal). Note that the C<LINE FEED> character | |
735 | may be generated by "\cJ" on ASCII machines but by "\cU" on 1047 or POSIX-BC | |
736 | machines and cannot be generated as a C<"\c.letter."> control character on | |
737 | 0037 machines. Note also that "\c\\" maps to two characters | |
738 | not one. | |
739 | ||
740 | chr ord 8859-1 0037 1047 && POSIX-BC | |
741 | ------------------------------------------------------------------------ | |
742 | "\c?" 127 <DELETE> " " ***>< | |
743 | "\c@" 0 <NULL> <NULL> <NULL> ***>< | |
744 | "\cA" 1 <S.O. HEADING> <S.O. HEADING> <S.O. HEADING> | |
745 | "\cB" 2 <S.O. TEXT> <S.O. TEXT> <S.O. TEXT> | |
746 | "\cC" 3 <E.O. TEXT> <E.O. TEXT> <E.O. TEXT> | |
747 | "\cD" 4 <E.O. TRANS.> <C1 28> <C1 28> | |
748 | "\cE" 5 <ENQUIRY> <HORIZ. TAB.> <HORIZ. TAB.> | |
749 | "\cF" 6 <ACKNOWLEDGE> <C1 6> <C1 6> | |
750 | "\cG" 7 <BELL> <DELETE> <DELETE> | |
751 | "\cH" 8 <BACKSPACE> <C1 23> <C1 23> | |
752 | "\cI" 9 <HORIZ. TAB.> <C1 13> <C1 13> | |
753 | "\cJ" 10 <LINE FEED> <C1 14> <C1 14> | |
754 | "\cK" 11 <VERT. TAB.> <VERT. TAB.> <VERT. TAB.> | |
755 | "\cL" 12 <FORM FEED> <FORM FEED> <FORM FEED> | |
756 | "\cM" 13 <CARRIAGE RETURN> <CARRIAGE RETURN> <CARRIAGE RETURN> | |
757 | "\cN" 14 <SHIFT OUT> <SHIFT OUT> <SHIFT OUT> | |
758 | "\cO" 15 <SHIFT IN> <SHIFT IN> <SHIFT IN> | |
759 | "\cP" 16 <DATA LINK ESCAPE> <DATA LINK ESCAPE> <DATA LINK ESCAPE> | |
760 | "\cQ" 17 <D.C. ONE> <D.C. ONE> <D.C. ONE> | |
761 | "\cR" 18 <D.C. TWO> <D.C. TWO> <D.C. TWO> | |
762 | "\cS" 19 <D.C. THREE> <D.C. THREE> <D.C. THREE> | |
763 | "\cT" 20 <D.C. FOUR> <C1 29> <C1 29> | |
764 | "\cU" 21 <NEG. ACK.> <C1 5> <LINE FEED> *** | |
765 | "\cV" 22 <SYNCHRONOUS IDLE> <BACKSPACE> <BACKSPACE> | |
766 | "\cW" 23 <E.O. TRANS. BLOCK> <C1 7> <C1 7> | |
767 | "\cX" 24 <CANCEL> <CANCEL> <CANCEL> | |
768 | "\cY" 25 <E.O. MEDIUM> <E.O. MEDIUM> <E.O. MEDIUM> | |
769 | "\cZ" 26 <SUBSTITUTE> <C1 18> <C1 18> | |
770 | "\c[" 27 <ESCAPE> <C1 15> <C1 15> | |
771 | "\c\\" 28 <FILE SEP.>\ <FILE SEP.>\ <FILE SEP.>\ | |
772 | "\c]" 29 <GROUP SEP.> <GROUP SEP.> <GROUP SEP.> | |
773 | "\c^" 30 <RECORD SEP.> <RECORD SEP.> <RECORD SEP.> ***>< | |
774 | "\c_" 31 <UNIT SEP.> <UNIT SEP.> <UNIT SEP.> ***>< | |
775 | ||
776 | ||
777 | =head1 FUNCTION DIFFERENCES | |
778 | ||
779 | =over 8 | |
780 | ||
781 | =item chr() | |
782 | ||
783 | chr() must be given an EBCDIC code number argument to yield a desired | |
784 | character return value on an EBCDIC machine. For example: | |
785 | ||
786 | $CAPITAL_LETTER_A = chr(193); | |
787 | ||
788 | =item ord() | |
789 | ||
790 | ord() will return EBCDIC code number values on an EBCDIC machine. | |
791 | For example: | |
792 | ||
793 | $the_number_193 = ord("A"); | |
794 | ||
795 | =item pack() | |
796 | ||
797 | The c and C templates for pack() are dependent upon character set | |
798 | encoding. Examples of usage on EBCDIC include: | |
799 | ||
800 | $foo = pack("CCCC",193,194,195,196); | |
801 | # $foo eq "ABCD" | |
802 | $foo = pack("C4",193,194,195,196); | |
803 | # same thing | |
804 | ||
805 | $foo = pack("ccxxcc",193,194,195,196); | |
806 | # $foo eq "AB\0\0CD" | |
807 | ||
808 | =item print() | |
809 | ||
810 | One must be careful with scalars and strings that are passed to | |
811 | print that contain ASCII encodings. One common place | |
812 | for this to occur is in the output of the MIME type header for | |
813 | CGI script writing. For example, many perl programming guides | |
814 | recommend something similar to: | |
815 | ||
816 | print "Content-type:\ttext/html\015\012\015\012"; | |
817 | # this may be wrong on EBCDIC | |
818 | ||
819 | Under the IBM OS/390 USS Web Server or WebSphere on z/OS for example | |
820 | you should instead write that as: | |
821 | ||
822 | print "Content-type:\ttext/html\r\n\r\n"; # OK for DGW et alia | |
823 | ||
824 | That is because the translation from EBCDIC to ASCII is done | |
825 | by the web server in this case (such code will not be appropriate for | |
826 | the Macintosh however). Consult your web server's documentation for | |
827 | further details. | |
828 | ||
829 | =item printf() | |
830 | ||
831 | The formats that can convert characters to numbers and vice versa | |
832 | will be different from their ASCII counterparts when executed | |
833 | on an EBCDIC machine. Examples include: | |
834 | ||
835 | printf("%c%c%c",193,194,195); # prints ABC | |
836 | ||
837 | =item sort() | |
838 | ||
839 | EBCDIC sort results may differ from ASCII sort results especially for | |
840 | mixed case strings. This is discussed in more detail below. | |
841 | ||
842 | =item sprintf() | |
843 | ||
844 | See the discussion of printf() above. An example of the use | |
845 | of sprintf would be: | |
846 | ||
847 | $CAPITAL_LETTER_A = sprintf("%c",193); | |
848 | ||
849 | =item unpack() | |
850 | ||
851 | See the discussion of pack() above. | |
852 | ||
853 | =back | |
854 | ||
855 | =head1 REGULAR EXPRESSION DIFFERENCES | |
856 | ||
857 | As of perl 5.005_03 the letter range regular expression such as | |
858 | [A-Z] and [a-z] have been especially coded to not pick up gap | |
859 | characters. For example, characters such as E<ocirc> C<o WITH CIRCUMFLEX> | |
860 | that lie between I and J would not be matched by the | |
861 | regular expression range C</[H-K]/>. This works in | |
862 | the other direction, too, if either of the range end points is | |
863 | explicitly numeric: C<[\x89-\x91]> will match C<\x8e>, even | |
864 | though C<\x89> is C<i> and C<\x91 > is C<j>, and C<\x8e> | |
865 | is a gap character from the alphabetic viewpoint. | |
866 | ||
867 | If you do want to match the alphabet gap characters in a single octet | |
868 | regular expression try matching the hex or octal code such | |
869 | as C</\313/> on EBCDIC or C</\364/> on ASCII machines to | |
870 | have your regular expression match C<o WITH CIRCUMFLEX>. | |
871 | ||
872 | Another construct to be wary of is the inappropriate use of hex or | |
873 | octal constants in regular expressions. Consider the following | |
874 | set of subs: | |
875 | ||
876 | sub is_c0 { | |
877 | my $char = substr(shift,0,1); | |
878 | $char =~ /[\000-\037]/; | |
879 | } | |
880 | ||
881 | sub is_print_ascii { | |
882 | my $char = substr(shift,0,1); | |
883 | $char =~ /[\040-\176]/; | |
884 | } | |
885 | ||
886 | sub is_delete { | |
887 | my $char = substr(shift,0,1); | |
888 | $char eq "\177"; | |
889 | } | |
890 | ||
891 | sub is_c1 { | |
892 | my $char = substr(shift,0,1); | |
893 | $char =~ /[\200-\237]/; | |
894 | } | |
895 | ||
896 | sub is_latin_1 { | |
897 | my $char = substr(shift,0,1); | |
898 | $char =~ /[\240-\377]/; | |
899 | } | |
900 | ||
901 | The above would be adequate if the concern was only with numeric code points. | |
902 | However, the concern may be with characters rather than code points | |
903 | and on an EBCDIC machine it may be desirable for constructs such as | |
904 | C<if (is_print_ascii("A")) {print "A is a printable character\n";}> to print | |
905 | out the expected message. One way to represent the above collection | |
906 | of character classification subs that is capable of working across the | |
907 | four coded character sets discussed in this document is as follows: | |
908 | ||
909 | sub Is_c0 { | |
910 | my $char = substr(shift,0,1); | |
911 | if (ord('^')==94) { # ascii | |
912 | return $char =~ /[\000-\037]/; | |
913 | } | |
914 | if (ord('^')==176) { # 37 | |
915 | return $char =~ /[\000-\003\067\055-\057\026\005\045\013-\023\074\075\062\046\030\031\077\047\034-\037]/; | |
916 | } | |
917 | if (ord('^')==95 || ord('^')==106) { # 1047 || posix-bc | |
918 | return $char =~ /[\000-\003\067\055-\057\026\005\025\013-\023\074\075\062\046\030\031\077\047\034-\037]/; | |
919 | } | |
920 | } | |
921 | ||
922 | sub Is_print_ascii { | |
923 | my $char = substr(shift,0,1); | |
924 | $char =~ /[ !"\#\$%&'()*+,\-.\/0-9:;<=>?\@A-Z[\\\]^_`a-z{|}~]/; | |
925 | } | |
926 | ||
927 | sub Is_delete { | |
928 | my $char = substr(shift,0,1); | |
929 | if (ord('^')==94) { # ascii | |
930 | return $char eq "\177"; | |
931 | } | |
932 | else { # ebcdic | |
933 | return $char eq "\007"; | |
934 | } | |
935 | } | |
936 | ||
937 | sub Is_c1 { | |
938 | my $char = substr(shift,0,1); | |
939 | if (ord('^')==94) { # ascii | |
940 | return $char =~ /[\200-\237]/; | |
941 | } | |
942 | if (ord('^')==176) { # 37 | |
943 | return $char =~ /[\040-\044\025\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\377]/; | |
944 | } | |
945 | if (ord('^')==95) { # 1047 | |
946 | return $char =~ /[\040-\045\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\377]/; | |
947 | } | |
948 | if (ord('^')==106) { # posix-bc | |
949 | return $char =~ | |
950 | /[\040-\045\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\137]/; | |
951 | } | |
952 | } | |
953 | ||
954 | sub Is_latin_1 { | |
955 | my $char = substr(shift,0,1); | |
956 | if (ord('^')==94) { # ascii | |
957 | return $char =~ /[\240-\377]/; | |
958 | } | |
959 | if (ord('^')==176) { # 37 | |
960 | return $char =~ | |
961 | /[\101\252\112\261\237\262\152\265\275\264\232\212\137\312\257\274\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\375\376\373\374\255\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\335\336\333\334\215\216\337]/; | |
962 | } | |
963 | if (ord('^')==95) { # 1047 | |
964 | return $char =~ | |
965 | /[\101\252\112\261\237\262\152\265\273\264\232\212\260\312\257\274\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\375\376\373\374\272\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\335\336\333\334\215\216\337]/; | |
966 | } | |
967 | if (ord('^')==106) { # posix-bc | |
968 | return $char =~ | |
969 | /[\101\252\260\261\237\262\320\265\171\264\232\212\272\312\257\241\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\340\376\335\374\255\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\300\336\333\334\215\216\337]/; | |
970 | } | |
971 | } | |
972 | ||
973 | Note however that only the C<Is_ascii_print()> sub is really independent | |
974 | of coded character set. Another way to write C<Is_latin_1()> would be | |
975 | to use the characters in the range explicitly: | |
976 | ||
977 | sub Is_latin_1 { | |
978 | my $char = substr(shift,0,1); | |
979 |