Initial commit of OpenSPARC T2 architecture model.
[OpenSPARC-T2-SAM] / sam-t2 / devtools / amd64 / html / python / lib / module-email.Charset.html
CommitLineData
920dae64
AT
1<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
2<html>
3<head>
4<link rel="STYLESHEET" href="lib.css" type='text/css' />
5<link rel="SHORTCUT ICON" href="../icons/pyfav.png" type="image/png" />
6<link rel='start' href='../index.html' title='Python Documentation Index' />
7<link rel="first" href="lib.html" title='Python Library Reference' />
8<link rel='contents' href='contents.html' title="Contents" />
9<link rel='index' href='genindex.html' title='Index' />
10<link rel='last' href='about.html' title='About this document...' />
11<link rel='help' href='about.html' title='About this document...' />
12<link rel="next" href="module-email.Encoders.html" />
13<link rel="prev" href="module-email.Header.html" />
14<link rel="parent" href="module-email.html" />
15<link rel="next" href="module-email.Encoders.html" />
16<meta name='aesop' content='information' />
17<title>12.2.6 Representing character sets</title>
18</head>
19<body>
20<DIV CLASS="navigation">
21<div id='top-navigation-panel' xml:id='top-navigation-panel'>
22<table align="center" width="100%" cellpadding="0" cellspacing="2">
23<tr>
24<td class='online-navigation'><a rel="prev" title="12.2.5 Internationalized headers"
25 href="module-email.Header.html"><img src='../icons/previous.png'
26 border='0' height='32' alt='Previous Page' width='32' /></A></td>
27<td class='online-navigation'><a rel="parent" title="12.2 email "
28 href="module-email.html"><img src='../icons/up.png'
29 border='0' height='32' alt='Up One Level' width='32' /></A></td>
30<td class='online-navigation'><a rel="next" title="12.2.7 Encoders"
31 href="module-email.Encoders.html"><img src='../icons/next.png'
32 border='0' height='32' alt='Next Page' width='32' /></A></td>
33<td align="center" width="100%">Python Library Reference</td>
34<td class='online-navigation'><a rel="contents" title="Table of Contents"
35 href="contents.html"><img src='../icons/contents.png'
36 border='0' height='32' alt='Contents' width='32' /></A></td>
37<td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png'
38 border='0' height='32' alt='Module Index' width='32' /></a></td>
39<td class='online-navigation'><a rel="index" title="Index"
40 href="genindex.html"><img src='../icons/index.png'
41 border='0' height='32' alt='Index' width='32' /></A></td>
42</tr></table>
43<div class='online-navigation'>
44<b class="navlabel">Previous:</b>
45<a class="sectref" rel="prev" href="module-email.Header.html">12.2.5 Internationalized headers</A>
46<b class="navlabel">Up:</b>
47<a class="sectref" rel="parent" href="module-email.html">12.2 email </A>
48<b class="navlabel">Next:</b>
49<a class="sectref" rel="next" href="module-email.Encoders.html">12.2.7 Encoders</A>
50</div>
51<hr /></div>
52</DIV>
53<!--End of Navigation Panel-->
54
55<H2><A NAME="SECTION0014260000000000000000">
5612.2.6 Representing character sets</A>
57</H2>
58<A NAME="module-email.Charset"></A>
59
60<P>
61This module provides a class <tt class="class">Charset</tt> for representing
62character sets and character set conversions in email messages, as
63well as a character set registry and several convenience methods for
64manipulating this registry. Instances of <tt class="class">Charset</tt> are used in
65several other modules within the <tt class="module">email</tt> package.
66
67<P>
68
69<span class="versionnote">New in version 2.2.2.</span>
70
71<P>
72<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
73 <td><nobr><b><span class="typelabel">class</span>&nbsp;<tt id='l2h-3909' xml:id='l2h-3909' class="class">Charset</tt></b>(</nobr></td>
74 <td><var></var><big>[</big><var>input_charset</var><big>]</big><var></var>)</td></tr></table></dt>
75<dd>
76Map character sets to their email properties.
77
78<P>
79This class provides information about the requirements imposed on
80email for a specific character set. It also provides convenience
81routines for converting between character sets, given the availability
82of the applicable codecs. Given a character set, it will do its best
83to provide information on how to use that character set in an email
84message in an RFC-compliant way.
85
86<P>
87Certain character sets must be encoded with quoted-printable or base64
88when used in email headers or bodies. Certain character sets must be
89converted outright, and are not allowed in email.
90
91<P>
92Optional <var>input_charset</var> is as described below; it is always
93coerced to lower case. After being alias normalized it is also used
94as a lookup into the registry of character sets to find out the header
95encoding, body encoding, and output conversion codec to be used for
96the character set. For example, if
97<var>input_charset</var> is <code>iso-8859-1</code>, then headers and bodies will
98be encoded using quoted-printable and no output conversion codec is
99necessary. If <var>input_charset</var> is <code>euc-jp</code>, then headers will
100be encoded with base64, bodies will not be encoded, but output text
101will be converted from the <code>euc-jp</code> character set to the
102<code>iso-2022-jp</code> character set.
103</dl>
104
105<P>
106<tt class="class">Charset</tt> instances have the following data attributes:
107
108<P>
109<dl><dt><b><tt id='l2h-3910' xml:id='l2h-3910'>input_charset</tt></b></dt>
110<dd>
111The initial character set specified. Common aliases are converted to
112their <em>official</em> email names (e.g. <code>latin_1</code> is converted to
113<code>iso-8859-1</code>). Defaults to 7-bit <code>us-ascii</code>.
114</dd></dl>
115
116<P>
117<dl><dt><b><tt id='l2h-3911' xml:id='l2h-3911'>header_encoding</tt></b></dt>
118<dd>
119If the character set must be encoded before it can be used in an
120email header, this attribute will be set to <code>Charset.QP</code> (for
121quoted-printable), <code>Charset.BASE64</code> (for base64 encoding), or
122<code>Charset.SHORTEST</code> for the shortest of QP or BASE64 encoding.
123Otherwise, it will be <code>None</code>.
124</dd></dl>
125
126<P>
127<dl><dt><b><tt id='l2h-3912' xml:id='l2h-3912'>body_encoding</tt></b></dt>
128<dd>
129Same as <var>header_encoding</var>, but describes the encoding for the
130mail message's body, which indeed may be different than the header
131encoding. <code>Charset.SHORTEST</code> is not allowed for
132<var>body_encoding</var>.
133</dd></dl>
134
135<P>
136<dl><dt><b><tt id='l2h-3913' xml:id='l2h-3913'>output_charset</tt></b></dt>
137<dd>
138Some character sets must be converted before they can be used in
139email headers or bodies. If the <var>input_charset</var> is one of
140them, this attribute will contain the name of the character set
141output will be converted to. Otherwise, it will be <code>None</code>.
142</dd></dl>
143
144<P>
145<dl><dt><b><tt id='l2h-3914' xml:id='l2h-3914'>input_codec</tt></b></dt>
146<dd>
147The name of the Python codec used to convert the <var>input_charset</var> to
148Unicode. If no conversion codec is necessary, this attribute will be
149<code>None</code>.
150</dd></dl>
151
152<P>
153<dl><dt><b><tt id='l2h-3915' xml:id='l2h-3915'>output_codec</tt></b></dt>
154<dd>
155The name of the Python codec used to convert Unicode to the
156<var>output_charset</var>. If no conversion codec is necessary, this
157attribute will have the same value as the <var>input_codec</var>.
158</dd></dl>
159
160<P>
161<tt class="class">Charset</tt> instances also have the following methods:
162
163<P>
164<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
165 <td><nobr><b><tt id='l2h-3916' xml:id='l2h-3916' class="method">get_body_encoding</tt></b>(</nobr></td>
166 <td><var></var>)</td></tr></table></dt>
167<dd>
168Return the content transfer encoding used for body encoding.
169
170<P>
171This is either the string "<tt class="samp">quoted-printable</tt>" or "<tt class="samp">base64</tt>"depending on the encoding used, or it is a function, in which case you
172should call the function with a single argument, the Message object
173being encoded. The function should then set the
174<span class="mailheader">Content-Transfer-Encoding:</span> header itself to whatever is
175appropriate.
176
177<P>
178Returns the string "<tt class="samp">quoted-printable</tt>" if
179<var>body_encoding</var> is <code>QP</code>, returns the string
180"<tt class="samp">base64</tt>" if <var>body_encoding</var> is <code>BASE64</code>, and returns the
181string "<tt class="samp">7bit</tt>" otherwise.
182</dl>
183
184<P>
185<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
186 <td><nobr><b><tt id='l2h-3917' xml:id='l2h-3917' class="method">convert</tt></b>(</nobr></td>
187 <td><var>s</var>)</td></tr></table></dt>
188<dd>
189Convert the string <var>s</var> from the <var>input_codec</var> to the
190<var>output_codec</var>.
191</dl>
192
193<P>
194<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
195 <td><nobr><b><tt id='l2h-3918' xml:id='l2h-3918' class="method">to_splittable</tt></b>(</nobr></td>
196 <td><var>s</var>)</td></tr></table></dt>
197<dd>
198Convert a possibly multibyte string to a safely splittable format.
199<var>s</var> is the string to split.
200
201<P>
202Uses the <var>input_codec</var> to try and convert the string to Unicode,
203so it can be safely split on character boundaries (even for multibyte
204characters).
205
206<P>
207Returns the string as-is if it isn't known how to convert <var>s</var> to
208Unicode with the <var>input_charset</var>.
209
210<P>
211Characters that could not be converted to Unicode will be replaced
212with the Unicode replacement character "<tt class="character">U+FFFD</tt>".
213</dl>
214
215<P>
216<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
217 <td><nobr><b><tt id='l2h-3919' xml:id='l2h-3919' class="method">from_splittable</tt></b>(</nobr></td>
218 <td><var>ustr</var><big>[</big><var>, to_output</var><big>]</big><var></var>)</td></tr></table></dt>
219<dd>
220Convert a splittable string back into an encoded string. <var>ustr</var>
221is a Unicode string to ``unsplit''.
222
223<P>
224This method uses the proper codec to try and convert the string from
225Unicode back into an encoded format. Return the string as-is if it is
226not Unicode, or if it could not be converted from Unicode.
227
228<P>
229Characters that could not be converted from Unicode will be replaced
230with an appropriate character (usually "<tt class="character">?</tt>").
231
232<P>
233If <var>to_output</var> is <code>True</code> (the default), uses
234<var>output_codec</var> to convert to an
235encoded format. If <var>to_output</var> is <code>False</code>, it uses
236<var>input_codec</var>.
237</dl>
238
239<P>
240<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
241 <td><nobr><b><tt id='l2h-3920' xml:id='l2h-3920' class="method">get_output_charset</tt></b>(</nobr></td>
242 <td><var></var>)</td></tr></table></dt>
243<dd>
244Return the output character set.
245
246<P>
247This is the <var>output_charset</var> attribute if that is not <code>None</code>,
248otherwise it is <var>input_charset</var>.
249</dl>
250
251<P>
252<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
253 <td><nobr><b><tt id='l2h-3921' xml:id='l2h-3921' class="method">encoded_header_len</tt></b>(</nobr></td>
254 <td><var></var>)</td></tr></table></dt>
255<dd>
256Return the length of the encoded header string, properly calculating
257for quoted-printable or base64 encoding.
258</dl>
259
260<P>
261<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
262 <td><nobr><b><tt id='l2h-3922' xml:id='l2h-3922' class="method">header_encode</tt></b>(</nobr></td>
263 <td><var>s</var><big>[</big><var>, convert</var><big>]</big><var></var>)</td></tr></table></dt>
264<dd>
265Header-encode the string <var>s</var>.
266
267<P>
268If <var>convert</var> is <code>True</code>, the string will be converted from the
269input charset to the output charset automatically. This is not useful
270for multibyte character sets, which have line length issues (multibyte
271characters must be split on a character, not a byte boundary); use the
272higher-level <tt class="class">Header</tt> class to deal with these issues (see
273<tt class="module"><a href="module-email.Header.html">email.Header</a></tt>). <var>convert</var> defaults to <code>False</code>.
274
275<P>
276The type of encoding (base64 or quoted-printable) will be based on
277the <var>header_encoding</var> attribute.
278</dl>
279
280<P>
281<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
282 <td><nobr><b><tt id='l2h-3923' xml:id='l2h-3923' class="method">body_encode</tt></b>(</nobr></td>
283 <td><var>s</var><big>[</big><var>, convert</var><big>]</big><var></var>)</td></tr></table></dt>
284<dd>
285Body-encode the string <var>s</var>.
286
287<P>
288If <var>convert</var> is <code>True</code> (the default), the string will be
289converted from the input charset to output charset automatically.
290Unlike <tt class="method">header_encode()</tt>, there are no issues with byte
291boundaries and multibyte charsets in email bodies, so this is usually
292pretty safe.
293
294<P>
295The type of encoding (base64 or quoted-printable) will be based on
296the <var>body_encoding</var> attribute.
297</dl>
298
299<P>
300The <tt class="class">Charset</tt> class also provides a number of methods to support
301standard operations and built-in functions.
302
303<P>
304<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
305 <td><nobr><b><tt id='l2h-3924' xml:id='l2h-3924' class="method">__str__</tt></b>(</nobr></td>
306 <td><var></var>)</td></tr></table></dt>
307<dd>
308Returns <var>input_charset</var> as a string coerced to lower case.
309<tt class="method">__repr__()</tt> is an alias for <tt class="method">__str__()</tt>.
310</dl>
311
312<P>
313<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
314 <td><nobr><b><tt id='l2h-3925' xml:id='l2h-3925' class="method">__eq__</tt></b>(</nobr></td>
315 <td><var>other</var>)</td></tr></table></dt>
316<dd>
317This method allows you to compare two <tt class="class">Charset</tt> instances for equality.
318</dl>
319
320<P>
321<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
322 <td><nobr><b><tt id='l2h-3926' xml:id='l2h-3926' class="method">__ne__</tt></b>(</nobr></td>
323 <td><var>other</var>)</td></tr></table></dt>
324<dd>
325This method allows you to compare two <tt class="class">Charset</tt> instances for inequality.
326</dl>
327
328<P>
329The <tt class="module">email.Charset</tt> module also provides the following
330functions for adding new entries to the global character set, alias,
331and codec registries:
332
333<P>
334<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
335 <td><nobr><b><tt id='l2h-3927' xml:id='l2h-3927' class="function">add_charset</tt></b>(</nobr></td>
336 <td><var>charset</var><big>[</big><var>, header_enc</var><big>[</big><var>,
337 body_enc</var><big>[</big><var>, output_charset</var><big>]</big><var></var><big>]</big><var></var><big>]</big><var></var>)</td></tr></table></dt>
338<dd>
339Add character properties to the global registry.
340
341<P>
342<var>charset</var> is the input character set, and must be the canonical
343name of a character set.
344
345<P>
346Optional <var>header_enc</var> and <var>body_enc</var> is either
347<code>Charset.QP</code> for quoted-printable, <code>Charset.BASE64</code> for
348base64 encoding, <code>Charset.SHORTEST</code> for the shortest of
349quoted-printable or base64 encoding, or <code>None</code> for no encoding.
350<code>SHORTEST</code> is only valid for <var>header_enc</var>. The default is
351<code>None</code> for no encoding.
352
353<P>
354Optional <var>output_charset</var> is the character set that the output
355should be in. Conversions will proceed from input charset, to
356Unicode, to the output charset when the method
357<tt class="method">Charset.convert()</tt> is called. The default is to output in the
358same character set as the input.
359
360<P>
361Both <var>input_charset</var> and <var>output_charset</var> must have Unicode
362codec entries in the module's character set-to-codec mapping; use
363<tt class="function">add_codec()</tt> to add codecs the module does
364not know about. See the <tt class="module"><a href="module-codecs.html">codecs</a></tt> module's documentation for
365more information.
366
367<P>
368The global character set registry is kept in the module global
369dictionary <code>CHARSETS</code>.
370</dl>
371
372<P>
373<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
374 <td><nobr><b><tt id='l2h-3928' xml:id='l2h-3928' class="function">add_alias</tt></b>(</nobr></td>
375 <td><var>alias, canonical</var>)</td></tr></table></dt>
376<dd>
377Add a character set alias. <var>alias</var> is the alias name,
378e.g. <code>latin-1</code>. <var>canonical</var> is the character set's canonical
379name, e.g. <code>iso-8859-1</code>.
380
381<P>
382The global charset alias registry is kept in the module global
383dictionary <code>ALIASES</code>.
384</dl>
385
386<P>
387<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
388 <td><nobr><b><tt id='l2h-3929' xml:id='l2h-3929' class="function">add_codec</tt></b>(</nobr></td>
389 <td><var>charset, codecname</var>)</td></tr></table></dt>
390<dd>
391Add a codec that map characters in the given character set to and from
392Unicode.
393
394<P>
395<var>charset</var> is the canonical name of a character set.
396<var>codecname</var> is the name of a Python codec, as appropriate for the
397second argument to the <tt class="function">unicode()</tt> built-in, or to the
398<tt class="method">encode()</tt> method of a Unicode string.
399</dl>
400
401<P>
402
403<DIV CLASS="navigation">
404<div class='online-navigation'>
405<p></p><hr />
406<table align="center" width="100%" cellpadding="0" cellspacing="2">
407<tr>
408<td class='online-navigation'><a rel="prev" title="12.2.5 Internationalized headers"
409 href="module-email.Header.html"><img src='../icons/previous.png'
410 border='0' height='32' alt='Previous Page' width='32' /></A></td>
411<td class='online-navigation'><a rel="parent" title="12.2 email "
412 href="module-email.html"><img src='../icons/up.png'
413 border='0' height='32' alt='Up One Level' width='32' /></A></td>
414<td class='online-navigation'><a rel="next" title="12.2.7 Encoders"
415 href="module-email.Encoders.html"><img src='../icons/next.png'
416 border='0' height='32' alt='Next Page' width='32' /></A></td>
417<td align="center" width="100%">Python Library Reference</td>
418<td class='online-navigation'><a rel="contents" title="Table of Contents"
419 href="contents.html"><img src='../icons/contents.png'
420 border='0' height='32' alt='Contents' width='32' /></A></td>
421<td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png'
422 border='0' height='32' alt='Module Index' width='32' /></a></td>
423<td class='online-navigation'><a rel="index" title="Index"
424 href="genindex.html"><img src='../icons/index.png'
425 border='0' height='32' alt='Index' width='32' /></A></td>
426</tr></table>
427<div class='online-navigation'>
428<b class="navlabel">Previous:</b>
429<a class="sectref" rel="prev" href="module-email.Header.html">12.2.5 Internationalized headers</A>
430<b class="navlabel">Up:</b>
431<a class="sectref" rel="parent" href="module-email.html">12.2 email </A>
432<b class="navlabel">Next:</b>
433<a class="sectref" rel="next" href="module-email.Encoders.html">12.2.7 Encoders</A>
434</div>
435</div>
436<hr />
437<span class="release-info">Release 2.4.2, documentation updated on 28 September 2005.</span>
438</DIV>
439<!--End of Navigation Panel-->
440<ADDRESS>
441See <i><a href="about.html">About this document...</a></i> for information on suggesting changes.
442</ADDRESS>
443</BODY>
444</HTML>