Commit | Line | Data |
---|---|---|
86530b38 AT |
1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
2 | <html> | |
3 | <head> | |
4 | <link rel="STYLESHEET" href="lib.css" type='text/css' /> | |
5 | <link rel="SHORTCUT ICON" href="../icons/pyfav.png" type="image/png" /> | |
6 | <link rel='start' href='../index.html' title='Python Documentation Index' /> | |
7 | <link rel="first" href="lib.html" title='Python Library Reference' /> | |
8 | <link rel='contents' href='contents.html' title="Contents" /> | |
9 | <link rel='index' href='genindex.html' title='Index' /> | |
10 | <link rel='last' href='about.html' title='About this document...' /> | |
11 | <link rel='help' href='about.html' title='About this document...' /> | |
12 | <link rel="prev" href="node773.html" /> | |
13 | <link rel="parent" href="node772.html" /> | |
14 | <link rel="next" href="module-symbol.html" /> | |
15 | <meta name='aesop' content='information' /> | |
16 | <title>18.1.6.2 Information Discovery</title> | |
17 | </head> | |
18 | <body> | |
19 | <DIV CLASS="navigation"> | |
20 | <div id='top-navigation-panel' xml:id='top-navigation-panel'> | |
21 | <table align="center" width="100%" cellpadding="0" cellspacing="2"> | |
22 | <tr> | |
23 | <td class='online-navigation'><a rel="prev" title="18.1.6.1 Emulation of compile()" | |
24 | href="node773.html"><img src='../icons/previous.png' | |
25 | border='0' height='32' alt='Previous Page' width='32' /></A></td> | |
26 | <td class='online-navigation'><a rel="parent" title="18.1.6 Examples" | |
27 | href="node772.html"><img src='../icons/up.png' | |
28 | border='0' height='32' alt='Up One Level' width='32' /></A></td> | |
29 | <td class='online-navigation'><a rel="next" title="18.2 symbol " | |
30 | href="module-symbol.html"><img src='../icons/next.png' | |
31 | border='0' height='32' alt='Next Page' width='32' /></A></td> | |
32 | <td align="center" width="100%">Python Library Reference</td> | |
33 | <td class='online-navigation'><a rel="contents" title="Table of Contents" | |
34 | href="contents.html"><img src='../icons/contents.png' | |
35 | border='0' height='32' alt='Contents' width='32' /></A></td> | |
36 | <td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png' | |
37 | border='0' height='32' alt='Module Index' width='32' /></a></td> | |
38 | <td class='online-navigation'><a rel="index" title="Index" | |
39 | href="genindex.html"><img src='../icons/index.png' | |
40 | border='0' height='32' alt='Index' width='32' /></A></td> | |
41 | </tr></table> | |
42 | <div class='online-navigation'> | |
43 | <b class="navlabel">Previous:</b> | |
44 | <a class="sectref" rel="prev" href="node773.html">18.1.6.1 Emulation of compile()</A> | |
45 | <b class="navlabel">Up:</b> | |
46 | <a class="sectref" rel="parent" href="node772.html">18.1.6 Examples</A> | |
47 | <b class="navlabel">Next:</b> | |
48 | <a class="sectref" rel="next" href="module-symbol.html">18.2 symbol </A> | |
49 | </div> | |
50 | <hr /></div> | |
51 | </DIV> | |
52 | <!--End of Navigation Panel--> | |
53 | ||
54 | <H3><A NAME="SECTION0020162000000000000000"> | |
55 | 18.1.6.2 Information Discovery</A> | |
56 | </H3> | |
57 | ||
58 | <P> | |
59 | Some applications benefit from direct access to the parse tree. The | |
60 | remainder of this section demonstrates how the parse tree provides | |
61 | access to module documentation defined in | |
62 | docstrings<a id='l2h-4957' xml:id='l2h-4957'></a> without | |
63 | requiring that the code being examined be loaded into a running | |
64 | interpreter via <tt class="keyword">import</tt>. This can be very useful for | |
65 | performing analyses of untrusted code. | |
66 | ||
67 | <P> | |
68 | Generally, the example will demonstrate how the parse tree may be | |
69 | traversed to distill interesting information. Two functions and a set | |
70 | of classes are developed which provide programmatic access to high | |
71 | level function and class definitions provided by a module. The | |
72 | classes extract information from the parse tree and provide access to | |
73 | the information at a useful semantic level, one function provides a | |
74 | simple low-level pattern matching capability, and the other function | |
75 | defines a high-level interface to the classes by handling file | |
76 | operations on behalf of the caller. All source files mentioned here | |
77 | which are not part of the Python installation are located in the | |
78 | <span class="file">Demo/parser/</span> directory of the distribution. | |
79 | ||
80 | <P> | |
81 | The dynamic nature of Python allows the programmer a great deal of | |
82 | flexibility, but most modules need only a limited measure of this when | |
83 | defining classes, functions, and methods. In this example, the only | |
84 | definitions that will be considered are those which are defined in the | |
85 | top level of their context, e.g., a function defined by a <tt class="keyword">def</tt> | |
86 | statement at column zero of a module, but not a function defined | |
87 | within a branch of an <tt class="keyword">if</tt> ... <tt class="keyword">else</tt> construct, though | |
88 | there are some good reasons for doing so in some situations. Nesting | |
89 | of definitions will be handled by the code developed in the example. | |
90 | ||
91 | <P> | |
92 | To construct the upper-level extraction methods, we need to know what | |
93 | the parse tree structure looks like and how much of it we actually | |
94 | need to be concerned about. Python uses a moderately deep parse tree | |
95 | so there are a large number of intermediate nodes. It is important to | |
96 | read and understand the formal grammar used by Python. This is | |
97 | specified in the file <span class="file">Grammar/Grammar</span> in the distribution. | |
98 | Consider the simplest case of interest when searching for docstrings: | |
99 | a module consisting of a docstring and nothing else. (See file | |
100 | <span class="file">docstring.py</span>.) | |
101 | ||
102 | <P> | |
103 | <div class="verbatim"><pre> | |
104 | """Some documentation. | |
105 | """ | |
106 | </pre></div> | |
107 | ||
108 | <P> | |
109 | Using the interpreter to take a look at the parse tree, we find a | |
110 | bewildering mass of numbers and parentheses, with the documentation | |
111 | buried deep in nested tuples. | |
112 | ||
113 | <P> | |
114 | <div class="verbatim"><pre> | |
115 | >>> import parser | |
116 | >>> import pprint | |
117 | >>> ast = parser.suite(open('docstring.py').read()) | |
118 | >>> tup = ast.totuple() | |
119 | >>> pprint.pprint(tup) | |
120 | (257, | |
121 | (264, | |
122 | (265, | |
123 | (266, | |
124 | (267, | |
125 | (307, | |
126 | (287, | |
127 | (288, | |
128 | (289, | |
129 | (290, | |
130 | (292, | |
131 | (293, | |
132 | (294, | |
133 | (295, | |
134 | (296, | |
135 | (297, | |
136 | (298, | |
137 | (299, | |
138 | (300, (3, '"""Some documentation.\n"""'))))))))))))))))), | |
139 | (4, ''))), | |
140 | (4, ''), | |
141 | (0, '')) | |
142 | </pre></div> | |
143 | ||
144 | <P> | |
145 | The numbers at the first element of each node in the tree are the node | |
146 | types; they map directly to terminal and non-terminal symbols in the | |
147 | grammar. Unfortunately, they are represented as integers in the | |
148 | internal representation, and the Python structures generated do not | |
149 | change that. However, the <tt class="module"><a href="module-symbol.html">symbol</a></tt> and <tt class="module"><a href="module-token.html">token</a></tt> modules | |
150 | provide symbolic names for the node types and dictionaries which map | |
151 | from the integers to the symbolic names for the node types. | |
152 | ||
153 | <P> | |
154 | In the output presented above, the outermost tuple contains four | |
155 | elements: the integer <code>257</code> and three additional tuples. Node | |
156 | type <code>257</code> has the symbolic name <tt class="constant">file_input</tt>. Each of | |
157 | these inner tuples contains an integer as the first element; these | |
158 | integers, <code>264</code>, <code>4</code>, and <code>0</code>, represent the node types | |
159 | <tt class="constant">stmt</tt>, <tt class="constant">NEWLINE</tt>, and <tt class="constant">ENDMARKER</tt>, | |
160 | respectively. | |
161 | Note that these values may change depending on the version of Python | |
162 | you are using; consult <span class="file">symbol.py</span> and <span class="file">token.py</span> for | |
163 | details of the mapping. It should be fairly clear that the outermost | |
164 | node is related primarily to the input source rather than the contents | |
165 | of the file, and may be disregarded for the moment. The <tt class="constant">stmt</tt> | |
166 | node is much more interesting. In particular, all docstrings are | |
167 | found in subtrees which are formed exactly as this node is formed, | |
168 | with the only difference being the string itself. The association | |
169 | between the docstring in a similar tree and the defined entity (class, | |
170 | function, or module) which it describes is given by the position of | |
171 | the docstring subtree within the tree defining the described | |
172 | structure. | |
173 | ||
174 | <P> | |
175 | By replacing the actual docstring with something to signify a variable | |
176 | component of the tree, we allow a simple pattern matching approach to | |
177 | check any given subtree for equivalence to the general pattern for | |
178 | docstrings. Since the example demonstrates information extraction, we | |
179 | can safely require that the tree be in tuple form rather than list | |
180 | form, allowing a simple variable representation to be | |
181 | <code>['variable_name']</code>. A simple recursive function can implement | |
182 | the pattern matching, returning a Boolean and a dictionary of variable | |
183 | name to value mappings. (See file <span class="file">example.py</span>.) | |
184 | ||
185 | <P> | |
186 | <div class="verbatim"><pre> | |
187 | from types import ListType, TupleType | |
188 | ||
189 | def match(pattern, data, vars=None): | |
190 | if vars is None: | |
191 | vars = {} | |
192 | if type(pattern) is ListType: | |
193 | vars[pattern[0]] = data | |
194 | return 1, vars | |
195 | if type(pattern) is not TupleType: | |
196 | return (pattern == data), vars | |
197 | if len(data) != len(pattern): | |
198 | return 0, vars | |
199 | for pattern, data in map(None, pattern, data): | |
200 | same, vars = match(pattern, data, vars) | |
201 | if not same: | |
202 | break | |
203 | return same, vars | |
204 | </pre></div> | |
205 | ||
206 | <P> | |
207 | Using this simple representation for syntactic variables and the symbolic | |
208 | node types, the pattern for the candidate docstring subtrees becomes | |
209 | fairly readable. (See file <span class="file">example.py</span>.) | |
210 | ||
211 | <P> | |
212 | <div class="verbatim"><pre> | |
213 | import symbol | |
214 | import token | |
215 | ||
216 | DOCSTRING_STMT_PATTERN = ( | |
217 | symbol.stmt, | |
218 | (symbol.simple_stmt, | |
219 | (symbol.small_stmt, | |
220 | (symbol.expr_stmt, | |
221 | (symbol.testlist, | |
222 | (symbol.test, | |
223 | (symbol.and_test, | |
224 | (symbol.not_test, | |
225 | (symbol.comparison, | |
226 | (symbol.expr, | |
227 | (symbol.xor_expr, | |
228 | (symbol.and_expr, | |
229 | (symbol.shift_expr, | |
230 | (symbol.arith_expr, | |
231 | (symbol.term, | |
232 | (symbol.factor, | |
233 | (symbol.power, | |
234 | (symbol.atom, | |
235 | (token.STRING, ['docstring']) | |
236 | )))))))))))))))), | |
237 | (token.NEWLINE, '') | |
238 | )) | |
239 | </pre></div> | |
240 | ||
241 | <P> | |
242 | Using the <tt class="function">match()</tt> function with this pattern, extracting the | |
243 | module docstring from the parse tree created previously is easy: | |
244 | ||
245 | <P> | |
246 | <div class="verbatim"><pre> | |
247 | >>> found, vars = match(DOCSTRING_STMT_PATTERN, tup[1]) | |
248 | >>> found | |
249 | 1 | |
250 | >>> vars | |
251 | {'docstring': '"""Some documentation.\n"""'} | |
252 | </pre></div> | |
253 | ||
254 | <P> | |
255 | Once specific data can be extracted from a location where it is | |
256 | expected, the question of where information can be expected | |
257 | needs to be answered. When dealing with docstrings, the answer is | |
258 | fairly simple: the docstring is the first <tt class="constant">stmt</tt> node in a code | |
259 | block (<tt class="constant">file_input</tt> or <tt class="constant">suite</tt> node types). A module | |
260 | consists of a single <tt class="constant">file_input</tt> node, and class and function | |
261 | definitions each contain exactly one <tt class="constant">suite</tt> node. Classes and | |
262 | functions are readily identified as subtrees of code block nodes which | |
263 | start with <code>(stmt, (compound_stmt, (classdef, ...</code> or | |
264 | <code>(stmt, (compound_stmt, (funcdef, ...</code>. Note that these subtrees | |
265 | cannot be matched by <tt class="function">match()</tt> since it does not support multiple | |
266 | sibling nodes to match without regard to number. A more elaborate | |
267 | matching function could be used to overcome this limitation, but this | |
268 | is sufficient for the example. | |
269 | ||
270 | <P> | |
271 | Given the ability to determine whether a statement might be a | |
272 | docstring and extract the actual string from the statement, some work | |
273 | needs to be performed to walk the parse tree for an entire module and | |
274 | extract information about the names defined in each context of the | |
275 | module and associate any docstrings with the names. The code to | |
276 | perform this work is not complicated, but bears some explanation. | |
277 | ||
278 | <P> | |
279 | The public interface to the classes is straightforward and should | |
280 | probably be somewhat more flexible. Each ``major'' block of the | |
281 | module is described by an object providing several methods for inquiry | |
282 | and a constructor which accepts at least the subtree of the complete | |
283 | parse tree which it represents. The <tt class="class">ModuleInfo</tt> constructor | |
284 | accepts an optional <var>name</var> parameter since it cannot | |
285 | otherwise determine the name of the module. | |
286 | ||
287 | <P> | |
288 | The public classes include <tt class="class">ClassInfo</tt>, <tt class="class">FunctionInfo</tt>, | |
289 | and <tt class="class">ModuleInfo</tt>. All objects provide the | |
290 | methods <tt class="method">get_name()</tt>, <tt class="method">get_docstring()</tt>, | |
291 | <tt class="method">get_class_names()</tt>, and <tt class="method">get_class_info()</tt>. The | |
292 | <tt class="class">ClassInfo</tt> objects support <tt class="method">get_method_names()</tt> and | |
293 | <tt class="method">get_method_info()</tt> while the other classes provide | |
294 | <tt class="method">get_function_names()</tt> and <tt class="method">get_function_info()</tt>. | |
295 | ||
296 | <P> | |
297 | Within each of the forms of code block that the public classes | |
298 | represent, most of the required information is in the same form and is | |
299 | accessed in the same way, with classes having the distinction that | |
300 | functions defined at the top level are referred to as ``methods.'' | |
301 | Since the difference in nomenclature reflects a real semantic | |
302 | distinction from functions defined outside of a class, the | |
303 | implementation needs to maintain the distinction. | |
304 | Hence, most of the functionality of the public classes can be | |
305 | implemented in a common base class, <tt class="class">SuiteInfoBase</tt>, with the | |
306 | accessors for function and method information provided elsewhere. | |
307 | Note that there is only one class which represents function and method | |
308 | information; this parallels the use of the <tt class="keyword">def</tt> statement to | |
309 | define both types of elements. | |
310 | ||
311 | <P> | |
312 | Most of the accessor functions are declared in <tt class="class">SuiteInfoBase</tt> | |
313 | and do not need to be overridden by subclasses. More importantly, the | |
314 | extraction of most information from a parse tree is handled through a | |
315 | method called by the <tt class="class">SuiteInfoBase</tt> constructor. The example | |
316 | code for most of the classes is clear when read alongside the formal | |
317 | grammar, but the method which recursively creates new information | |
318 | objects requires further examination. Here is the relevant part of | |
319 | the <tt class="class">SuiteInfoBase</tt> definition from <span class="file">example.py</span>: | |
320 | ||
321 | <P> | |
322 | <div class="verbatim"><pre> | |
323 | class SuiteInfoBase: | |
324 | _docstring = '' | |
325 | _name = '' | |
326 | ||
327 | def __init__(self, tree = None): | |
328 | self._class_info = {} | |
329 | self._function_info = {} | |
330 | if tree: | |
331 | self._extract_info(tree) | |
332 | ||
333 | def _extract_info(self, tree): | |
334 | # extract docstring | |
335 | if len(tree) == 2: | |
336 | found, vars = match(DOCSTRING_STMT_PATTERN[1], tree[1]) | |
337 | else: | |
338 | found, vars = match(DOCSTRING_STMT_PATTERN, tree[3]) | |
339 | if found: | |
340 | self._docstring = eval(vars['docstring']) | |
341 | # discover inner definitions | |
342 | for node in tree[1:]: | |
343 | found, vars = match(COMPOUND_STMT_PATTERN, node) | |
344 | if found: | |
345 | cstmt = vars['compound'] | |
346 | if cstmt[0] == symbol.funcdef: | |
347 | name = cstmt[2][1] | |
348 | self._function_info[name] = FunctionInfo(cstmt) | |
349 | elif cstmt[0] == symbol.classdef: | |
350 | name = cstmt[2][1] | |
351 | self._class_info[name] = ClassInfo(cstmt) | |
352 | </pre></div> | |
353 | ||
354 | <P> | |
355 | After initializing some internal state, the constructor calls the | |
356 | <tt class="method">_extract_info()</tt> method. This method performs the bulk of the | |
357 | information extraction which takes place in the entire example. The | |
358 | extraction has two distinct phases: the location of the docstring for | |
359 | the parse tree passed in, and the discovery of additional definitions | |
360 | within the code block represented by the parse tree. | |
361 | ||
362 | <P> | |
363 | The initial <tt class="keyword">if</tt> test determines whether the nested suite is of | |
364 | the ``short form'' or the ``long form.'' The short form is used when | |
365 | the code block is on the same line as the definition of the code | |
366 | block, as in | |
367 | ||
368 | <P> | |
369 | <div class="verbatim"><pre> | |
370 | def square(x): "Square an argument."; return x ** 2 | |
371 | </pre></div> | |
372 | ||
373 | <P> | |
374 | while the long form uses an indented block and allows nested | |
375 | definitions: | |
376 | ||
377 | <P> | |
378 | <div class="verbatim"><pre> | |
379 | def make_power(exp): | |
380 | "Make a function that raises an argument to the exponent `exp'." | |
381 | def raiser(x, y=exp): | |
382 | return x ** y | |
383 | return raiser | |
384 | </pre></div> | |
385 | ||
386 | <P> | |
387 | When the short form is used, the code block may contain a docstring as | |
388 | the first, and possibly only, <tt class="constant">small_stmt</tt> element. The | |
389 | extraction of such a docstring is slightly different and requires only | |
390 | a portion of the complete pattern used in the more common case. As | |
391 | implemented, the docstring will only be found if there is only | |
392 | one <tt class="constant">small_stmt</tt> node in the <tt class="constant">simple_stmt</tt> node. | |
393 | Since most functions and methods which use the short form do not | |
394 | provide a docstring, this may be considered sufficient. The | |
395 | extraction of the docstring proceeds using the <tt class="function">match()</tt> function | |
396 | as described above, and the value of the docstring is stored as an | |
397 | attribute of the <tt class="class">SuiteInfoBase</tt> object. | |
398 | ||
399 | <P> | |
400 | After docstring extraction, a simple definition discovery | |
401 | algorithm operates on the <tt class="constant">stmt</tt> nodes of the | |
402 | <tt class="constant">suite</tt> node. The special case of the short form is not | |
403 | tested; since there are no <tt class="constant">stmt</tt> nodes in the short form, | |
404 | the algorithm will silently skip the single <tt class="constant">simple_stmt</tt> | |
405 | node and correctly not discover any nested definitions. | |
406 | ||
407 | <P> | |
408 | Each statement in the code block is categorized as | |
409 | a class definition, function or method definition, or | |
410 | something else. For the definition statements, the name of the | |
411 | element defined is extracted and a representation object | |
412 | appropriate to the definition is created with the defining subtree | |
413 | passed as an argument to the constructor. The representation objects | |
414 | are stored in instance variables and may be retrieved by name using | |
415 | the appropriate accessor methods. | |
416 | ||
417 | <P> | |
418 | The public classes provide any accessors required which are more | |
419 | specific than those provided by the <tt class="class">SuiteInfoBase</tt> class, but | |
420 | the real extraction algorithm remains common to all forms of code | |
421 | blocks. A high-level function can be used to extract the complete set | |
422 | of information from a source file. (See file <span class="file">example.py</span>.) | |
423 | ||
424 | <P> | |
425 | <div class="verbatim"><pre> | |
426 | def get_docs(fileName): | |
427 | import os | |
428 | import parser | |
429 | ||
430 | source = open(fileName).read() | |
431 | basename = os.path.basename(os.path.splitext(fileName)[0]) | |
432 | ast = parser.suite(source) | |
433 | return ModuleInfo(ast.totuple(), basename) | |
434 | </pre></div> | |
435 | ||
436 | <P> | |
437 | This provides an easy-to-use interface to the documentation of a | |
438 | module. If information is required which is not extracted by the code | |
439 | of this example, the code may be extended at clearly defined points to | |
440 | provide additional capabilities. | |
441 | ||
442 | <DIV CLASS="navigation"> | |
443 | <div class='online-navigation'> | |
444 | <p></p><hr /> | |
445 | <table align="center" width="100%" cellpadding="0" cellspacing="2"> | |
446 | <tr> | |
447 | <td class='online-navigation'><a rel="prev" title="18.1.6.1 Emulation of compile()" | |
448 | href="node773.html"><img src='../icons/previous.png' | |
449 | border='0' height='32' alt='Previous Page' width='32' /></A></td> | |
450 | <td class='online-navigation'><a rel="parent" title="18.1.6 Examples" | |
451 | href="node772.html"><img src='../icons/up.png' | |
452 | border='0' height='32' alt='Up One Level' width='32' /></A></td> | |
453 | <td class='online-navigation'><a rel="next" title="18.2 symbol " | |
454 | href="module-symbol.html"><img src='../icons/next.png' | |
455 | border='0' height='32' alt='Next Page' width='32' /></A></td> | |
456 | <td align="center" width="100%">Python Library Reference</td> | |
457 | <td class='online-navigation'><a rel="contents" title="Table of Contents" | |
458 | href="contents.html"><img src='../icons/contents.png' | |
459 | border='0' height='32' alt='Contents' width='32' /></A></td> | |
460 | <td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png' | |
461 | border='0' height='32' alt='Module Index' width='32' /></a></td> | |
462 | <td class='online-navigation'><a rel="index" title="Index" | |
463 | href="genindex.html"><img src='../icons/index.png' | |
464 | border='0' height='32' alt='Index' width='32' /></A></td> | |
465 | </tr></table> | |
466 | <div class='online-navigation'> | |
467 | <b class="navlabel">Previous:</b> | |
468 | <a class="sectref" rel="prev" href="node773.html">18.1.6.1 Emulation of compile()</A> | |
469 | <b class="navlabel">Up:</b> | |
470 | <a class="sectref" rel="parent" href="node772.html">18.1.6 Examples</A> | |
471 | <b class="navlabel">Next:</b> | |
472 | <a class="sectref" rel="next" href="module-symbol.html">18.2 symbol </A> | |
473 | </div> | |
474 | </div> | |
475 | <hr /> | |
476 | <span class="release-info">Release 2.4.2, documentation updated on 28 September 2005.</span> | |
477 | </DIV> | |
478 | <!--End of Navigation Panel--> | |
479 | <ADDRESS> | |
480 | See <i><a href="about.html">About this document...</a></i> for information on suggesting changes. | |
481 | </ADDRESS> | |
482 | </BODY> | |
483 | </HTML> |