Commit | Line | Data |
---|---|---|
d509cef7 LC |
1 | .EQ |
2 | delim $$ | |
3 | .EN | |
4 | .NH 1 | |
5 | Introduction | |
6 | .PP | |
7 | Computers have become important | |
8 | in the document preparation process, with programs | |
9 | to check for spelling errors and to format documents. | |
10 | As the amount of text stored on line increases, it becomes | |
11 | feasible and attractive to study writing | |
12 | style and to attempt to help the writer in producing readable | |
13 | documents. | |
14 | The system of writing tools described here is a first step toward such help. | |
15 | The system includes programs and a data base to | |
16 | analyze writing style at the word and sentence level. | |
17 | We use the term ``style'' in this paper to describe the | |
18 | results of a writer's particular choices among individual words and | |
19 | sentence forms. | |
20 | Although many judgements of style are subjective, | |
21 | particularly those of word choice, | |
22 | there are some objective measures that experts | |
23 | agree lead to good style. | |
24 | Three programs have been written to measure some of | |
25 | the objectively definable characteristics of writing style | |
26 | and to identify some commonly misused or unnecessary phrases. | |
27 | Although a document that conforms to the stylistic rules | |
28 | is not guaranteed to be coherent and readable, one that | |
29 | violates all of the rules is likely to be | |
30 | difficult or tedious to read. | |
31 | The program STYLE calculates readability, sentence length variability, | |
32 | sentence type, word usage and sentence openers at a rate of about 400 words per second | |
33 | on a PDP11/70 running the | |
34 | .UX | |
35 | Operating System. | |
36 | It assumes that the sentences are well-formed, i. e. that | |
37 | each sentence has a verb and that the subject and verb agree in number. | |
38 | DICTION identifies phrases that are either bad usage or unnecessarily wordy. | |
39 | EXPLAIN acts as a thesaurus for the phrases found by DICTION. | |
40 | Sections 2, 3, and 4 describe the programs; Section 5 gives the results | |
41 | on a cross-section of technical documents; Section 6 discusses | |
42 | accuracy and problems; Section 7 gives implementation details. | |
43 | .NH 1 | |
44 | STYLE | |
45 | .PP | |
46 | The program STYLE reads a document and prints a summary of | |
47 | readability indices, sentence length and type, word usage, | |
48 | and sentence openers. | |
49 | It may also be used to locate all sentences in a document | |
50 | longer than a given length, of readability index higher than a given | |
51 | number, those containing a passive verb, or those beginning with an expletive. | |
52 | STYLE | |
53 | is based on the system for finding English word classes or parts of speech, PARTS [1]. | |
54 | PARTS is a set of programs that uses a small dictionary (about 350 words) | |
55 | and suffix rules to partially assign word classes to | |
56 | English text. | |
57 | It then uses experimentally derived rules of word order to assign | |
58 | word classes to all words in the text with an accuracy of about 95%. | |
59 | Because PARTS uses only a small dictionary and general rules, it works | |
60 | on text about any subject, from physics to psychology. | |
61 | Style measures have been built into the output phase | |
62 | of the programs that make up PARTS. | |
63 | Some of the measures are simple counters of the word classes | |
64 | found by PARTS; many are more complicated. | |
65 | For example, the verb count is the total number of verb phrases. | |
66 | This includes phrases like: | |
67 | .DS | |
68 | has been going | |
69 | was only going | |
70 | to go | |
71 | .DE | |
72 | each of which each counts as one verb. | |
73 | Figure 1 shows the output of STYLE run on a paper by Kernighan and Mashey | |
74 | about the | |
75 | .UX | |
76 | programming environment [2]. | |
77 | .KF | |
78 | .sp 2 | |
79 | .TS | |
80 | box; | |
81 | l1l. | |
82 | programming environment | |
83 | readability grades: | |
84 | (Kincaid) 12.3 (auto) 12.8 (Coleman-Liau) 11.8 (Flesch) 13.5 (46.3) | |
85 | sentence info: | |
86 | no. sent 335 no. wds 7419 | |
87 | av sent leng 22.1 av word leng 4.91 | |
88 | no. questions 0 no. imperatives 0 | |
89 | no. nonfunc wds 4362 58.8% av leng 6.38 | |
90 | short sent (<17) 35% (118) long sent (>32) 16% (55) | |
91 | longest sent 82 wds at sent 174; shortest sent 1 wds at sent 117 | |
92 | sentence types: | |
93 | simple 34% (114) complex 32% (108) | |
94 | compound 12% (41) compound-complex 21% (72) | |
95 | word usage: | |
96 | verb types as % of total verbs | |
97 | tobe 45% (373) aux 16% (133) inf 14% (114) | |
98 | passives as % of non-inf verbs 20% (144) | |
99 | types as % of total | |
100 | prep 10.8% (804) conj 3.5% (262) adv 4.8% (354) | |
101 | noun 26.7% (1983) adj 18.7% (1388) pron 5.3% (393) | |
102 | nominalizations 2 % (155) | |
103 | sentence beginnings: | |
104 | subject opener: noun (63) pron (43) pos (0) adj (58) art (62) tot 67% | |
105 | prep 12% (39) adv 9% (31) | |
106 | verb 0% (1) sub_conj 6% (20) conj 1% (5) | |
107 | expletives 4% (13) | |
108 | .TE | |
109 | .sp | |
110 | .ce | |
111 | Figure 1 | |
112 | .sp | |
113 | .KE | |
114 | As the example shows, STYLE output is in five parts. | |
115 | After a brief discussion of sentences, we will describe the parts in order. | |
116 | .NH 2 | |
117 | What is a sentence? | |
118 | .PP | |
119 | Readers of documents have little | |
120 | trouble deciding where the sentences end. | |
121 | People don't even have to stop and think about uses of the | |
122 | character ``.'' in constructions like | |
123 | 1.25, A. J. Jones, Ph.D., i. e., or etc. . | |
124 | When a computer reads a document, | |
125 | finding the end of sentences is not as easy. | |
126 | First we must throw away the printer's marks and formatting | |
127 | commands that litter the text in computer form. | |
128 | Then STYLE | |
129 | defines a sentence | |
130 | as a string of words ending in one of: | |
131 | .DS | |
132 | . ! ? /. | |
133 | .DE | |
134 | The end marker ``/.'' may be used to indicate an imperative sentence. | |
135 | Imperative sentences that are not so marked are not identified as imperative. | |
136 | STYLE properly handles numbers with embedded decimal points and commas, | |
137 | strings of letters and numbers with embedded decimal points used for | |
138 | naming computer file names, and | |
139 | the common | |
140 | abbreviations listed in Appendix 1. | |
141 | Numbers that end sentences, like the preceding sentence, cause | |
142 | a sentence break if the next word begins with a capital letter. | |
143 | Initials only cause a sentence break if the next word begins with | |
144 | a capital and is found in the dictionary of function words used by PARTS. | |
145 | So the string | |
146 | .DS | |
147 | J. D. JONES | |
148 | .DE | |
149 | does not cause a break, but the string | |
150 | .DS | |
151 | ... system H. The ... | |
152 | .DE | |
153 | does. | |
154 | With these rules most sentences are broken at the proper place, | |
155 | although occasionally | |
156 | either two sentences are called one or a fragment is called | |
157 | a sentence. | |
158 | ||
159 | .NH 2 | |
160 | Readability Grades | |
161 | .PP | |
162 | The first section of STYLE output consists of four readability indices. | |
163 | As Klare points out in [3] readability indices may be used to | |
164 | estimate the reading skills needed by the reader to understand a document. | |
165 | The readability indices reported by STYLE are based on | |
166 | measures of sentence and word lengths. | |
167 | Although the indices | |
168 | may not measure whether the document is coherent | |
169 | and well organized, | |
170 | experience has shown that high indices seem to be indicators of stylistic | |
171 | difficulty. | |
172 | Documents with short sentences and short words have low scores; | |
173 | those with long sentences and many polysyllabic words have high scores. | |
174 | The 4 formulae reported are Kincaid Formula [4], Automated Readability Index [5], | |
175 | Coleman-Liau Formula [6] | |
176 | and a normalized version of Flesch Reading Ease Score [7]. | |
177 | The formulae differ because they were experimentally derived using different texts | |
178 | and subject groups. | |
179 | We will discuss each of the formulae briefly; for a more | |
180 | detailed discussion the reader should see [3]. | |
181 | .PP | |
182 | The Kincaid Formula, given by: | |
183 | .EQ | |
184 | Reading_Grade = 11.8 * syl_per_wd + .39 * wds_per_sent - 15.59 | |
185 | .EN | |
186 | .br | |
187 | was based on Navy training manuals that ranged in difficulty | |
188 | from 5.5 to 16.3 in reading grade level. | |
189 | The score reported by this formula tends to be in the mid-range of the | |
190 | 4 scores. | |
191 | Because it is based on adult training manuals rather than | |
192 | school book text, this formula is probably the best | |
193 | one to apply to technical documents. | |
194 | .PP | |
195 | The Automated Readability Index (ARI), based on text from | |
196 | grades 0 to 7, was derived to be easy to automate. | |
197 | The formula is: | |
198 | .EQ | |
199 | Reading_Grade = 4.71 * let_per_wd + .5 * wds_per_sent - 21.43 | |
200 | .EN | |
201 | .br | |
202 | ARI tends to produce scores that are higher than Kincaid and | |
203 | Coleman-Liau but are usually slightly lower than Flesch. | |
204 | .PP | |
205 | The Coleman-Liau Formula, based on text ranging in | |
206 | difficulty from .4 to 16.3, is: | |
207 | .EQ | |
208 | Reading_Grade = 5.89 * let_per_wd - .3 * sent_per_100_wds - 15.8 | |
209 | .EN | |
210 | .br | |
211 | Of the four formulae this one usually gives the lowest | |
212 | grade when applied to technical documents. | |
213 | .PP | |
214 | The last formula, the Flesch Reading Ease Score, is based | |
215 | on grade school text covering grades 3 to 12. | |
216 | The formula, given by: | |
217 | .EQ | |
218 | Reading_Score = 206.835 - 84.6 * syl_per_wd - 1.015 * wds_per_sent | |
219 | .EN | |
220 | .br | |
221 | is usually reported in the range 0 (very difficult) to 100 (very easy). | |
222 | The score reported by STYLE is scaled to be comparable to | |
223 | the other formulas, | |
224 | except that the maximum grade level reported is set to 17. | |
225 | The Flesch score is usually the highest of the 4 scores | |
226 | on technical documents. | |
227 | .PP | |
228 | Coke [8] found that the Kincaid Formula is probably the best predictor for | |
229 | technical documents; | |
230 | both ARI and Flesch tend to overestimate | |
231 | the difficulty; Coleman-Liau tend to underestimate. | |
232 | On text in the range of grades 7 to 9 | |
233 | the four formulas tend to be about the same. | |
234 | On easy text the Coleman-Liau formula is probably | |
235 | preferred since it is reasonably accurate at the lower | |
236 | grades and it is safer to present text that is a little too | |
237 | easy than a little too hard. | |
238 | .PP | |
239 | If a document has particularly difficult technical content, especially if | |
240 | it includes a lot of mathematics, | |
241 | it is probably best to make the text very easy to read, i.e. a lower | |
242 | readability index by shortening the sentences and words. | |
243 | This will allow the reader to concentrate on the technical | |
244 | content and not the long sentences. | |
245 | The user should remember that these indices are estimators; | |
246 | they should not be taken as absolute numbers. | |
247 | STYLE called with ``\-r number'' will print all sentences with | |
248 | an Automated Readability Index equal to or greater than ``number''. | |
249 | .NH 2 | |
250 | Sentence length and structure | |
251 | .PP | |
252 | The next two sections of STYLE output deal with sentence length and structure. | |
253 | Almost all books on writing style or effective writing emphasize | |
254 | the importance of variety in sentence length and structure for good writing. | |
255 | Ewing's first rule in discussing style in the book | |
256 | .I | |
257 | Writing for Results | |
258 | .R | |
259 | [9] is: | |
260 | .DS | |
261 | ``Vary the sentence structure and length of your sentences.'' | |
262 | .DE | |
263 | Leggett, Mead and Charvat break this rule into 3 in | |
264 | .I | |
265 | Prentice-Hall Handbook for Writers | |
266 | .R | |
267 | [10] as follows: | |
268 | .DS | |
269 | ``34a. Avoid the overuse of short simple sentences.'' | |
270 | ``34b. Avoid the overuse of long compound sentences.'' | |
271 | ``34c. Use various sentence structures to avoid monotony and increase effectiveness.'' | |
272 | .DE | |
273 | Although experts agree that these rules are important, not all writers | |
274 | follow them. | |
275 | Sample technical documents have been found with almost no | |
276 | sentence length or type variability. | |
277 | One document had 90% of its sentences about the same | |
278 | length as the average; | |
279 | another was made up almost entirely of simple sentences (80%). | |
280 | .PP | |
281 | The output sections labeled ``sentence info'' and ``sentence types'' give | |
282 | both length and structure measures. | |
283 | STYLE reports on the number and average length of both | |
284 | sentences and words, | |
285 | and number of questions and imperative sentences (those ending in ``/.''). | |
286 | The measures of non-function words are an attempt to look at the content | |
287 | words in the document. | |
288 | In English | |
289 | non-function words are nouns, adjectives, adverbs, and non-auxiliary verbs; | |
290 | function words are prepositions, conjunctions, articles, and auxiliary | |
291 | verbs. | |
292 | Since most function words are short, they tend to lower the average | |
293 | word length. | |
294 | The average length of non-function words may be a more useful measure for comparing | |
295 | word choice of different writers than the total average word length. | |
296 | The percentages of short and long sentences measure sentence | |
297 | length variability. | |
298 | Short sentences are those at least 5 words less than the | |
299 | average; long sentences are those at least 10 words longer than the average. | |
300 | Last in the sentence information section is the | |
301 | length and location of the longest and shortest sentences. | |
302 | If the flag ``\-l number'' is used, STYLE will print all sentences | |
303 | longer than ``number''. | |
304 | .PP | |
305 | Because of the difficulties in dealing with the many uses of commas and conjunctions | |
306 | in English, sentence type definitions | |
307 | vary slightly from those of standard textbooks, but still measure | |
308 | the same constructional activity. | |
309 | .IP 1. | |
310 | A simple sentence has one verb and no dependent clause. | |
311 | .IP 2. | |
312 | A complex sentence has one independent | |
313 | clause and one dependent clause, each with one verb. | |
314 | Complex sentences are found by identifying sentences that contain either | |
315 | a subordinate conjunction or a clause beginning with words like ``that'' | |
316 | or ``who''. | |
317 | The preceding sentence has such a clause. | |
318 | .IP 3. | |
319 | A compound sentence has more than one verb and no dependent | |
320 | clause. | |
321 | Sentences joined by ``;'' are also counted as compound. | |
322 | .IP 4. | |
323 | A compound-complex sentence has either several dependent clauses | |
324 | or one dependent clause and a compound verb in either | |
325 | the dependent or independent clause. | |
326 | .PP | |
327 | Even using these broader definitions, simple | |
328 | sentences dominate many of the technical documents that | |
329 | have been tested, | |
330 | but the example in Figure 1 shows variety in both sentence structure and | |
331 | sentence length. | |
332 | .NH 2 | |
333 | Word Usage | |
334 | .PP | |
335 | The word usage measures are an attempt to identify | |
336 | some other constructional features of writing style. | |
337 | There are many different ways in English to | |
338 | say the same thing. | |
339 | The constructions differ from one another | |
340 | in the form of the words used. | |
341 | The following sentences all convey approximately the | |
342 | same meaning but differ in word usage: | |
343 | .DS | |
344 | The cxio program is used to perform all communication between the systems. | |
345 | The cxio program performs all communications between the systems. | |
346 | The cxio program is used to communicate between the systems. | |
347 | The cxio program communicates between the systems. | |
348 | All communication between the systems is performed by the cxio program. | |
349 | .DE | |
350 | The distribution of the parts of speech and verb constructions | |
351 | helps identify overuse of particular constructions. | |
352 | Although the measures used by STYLE are crude, they do point out | |
353 | problem areas. | |
354 | For each category, STYLE reports a percentage and a raw count. | |
355 | In addition to looking at the percentage, the user | |
356 | may find it useful to compare the raw count with the number of sentences. | |
357 | If, for example, the number of infinitives is almost equal to the number | |
358 | of sentences, then many of the sentences in the document are constructed | |
359 | like the first and third in the preceding example. | |
360 | The user may want to transform some of these sentences into another form. | |
361 | Some of the implications of the word usage measures are discussed below. | |
362 | .IP "\fIVerbs\fR " | |
363 | are measured in several different ways to | |
364 | try to determine what types of verb constructions are | |
365 | most frequent in the document. | |
366 | Technical writing tends to contain many | |
367 | passive verb constructions and other usage of the verb ``to be''. | |
368 | The category of verbs labeled ``tobe'' measures both passives and sentences of | |
369 | the form: | |
370 | .DS | |
371 | .I | |
372 | subject tobe predicate | |
373 | .R | |
374 | .DE | |
375 | In counting verbs, whole verb phrases are counted as one verb. | |
376 | Verb phrases containing auxiliary verbs are counted in the category | |
377 | ``aux''. | |
378 | The verb phrases counted here are those whose tense is not | |
379 | simple present or simple past. | |
380 | It might eventually be useful to do more detailed measures | |
381 | of verb tense or mood. | |
382 | Infinitives are listed as ``inf''. | |
383 | The percentages reported for these three categories are based on | |
384 | the total number of verb phrases found. | |
385 | These categories are not mutually exclusive; | |
386 | they cannot be added, since, for example, | |
387 | ``to be going'' counts as both ``tobe'' and ``inf''. | |
388 | Use of these three types of verb constructions varies significantly among authors. | |
389 | .sp 2 | |
390 | STYLE reports passive verbs as a percentage of the finite verbs in the | |
391 | document. | |
392 | Most style books warn against the overuse of passive verbs. | |
393 | Coleman [11] has shown that sentences with | |
394 | active verbs are easier to learn than those | |
395 | with passive verbs. | |
396 | Although the inverted object-subject order of the passive | |
397 | voice seems to emphasize the object, Coleman's experiments | |
398 | showed that there is little difference in retention | |
399 | by word position. He also showed that the direct object of an active verb | |
400 | is retained better than the subject of a passive verb. | |
401 | These experiments support the advice of the style books suggesting | |
402 | that writers should try to use active verbs wherever possible. | |
403 | The flag ``\-p'' causes STYLE to print all sentences containing passive verbs. | |
404 | .PP | |
405 | .IP "\fIPronouns\fR " | |
406 | add cohesiveness and connectivity to a document | |
407 | by providing back-reference. | |
408 | They are often a short-hand notation for something | |
409 | previously mentioned, and therefore connect the sentence containing the pronoun with the | |
410 | word to which the pronoun refers. | |
411 | Although there are other mechanisms for such connections, documents | |
412 | with no pronouns tend to be wordy and to have little connectivity. | |
413 | .IP "\fIAdverbs\fR " | |
414 | can provide transition between sentences and order | |
415 | in time and space. | |
416 | In performing these functions, adverbs, like pronouns, provide | |
417 | connectivity and cohesiveness. | |
418 | .IP "\fIConjunctions\fR " | |
419 | provide parallelism in a document by connecting two or more | |
420 | equal units. | |
421 | These units may be whole sentences, verb phrases, nouns, adjectives, or | |
422 | prepositional phrases. | |
423 | The compound and compound-complex sentences reported under | |
424 | sentence type are parallel structures. | |
425 | Other uses of parallel structures are indicated by the degree that the | |
426 | number of conjunctions reported under word usage exceeds the | |
427 | compound sentence measures. | |
428 | .IP "\fINouns and Adjectives.\fR " | |
429 | A ratio of nouns to adjectives near unity may indicate the over-use of modifiers. | |
430 | Some technical writers qualify every noun with one or more | |
431 | adjectives. | |
432 | Qualifiers in phrases like ``simple linear single-link network model'' | |
433 | often lend more obscurity than precision to a text. | |
434 | .IP "\fINominalizations\fR " | |
435 | are verbs that are changed to nouns by adding one of the suffixes | |
436 | ``ment'', ``ance'', ``ence'', or ``ion''. | |
437 | Examples are accomplishment, admittance, adherence, and abbreviation. | |
438 | When a writer transforms a nominalized sentence to a non-nominalized | |
439 | sentence, she/he increases the effectiveness of the sentence in | |
440 | several ways. | |
441 | The noun becomes an active verb and frequently one complicated clause | |
442 | becomes two shorter clauses. | |
443 | For example, | |
444 | .DS | |
445 | Their inclusion of this provision is admission of the importance of the system. | |
446 | When they included this provision, they admitted the importance of the system. | |
447 | .DE | |
448 | Coleman found that the transformed sentences were easier to | |
449 | learn, even when the transformation produced sentences that were | |
450 | slightly longer, provided the transformation broke one clause into two. | |
451 | Writers who find their document contains many | |
452 | nominalizations may want to transform some of the sentences | |
453 | to use active verbs. | |
454 | .NH 2 | |
455 | Sentence openers | |
456 | .PP | |
457 | Another agreed upon principle of style is variety in sentence openers. | |
458 | Because STYLE determines the type of sentence opener by | |
459 | looking at the part of speech of the first word in the sentence, | |
460 | the sentences counted under the heading ``subject opener'' may not | |
461 | all really begin with the subject. | |
462 | However, a large percentage of sentences in this category | |
463 | still indicates lack of variety in sentence openers. | |
464 | Other sentence opener measures help the user determine | |
465 | if there are transitions between sentences and where | |
466 | the subordination occurs. | |
467 | Adverbs and conjunctions at the beginning of sentences are mechanisms for | |
468 | transition between sentences. | |
469 | A pronoun at the beginning shows a link to something previously mentioned | |
470 | and indicates connectivity. | |
471 | .PP | |
472 | The location of subordination can be determined by comparing | |
473 | the number of sentences that begin with a subordinator with | |
474 | the number of sentences with complex clauses. | |
475 | If few sentences start with subordinate conjunctions then | |
476 | the subordination is embedded or at the end of the complex sentences. | |
477 | For variety the writer may want to transform some sentences | |
478 | to have leading subordination. | |
479 | .PP | |
480 | The last category of openers, expletives, is commonly | |
481 | overworked in technical writing. | |
482 | Expletives are the words ``it'' and ``there'', usually with the verb ``to be'', | |
483 | in constructions where the subject follows the verb. | |
484 | For example, | |
485 | .DS | |
486 | There are three streets used by the traffic. | |
487 | There are too many users on this system. | |
488 | .DE | |
489 | This construction tends to emphasize the object rather than the | |
490 | subject of the sentence. | |
491 | The flag ``\-e'' will cause STYLE to print all | |
492 | sentences that begin with an expletive. | |
493 | .NH 1 | |
494 | DICTION | |
495 | .PP | |
496 | The program DICTION prints all sentences in a document containing | |
497 | phrases that are either frequently misused or indicate wordiness. | |
498 | The program, an extension of Aho's FGREP [12] string | |
499 | matching program, | |
500 | takes as input a file of phrases or patterns to be matched and a file | |
501 | of text to be searched. | |
502 | A data base of about 450 phrases has been compiled as a default | |
503 | pattern file for DICTION. | |
504 | Before attempting to locate phrases, the program maps | |
505 | upper case letters to lower case and substitutes blanks for | |
506 | punctuation. | |
507 | Sentence boundaries were deemed less critical in DICTION than | |
508 | in STYLE, so abbreviations and other uses of the character | |
509 | ``.'' are not treated specially. | |
510 | DICTION brackets all pattern matches in a sentence with the characters | |
511 | ``['' ``]'' . | |
512 | Although many of the phrases in the default data base are correct | |
513 | in some contexts, in others they indicate wordiness. | |
514 | Some examples of the phrases and suggested alternatives are: | |
515 | .DS | |
516 | .TS | |
517 | cc | |
518 | ll. | |
519 | Phrase Alternative | |
520 | a large number of many | |
521 | arrive at a decision decide | |
522 | collect together collect | |
523 | for this reason so | |
524 | pertaining to about | |
525 | through the use of by or with | |
526 | utilize use | |
527 | with the exception of except | |
528 | .TE | |
529 | .DE | |
530 | Appendix 2 contains a complete list of the default file. | |
531 | Some of the entries are short forms of problem phrases. | |
532 | For example, the phrase ``the fact'' is found in all of the following | |
533 | and is sufficient to point out the wordiness to the user: | |
534 | .DS | |
535 | .TS | |
536 | cc | |
537 | ll. | |
538 | Phrase Alternative | |
539 | accounted for by the fact that caused by | |
540 | an example of this is the fact that thus | |
541 | based on the fact that because | |
542 | despite the fact that although | |
543 | due to the fact that because | |
544 | in light of the fact that because | |
545 | in view of the fact that since | |
546 | notwithstanding the fact that although | |
547 | .TE | |
548 | .DE | |
549 | Entries in Appendix 2 preceded by ``~'' are not matched. | |
550 | See Section 7 for details on the use of ``~''. | |
551 | .PP | |
552 | The user may supply her/his own pattern file with the flag ``\-f patfile''. | |
553 | In this case the default file will be loaded first, followed by the user file. | |
554 | This mechanism allows users to suppress | |
555 | patterns contained in the default file or to include their own pet peeves that are not in the default file. | |
556 | The flag ``\-n'' will exclude the default file altogether. | |
557 | In constructing a pattern file, blanks should be used before and after each | |
558 | phrase to avoid matching substrings in words. | |
559 | For example, to find all occurrences of the word ``the'', the pattern | |
560 | `` the '' should be used. | |
561 | The blanks cause only the word ``the'' to be matched and not the | |
562 | string ``the'' in words like there, other, and therefore. | |
563 | One side effect of surrounding the words with blanks is that | |
564 | when two phrases occur without intervening words, only the | |
565 | first will be matched. | |
566 | .NH 1 | |
567 | EXPLAIN | |
568 | .PP | |
569 | The last program, EXPLAIN, is an interactive thesaurus for | |
570 | phrases found by DICTION. | |
571 | The user types one of the phrases bracketed by DICTION | |
572 | and EXPLAIN responds with suggested substitutions for the phrase | |
573 | that will improve the diction of the document. | |
574 | .KF | |
575 | .DS C | |
576 | Table 1 | |
577 | Text Statistics on 20 Technical Documents | |
578 | .TS | |
579 | cccccc | |
580 | llnnnn. | |
581 | variable minimum maximum mean standard deviation | |
582 | _ | |
583 | Readability Kincaid 9.5 16.9 13.3 2.2 | |
584 | automated 9.0 17.4 13.3 2.5 | |
585 | Cole-Liau 10.0 16.0 12.7 1.8 | |
586 | Flesch 8.9 17.0 14.4 2.2 | |
587 | _ | |
588 | sentence info. av sent length 15.5 30.3 21.6 4.0 | |
589 | av word length 4.61 5.63 5.08 .29 | |
590 | av nonfunction length 5.72 7.30 6.52 .45 | |
591 | short sent 23% 46% 33% 5.9 | |
592 | long sent 7% 20% 14% 2.9 | |
593 | _ | |
594 | sentence types simple 31% 71% 49% 11.4 | |
595 | complex 19% 50% 33% 8.3 | |
596 | compound 2% 14% 7% 3.3 | |
597 | compound-complex 2% 19% 10% 4.8 | |
598 | _ | |
599 | verb types tobe 26% 64% 44.7% 10.3 | |
600 | auxiliary 10% 40% 21% 8.7 | |
601 | infinitives 8% 24% 15.1% 4.8 | |
602 | passives 12% 50% 29% 9.3 | |
603 | _ | |
604 | word usage prepositions 10.1% 15.0% 12.3% 1.6 | |
605 | conjunction 1.8% 4.8% 3.4% .9 | |
606 | adverbs 1.2% 5.0% 3.4% 1.0 | |
607 | nouns 23.6% 31.6% 27.8% 1.7 | |
608 | adjectives 15.4% 27.1% 21.1% 3.4 | |
609 | pronouns 1.2% 8.4% 2.5% 1.1 | |
610 | nominalizations 2% 5% 3.3% .8 | |
611 | _ | |
612 | sentence openers prepositions 6% 19% 12% 3.4 | |
613 | adverbs 0% 20% 9% 4.6 | |
614 | subject 56% 85% 70% 8.0 | |
615 | verbs 0% 4% 1% 1.0 | |
616 | subordinating conj 1% 12% 5% 2.7 | |
617 | conjunctions 0% 4% 0% 1.5 | |
618 | expletives 0% 6% 2% 1.7 | |
619 | .TE | |
620 | .DE | |
621 | .KE | |
622 | .NH 1 | |
623 | Results | |
624 | .NH 2 | |
625 | STYLE | |
626 | .PP | |
627 | To get baseline statistics and check the program's accuracy, | |
628 | we ran STYLE on 20 technical documents. | |
629 | There were a total of 3287 sentences in the sample. | |
630 | The shortest document was 67 sentences long; the longest 339 sentences. | |
631 | The documents covered a wide range of subject matter, including | |
632 | theoretical computing, physics, psychology, engineering, and | |
633 | affirmative action. | |
634 | Table 1 gives the range, median, and standard deviation of the various style measures. | |
635 | As you will note most of the measurements have a fairly wide range of values | |
636 | across the sample documents. | |
637 | .PP | |
638 | As a comparison, Table 2 gives the median results | |
639 | for two different technical authors, a sample of instructional material, and a sample of the | |
640 | Federalist Papers. | |
641 | The two authors show similar styles, although author 2 | |
642 | uses somewhat shorter sentences and longer words than author 1. | |
643 | Author 1 uses all types of sentences, while author 2 prefers | |
644 | simple and complex sentences, using few compound or compound-complex sentences. | |
645 | The other major difference in the styles of these authors is the location | |
646 | of subordination. | |
647 | Author 1 seems to prefer embedded or trailing subordination, while | |
648 | author 2 begins many sentences with the subordinate clause. | |
649 | The documents tested for both authors 1 and 2 were technical documents, | |
650 | written for a technical audience. | |
651 | The instructional documents, which are written for craftspeople, | |
652 | vary surprisingly little from the two technical samples. | |
653 | The sentences and words are a little longer, | |
654 | and they contain many passive and auxiliary verbs, few adverbs, and almost | |
655 | no pronouns. | |
656 | The instructional documents contain many imperative sentences, so there are | |
657 | many sentence with verb openers. | |
658 | The sample of Federalist Papers contrasts with the other | |
659 | samples in almost every way. | |
660 | .KF | |
661 | .DS C | |
662 | Table 2 | |
663 | Text Statistics on Single Authors | |
664 | .TS | |
665 | cccccc | |
666 | llnnnn. | |
667 | variable author 1 author 2 inst. FED | |
668 | _ | |
669 | readability Kincaid 11.0 10.3 10.8 16.3 | |
670 | automated 11.0 10.3 11.9 17.8 | |
671 | Coleman-Liau 9.3 10.1 10.2 12.3 | |
672 | Flesch 10.3 10.7 10.1 15.0 | |
673 | _ | |
674 | sentence info av sent length 22.64 19.61 22.78 31.85 | |
675 | av word length 4.47 4.66 4.65 4.95 | |
676 | av nonfunction length 5.64 5.92 6.04 6.87 | |
677 | short sent 35% 43% 35% 40% | |
678 | long sent 18% 15% 16% 21% | |
679 | _ | |
680 | sentence types simple 36% 43% 40% 31% | |
681 | complex 34% 41% 37% 34% | |
682 | compound 13% 7% 4% 10% | |
683 | compound-complex 16% 8% 14% 25% | |
684 | _ | |
685 | verb type tobe 42% 43% 45% 37% | |
686 | auxiliary 17% 19% 32% 32% | |
687 | infinitives 17% 15% 12% 21% | |
688 | passives 20% 19% 36% 20% | |
689 | _ | |
690 | word usage prepositions 10.0% 10.8% 12.3% 15.9% | |
691 | conjunctions 3.2% 2.4% 3.9% 3.4% | |
692 | adverbs 5.05% 4.6% 3.5% 3.7% | |
693 | nouns 27.7% 26.5% 29.1% 24.9% | |
694 | adjectives 17.0% 19.0% 15.4% 12.4% | |
695 | pronouns 5.3% 4.3% 2.1% 6.5% | |
696 | nominalizations 1% 2% 2% 3% | |
697 | _ | |
698 | sentence openers prepositions 11% 14% 6% 5% | |
699 | adverbs 9% 9% 6% 4% | |
700 | subject 65% 59% 54% 66% | |
701 | verb 3% 2% 14% 2% | |
702 | subordinating conj 8% 14% 11% 3% | |
703 | conjunction 1% 0% 0% 3% | |
704 | expletives 3% 3% 0% 3% | |
705 | .TE | |
706 | .DE | |
707 | .KE | |
708 | .NH 2 | |
709 | DICTION | |
710 | .PP | |
711 | In the few weeks that DICTION has been available | |
712 | to users | |
713 | about 35,000 sentences have been run with about | |
714 | 5,000 string matches. | |
715 | The authors using the program seem to make | |
716 | the suggested changes about 50-75% of the time. | |
717 | To date, almost 200 of the 450 strings in the default | |
718 | file have been matched. | |
719 | Although most of these phrases are valid and correct | |
720 | in some contexts, the 50-75% change rate seems to | |
721 | show that the phrases are used much more often than | |
722 | concise diction warrants. | |
723 | .NH 1 | |
724 | Accuracy | |
725 | .NH 2 | |
726 | Sentence Identification | |
727 | .PP | |
728 | The correctness of the STYLE output on the 20 document sample was checked | |
729 | in detail. | |
730 | STYLE misidentified | |
731 | 129 sentence fragments as sentences | |
732 | and incorrectly joined two or more sentences 75 times | |
733 | in the 3287 sentence sample. | |
734 | The problems were usually because of nonstandard formatting | |
735 | commands, unknown abbreviations, or lists of non-sentences. | |
736 | An impossibly long sentence found as the longest sentence in | |
737 | the document usually is the result of a long list | |
738 | of non-sentences. | |
739 | .NH 2 | |
740 | Sentence Types | |
741 | .PP | |
742 | Style correctly identified sentence type on 86.5% of | |
743 | the sentences in the sample. | |
744 | The type distribution of the sentences was | |
745 | 52.5% simple, 29.9% complex, 8.5% compound and | |
746 | 9% compound-complex. | |
747 | The program reported 49.5% simple, 31.9% complex, | |
748 | 8% compound and 10.4% compound-complex. | |
749 | Looking at the errors on the individual | |
750 | documents, the number of simple sentences was | |
751 | under-reported by about 4% and the complex and compound-complex | |
752 | were over-reported by 3% and 2%, respectively. | |
753 | The following matrix shows the programs output | |
754 | vs. the actual sentence type. | |
755 | .DS C | |
756 | .TS | |
757 | csssss | |
758 | cccccc | |
759 | clnnnn. | |
760 | Program Results | |
761 | simple complex compound comp-complex | |
762 | Actual simple 1566 132 49 17 | |
763 | Sentence complex 47 892 6 65 | |
764 | Type compound 40 6 207 23 | |
765 | comp-complex 0 52 5 249 | |
766 | .TE | |
767 | .DE | |
768 | .PP | |
769 | The system's inability to find imperative sentences seems to | |
770 | have little effect on most of the style statistics. | |
771 | A document with half of its sentences imperative was run, with and | |
772 | without the imperative end marker. | |
773 | The results were identical except for the expected errors of not finding | |
774 | verbs as sentence openers, not counting the imperative sentences, | |
775 | and a slight difference (1%) in the number of nouns | |
776 | and adjectives reported. | |
777 | .NH 2 | |
778 | Word Usage | |
779 | .PP | |
780 | The accuracy of identifying word types reflects | |
781 | that of PARTS, which is about 95% correct. | |
782 | The largest source of confusion is between nouns and | |
783 | adjectives. | |
784 | The verb counts were checked on about 20 sentences from each | |
785 | document and found to be about 98% correct. | |
786 | .NH 1 | |
787 | Technical Details | |
788 | .NH 2 | |
789 | Finding Sentences | |
790 | .PP | |
791 | The formatting commands embedded in the text increase the difficulty | |
792 | of finding sentences. | |
793 | Not all text in a document is in sentence form; there are headings, | |
794 | tables, equations and lists, for example. | |
795 | Headings like ``Finding Sentences'' above should be discarded, not | |
796 | attached to the next sentence. | |
797 | However, since many of the documents are formatted to be phototypeset, | |
798 | and contain font changes, which usually operate on the | |
799 | most important words in the document, | |
800 | discarding all formatting commands is not correct. | |
801 | To improve the programs' ability to find sentence boundaries, the deformatting program, DEROFF [13], | |
802 | has been given some knowledge of the formatting packages used on the | |
803 | .UX | |
804 | operating system. | |
805 | DEROFF will now do the following: | |
806 | .IP 1. | |
807 | Suppress all formatting macros that | |
808 | are used for titles, headings, author's name, etc. | |
809 | .IP 2. | |
810 | Suppress the arguments to the macros for titles, headings, author's name, etc. | |
811 | .IP 3. | |
812 | Suppress displays, tables, footnotes and text that is centered or in no-fill mode. | |
813 | .IP 4. | |
814 | Substitute a place holder for equations and check | |
815 | for hidden end markers. | |
816 | The place holder is necessary because many typists and authors use | |
817 | the equation setter to change fonts on important words. | |
818 | For this reason, header files containing the definition of | |
819 | the EQN delimiters must also be included as input to STYLE. | |
820 | End markers are often hidden when an equation ends a sentence | |
821 | and the period is typed | |
822 | inside the EQN delimiters. | |
823 | .IP 5. | |
824 | Add a "." after lists. | |
825 | If the flag \-ml is also used, all lists are suppressed. | |
826 | This is a separate flag because of the variety of ways the | |
827 | list macros are used. | |
828 | Often, lists are sentences that should be included in the analysis. | |
829 | The user must determine how lists are used in the document to be analyzed. | |
830 | .PP | |
831 | Both STYLE and DICTION call DEROFF before they look at the text. | |
832 | The user should supply the \-ml flag if the document contains | |
833 | many lists of non-sentences that should be skipped. | |
834 | .NH 2 | |
835 | Details of DICTION | |
836 | .PP | |
837 | The program DICTION is based on the string matching program FGREP. | |
838 | FGREP takes as input a file of patterns to be matched and a file | |
839 | to be searched and outputs each line that contains | |
840 | any of the patterns | |
841 | with no indication of which pattern was matched. | |
842 | The following changes have been added to FGREP: | |
843 | .IP 1. | |
844 | The basic unit that DICTION operates on is a sentence rather than a line. | |
845 | Each sentence that contains one of the patterns is output. | |
846 | .IP 2. | |
847 | Upper case letters are mapped to lower case. | |
848 | .IP 3. | |
849 | Punctuation is replaced by blanks. | |
850 | .IP 4 | |
851 | All pattern matches in the sentence are found and surrounded with | |
852 | ``['' ``]'' . | |
853 | .IP 5. | |
854 | A method for suppressing a string match has been added. | |
855 | Any pattern that begins with ``~'' will not be matched. | |
856 | Because the matching algorithm finds the longest | |
857 | substring, the suppression of a match allows words in some | |
858 | correct contexts not to be matched while allowing | |
859 | the word in another context to be found. | |
860 | For example, the word ``which'' is often incorrectly used | |
861 | instead of ``that'' in restrictive clauses. | |
862 | However, ``which'' is usually correct when preceded by a preposition | |
863 | or ``,''. | |
864 | The default pattern file suppresses the match | |
865 | of the common prepositions or a double | |
866 | blank followed by ``which'' and therefore matches only | |
867 | the suspect uses. | |
868 | The double blank accounts for the replaced comma. | |
869 | .NH | |
870 | Conclusions | |
871 | .PP | |
872 | A system of writing tools that measure some of the | |
873 | objective characteristics of writing style has been developed. | |
874 | The tools are sufficiently general that they may be applied to | |
875 | documents on any subject with equal accuracy. | |
876 | Although the measurements are only of the surface | |
877 | structure of the text, they do point out problem areas. | |
878 | In addition to helping writers produce better documents, | |
879 | these programs may be useful for studying | |
880 | the writing process and finding other formulae for measuring | |
881 | readability. |