BSD 4_2 development
[unix-history] / usr / doc / diction / rm1
CommitLineData
d509cef7
LC
1.EQ
2delim $$
3.EN
4.NH 1
5Introduction
6.PP
7Computers have become important
8in the document preparation process, with programs
9to check for spelling errors and to format documents.
10As the amount of text stored on line increases, it becomes
11feasible and attractive to study writing
12style and to attempt to help the writer in producing readable
13documents.
14The system of writing tools described here is a first step toward such help.
15The system includes programs and a data base to
16analyze writing style at the word and sentence level.
17We use the term ``style'' in this paper to describe the
18results of a writer's particular choices among individual words and
19sentence forms.
20Although many judgements of style are subjective,
21particularly those of word choice,
22there are some objective measures that experts
23agree lead to good style.
24Three programs have been written to measure some of
25the objectively definable characteristics of writing style
26and to identify some commonly misused or unnecessary phrases.
27Although a document that conforms to the stylistic rules
28is not guaranteed to be coherent and readable, one that
29violates all of the rules is likely to be
30difficult or tedious to read.
31The program STYLE calculates readability, sentence length variability,
32sentence type, word usage and sentence openers at a rate of about 400 words per second
33on a PDP11/70 running the
34.UX
35Operating System.
36It assumes that the sentences are well-formed, i. e. that
37each sentence has a verb and that the subject and verb agree in number.
38DICTION identifies phrases that are either bad usage or unnecessarily wordy.
39EXPLAIN acts as a thesaurus for the phrases found by DICTION.
40Sections 2, 3, and 4 describe the programs; Section 5 gives the results
41on a cross-section of technical documents; Section 6 discusses
42accuracy and problems; Section 7 gives implementation details.
43.NH 1
44STYLE
45.PP
46The program STYLE reads a document and prints a summary of
47readability indices, sentence length and type, word usage,
48and sentence openers.
49It may also be used to locate all sentences in a document
50longer than a given length, of readability index higher than a given
51number, those containing a passive verb, or those beginning with an expletive.
52STYLE
53is based on the system for finding English word classes or parts of speech, PARTS [1].
54PARTS is a set of programs that uses a small dictionary (about 350 words)
55and suffix rules to partially assign word classes to
56English text.
57It then uses experimentally derived rules of word order to assign
58word classes to all words in the text with an accuracy of about 95%.
59Because PARTS uses only a small dictionary and general rules, it works
60on text about any subject, from physics to psychology.
61Style measures have been built into the output phase
62of the programs that make up PARTS.
63Some of the measures are simple counters of the word classes
64found by PARTS; many are more complicated.
65For example, the verb count is the total number of verb phrases.
66This includes phrases like:
67.DS
68has been going
69was only going
70to go
71.DE
72each of which each counts as one verb.
73Figure 1 shows the output of STYLE run on a paper by Kernighan and Mashey
74about the
75.UX
76programming environment [2].
77.KF
78.sp 2
79.TS
80box;
81l1l.
82programming environment
83readability grades:
84 (Kincaid) 12.3 (auto) 12.8 (Coleman-Liau) 11.8 (Flesch) 13.5 (46.3)
85sentence info:
86 no. sent 335 no. wds 7419
87 av sent leng 22.1 av word leng 4.91
88 no. questions 0 no. imperatives 0
89 no. nonfunc wds 4362 58.8% av leng 6.38
90 short sent (<17) 35% (118) long sent (>32) 16% (55)
91 longest sent 82 wds at sent 174; shortest sent 1 wds at sent 117
92sentence types:
93 simple 34% (114) complex 32% (108)
94 compound 12% (41) compound-complex 21% (72)
95word usage:
96 verb types as % of total verbs
97 tobe 45% (373) aux 16% (133) inf 14% (114)
98 passives as % of non-inf verbs 20% (144)
99 types as % of total
100 prep 10.8% (804) conj 3.5% (262) adv 4.8% (354)
101 noun 26.7% (1983) adj 18.7% (1388) pron 5.3% (393)
102 nominalizations 2 % (155)
103sentence beginnings:
104 subject opener: noun (63) pron (43) pos (0) adj (58) art (62) tot 67%
105 prep 12% (39) adv 9% (31)
106 verb 0% (1) sub_conj 6% (20) conj 1% (5)
107 expletives 4% (13)
108.TE
109.sp
110.ce
111Figure 1
112.sp
113.KE
114As the example shows, STYLE output is in five parts.
115After a brief discussion of sentences, we will describe the parts in order.
116.NH 2
117What is a sentence?
118.PP
119Readers of documents have little
120trouble deciding where the sentences end.
121People don't even have to stop and think about uses of the
122character ``.'' in constructions like
1231.25, A. J. Jones, Ph.D., i. e., or etc. .
124When a computer reads a document,
125finding the end of sentences is not as easy.
126First we must throw away the printer's marks and formatting
127commands that litter the text in computer form.
128Then STYLE
129defines a sentence
130as a string of words ending in one of:
131.DS
132 . ! ? /.
133.DE
134The end marker ``/.'' may be used to indicate an imperative sentence.
135Imperative sentences that are not so marked are not identified as imperative.
136STYLE properly handles numbers with embedded decimal points and commas,
137strings of letters and numbers with embedded decimal points used for
138naming computer file names, and
139the common
140abbreviations listed in Appendix 1.
141Numbers that end sentences, like the preceding sentence, cause
142a sentence break if the next word begins with a capital letter.
143Initials only cause a sentence break if the next word begins with
144a capital and is found in the dictionary of function words used by PARTS.
145So the string
146.DS
147J. D. JONES
148.DE
149does not cause a break, but the string
150.DS
151 ... system H. The ...
152.DE
153does.
154With these rules most sentences are broken at the proper place,
155although occasionally
156either two sentences are called one or a fragment is called
157a sentence.
158More on this later.
159.NH 2
160Readability Grades
161.PP
162The first section of STYLE output consists of four readability indices.
163As Klare points out in [3] readability indices may be used to
164estimate the reading skills needed by the reader to understand a document.
165The readability indices reported by STYLE are based on
166measures of sentence and word lengths.
167Although the indices
168may not measure whether the document is coherent
169and well organized,
170experience has shown that high indices seem to be indicators of stylistic
171difficulty.
172Documents with short sentences and short words have low scores;
173those with long sentences and many polysyllabic words have high scores.
174The 4 formulae reported are Kincaid Formula [4], Automated Readability Index [5],
175Coleman-Liau Formula [6]
176and a normalized version of Flesch Reading Ease Score [7].
177The formulae differ because they were experimentally derived using different texts
178and subject groups.
179We will discuss each of the formulae briefly; for a more
180detailed discussion the reader should see [3].
181.PP
182The Kincaid Formula, given by:
183.EQ
184Reading_Grade = 11.8 * syl_per_wd + .39 * wds_per_sent - 15.59
185.EN
186.br
187was based on Navy training manuals that ranged in difficulty
188from 5.5 to 16.3 in reading grade level.
189The score reported by this formula tends to be in the mid-range of the
1904 scores.
191Because it is based on adult training manuals rather than
192school book text, this formula is probably the best
193one to apply to technical documents.
194.PP
195The Automated Readability Index (ARI), based on text from
196grades 0 to 7, was derived to be easy to automate.
197The formula is:
198.EQ
199Reading_Grade = 4.71 * let_per_wd + .5 * wds_per_sent - 21.43
200.EN
201.br
202ARI tends to produce scores that are higher than Kincaid and
203Coleman-Liau but are usually slightly lower than Flesch.
204.PP
205The Coleman-Liau Formula, based on text ranging in
206difficulty from .4 to 16.3, is:
207.EQ
208Reading_Grade = 5.89 * let_per_wd - .3 * sent_per_100_wds - 15.8
209.EN
210.br
211Of the four formulae this one usually gives the lowest
212grade when applied to technical documents.
213.PP
214The last formula, the Flesch Reading Ease Score, is based
215on grade school text covering grades 3 to 12.
216The formula, given by:
217.EQ
218Reading_Score = 206.835 - 84.6 * syl_per_wd - 1.015 * wds_per_sent
219.EN
220.br
221is usually reported in the range 0 (very difficult) to 100 (very easy).
222The score reported by STYLE is scaled to be comparable to
223the other formulas,
224except that the maximum grade level reported is set to 17.
225The Flesch score is usually the highest of the 4 scores
226on technical documents.
227.PP
228Coke [8] found that the Kincaid Formula is probably the best predictor for
229technical documents;
230both ARI and Flesch tend to overestimate
231the difficulty; Coleman-Liau tend to underestimate.
232On text in the range of grades 7 to 9
233the four formulas tend to be about the same.
234On easy text the Coleman-Liau formula is probably
235preferred since it is reasonably accurate at the lower
236grades and it is safer to present text that is a little too
237easy than a little too hard.
238.PP
239If a document has particularly difficult technical content, especially if
240it includes a lot of mathematics,
241it is probably best to make the text very easy to read, i.e. a lower
242readability index by shortening the sentences and words.
243This will allow the reader to concentrate on the technical
244content and not the long sentences.
245The user should remember that these indices are estimators;
246they should not be taken as absolute numbers.
247STYLE called with ``\-r number'' will print all sentences with
248an Automated Readability Index equal to or greater than ``number''.
249.NH 2
250Sentence length and structure
251.PP
252The next two sections of STYLE output deal with sentence length and structure.
253Almost all books on writing style or effective writing emphasize
254the importance of variety in sentence length and structure for good writing.
255Ewing's first rule in discussing style in the book
256.I
257Writing for Results
258.R
259[9] is:
260.DS
261``Vary the sentence structure and length of your sentences.''
262.DE
263Leggett, Mead and Charvat break this rule into 3 in
264.I
265Prentice-Hall Handbook for Writers
266.R
267[10] as follows:
268.DS
269``34a. Avoid the overuse of short simple sentences.''
270``34b. Avoid the overuse of long compound sentences.''
271``34c. Use various sentence structures to avoid monotony and increase effectiveness.''
272.DE
273Although experts agree that these rules are important, not all writers
274follow them.
275Sample technical documents have been found with almost no
276sentence length or type variability.
277One document had 90% of its sentences about the same
278length as the average;
279another was made up almost entirely of simple sentences (80%).
280.PP
281The output sections labeled ``sentence info'' and ``sentence types'' give
282both length and structure measures.
283STYLE reports on the number and average length of both
284sentences and words,
285and number of questions and imperative sentences (those ending in ``/.'').
286The measures of non-function words are an attempt to look at the content
287words in the document.
288In English
289non-function words are nouns, adjectives, adverbs, and non-auxiliary verbs;
290function words are prepositions, conjunctions, articles, and auxiliary
291verbs.
292Since most function words are short, they tend to lower the average
293word length.
294The average length of non-function words may be a more useful measure for comparing
295word choice of different writers than the total average word length.
296The percentages of short and long sentences measure sentence
297length variability.
298Short sentences are those at least 5 words less than the
299average; long sentences are those at least 10 words longer than the average.
300Last in the sentence information section is the
301length and location of the longest and shortest sentences.
302If the flag ``\-l number'' is used, STYLE will print all sentences
303longer than ``number''.
304.PP
305Because of the difficulties in dealing with the many uses of commas and conjunctions
306in English, sentence type definitions
307vary slightly from those of standard textbooks, but still measure
308the same constructional activity.
309.IP 1.
310A simple sentence has one verb and no dependent clause.
311.IP 2.
312A complex sentence has one independent
313clause and one dependent clause, each with one verb.
314Complex sentences are found by identifying sentences that contain either
315a subordinate conjunction or a clause beginning with words like ``that''
316or ``who''.
317The preceding sentence has such a clause.
318.IP 3.
319A compound sentence has more than one verb and no dependent
320clause.
321Sentences joined by ``;'' are also counted as compound.
322.IP 4.
323A compound-complex sentence has either several dependent clauses
324or one dependent clause and a compound verb in either
325the dependent or independent clause.
326.PP
327Even using these broader definitions, simple
328sentences dominate many of the technical documents that
329have been tested,
330but the example in Figure 1 shows variety in both sentence structure and
331sentence length.
332.NH 2
333Word Usage
334.PP
335The word usage measures are an attempt to identify
336some other constructional features of writing style.
337There are many different ways in English to
338say the same thing.
339The constructions differ from one another
340in the form of the words used.
341The following sentences all convey approximately the
342same meaning but differ in word usage:
343.DS
344The cxio program is used to perform all communication between the systems.
345The cxio program performs all communications between the systems.
346The cxio program is used to communicate between the systems.
347The cxio program communicates between the systems.
348All communication between the systems is performed by the cxio program.
349.DE
350The distribution of the parts of speech and verb constructions
351helps identify overuse of particular constructions.
352Although the measures used by STYLE are crude, they do point out
353problem areas.
354For each category, STYLE reports a percentage and a raw count.
355In addition to looking at the percentage, the user
356may find it useful to compare the raw count with the number of sentences.
357If, for example, the number of infinitives is almost equal to the number
358of sentences, then many of the sentences in the document are constructed
359like the first and third in the preceding example.
360The user may want to transform some of these sentences into another form.
361Some of the implications of the word usage measures are discussed below.
362.IP "\fIVerbs\fR "
363are measured in several different ways to
364try to determine what types of verb constructions are
365most frequent in the document.
366Technical writing tends to contain many
367passive verb constructions and other usage of the verb ``to be''.
368The category of verbs labeled ``tobe'' measures both passives and sentences of
369the form:
370.DS
371.I
372subject tobe predicate
373.R
374.DE
375In counting verbs, whole verb phrases are counted as one verb.
376Verb phrases containing auxiliary verbs are counted in the category
377``aux''.
378The verb phrases counted here are those whose tense is not
379simple present or simple past.
380It might eventually be useful to do more detailed measures
381of verb tense or mood.
382Infinitives are listed as ``inf''.
383The percentages reported for these three categories are based on
384the total number of verb phrases found.
385These categories are not mutually exclusive;
386they cannot be added, since, for example,
387``to be going'' counts as both ``tobe'' and ``inf''.
388Use of these three types of verb constructions varies significantly among authors.
389.sp 2
390STYLE reports passive verbs as a percentage of the finite verbs in the
391document.
392Most style books warn against the overuse of passive verbs.
393Coleman [11] has shown that sentences with
394active verbs are easier to learn than those
395with passive verbs.
396Although the inverted object-subject order of the passive
397voice seems to emphasize the object, Coleman's experiments
398showed that there is little difference in retention
399by word position. He also showed that the direct object of an active verb
400is retained better than the subject of a passive verb.
401These experiments support the advice of the style books suggesting
402that writers should try to use active verbs wherever possible.
403The flag ``\-p'' causes STYLE to print all sentences containing passive verbs.
404.PP
405.IP "\fIPronouns\fR "
406add cohesiveness and connectivity to a document
407by providing back-reference.
408They are often a short-hand notation for something
409previously mentioned, and therefore connect the sentence containing the pronoun with the
410word to which the pronoun refers.
411Although there are other mechanisms for such connections, documents
412with no pronouns tend to be wordy and to have little connectivity.
413.IP "\fIAdverbs\fR "
414can provide transition between sentences and order
415in time and space.
416In performing these functions, adverbs, like pronouns, provide
417connectivity and cohesiveness.
418.IP "\fIConjunctions\fR "
419provide parallelism in a document by connecting two or more
420equal units.
421These units may be whole sentences, verb phrases, nouns, adjectives, or
422prepositional phrases.
423The compound and compound-complex sentences reported under
424sentence type are parallel structures.
425Other uses of parallel structures are indicated by the degree that the
426number of conjunctions reported under word usage exceeds the
427compound sentence measures.
428.IP "\fINouns and Adjectives.\fR "
429A ratio of nouns to adjectives near unity may indicate the over-use of modifiers.
430Some technical writers qualify every noun with one or more
431adjectives.
432Qualifiers in phrases like ``simple linear single-link network model''
433often lend more obscurity than precision to a text.
434.IP "\fINominalizations\fR "
435are verbs that are changed to nouns by adding one of the suffixes
436``ment'', ``ance'', ``ence'', or ``ion''.
437Examples are accomplishment, admittance, adherence, and abbreviation.
438When a writer transforms a nominalized sentence to a non-nominalized
439sentence, she/he increases the effectiveness of the sentence in
440several ways.
441The noun becomes an active verb and frequently one complicated clause
442becomes two shorter clauses.
443For example,
444.DS
445Their inclusion of this provision is admission of the importance of the system.
446When they included this provision, they admitted the importance of the system.
447.DE
448Coleman found that the transformed sentences were easier to
449learn, even when the transformation produced sentences that were
450slightly longer, provided the transformation broke one clause into two.
451Writers who find their document contains many
452nominalizations may want to transform some of the sentences
453to use active verbs.
454.NH 2
455Sentence openers
456.PP
457Another agreed upon principle of style is variety in sentence openers.
458Because STYLE determines the type of sentence opener by
459looking at the part of speech of the first word in the sentence,
460the sentences counted under the heading ``subject opener'' may not
461all really begin with the subject.
462However, a large percentage of sentences in this category
463still indicates lack of variety in sentence openers.
464Other sentence opener measures help the user determine
465if there are transitions between sentences and where
466the subordination occurs.
467Adverbs and conjunctions at the beginning of sentences are mechanisms for
468transition between sentences.
469A pronoun at the beginning shows a link to something previously mentioned
470and indicates connectivity.
471.PP
472The location of subordination can be determined by comparing
473the number of sentences that begin with a subordinator with
474the number of sentences with complex clauses.
475If few sentences start with subordinate conjunctions then
476the subordination is embedded or at the end of the complex sentences.
477For variety the writer may want to transform some sentences
478to have leading subordination.
479.PP
480The last category of openers, expletives, is commonly
481overworked in technical writing.
482Expletives are the words ``it'' and ``there'', usually with the verb ``to be'',
483in constructions where the subject follows the verb.
484For example,
485.DS
486There are three streets used by the traffic.
487There are too many users on this system.
488.DE
489This construction tends to emphasize the object rather than the
490subject of the sentence.
491The flag ``\-e'' will cause STYLE to print all
492sentences that begin with an expletive.
493.NH 1
494DICTION
495.PP
496The program DICTION prints all sentences in a document containing
497phrases that are either frequently misused or indicate wordiness.
498The program, an extension of Aho's FGREP [12] string
499matching program,
500takes as input a file of phrases or patterns to be matched and a file
501of text to be searched.
502A data base of about 450 phrases has been compiled as a default
503pattern file for DICTION.
504Before attempting to locate phrases, the program maps
505upper case letters to lower case and substitutes blanks for
506punctuation.
507Sentence boundaries were deemed less critical in DICTION than
508in STYLE, so abbreviations and other uses of the character
509``.'' are not treated specially.
510DICTION brackets all pattern matches in a sentence with the characters
511``['' ``]'' .
512Although many of the phrases in the default data base are correct
513in some contexts, in others they indicate wordiness.
514Some examples of the phrases and suggested alternatives are:
515.DS
516.TS
517cc
518ll.
519Phrase Alternative
520a large number of many
521arrive at a decision decide
522collect together collect
523for this reason so
524pertaining to about
525through the use of by or with
526utilize use
527with the exception of except
528.TE
529.DE
530Appendix 2 contains a complete list of the default file.
531Some of the entries are short forms of problem phrases.
532For example, the phrase ``the fact'' is found in all of the following
533and is sufficient to point out the wordiness to the user:
534.DS
535.TS
536cc
537ll.
538Phrase Alternative
539accounted for by the fact that caused by
540an example of this is the fact that thus
541based on the fact that because
542despite the fact that although
543due to the fact that because
544in light of the fact that because
545in view of the fact that since
546notwithstanding the fact that although
547.TE
548.DE
549Entries in Appendix 2 preceded by ``~'' are not matched.
550See Section 7 for details on the use of ``~''.
551.PP
552The user may supply her/his own pattern file with the flag ``\-f patfile''.
553In this case the default file will be loaded first, followed by the user file.
554This mechanism allows users to suppress
555patterns contained in the default file or to include their own pet peeves that are not in the default file.
556The flag ``\-n'' will exclude the default file altogether.
557In constructing a pattern file, blanks should be used before and after each
558phrase to avoid matching substrings in words.
559For example, to find all occurrences of the word ``the'', the pattern
560`` the '' should be used.
561The blanks cause only the word ``the'' to be matched and not the
562string ``the'' in words like there, other, and therefore.
563One side effect of surrounding the words with blanks is that
564when two phrases occur without intervening words, only the
565first will be matched.
566.NH 1
567EXPLAIN
568.PP
569The last program, EXPLAIN, is an interactive thesaurus for
570phrases found by DICTION.
571The user types one of the phrases bracketed by DICTION
572and EXPLAIN responds with suggested substitutions for the phrase
573that will improve the diction of the document.
574.KF
575.DS C
576Table 1
577Text Statistics on 20 Technical Documents
578.TS
579cccccc
580llnnnn.
581 variable minimum maximum mean standard deviation
582_
583Readability Kincaid 9.5 16.9 13.3 2.2
584 automated 9.0 17.4 13.3 2.5
585 Cole-Liau 10.0 16.0 12.7 1.8
586 Flesch 8.9 17.0 14.4 2.2
587_
588sentence info. av sent length 15.5 30.3 21.6 4.0
589 av word length 4.61 5.63 5.08 .29
590 av nonfunction length 5.72 7.30 6.52 .45
591 short sent 23% 46% 33% 5.9
592 long sent 7% 20% 14% 2.9
593_
594sentence types simple 31% 71% 49% 11.4
595 complex 19% 50% 33% 8.3
596 compound 2% 14% 7% 3.3
597 compound-complex 2% 19% 10% 4.8
598_
599verb types tobe 26% 64% 44.7% 10.3
600 auxiliary 10% 40% 21% 8.7
601 infinitives 8% 24% 15.1% 4.8
602 passives 12% 50% 29% 9.3
603_
604word usage prepositions 10.1% 15.0% 12.3% 1.6
605 conjunction 1.8% 4.8% 3.4% .9
606 adverbs 1.2% 5.0% 3.4% 1.0
607 nouns 23.6% 31.6% 27.8% 1.7
608 adjectives 15.4% 27.1% 21.1% 3.4
609 pronouns 1.2% 8.4% 2.5% 1.1
610 nominalizations 2% 5% 3.3% .8
611_
612sentence openers prepositions 6% 19% 12% 3.4
613 adverbs 0% 20% 9% 4.6
614 subject 56% 85% 70% 8.0
615 verbs 0% 4% 1% 1.0
616 subordinating conj 1% 12% 5% 2.7
617 conjunctions 0% 4% 0% 1.5
618 expletives 0% 6% 2% 1.7
619.TE
620.DE
621.KE
622.NH 1
623Results
624.NH 2
625STYLE
626.PP
627To get baseline statistics and check the program's accuracy,
628we ran STYLE on 20 technical documents.
629There were a total of 3287 sentences in the sample.
630The shortest document was 67 sentences long; the longest 339 sentences.
631The documents covered a wide range of subject matter, including
632theoretical computing, physics, psychology, engineering, and
633affirmative action.
634Table 1 gives the range, median, and standard deviation of the various style measures.
635As you will note most of the measurements have a fairly wide range of values
636across the sample documents.
637.PP
638As a comparison, Table 2 gives the median results
639for two different technical authors, a sample of instructional material, and a sample of the
640Federalist Papers.
641The two authors show similar styles, although author 2
642uses somewhat shorter sentences and longer words than author 1.
643Author 1 uses all types of sentences, while author 2 prefers
644simple and complex sentences, using few compound or compound-complex sentences.
645The other major difference in the styles of these authors is the location
646of subordination.
647Author 1 seems to prefer embedded or trailing subordination, while
648author 2 begins many sentences with the subordinate clause.
649The documents tested for both authors 1 and 2 were technical documents,
650written for a technical audience.
651The instructional documents, which are written for craftspeople,
652vary surprisingly little from the two technical samples.
653The sentences and words are a little longer,
654and they contain many passive and auxiliary verbs, few adverbs, and almost
655no pronouns.
656The instructional documents contain many imperative sentences, so there are
657many sentence with verb openers.
658The sample of Federalist Papers contrasts with the other
659samples in almost every way.
660.KF
661.DS C
662Table 2
663Text Statistics on Single Authors
664.TS
665cccccc
666llnnnn.
667 variable author 1 author 2 inst. FED
668_
669readability Kincaid 11.0 10.3 10.8 16.3
670 automated 11.0 10.3 11.9 17.8
671 Coleman-Liau 9.3 10.1 10.2 12.3
672 Flesch 10.3 10.7 10.1 15.0
673_
674sentence info av sent length 22.64 19.61 22.78 31.85
675 av word length 4.47 4.66 4.65 4.95
676 av nonfunction length 5.64 5.92 6.04 6.87
677 short sent 35% 43% 35% 40%
678 long sent 18% 15% 16% 21%
679_
680sentence types simple 36% 43% 40% 31%
681 complex 34% 41% 37% 34%
682 compound 13% 7% 4% 10%
683 compound-complex 16% 8% 14% 25%
684_
685verb type tobe 42% 43% 45% 37%
686 auxiliary 17% 19% 32% 32%
687 infinitives 17% 15% 12% 21%
688 passives 20% 19% 36% 20%
689_
690word usage prepositions 10.0% 10.8% 12.3% 15.9%
691 conjunctions 3.2% 2.4% 3.9% 3.4%
692 adverbs 5.05% 4.6% 3.5% 3.7%
693 nouns 27.7% 26.5% 29.1% 24.9%
694 adjectives 17.0% 19.0% 15.4% 12.4%
695 pronouns 5.3% 4.3% 2.1% 6.5%
696 nominalizations 1% 2% 2% 3%
697_
698sentence openers prepositions 11% 14% 6% 5%
699 adverbs 9% 9% 6% 4%
700 subject 65% 59% 54% 66%
701 verb 3% 2% 14% 2%
702 subordinating conj 8% 14% 11% 3%
703 conjunction 1% 0% 0% 3%
704 expletives 3% 3% 0% 3%
705.TE
706.DE
707.KE
708.NH 2
709DICTION
710.PP
711In the few weeks that DICTION has been available
712to users
713about 35,000 sentences have been run with about
7145,000 string matches.
715The authors using the program seem to make
716the suggested changes about 50-75% of the time.
717To date, almost 200 of the 450 strings in the default
718file have been matched.
719Although most of these phrases are valid and correct
720in some contexts, the 50-75% change rate seems to
721show that the phrases are used much more often than
722concise diction warrants.
723.NH 1
724Accuracy
725.NH 2
726Sentence Identification
727.PP
728The correctness of the STYLE output on the 20 document sample was checked
729in detail.
730STYLE misidentified
731129 sentence fragments as sentences
732and incorrectly joined two or more sentences 75 times
733in the 3287 sentence sample.
734The problems were usually because of nonstandard formatting
735commands, unknown abbreviations, or lists of non-sentences.
736An impossibly long sentence found as the longest sentence in
737the document usually is the result of a long list
738of non-sentences.
739.NH 2
740Sentence Types
741.PP
742Style correctly identified sentence type on 86.5% of
743the sentences in the sample.
744The type distribution of the sentences was
74552.5% simple, 29.9% complex, 8.5% compound and
7469% compound-complex.
747The program reported 49.5% simple, 31.9% complex,
7488% compound and 10.4% compound-complex.
749Looking at the errors on the individual
750documents, the number of simple sentences was
751under-reported by about 4% and the complex and compound-complex
752were over-reported by 3% and 2%, respectively.
753The following matrix shows the programs output
754vs. the actual sentence type.
755.DS C
756.TS
757csssss
758cccccc
759clnnnn.
760Program Results
761 simple complex compound comp-complex
762Actual simple 1566 132 49 17
763Sentence complex 47 892 6 65
764Type compound 40 6 207 23
765 comp-complex 0 52 5 249
766.TE
767.DE
768.PP
769The system's inability to find imperative sentences seems to
770have little effect on most of the style statistics.
771A document with half of its sentences imperative was run, with and
772without the imperative end marker.
773The results were identical except for the expected errors of not finding
774verbs as sentence openers, not counting the imperative sentences,
775and a slight difference (1%) in the number of nouns
776and adjectives reported.
777.NH 2
778Word Usage
779.PP
780The accuracy of identifying word types reflects
781that of PARTS, which is about 95% correct.
782The largest source of confusion is between nouns and
783adjectives.
784The verb counts were checked on about 20 sentences from each
785document and found to be about 98% correct.
786.NH 1
787Technical Details
788.NH 2
789Finding Sentences
790.PP
791The formatting commands embedded in the text increase the difficulty
792of finding sentences.
793Not all text in a document is in sentence form; there are headings,
794tables, equations and lists, for example.
795Headings like ``Finding Sentences'' above should be discarded, not
796attached to the next sentence.
797However, since many of the documents are formatted to be phototypeset,
798and contain font changes, which usually operate on the
799most important words in the document,
800discarding all formatting commands is not correct.
801To improve the programs' ability to find sentence boundaries, the deformatting program, DEROFF [13],
802has been given some knowledge of the formatting packages used on the
803.UX
804operating system.
805DEROFF will now do the following:
806.IP 1.
807Suppress all formatting macros that
808are used for titles, headings, author's name, etc.
809.IP 2.
810Suppress the arguments to the macros for titles, headings, author's name, etc.
811.IP 3.
812Suppress displays, tables, footnotes and text that is centered or in no-fill mode.
813.IP 4.
814Substitute a place holder for equations and check
815for hidden end markers.
816The place holder is necessary because many typists and authors use
817the equation setter to change fonts on important words.
818For this reason, header files containing the definition of
819the EQN delimiters must also be included as input to STYLE.
820End markers are often hidden when an equation ends a sentence
821and the period is typed
822inside the EQN delimiters.
823.IP 5.
824Add a "." after lists.
825If the flag \-ml is also used, all lists are suppressed.
826This is a separate flag because of the variety of ways the
827list macros are used.
828Often, lists are sentences that should be included in the analysis.
829The user must determine how lists are used in the document to be analyzed.
830.PP
831Both STYLE and DICTION call DEROFF before they look at the text.
832The user should supply the \-ml flag if the document contains
833many lists of non-sentences that should be skipped.
834.NH 2
835Details of DICTION
836.PP
837The program DICTION is based on the string matching program FGREP.
838FGREP takes as input a file of patterns to be matched and a file
839to be searched and outputs each line that contains
840any of the patterns
841with no indication of which pattern was matched.
842The following changes have been added to FGREP:
843.IP 1.
844The basic unit that DICTION operates on is a sentence rather than a line.
845Each sentence that contains one of the patterns is output.
846.IP 2.
847Upper case letters are mapped to lower case.
848.IP 3.
849Punctuation is replaced by blanks.
850.IP 4
851All pattern matches in the sentence are found and surrounded with
852``['' ``]'' .
853.IP 5.
854A method for suppressing a string match has been added.
855Any pattern that begins with ``~'' will not be matched.
856Because the matching algorithm finds the longest
857substring, the suppression of a match allows words in some
858correct contexts not to be matched while allowing
859the word in another context to be found.
860For example, the word ``which'' is often incorrectly used
861instead of ``that'' in restrictive clauses.
862However, ``which'' is usually correct when preceded by a preposition
863or ``,''.
864The default pattern file suppresses the match
865of the common prepositions or a double
866blank followed by ``which'' and therefore matches only
867the suspect uses.
868The double blank accounts for the replaced comma.
869.NH
870Conclusions
871.PP
872A system of writing tools that measure some of the
873objective characteristics of writing style has been developed.
874The tools are sufficiently general that they may be applied to
875documents on any subject with equal accuracy.
876Although the measurements are only of the surface
877structure of the text, they do point out problem areas.
878In addition to helping writers produce better documents,
879these programs may be useful for studying
880the writing process and finding other formulae for measuring
881readability.