Initial commit of OpenSPARC T2 design and verification files.
[OpenSPARC-T2-DV] / tools / perl-5.8.0 / lib / 5.8.0 / Locale / Maketext / TPJ13.pod
CommitLineData
86530b38
AT
1
2# This document contains text in Perl "POD" format.
3# Use a POD viewer like perldoc or perlman to render it.
4
5=head1 NAME
6
7Locale::Maketext::TPJ13 -- article about software localization
8
9=head1 SYNOPSIS
10
11 # This an article, not a module.
12
13=head1 DESCRIPTION
14
15The following article by Sean M. Burke and Jordan Lachler
16first appeared in I<The Perl
17Journal> #13 and is copyright 1999 The Perl Journal. It appears
18courtesy of Jon Orwant and The Perl Journal. This document may be
19distributed under the same terms as Perl itself.
20
21=head1 Localization and Perl: gettext breaks, Maketext fixes
22
23by Sean M. Burke and Jordan Lachler
24
25This article points out cases where gettext (a common system for
26localizing software interfaces -- i.e., making them work in the user's
27language of choice) fails because of basic differences between human
28languages. This article then describes Maketext, a new system capable
29of correctly treating these differences.
30
31=head2 A Localization Horror Story: It Could Happen To You
32
33=over
34
35"There are a number of languages spoken by human beings in this
36world."
37
38-- Harald Tveit Alvestrand, in RFC 1766, "Tags for the
39Identification of Languages"
40
41=back
42
43Imagine that your task for the day is to localize a piece of software
44-- and luckily for you, the only output the program emits is two
45messages, like this:
46
47 I scanned 12 directories.
48
49 Your query matched 10 files in 4 directories.
50
51So how hard could that be? You look at the code that
52produces the first item, and it reads:
53
54 printf("I scanned %g directories.",
55 $directory_count);
56
57You think about that, and realize that it doesn't even work right for
58English, as it can produce this output:
59
60 I scanned 1 directories.
61
62So you rewrite it to read:
63
64 printf("I scanned %g %s.",
65 $directory_count,
66 $directory_count == 1 ?
67 "directory" : "directories",
68 );
69
70...which does the Right Thing. (In case you don't recall, "%g" is for
71locale-specific number interpolation, and "%s" is for string
72interpolation.)
73
74But you still have to localize it for all the languages you're
75producing this software for, so you pull Locale::gettext off of CPAN
76so you can access the C<gettext> C functions you've heard are standard
77for localization tasks.
78
79And you write:
80
81 printf(gettext("I scanned %g %s."),
82 $dir_scan_count,
83 $dir_scan_count == 1 ?
84 gettext("directory") : gettext("directories"),
85 );
86
87But you then read in the gettext manual (Drepper, Miller, and Pinard 1995)
88that this is not a good idea, since how a single word like "directory"
89or "directories" is translated may depend on context -- and this is
90true, since in a case language like German or Russian, you'd may need
91these words with a different case ending in the first instance (where the
92word is the object of a verb) than in the second instance, which you haven't even
93gotten to yet (where the word is the object of a preposition, "in %g
94directories") -- assuming these keep the same syntax when translated
95into those languages.
96
97So, on the advice of the gettext manual, you rewrite:
98
99 printf( $dir_scan_count == 1 ?
100 gettext("I scanned %g directory.") :
101 gettext("I scanned %g directories."),
102 $dir_scan_count );
103
104So, you email your various translators (the boss decides that the
105languages du jour are Chinese, Arabic, Russian, and Italian, so you
106have one translator for each), asking for translations for "I scanned
107%g directory." and "I scanned %g directories.". When they reply,
108you'll put that in the lexicons for gettext to use when it localizes
109your software, so that when the user is running under the "zh"
110(Chinese) locale, gettext("I scanned %g directory.") will return the
111appropriate Chinese text, with a "%g" in there where printf can then
112interpolate $dir_scan.
113
114Your Chinese translator emails right back -- he says both of these
115phrases translate to the same thing in Chinese, because, in linguistic
116jargon, Chinese "doesn't have number as a grammatical category" --
117whereas English does. That is, English has grammatical rules that
118refer to "number", i.e., whether something is grammatically singular
119or plural; and one of these rules is the one that forces nouns to take
120a plural suffix (generally "s") when in a plural context, as they are when
121they follow a number other than "one" (including, oddly enough, "zero").
122Chinese has no such rules, and so has just the one phrase where English
123has two. But, no problem, you can have this one Chinese phrase appear
124as the translation for the two English phrases in the "zh" gettext
125lexicon for your program.
126
127Emboldened by this, you dive into the second phrase that your software
128needs to output: "Your query matched 10 files in 4 directories.". You notice
129that if you want to treat phrases as indivisible, as the gettext
130manual wisely advises, you need four cases now, instead of two, to
131cover the permutations of singular and plural on the two items,
132$dir_count and $file_count. So you try this:
133
134 printf( $file_count == 1 ?
135 ( $directory_count == 1 ?
136 gettext("Your query matched %g file in %g directory.") :
137 gettext("Your query matched %g file in %g directories.") ) :
138 ( $directory_count == 1 ?
139 gettext("Your query matched %g files in %g directory.") :
140 gettext("Your query matched %g files in %g directories.") ),
141 $file_count, $directory_count,
142 );
143
144(The case of "1 file in 2 [or more] directories" could, I suppose,
145occur in the case of symlinking or something of the sort.)
146
147It occurs to you that this is not the prettiest code you've ever
148written, but this seems the way to go. You mail off to the
149translators asking for translations for these four cases. The
150Chinese guy replies with the one phrase that these all translate to in
151Chinese, and that phrase has two "%g"s in it, as it should -- but
152there's a problem. He translates it word-for-word back: "In %g
153directories contains %g files match your query." The %g
154slots are in an order reverse to what they are in English. You wonder
155how you'll get gettext to handle that.
156
157But you put it aside for the moment, and optimistically hope that the
158other translators won't have this problem, and that their languages
159will be better behaved -- i.e., that they will be just like English.
160
161But the Arabic translator is the next to write back. First off, your
162code for "I scanned %g directory." or "I scanned %g directories."
163assumes there's only singular or plural. But, to use linguistic
164jargon again, Arabic has grammatical number, like English (but unlike
165Chinese), but it's a three-term category: singular, dual, and plural.
166In other words, the way you say "directory" depends on whether there's
167one directory, or I<two> of them, or I<more than two> of them. Your
168test of C<($directory == 1)> no longer does the job. And it means
169that where English's grammatical category of number necessitates
170only the two permutations of the first sentence based on "directory
171[singular]" and "directories [plural]", Arabic has three -- and,
172worse, in the second sentence ("Your query matched %g file in %g
173directory."), where English has four, Arabic has nine. You sense
174an unwelcome, exponential trend taking shape.
175
176Your Italian translator emails you back and says that "I searched 0
177directories" (a possible English output of your program) is stilted,
178and if you think that's fine English, that's your problem, but that
179I<just will not do> in the language of Dante. He insists that where
180$directory_count is 0, your program should produce the Italian text
181for "I I<didn't> scan I<any> directories.". And ditto for "I didn't
182match any files in any directories", although he says the last part
183about "in any directories" should probably just be left off.
184
185You wonder how you'll get gettext to handle this; to accomodate the
186ways Arabic, Chinese, and Italian deal with numbers in just these few
187very simple phrases, you need to write code that will ask gettext for
188different queries depending on whether the numerical values in
189question are 1, 2, more than 2, or in some cases 0, and you still haven't
190figured out the problem with the different word order in Chinese.
191
192Then your Russian translator calls on the phone, to I<personally> tell
193you the bad news about how really unpleasant your life is about to
194become:
195
196Russian, like German or Latin, is an inflectional language; that is, nouns
197and adjectives have to take endings that depend on their case
198(i.e., nominative, accusative, genitive, etc...) -- which is roughly a matter of
199what role they have in syntax of the sentence --
200as well as on the grammatical gender (i.e., masculine, feminine, neuter)
201and number (i.e., singular or plural) of the noun, as well as on the
202declension class of the noun. But unlike with most other inflected languages,
203putting a number-phrase (like "ten" or "forty-three", or their Arabic
204numeral equivalents) in front of noun in Russian can change the case and
205number that noun is, and therefore the endings you have to put on it.
206
207He elaborates: In "I scanned %g directories", you'd I<expect>
208"directories" to be in the accusative case (since it is the direct
209object in the sentnce) and the plural number,
210except where $directory_count is 1, then you'd expect the singular, of
211course. Just like Latin or German. I<But!> Where $directory_count %
21210 is 1 ("%" for modulo, remember), assuming $directory count is an
213integer, and except where $directory_count % 100 is 11, "directories"
214is forced to become grammatically singular, which means it gets the
215ending for the accusative singular... You begin to visualize the code
216it'd take to test for the problem so far, I<and still work for Chinese
217and Arabic and Italian>, and how many gettext items that'd take, but
218he keeps going... But where $directory_count % 10 is 2, 3, or 4
219(except where $directory_count % 100 is 12, 13, or 14), the word for
220"directories" is forced to be genitive singular -- which means another
221ending... The room begins to spin around you, slowly at first... But
222with I<all other> integer values, since "directory" is an inanimate
223noun, when preceded by a number and in the nominative or accusative
224cases (as it is here, just your luck!), it does stay plural, but it is
225forced into the genitive case -- yet another ending... And
226you never hear him get to the part about how you're going to run into
227similar (but maybe subtly different) problems with other Slavic
228languages like Polish, because the floor comes up to meet you, and you
229fade into unconsciousness.
230
231
232The above cautionary tale relates how an attempt at localization can
233lead from programmer consternation, to program obfuscation, to a need
234for sedation. But careful evaluation shows that your choice of tools
235merely needed further consideration.
236
237=head2 The Linguistic View
238
239=over
240
241"It is more complicated than you think."
242
243-- The Eighth Networking Truth, from RFC 1925
244
245=back
246
247The field of Linguistics has expended a great deal of effort over the
248past century trying to find grammatical patterns which hold across
249languages; it's been a constant process
250of people making generalizations that should apply to all languages,
251only to find out that, all too often, these generalizations fail --
252sometimes failing for just a few languages, sometimes whole classes of
253languages, and sometimes nearly every language in the world except
254English. Broad statistical trends are evident in what the "average
255language" is like as far as what its rules can look like, must look
256like, and cannot look like. But the "average language" is just as
257unreal a concept as the "average person" -- it runs up against the
258fact no language (or person) is, in fact, average. The wisdom of past
259experience leads us to believe that any given language can do whatever
260it wants, in any order, with appeal to any kind of grammatical
261categories wants -- case, number, tense, real or metaphoric
262characteristics of the things that words refer to, arbitrary or
263predictable classifications of words based on what endings or prefixes
264they can take, degree or means of certainty about the truth of
265statements expressed, and so on, ad infinitum.
266
267Mercifully, most localization tasks are a matter of finding ways to
268translate whole phrases, generally sentences, where the context is
269relatively set, and where the only variation in content is I<usually>
270in a number being expressed -- as in the example sentences above.
271Translating specific, fully-formed sentences is, in practice, fairly
272foolproof -- which is good, because that's what's in the phrasebooks
273that so many tourists rely on. Now, a given phrase (whether in a
274phrasebook or in a gettext lexicon) in one language I<might> have a
275greater or lesser applicability than that phrase's translation into
276another language -- for example, strictly speaking, in Arabic, the
277"your" in "Your query matched..." would take a different form
278depending on whether the user is male or female; so the Arabic
279translation "your[feminine] query" is applicable in fewer cases than
280the corresponding English phrase, which doesn't distinguish the user's
281gender. (In practice, it's not feasable to have a program know the
282user's gender, so the masculine "you" in Arabic is usually used, by
283default.)
284
285But in general, such surprises are rare when entire sentences are
286being translated, especially when the functional context is restricted
287to that of a computer interacting with a user either to convey a fact
288or to prompt for a piece of information. So, for purposes of
289localization, translation by phrase (generally by sentence) is both the
290simplest and the least problematic.
291
292=head2 Breaking gettext
293
294=over
295
296"It Has To Work."
297
298-- First Networking Truth, RFC 1925
299
300=back
301
302Consider that sentences in a tourist phrasebook are of two types: ones
303like "How do I get to the marketplace?" that don't have any blanks to
304fill in, and ones like "How much do these ___ cost?", where there's
305one or more blanks to fill in (and these are usually linked to a
306list of words that you can put in that blank: "fish", "potatoes",
307"tomatoes", etc.) The ones with no blanks are no problem, but the
308fill-in-the-blank ones may not be really straightforward. If it's a
309Swahili phrasebook, for example, the authors probably didn't bother to
310tell you the complicated ways that the verb "cost" changes its
311inflectional prefix depending on the noun you're putting in the blank.
312The trader in the marketplace will still understand what you're saying if
313you say "how much do these potatoes cost?" with the wrong
314inflectional prefix on "cost". After all, I<you> can't speak proper Swahili,
315I<you're> just a tourist. But while tourists can be stupid, computers
316are supposed to be smart; the computer should be able to fill in the
317blank, and still have the results be grammatical.
318
319In other words, a phrasebook entry takes some values as parameters
320(the things that you fill in the blank or blanks), and provides a value
321based on these parameters, where the way you get that final value from
322the given values can, properly speaking, involve an arbitrarily
323complex series of operations. (In the case of Chinese, it'd be not at
324all complex, at least in cases like the examples at the beginning of
325this article; whereas in the case of Russian it'd be a rather complex
326series of operations. And in some languages, the
327complexity could be spread around differently: while the act of
328putting a number-expression in front of a noun phrase might not be
329complex by itself, it may change how you have to, for example, inflect
330a verb elsewhere in the sentence. This is what in syntax is called
331"long-distance dependencies".)
332
333This talk of parameters and arbitrary complexity is just another way
334to say that an entry in a phrasebook is what in a programming language
335would be called a "function". Just so you don't miss it, this is the
336crux of this article: I<A phrase is a function; a phrasebook is a
337bunch of functions.>
338
339The reason that using gettext runs into walls (as in the above
340second-person horror story) is that you're trying to use a string (or
341worse, a choice among a bunch of strings) to do what you really need a
342function for -- which is futile. Preforming (s)printf interpolation
343on the strings which you get back from gettext does allow you to do I<some>
344common things passably well... sometimes... sort of; but, to paraphrase
345what some people say about C<csh> script programming, "it fools you
346into thinking you can use it for real things, but you can't, and you
347don't discover this until you've already spent too much time trying,
348and by then it's too late."
349
350=head2 Replacing gettext
351
352So, what needs to replace gettext is a system that supports lexicons
353of functions instead of lexicons of strings. An entry in a lexicon
354from such a system should I<not> look like this:
355
356 "J'ai trouv\xE9 %g fichiers dans %g r\xE9pertoires"
357
358[\xE9 is e-acute in Latin-1. Some pod renderers would
359scream if I used the actual character here. -- SB]
360
361but instead like this, bearing in mind that this is just a first stab:
362
363 sub I_found_X1_files_in_X2_directories {
364 my( $files, $dirs ) = @_[0,1];
365 $files = sprintf("%g %s", $files,
366 $files == 1 ? 'fichier' : 'fichiers');
367 $dirs = sprintf("%g %s", $dirs,
368 $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
369 return "J'ai trouv\xE9 $files dans $dirs.";
370 }
371
372Now, there's no particularly obvious way to store anything but strings
373in a gettext lexicon; so it looks like we just have to start over and
374make something better, from scratch. I call my shot at a
375gettext-replacement system "Maketext", or, in CPAN terms,
376Locale::Maketext.
377
378When designing Maketext, I chose to plan its main features in terms of
379"buzzword compliance". And here are the buzzwords:
380
381=head2 Buzzwords: Abstraction and Encapsulation
382
383The complexity of the language you're trying to output a phrase in is
384entirely abstracted inside (and encapsulated within) the Maketext module
385for that interface. When you call:
386
387 print $lang->maketext("You have [quant,_1,piece] of new mail.",
388 scalar(@messages));
389
390you don't know (and in fact can't easily find out) whether this will
391involve lots of figuring, as in Russian (if $lang is a handle to the
392Russian module), or relatively little, as in Chinese. That kind of
393abstraction and encapsulation may encourage other pleasant buzzwords
394like modularization and stratification, depending on what design
395decisions you make.
396
397=head2 Buzzword: Isomorphism
398
399"Isomorphism" means "having the same structure or form"; in discussions
400of program design, the word takes on the special, specific meaning that
401your implementation of a solution to a problem I<has the same
402structure> as, say, an informal verbal description of the solution, or
403maybe of the problem itself. Isomorphism is, all things considered,
404a good thing -- it's what problem-solving (and solution-implementing)
405should look like.
406
407What's wrong the with gettext-using code like this...
408
409 printf( $file_count == 1 ?
410 ( $directory_count == 1 ?
411 "Your query matched %g file in %g directory." :
412 "Your query matched %g file in %g directories." ) :
413 ( $directory_count == 1 ?
414 "Your query matched %g files in %g directory." :
415 "Your query matched %g files in %g directories." ),
416 $file_count, $directory_count,
417 );
418
419is first off that it's not well abstracted -- these ways of testing
420for grammatical number (as in the expressions like C<foo == 1 ?
421singular_form : plural_form>) should be abstracted to each language
422module, since how you get grammatical number is language-specific.
423
424But second off, it's not isomorphic -- the "solution" (i.e., the
425phrasebook entries) for Chinese maps from these four English phrases to
426the one Chinese phrase that fits for all of them. In other words, the
427informal solution would be "The way to say what you want in Chinese is
428with the one phrase 'For your question, in Y directories you would
429find X files'" -- and so the implemented solution should be,
430isomorphically, just a straightforward way to spit out that one
431phrase, with numerals properly interpolated. It shouldn't have to map
432from the complexity of other languages to the simplicity of this one.
433
434=head2 Buzzword: Inheritance
435
436There's a great deal of reuse possible for sharing of phrases between
437modules for related dialects, or for sharing of auxiliary functions
438between related languages. (By "auxiliary functions", I mean
439functions that don't produce phrase-text, but which, say, return an
440answer to "does this number require a plural noun after it?". Such
441auxiliary functions would be used in the internal logic of functions
442that actually do produce phrase-text.)
443
444In the case of sharing phrases, consider that you have an interface
445already localized for American English (probably by having been
446written with that as the native locale, but that's incidental).
447Localizing it for UK English should, in practical terms, be just a
448matter of running it past a British person with the instructions to
449indicate what few phrases would benefit from a change in spelling or
450possibly minor rewording. In that case, you should be able to put in
451the UK English localization module I<only> those phrases that are
452UK-specific, and for all the rest, I<inherit> from the American
453English module. (And I expect this same situation would apply with
454Brazilian and Continental Portugese, possbily with some I<very>
455closely related languages like Czech and Slovak, and possibly with the
456slightly different "versions" of written Mandarin Chinese, as I hear exist in
457Taiwan and mainland China.)
458
459As to sharing of auxiliary functions, consider the problem of Russian
460numbers from the beginning of this article; obviously, you'd want to
461write only once the hairy code that, given a numeric value, would
462return some specification of which case and number a given quanitified
463noun should use. But suppose that you discover, while localizing an
464interface for, say, Ukranian (a Slavic language related to Russian,
465spoken by several million people, many of whom would be relieved to
466find that your Web site's or software's interface is available in
467their language), that the rules in Ukranian are the same as in Russian
468for quantification, and probably for many other grammatical functions.
469While there may well be no phrases in common between Russian and
470Ukranian, you could still choose to have the Ukranian module inherit
471from the Russian module, just for the sake of inheriting all the
472various grammatical methods. Or, probably better organizationally,
473you could move those functions to a module called C<_E_Slavic> or
474something, which Russian and Ukranian could inherit useful functions
475from, but which would (presumably) provide no lexicon.
476
477=head2 Buzzword: Concision
478
479Okay, concision isn't a buzzword. But it should be, so I decree that
480as a new buzzword, "concision" means that simple common things should
481be expressible in very few lines (or maybe even just a few characters)
482of code -- call it a special case of "making simple things easy and
483hard things possible", and see also the role it played in the
484MIDI::Simple language, discussed elsewhere in this issue [TPJ#13].
485
486Consider our first stab at an entry in our "phrasebook of functions":
487
488 sub I_found_X1_files_in_X2_directories {
489 my( $files, $dirs ) = @_[0,1];
490 $files = sprintf("%g %s", $files,
491 $files == 1 ? 'fichier' : 'fichiers');
492 $dirs = sprintf("%g %s", $dirs,
493 $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
494 return "J'ai trouv\xE9 $files dans $dirs.";
495 }
496
497You may sense that a lexicon (to use a non-committal catch-all term for a
498collection of things you know how to say, regardless of whether they're
499phrases or words) consisting of functions I<expressed> as above would
500make for rather long-winded and repetitive code -- even if you wisely
501rewrote this to have quantification (as we call adding a number
502expression to a noun phrase) be a function called like:
503
504 sub I_found_X1_files_in_X2_directories {
505 my( $files, $dirs ) = @_[0,1];
506 $files = quant($files, "fichier");
507 $dirs = quant($dirs, "r\xE9pertoire");
508 return "J'ai trouv\xE9 $files dans $dirs.";
509 }
510
511And you may also sense that you do not want to bother your translators
512with having to write Perl code -- you'd much rather that they spend
513their I<very costly time> on just translation. And this is to say
514nothing of the near impossibility of finding a commercial translator
515who would know even simple Perl.
516
517In a first-hack implementation of Maketext, each language-module's
518lexicon looked like this:
519
520 %Lexicon = (
521 "I found %g files in %g directories"
522 => sub {
523 my( $files, $dirs ) = @_[0,1];
524 $files = quant($files, "fichier");
525 $dirs = quant($dirs, "r\xE9pertoire");
526 return "J'ai trouv\xE9 $files dans $dirs.";
527 },
528 ... and so on with other phrase => sub mappings ...
529 );
530
531but I immediately went looking for some more concise way to basically
532denote the same phrase-function -- a way that would also serve to
533concisely denote I<most> phrase-functions in the lexicon for I<most>
534languages. After much time and even some actual thought, I decided on
535this system:
536
537* Where a value in a %Lexicon hash is a contentful string instead of
538an anonymous sub (or, conceivably, a coderef), it would be interpreted
539as a sort of shorthand expression of what the sub does. When accessed
540for the first time in a session, it is parsed, turned into Perl code,
541and then eval'd into an anonymous sub; then that sub replaces the
542original string in that lexicon. (That way, the work of parsing and
543evaling the shorthand form for a given phrase is done no more than
544once per session.)
545
546* Calls to C<maketext> (as Maketext's main function is called) happen
547thru a "language session handle", notionally very much like an IO
548handle, in that you open one at the start of the session, and use it
549for "sending signals" to an object in order to have it return the text
550you want.
551
552So, this:
553
554 $lang->maketext("You have [quant,_1,piece] of new mail.",
555 scalar(@messages));
556
557basically means this: look in the lexicon for $lang (which may inherit
558from any number of other lexicons), and find the function that we
559happen to associate with the string "You have [quant,_1,piece] of new
560mail" (which is, and should be, a functioning "shorthand" for this
561function in the native locale -- English in this case). If you find
562such a function, call it with $lang as its first parameter (as if it
563were a method), and then a copy of scalar(@messages) as its second,
564and then return that value. If that function was found, but was in
565string shorthand instead of being a fully specified function, parse it
566and make it into a function before calling it the first time.
567
568* The shorthand uses code in brackets to indicate method calls that
569should be performed. A full explanation is not in order here, but a
570few examples will suffice:
571
572 "You have [quant,_1,piece] of new mail."
573
574The above code is shorthand for, and will be interpreted as,
575this:
576
577 sub {
578 my $handle = $_[0];
579 my(@params) = @_;
580 return join '',
581 "You have ",
582 $handle->quant($params[1], 'piece'),
583 "of new mail.";
584 }
585
586where "quant" is the name of a method you're using to quantify the
587noun "piece" with the number $params[0].
588
589A string with no brackety calls, like this:
590
591 "Your search expression was malformed."
592
593is somewhat of a degerate case, and just gets turned into:
594
595 sub { return "Your search expression was malformed." }
596
597However, not everything you can write in Perl code can be written in
598the above shorthand system -- not by a long shot. For example, consider
599the Italian translator from the beginning of this article, who wanted
600the Italian for "I didn't find any files" as a special case, instead
601of "I found 0 files". That couldn't be specified (at least not easily
602or simply) in our shorthand system, and it would have to be written
603out in full, like this:
604
605 sub { # pretend the English strings are in Italian
606 my($handle, $files, $dirs) = @_[0,1,2];
607 return "I didn't find any files" unless $files;
608 return join '',
609 "I found ",
610 $handle->quant($files, 'file'),
611 " in ",
612 $handle->quant($dirs, 'directory'),
613 ".";
614 }
615
616Next to a lexicon full of shorthand code, that sort of sticks out like a
617sore thumb -- but this I<is> a special case, after all; and at least
618it's possible, if not as concise as usual.
619
620As to how you'd implement the Russian example from the beginning of
621the article, well, There's More Than One Way To Do It, but it could be
622something like this (using English words for Russian, just so you know
623what's going on):
624
625 "I [quant,_1,directory,accusative] scanned."
626
627This shifts the burden of complexity off to the quant method. That
628method's parameters are: the numeric value it's going to use to
629quantify something; the Russian word it's going to quantify; and the
630parameter "accusative", which you're using to mean that this
631sentence's syntax wants a noun in the accusative case there, although
632that quantification method may have to overrule, for grammatical
633reasons you may recall from the beginning of this article.
634
635Now, the Russian quant method here is responsible not only for
636implementing the strange logic necessary for figuring out how Russian
637number-phrases impose case and number on their noun-phrases, but also
638for inflecting the Russian word for "directory". How that inflection
639is to be carried out is no small issue, and among the solutions I've
640seen, some (like variations on a simple lookup in a hash where all
641possible forms are provided for all necessary words) are
642straightforward but I<can> become cumbersome when you need to inflect
643more than a few dozen words; and other solutions (like using
644algorithms to model the inflections, storing only root forms and
645irregularities) I<can> involve more overhead than is justifiable for
646all but the largest lexicons.
647
648Mercifully, this design decision becomes crucial only in the hairiest
649of inflected languages, of which Russian is by no means the I<worst> case
650scenario, but is worse than most. Most languages have simpler
651inflection systems; for example, in English or Swahili, there are
652generally no more than two possible inflected forms for a given noun
653("error/errors"; "kosa/makosa"), and the
654rules for producing these forms are fairly simple -- or at least,
655simple rules can be formulated that work for most words, and you can
656then treat the exceptions as just "irregular", at least relative to
657your ad hoc rules. A simpler inflection system (simpler rules, fewer
658forms) means that design decisions are less crucial to maintaining
659sanity, whereas the same decisions could incur
660overhead-versus-scalability problems in languages like Russian. It
661may I<also> be likely that code (possibly in Perl, as with
662Lingua::EN::Inflect, for English nouns) has already
663been written for the language in question, whether simple or complex.
664
665Moreover, a third possibility may even be simpler than anything
666discussed above: "Just require that all possible (or at least
667applicable) forms be provided in the call to the given language's quant
668method, as in:"
669
670 "I found [quant,_1,file,files]."
671
672That way, quant just has to chose which form it needs, without having
673to look up or generate anything. While possibly not optimal for
674Russian, this should work well for most other languages, where
675quantification is not as complicated an operation.
676
677=head2 The Devil in the Details
678
679There's plenty more to Maketext than described above -- for example,
680there's the details of how language tags ("en-US", "i-pwn", "fi",
681etc.) or locale IDs ("en_US") interact with actual module naming
682("BogoQuery/Locale/en_us.pm"), and what magic can ensue; there's the
683details of how to record (and possibly negotiate) what character
684encoding Maketext will return text in (UTF8? Latin-1? KOI8?). There's
685the interesting fact that Maketext is for localization, but nowhere
686actually has a "C<use locale;>" anywhere in it. For the curious,
687there's the somewhat frightening details of how I actually
688implement something like data inheritance so that searches across
689modules' %Lexicon hashes can parallel how Perl implements method
690inheritance.
691
692And, most importantly, there's all the practical details of how to
693actually go about deriving from Maketext so you can use it for your
694interfaces, and the various tools and conventions for starting out and
695maintaining individual language modules.
696
697That is all covered in the documentation for Locale::Maketext and the
698modules that come with it, available in CPAN. After having read this
699article, which covers the why's of Maketext, the documentation,
700which covers the how's of it, should be quite straightfoward.
701
702=head2 The Proof in the Pudding: Localizing Web Sites
703
704Maketext and gettext have a notable difference: gettext is in C,
705accessible thru C library calls, whereas Maketext is in Perl, and
706really can't work without a Perl interpreter (although I suppose
707something like it could be written for C). Accidents of history (and
708not necessarily lucky ones) have made C++ the most common language for
709the implementation of applications like word processors, Web browsers,
710and even many in-house applications like custom query systems. Current
711conditions make it somewhat unlikely that the next one of any of these
712kinds of applications will be written in Perl, albeit clearly more for
713reasons of custom and inertia than out of consideration of what is the
714right tool for the job.
715
716However, other accidents of history have made Perl a well-accepted
717language for design of server-side programs (generally in CGI form)
718for Web site interfaces. Localization of static pages in Web sites is
719trivial, feasable either with simple language-negotiation features in
720servers like Apache, or with some kind of server-side inclusions of
721language-appropriate text into layout templates. However, I think
722that the localization of Perl-based search systems (or other kinds of
723dynamic content) in Web sites, be they public or access-restricted,
724is where Maketext will see the greatest use.
725
726I presume that it would be only the exceptional Web site that gets
727localized for English I<and> Chinese I<and> Italian I<and> Arabic
728I<and> Russian, to recall the languages from the beginning of this
729article -- to say nothing of German, Spanish, French, Japanese,
730Finnish, and Hindi, to name a few languages that benefit from large
731numbers of programmers or Web viewers or both.
732
733However, the ever-increasing internationalization of the Web (whether
734measured in terms of amount of content, of numbers of content writers
735or programmers, or of size of content audiences) makes it increasingly
736likely that the interface to the average Web-based dynamic content
737service will be localized for two or maybe three languages. It is my
738hope that Maketext will make that task as simple as possible, and will
739remove previous barriers to localization for languages dissimilar to
740English.
741
742 __END__
743
744Sean M. Burke (sburkeE<64>cpan.org) has a Master's in linguistics
745from Northwestern University; he specializes in language technology.
746Jordan Lachler (lachlerE<64>unm.edu) is a PhD student in the Department of
747Linguistics at the University of New Mexico; he specializes in
748morphology and pedagogy of North American native languages.
749
750=head2 References
751
752Alvestrand, Harald Tveit. 1995. I<RFC 1766: Tags for the
753Identification of Languages.>
754C<ftp://ftp.isi.edu/in-notes/rfc1766.txt>
755[Now see RFC 3066.]
756
757Callon, Ross, editor. 1996. I<RFC 1925: The Twelve
758Networking Truths.>
759C<ftp://ftp.isi.edu/in-notes/rfc1925.txt>
760
761Drepper, Ulrich, Peter Miller,
762and FranE<ccedil>ois Pinard. 1995-2001. GNU
763C<gettext>. Available in C<ftp://prep.ai.mit.edu/pub/gnu/>, with
764extensive docs in the distribution tarball. [Since
765I wrote this article in 1998, I now see that the
766gettext docs are now trying more to come to terms with
767plurality. Whether useful conclusions have come from it
768is another question altogether. -- SMB, May 2001]
769
770Forbes, Nevill. 1964. I<Russian Grammar.> Third Edition, revised
771by J. C. Dumbreck. Oxford University Press.
772
773=cut
774
775#End
776