git.subgeniuskitty.com - OpenSPARC-T2-SAM/.git/blame_incremental - sam-t2/devtools/amd64/lib/perl5/5.8.8/Locale/Maketext/TPJ13.pod

... / ...

Commit	Line	Data
	1
	2	# This document contains text in Perl "POD" format.
	3	# Use a POD viewer like perldoc or perlman to render it.
	4
	5	# This corrects some typoes in the previous release.
	6
	7	=head1 NAME
	8
	9	Locale::Maketext::TPJ13 -- article about software localization
	10
	11	=head1 SYNOPSIS
	12
	13	# This an article, not a module.
	14
	15	=head1 DESCRIPTION
	16
	17	The following article by Sean M. Burke and Jordan Lachler
	18	first appeared in I<The Perl
	19	Journal> #13 and is copyright 1999 The Perl Journal. It appears
	20	courtesy of Jon Orwant and The Perl Journal. This document may be
	21	distributed under the same terms as Perl itself.
	22
	23	=head1 Localization and Perl: gettext breaks, Maketext fixes
	24
	25	by Sean M. Burke and Jordan Lachler
	26
	27	This article points out cases where gettext (a common system for
	28	localizing software interfaces -- i.e., making them work in the user's
	29	language of choice) fails because of basic differences between human
	30	languages. This article then describes Maketext, a new system capable
	31	of correctly treating these differences.
	32
	33	=head2 A Localization Horror Story: It Could Happen To You
	34
	35	=over
	36
	37	"There are a number of languages spoken by human beings in this
	38	world."
	39
	40	-- Harald Tveit Alvestrand, in RFC 1766, "Tags for the
	41	Identification of Languages"
	42
	43	=back
	44
	45	Imagine that your task for the day is to localize a piece of software
	46	-- and luckily for you, the only output the program emits is two
	47	messages, like this:
	48
	49	I scanned 12 directories.
	50
	51	Your query matched 10 files in 4 directories.
	52
	53	So how hard could that be? You look at the code that
	54	produces the first item, and it reads:
	55
	56	printf("I scanned %g directories.",
	57	$directory_count);
	58
	59	You think about that, and realize that it doesn't even work right for
	60	English, as it can produce this output:
	61
	62	I scanned 1 directories.
	63
	64	So you rewrite it to read:
	65
	66	printf("I scanned %g %s.",
	67	$directory_count,
	68	$directory_count == 1 ?
	69	"directory" : "directories",
	70	);
	71
	72	...which does the Right Thing. (In case you don't recall, "%g" is for
	73	locale-specific number interpolation, and "%s" is for string
	74	interpolation.)
	75
	76	But you still have to localize it for all the languages you're
	77	producing this software for, so you pull Locale::gettext off of CPAN
	78	so you can access the C<gettext> C functions you've heard are standard
	79	for localization tasks.
	80
	81	And you write:
	82
	83	printf(gettext("I scanned %g %s."),
	84	$dir_scan_count,
	85	$dir_scan_count == 1 ?
	86	gettext("directory") : gettext("directories"),
	87	);
	88
	89	But you then read in the gettext manual (Drepper, Miller, and Pinard 1995)
	90	that this is not a good idea, since how a single word like "directory"
	91	or "directories" is translated may depend on context -- and this is
	92	true, since in a case language like German or Russian, you'd may need
	93	these words with a different case ending in the first instance (where the
	94	word is the object of a verb) than in the second instance, which you haven't even
	95	gotten to yet (where the word is the object of a preposition, "in %g
	96	directories") -- assuming these keep the same syntax when translated
	97	into those languages.
	98
	99	So, on the advice of the gettext manual, you rewrite:
	100
	101	printf( $dir_scan_count == 1 ?
	102	gettext("I scanned %g directory.") :
	103	gettext("I scanned %g directories."),
	104	$dir_scan_count );
	105
	106	So, you email your various translators (the boss decides that the
	107	languages du jour are Chinese, Arabic, Russian, and Italian, so you
	108	have one translator for each), asking for translations for "I scanned
	109	%g directory." and "I scanned %g directories.". When they reply,
	110	you'll put that in the lexicons for gettext to use when it localizes
	111	your software, so that when the user is running under the "zh"
	112	(Chinese) locale, gettext("I scanned %g directory.") will return the
	113	appropriate Chinese text, with a "%g" in there where printf can then
	114	interpolate $dir_scan.
	115
	116	Your Chinese translator emails right back -- he says both of these
	117	phrases translate to the same thing in Chinese, because, in linguistic
	118	jargon, Chinese "doesn't have number as a grammatical category" --
	119	whereas English does. That is, English has grammatical rules that
	120	refer to "number", i.e., whether something is grammatically singular
	121	or plural; and one of these rules is the one that forces nouns to take
	122	a plural suffix (generally "s") when in a plural context, as they are when
	123	they follow a number other than "one" (including, oddly enough, "zero").
	124	Chinese has no such rules, and so has just the one phrase where English
	125	has two. But, no problem, you can have this one Chinese phrase appear
	126	as the translation for the two English phrases in the "zh" gettext
	127	lexicon for your program.
	128
	129	Emboldened by this, you dive into the second phrase that your software
	130	needs to output: "Your query matched 10 files in 4 directories.". You notice
	131	that if you want to treat phrases as indivisible, as the gettext
	132	manual wisely advises, you need four cases now, instead of two, to
	133	cover the permutations of singular and plural on the two items,
	134	$dir_count and $file_count. So you try this:
	135
	136	printf( $file_count == 1 ?
	137	( $directory_count == 1 ?
	138	gettext("Your query matched %g file in %g directory.") :
	139	gettext("Your query matched %g file in %g directories.") ) :
	140	( $directory_count == 1 ?
	141	gettext("Your query matched %g files in %g directory.") :
	142	gettext("Your query matched %g files in %g directories.") ),
	143	$file_count, $directory_count,
	144	);
	145
	146	(The case of "1 file in 2 [or more] directories" could, I suppose,
	147	occur in the case of symlinking or something of the sort.)
	148
	149	It occurs to you that this is not the prettiest code you've ever
	150	written, but this seems the way to go. You mail off to the
	151	translators asking for translations for these four cases. The
	152	Chinese guy replies with the one phrase that these all translate to in
	153	Chinese, and that phrase has two "%g"s in it, as it should -- but
	154	there's a problem. He translates it word-for-word back: "In %g
	155	directories contains %g files match your query." The %g
	156	slots are in an order reverse to what they are in English. You wonder
	157	how you'll get gettext to handle that.
	158
	159	But you put it aside for the moment, and optimistically hope that the
	160	other translators won't have this problem, and that their languages
	161	will be better behaved -- i.e., that they will be just like English.
	162
	163	But the Arabic translator is the next to write back. First off, your
	164	code for "I scanned %g directory." or "I scanned %g directories."
	165	assumes there's only singular or plural. But, to use linguistic
	166	jargon again, Arabic has grammatical number, like English (but unlike
	167	Chinese), but it's a three-term category: singular, dual, and plural.
	168	In other words, the way you say "directory" depends on whether there's
	169	one directory, or I<two> of them, or I<more than two> of them. Your
	170	test of C<($directory == 1)> no longer does the job. And it means
	171	that where English's grammatical category of number necessitates
	172	only the two permutations of the first sentence based on "directory
	173	[singular]" and "directories [plural]", Arabic has three -- and,
	174	worse, in the second sentence ("Your query matched %g file in %g
	175	directory."), where English has four, Arabic has nine. You sense
	176	an unwelcome, exponential trend taking shape.
	177
	178	Your Italian translator emails you back and says that "I searched 0
	179	directories" (a possible English output of your program) is stilted,
	180	and if you think that's fine English, that's your problem, but that
	181	I<just will not do> in the language of Dante. He insists that where
	182	$directory_count is 0, your program should produce the Italian text
	183	for "I I<didn't> scan I<any> directories.". And ditto for "I didn't
	184	match any files in any directories", although he says the last part
	185	about "in any directories" should probably just be left off.
	186
	187	You wonder how you'll get gettext to handle this; to accomodate the
	188	ways Arabic, Chinese, and Italian deal with numbers in just these few
	189	very simple phrases, you need to write code that will ask gettext for
	190	different queries depending on whether the numerical values in
	191	question are 1, 2, more than 2, or in some cases 0, and you still haven't
	192	figured out the problem with the different word order in Chinese.
	193
	194	Then your Russian translator calls on the phone, to I<personally> tell
	195	you the bad news about how really unpleasant your life is about to
	196	become:
	197
	198	Russian, like German or Latin, is an inflectional language; that is, nouns
	199	and adjectives have to take endings that depend on their case
	200	(i.e., nominative, accusative, genitive, etc...) -- which is roughly a matter of
	201	what role they have in syntax of the sentence --
	202	as well as on the grammatical gender (i.e., masculine, feminine, neuter)
	203	and number (i.e., singular or plural) of the noun, as well as on the
	204	declension class of the noun. But unlike with most other inflected languages,
	205	putting a number-phrase (like "ten" or "forty-three", or their Arabic
	206	numeral equivalents) in front of noun in Russian can change the case and
	207	number that noun is, and therefore the endings you have to put on it.
	208
	209	He elaborates: In "I scanned %g directories", you'd I<expect>
	210	"directories" to be in the accusative case (since it is the direct
	211	object in the sentnce) and the plural number,
	212	except where $directory_count is 1, then you'd expect the singular, of
	213	course. Just like Latin or German. I<But!> Where $directory_count %
	214	10 is 1 ("%" for modulo, remember), assuming $directory count is an
	215	integer, and except where $directory_count % 100 is 11, "directories"
	216	is forced to become grammatically singular, which means it gets the
	217	ending for the accusative singular... You begin to visualize the code
	218	it'd take to test for the problem so far, I<and still work for Chinese
	219	and Arabic and Italian>, and how many gettext items that'd take, but
	220	he keeps going... But where $directory_count % 10 is 2, 3, or 4
	221	(except where $directory_count % 100 is 12, 13, or 14), the word for
	222	"directories" is forced to be genitive singular -- which means another
	223	ending... The room begins to spin around you, slowly at first... But
	224	with I<all other> integer values, since "directory" is an inanimate
	225	noun, when preceded by a number and in the nominative or accusative
	226	cases (as it is here, just your luck!), it does stay plural, but it is
	227	forced into the genitive case -- yet another ending... And
	228	you never hear him get to the part about how you're going to run into
	229	similar (but maybe subtly different) problems with other Slavic
	230	languages like Polish, because the floor comes up to meet you, and you
	231	fade into unconsciousness.
	232
	233
	234	The above cautionary tale relates how an attempt at localization can
	235	lead from programmer consternation, to program obfuscation, to a need
	236	for sedation. But careful evaluation shows that your choice of tools
	237	merely needed further consideration.
	238
	239	=head2 The Linguistic View
	240
	241	=over
	242
	243	"It is more complicated than you think."
	244
	245	-- The Eighth Networking Truth, from RFC 1925
	246
	247	=back
	248
	249	The field of Linguistics has expended a great deal of effort over the
	250	past century trying to find grammatical patterns which hold across
	251	languages; it's been a constant process
	252	of people making generalizations that should apply to all languages,
	253	only to find out that, all too often, these generalizations fail --
	254	sometimes failing for just a few languages, sometimes whole classes of
	255	languages, and sometimes nearly every language in the world except
	256	English. Broad statistical trends are evident in what the "average
	257	language" is like as far as what its rules can look like, must look
	258	like, and cannot look like. But the "average language" is just as
	259	unreal a concept as the "average person" -- it runs up against the
	260	fact no language (or person) is, in fact, average. The wisdom of past
	261	experience leads us to believe that any given language can do whatever
	262	it wants, in any order, with appeal to any kind of grammatical
	263	categories wants -- case, number, tense, real or metaphoric
	264	characteristics of the things that words refer to, arbitrary or
	265	predictable classifications of words based on what endings or prefixes
	266	they can take, degree or means of certainty about the truth of
	267	statements expressed, and so on, ad infinitum.
	268
	269	Mercifully, most localization tasks are a matter of finding ways to
	270	translate whole phrases, generally sentences, where the context is
	271	relatively set, and where the only variation in content is I<usually>
	272	in a number being expressed -- as in the example sentences above.
	273	Translating specific, fully-formed sentences is, in practice, fairly
	274	foolproof -- which is good, because that's what's in the phrasebooks
	275	that so many tourists rely on. Now, a given phrase (whether in a
	276	phrasebook or in a gettext lexicon) in one language I<might> have a
	277	greater or lesser applicability than that phrase's translation into
	278	another language -- for example, strictly speaking, in Arabic, the
	279	"your" in "Your query matched..." would take a different form
	280	depending on whether the user is male or female; so the Arabic
	281	translation "your[feminine] query" is applicable in fewer cases than
	282	the corresponding English phrase, which doesn't distinguish the user's
	283	gender. (In practice, it's not feasable to have a program know the
	284	user's gender, so the masculine "you" in Arabic is usually used, by
	285	default.)
	286
	287	But in general, such surprises are rare when entire sentences are
	288	being translated, especially when the functional context is restricted
	289	to that of a computer interacting with a user either to convey a fact
	290	or to prompt for a piece of information. So, for purposes of
	291	localization, translation by phrase (generally by sentence) is both the
	292	simplest and the least problematic.
	293
	294	=head2 Breaking gettext
	295
	296	=over
	297
	298	"It Has To Work."
	299
	300	-- First Networking Truth, RFC 1925
	301
	302	=back
	303
	304	Consider that sentences in a tourist phrasebook are of two types: ones
	305	like "How do I get to the marketplace?" that don't have any blanks to
	306	fill in, and ones like "How much do these ___ cost?", where there's
	307	one or more blanks to fill in (and these are usually linked to a
	308	list of words that you can put in that blank: "fish", "potatoes",
	309	"tomatoes", etc.) The ones with no blanks are no problem, but the
	310	fill-in-the-blank ones may not be really straightforward. If it's a
	311	Swahili phrasebook, for example, the authors probably didn't bother to
	312	tell you the complicated ways that the verb "cost" changes its
	313	inflectional prefix depending on the noun you're putting in the blank.
	314	The trader in the marketplace will still understand what you're saying if
	315	you say "how much do these potatoes cost?" with the wrong
	316	inflectional prefix on "cost". After all, I<you> can't speak proper Swahili,
	317	I<you're> just a tourist. But while tourists can be stupid, computers
	318	are supposed to be smart; the computer should be able to fill in the
	319	blank, and still have the results be grammatical.
	320
	321	In other words, a phrasebook entry takes some values as parameters
	322	(the things that you fill in the blank or blanks), and provides a value
	323	based on these parameters, where the way you get that final value from
	324	the given values can, properly speaking, involve an arbitrarily
	325	complex series of operations. (In the case of Chinese, it'd be not at
	326	all complex, at least in cases like the examples at the beginning of
	327	this article; whereas in the case of Russian it'd be a rather complex
	328	series of operations. And in some languages, the
	329	complexity could be spread around differently: while the act of
	330	putting a number-expression in front of a noun phrase might not be
	331	complex by itself, it may change how you have to, for example, inflect
	332	a verb elsewhere in the sentence. This is what in syntax is called
	333	"long-distance dependencies".)
	334
	335	This talk of parameters and arbitrary complexity is just another way
	336	to say that an entry in a phrasebook is what in a programming language
	337	would be called a "function". Just so you don't miss it, this is the
	338	crux of this article: I<A phrase is a function; a phrasebook is a
	339	bunch of functions.>
	340
	341	The reason that using gettext runs into walls (as in the above
	342	second-person horror story) is that you're trying to use a string (or
	343	worse, a choice among a bunch of strings) to do what you really need a
	344	function for -- which is futile. Preforming (s)printf interpolation
	345	on the strings which you get back from gettext does allow you to do I<some>
	346	common things passably well... sometimes... sort of; but, to paraphrase
	347	what some people say about C<csh> script programming, "it fools you
	348	into thinking you can use it for real things, but you can't, and you
	349	don't discover this until you've already spent too much time trying,
	350	and by then it's too late."
	351
	352	=head2 Replacing gettext
	353
	354	So, what needs to replace gettext is a system that supports lexicons
	355	of functions instead of lexicons of strings. An entry in a lexicon
	356	from such a system should I<not> look like this:
	357
	358	"J'ai trouv\xE9 %g fichiers dans %g r\xE9pertoires"
	359
	360	[\xE9 is e-acute in Latin-1. Some pod renderers would
	361	scream if I used the actual character here. -- SB]
	362
	363	but instead like this, bearing in mind that this is just a first stab:
	364
	365	sub I_found_X1_files_in_X2_directories {
	366	my( $files, $dirs ) = @_[0,1];
	367	$files = sprintf("%g %s", $files,
	368	$files == 1 ? 'fichier' : 'fichiers');
	369	$dirs = sprintf("%g %s", $dirs,
	370	$dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
	371	return "J'ai trouv\xE9 $files dans $dirs.";
	372	}
	373
	374	Now, there's no particularly obvious way to store anything but strings
	375	in a gettext lexicon; so it looks like we just have to start over and
	376	make something better, from scratch. I call my shot at a
	377	gettext-replacement system "Maketext", or, in CPAN terms,
	378	Locale::Maketext.
	379
	380	When designing Maketext, I chose to plan its main features in terms of
	381	"buzzword compliance". And here are the buzzwords:
	382
	383	=head2 Buzzwords: Abstraction and Encapsulation
	384
	385	The complexity of the language you're trying to output a phrase in is
	386	entirely abstracted inside (and encapsulated within) the Maketext module
	387	for that interface. When you call:
	388
	389	print $lang->maketext("You have [quant,_1,piece] of new mail.",
	390	scalar(@messages));
	391
	392	you don't know (and in fact can't easily find out) whether this will
	393	involve lots of figuring, as in Russian (if $lang is a handle to the
	394	Russian module), or relatively little, as in Chinese. That kind of
	395	abstraction and encapsulation may encourage other pleasant buzzwords
	396	like modularization and stratification, depending on what design
	397	decisions you make.
	398
	399	=head2 Buzzword: Isomorphism
	400
	401	"Isomorphism" means "having the same structure or form"; in discussions
	402	of program design, the word takes on the special, specific meaning that
	403	your implementation of a solution to a problem I<has the same
	404	structure> as, say, an informal verbal description of the solution, or
	405	maybe of the problem itself. Isomorphism is, all things considered,
	406	a good thing -- it's what problem-solving (and solution-implementing)
	407	should look like.
	408
	409	What's wrong the with gettext-using code like this...
	410
	411	printf( $file_count == 1 ?
	412	( $directory_count == 1 ?
	413	"Your query matched %g file in %g directory." :
	414	"Your query matched %g file in %g directories." ) :
	415	( $directory_count == 1 ?
	416	"Your query matched %g files in %g directory." :
	417	"Your query matched %g files in %g directories." ),
	418	$file_count, $directory_count,
	419	);
	420
	421	is first off that it's not well abstracted -- these ways of testing
	422	for grammatical number (as in the expressions like C<foo == 1 ?
	423	singular_form : plural_form>) should be abstracted to each language
	424	module, since how you get grammatical number is language-specific.
	425
	426	But second off, it's not isomorphic -- the "solution" (i.e., the
	427	phrasebook entries) for Chinese maps from these four English phrases to
	428	the one Chinese phrase that fits for all of them. In other words, the
	429	informal solution would be "The way to say what you want in Chinese is
	430	with the one phrase 'For your question, in Y directories you would
	431	find X files'" -- and so the implemented solution should be,
	432	isomorphically, just a straightforward way to spit out that one
	433	phrase, with numerals properly interpolated. It shouldn't have to map
	434	from the complexity of other languages to the simplicity of this one.
	435
	436	=head2 Buzzword: Inheritance
	437
	438	There's a great deal of reuse possible for sharing of phrases between
	439	modules for related dialects, or for sharing of auxiliary functions
	440	between related languages. (By "auxiliary functions", I mean
	441	functions that don't produce phrase-text, but which, say, return an
	442	answer to "does this number require a plural noun after it?". Such
	443	auxiliary functions would be used in the internal logic of functions
	444	that actually do produce phrase-text.)
	445
	446	In the case of sharing phrases, consider that you have an interface
	447	already localized for American English (probably by having been
	448	written with that as the native locale, but that's incidental).
	449	Localizing it for UK English should, in practical terms, be just a
	450	matter of running it past a British person with the instructions to
	451	indicate what few phrases would benefit from a change in spelling or
	452	possibly minor rewording. In that case, you should be able to put in
	453	the UK English localization module I<only> those phrases that are
	454	UK-specific, and for all the rest, I<inherit> from the American
	455	English module. (And I expect this same situation would apply with
	456	Brazilian and Continental Portugese, possbily with some I<very>
	457	closely related languages like Czech and Slovak, and possibly with the
	458	slightly different "versions" of written Mandarin Chinese, as I hear exist in
	459	Taiwan and mainland China.)
	460
	461	As to sharing of auxiliary functions, consider the problem of Russian
	462	numbers from the beginning of this article; obviously, you'd want to
	463	write only once the hairy code that, given a numeric value, would
	464	return some specification of which case and number a given quanitified
	465	noun should use. But suppose that you discover, while localizing an
	466	interface for, say, Ukranian (a Slavic language related to Russian,
	467	spoken by several million people, many of whom would be relieved to
	468	find that your Web site's or software's interface is available in
	469	their language), that the rules in Ukranian are the same as in Russian
	470	for quantification, and probably for many other grammatical functions.
	471	While there may well be no phrases in common between Russian and
	472	Ukranian, you could still choose to have the Ukranian module inherit
	473	from the Russian module, just for the sake of inheriting all the
	474	various grammatical methods. Or, probably better organizationally,
	475	you could move those functions to a module called C<_E_Slavic> or
	476	something, which Russian and Ukranian could inherit useful functions
	477	from, but which would (presumably) provide no lexicon.
	478
	479	=head2 Buzzword: Concision
	480
	481	Okay, concision isn't a buzzword. But it should be, so I decree that
	482	as a new buzzword, "concision" means that simple common things should
	483	be expressible in very few lines (or maybe even just a few characters)
	484	of code -- call it a special case of "making simple things easy and
	485	hard things possible", and see also the role it played in the
	486	MIDI::Simple language, discussed elsewhere in this issue [TPJ#13].
	487
	488	Consider our first stab at an entry in our "phrasebook of functions":
	489
	490	sub I_found_X1_files_in_X2_directories {
	491	my( $files, $dirs ) = @_[0,1];
	492	$files = sprintf("%g %s", $files,
	493	$files == 1 ? 'fichier' : 'fichiers');
	494	$dirs = sprintf("%g %s", $dirs,
	495	$dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
	496	return "J'ai trouv\xE9 $files dans $dirs.";
	497	}
	498
	499	You may sense that a lexicon (to use a non-committal catch-all term for a
	500	collection of things you know how to say, regardless of whether they're
	501	phrases or words) consisting of functions I<expressed> as above would
	502	make for rather long-winded and repetitive code -- even if you wisely
	503	rewrote this to have quantification (as we call adding a number
	504	expression to a noun phrase) be a function called like:
	505
	506	sub I_found_X1_files_in_X2_directories {
	507	my( $files, $dirs ) = @_[0,1];
	508	$files = quant($files, "fichier");
	509	$dirs = quant($dirs, "r\xE9pertoire");
	510	return "J'ai trouv\xE9 $files dans $dirs.";
	511	}
	512
	513	And you may also sense that you do not want to bother your translators
	514	with having to write Perl code -- you'd much rather that they spend
	515	their I<very costly time> on just translation. And this is to say
	516	nothing of the near impossibility of finding a commercial translator
	517	who would know even simple Perl.
	518
	519	In a first-hack implementation of Maketext, each language-module's
	520	lexicon looked like this:
	521
	522	%Lexicon = (
	523	"I found %g files in %g directories"
	524	=> sub {
	525	my( $files, $dirs ) = @_[0,1];
	526	$files = quant($files, "fichier");
	527	$dirs = quant($dirs, "r\xE9pertoire");
	528	return "J'ai trouv\xE9 $files dans $dirs.";
	529	},
	530	... and so on with other phrase => sub mappings ...
	531	);
	532
	533	but I immediately went looking for some more concise way to basically
	534	denote the same phrase-function -- a way that would also serve to
	535	concisely denote I<most> phrase-functions in the lexicon for I<most>
	536	languages. After much time and even some actual thought, I decided on
	537	this system:
	538
	539	* Where a value in a %Lexicon hash is a contentful string instead of
	540	an anonymous sub (or, conceivably, a coderef), it would be interpreted
	541	as a sort of shorthand expression of what the sub does. When accessed
	542	for the first time in a session, it is parsed, turned into Perl code,
	543	and then eval'd into an anonymous sub; then that sub replaces the
	544	original string in that lexicon. (That way, the work of parsing and
	545	evaling the shorthand form for a given phrase is done no more than
	546	once per session.)
	547
	548	* Calls to C<maketext> (as Maketext's main function is called) happen
	549	thru a "language session handle", notionally very much like an IO
	550	handle, in that you open one at the start of the session, and use it
	551	for "sending signals" to an object in order to have it return the text
	552	you want.
	553
	554	So, this:
	555
	556	$lang->maketext("You have [quant,_1,piece] of new mail.",
	557	scalar(@messages));
	558
	559	basically means this: look in the lexicon for $lang (which may inherit
	560	from any number of other lexicons), and find the function that we
	561	happen to associate with the string "You have [quant,_1,piece] of new
	562	mail" (which is, and should be, a functioning "shorthand" for this
	563	function in the native locale -- English in this case). If you find
	564	such a function, call it with $lang as its first parameter (as if it
	565	were a method), and then a copy of scalar(@messages) as its second,
	566	and then return that value. If that function was found, but was in
	567	string shorthand instead of being a fully specified function, parse it
	568	and make it into a function before calling it the first time.
	569
	570	* The shorthand uses code in brackets to indicate method calls that
	571	should be performed. A full explanation is not in order here, but a
	572	few examples will suffice:
	573
	574	"You have [quant,_1,piece] of new mail."
	575
	576	The above code is shorthand for, and will be interpreted as,
	577	this:
	578
	579	sub {
	580	my $handle = $_[0];
	581	my(@params) = @_;
	582	return join '',
	583	"You have ",
	584	$handle->quant($params[1], 'piece'),
	585	"of new mail.";
	586	}
	587
	588	where "quant" is the name of a method you're using to quantify the
	589	noun "piece" with the number $params[0].
	590
	591	A string with no brackety calls, like this:
	592
	593	"Your search expression was malformed."
	594
	595	is somewhat of a degerate case, and just gets turned into:
	596
	597	sub { return "Your search expression was malformed." }
	598
	599	However, not everything you can write in Perl code can be written in
	600	the above shorthand system -- not by a long shot. For example, consider
	601	the Italian translator from the beginning of this article, who wanted
	602	the Italian for "I didn't find any files" as a special case, instead
	603	of "I found 0 files". That couldn't be specified (at least not easily
	604	or simply) in our shorthand system, and it would have to be written
	605	out in full, like this:
	606
	607	sub { # pretend the English strings are in Italian
	608	my($handle, $files, $dirs) = @_[0,1,2];
	609	return "I didn't find any files" unless $files;
	610	return join '',
	611	"I found ",
	612	$handle->quant($files, 'file'),
	613	" in ",
	614	$handle->quant($dirs, 'directory'),
	615	".";
	616	}
	617
	618	Next to a lexicon full of shorthand code, that sort of sticks out like a
	619	sore thumb -- but this I<is> a special case, after all; and at least
	620	it's possible, if not as concise as usual.
	621
	622	As to how you'd implement the Russian example from the beginning of
	623	the article, well, There's More Than One Way To Do It, but it could be
	624	something like this (using English words for Russian, just so you know
	625	what's going on):
	626
	627	"I [quant,_1,directory,accusative] scanned."
	628
	629	This shifts the burden of complexity off to the quant method. That
	630	method's parameters are: the numeric value it's going to use to
	631	quantify something; the Russian word it's going to quantify; and the
	632	parameter "accusative", which you're using to mean that this
	633	sentence's syntax wants a noun in the accusative case there, although
	634	that quantification method may have to overrule, for grammatical
	635	reasons you may recall from the beginning of this article.
	636
	637	Now, the Russian quant method here is responsible not only for
	638	implementing the strange logic necessary for figuring out how Russian
	639	number-phrases impose case and number on their noun-phrases, but also
	640	for inflecting the Russian word for "directory". How that inflection
	641	is to be carried out is no small issue, and among the solutions I've
	642	seen, some (like variations on a simple lookup in a hash where all
	643	possible forms are provided for all necessary words) are
	644	straightforward but I<can> become cumbersome when you need to inflect
	645	more than a few dozen words; and other solutions (like using
	646	algorithms to model the inflections, storing only root forms and
	647	irregularities) I<can> involve more overhead than is justifiable for
	648	all but the largest lexicons.
	649
	650	Mercifully, this design decision becomes crucial only in the hairiest
	651	of inflected languages, of which Russian is by no means the I<worst> case
	652	scenario, but is worse than most. Most languages have simpler
	653	inflection systems; for example, in English or Swahili, there are
	654	generally no more than two possible inflected forms for a given noun
	655	("error/errors"; "kosa/makosa"), and the
	656	rules for producing these forms are fairly simple -- or at least,
	657	simple rules can be formulated that work for most words, and you can
	658	then treat the exceptions as just "irregular", at least relative to
	659	your ad hoc rules. A simpler inflection system (simpler rules, fewer
	660	forms) means that design decisions are less crucial to maintaining
	661	sanity, whereas the same decisions could incur
	662	overhead-versus-scalability problems in languages like Russian. It
	663	may I<also> be likely that code (possibly in Perl, as with
	664	Lingua::EN::Inflect, for English nouns) has already
	665	been written for the language in question, whether simple or complex.
	666
	667	Moreover, a third possibility may even be simpler than anything
	668	discussed above: "Just require that all possible (or at least
	669	applicable) forms be provided in the call to the given language's quant
	670	method, as in:"
	671
	672	"I found [quant,_1,file,files]."
	673
	674	That way, quant just has to chose which form it needs, without having
	675	to look up or generate anything. While possibly not optimal for
	676	Russian, this should work well for most other languages, where
	677	quantification is not as complicated an operation.
	678
	679	=head2 The Devil in the Details
	680
	681	There's plenty more to Maketext than described above -- for example,
	682	there's the details of how language tags ("en-US", "i-pwn", "fi",
	683	etc.) or locale IDs ("en_US") interact with actual module naming
	684	("BogoQuery/Locale/en_us.pm"), and what magic can ensue; there's the
	685	details of how to record (and possibly negotiate) what character
	686	encoding Maketext will return text in (UTF8? Latin-1? KOI8?). There's
	687	the interesting fact that Maketext is for localization, but nowhere
	688	actually has a "C<use locale;>" anywhere in it. For the curious,
	689	there's the somewhat frightening details of how I actually
	690	implement something like data inheritance so that searches across
	691	modules' %Lexicon hashes can parallel how Perl implements method
	692	inheritance.
	693
	694	And, most importantly, there's all the practical details of how to
	695	actually go about deriving from Maketext so you can use it for your
	696	interfaces, and the various tools and conventions for starting out and
	697	maintaining individual language modules.
	698
	699	That is all covered in the documentation for Locale::Maketext and the
	700	modules that come with it, available in CPAN. After having read this
	701	article, which covers the why's of Maketext, the documentation,
	702	which covers the how's of it, should be quite straightfoward.
	703
	704	=head2 The Proof in the Pudding: Localizing Web Sites
	705
	706	Maketext and gettext have a notable difference: gettext is in C,
	707	accessible thru C library calls, whereas Maketext is in Perl, and
	708	really can't work without a Perl interpreter (although I suppose
	709	something like it could be written for C). Accidents of history (and
	710	not necessarily lucky ones) have made C++ the most common language for
	711	the implementation of applications like word processors, Web browsers,
	712	and even many in-house applications like custom query systems. Current
	713	conditions make it somewhat unlikely that the next one of any of these
	714	kinds of applications will be written in Perl, albeit clearly more for
	715	reasons of custom and inertia than out of consideration of what is the
	716	right tool for the job.
	717
	718	However, other accidents of history have made Perl a well-accepted
	719	language for design of server-side programs (generally in CGI form)
	720	for Web site interfaces. Localization of static pages in Web sites is
	721	trivial, feasable either with simple language-negotiation features in
	722	servers like Apache, or with some kind of server-side inclusions of
	723	language-appropriate text into layout templates. However, I think
	724	that the localization of Perl-based search systems (or other kinds of
	725	dynamic content) in Web sites, be they public or access-restricted,
	726	is where Maketext will see the greatest use.
	727
	728	I presume that it would be only the exceptional Web site that gets
	729	localized for English I<and> Chinese I<and> Italian I<and> Arabic
	730	I<and> Russian, to recall the languages from the beginning of this
	731	article -- to say nothing of German, Spanish, French, Japanese,
	732	Finnish, and Hindi, to name a few languages that benefit from large
	733	numbers of programmers or Web viewers or both.
	734
	735	However, the ever-increasing internationalization of the Web (whether
	736	measured in terms of amount of content, of numbers of content writers
	737	or programmers, or of size of content audiences) makes it increasingly
	738	likely that the interface to the average Web-based dynamic content
	739	service will be localized for two or maybe three languages. It is my
	740	hope that Maketext will make that task as simple as possible, and will
	741	remove previous barriers to localization for languages dissimilar to
	742	English.
	743
	744	__END__
	745
	746	Sean M. Burke (sburkeE<64>cpan.org) has a Master's in linguistics
	747	from Northwestern University; he specializes in language technology.
	748	Jordan Lachler (lachlerE<64>unm.edu) is a PhD student in the Department of
	749	Linguistics at the University of New Mexico; he specializes in
	750	morphology and pedagogy of North American native languages.
	751
	752	=head2 References
	753
	754	Alvestrand, Harald Tveit. 1995. I<RFC 1766: Tags for the
	755	Identification of Languages.>
	756	C<ftp://ftp.isi.edu/in-notes/rfc1766.txt>
	757	[Now see RFC 3066.]
	758
	759	Callon, Ross, editor. 1996. I<RFC 1925: The Twelve
	760	Networking Truths.>
	761	C<ftp://ftp.isi.edu/in-notes/rfc1925.txt>
	762
	763	Drepper, Ulrich, Peter Miller,
	764	and FranE<ccedil>ois Pinard. 1995-2001. GNU
	765	C<gettext>. Available in C<ftp://prep.ai.mit.edu/pub/gnu/>, with
	766	extensive docs in the distribution tarball. [Since
	767	I wrote this article in 1998, I now see that the
	768	gettext docs are now trying more to come to terms with
	769	plurality. Whether useful conclusions have come from it
	770	is another question altogether. -- SMB, May 2001]
	771
	772	Forbes, Nevill. 1964. I<Russian Grammar.> Third Edition, revised
	773	by J. C. Dumbreck. Oxford University Press.
	774
	775	=cut
	776
	777	#End
	778