package I18N
::LangTags
::List
;
# Time-stamp: "2004-10-06 23:26:21 ADT"
use vars
qw(%Name %Is_Disrec $Debug $VERSION);
#----------------------------------------------------------------------
# read the table out of our own POD!
while(<I18N::LangTags::List::DATA>) {
$seeking = 0 if m/=for woohah/;
} elsif( ($disrec, $tag, $name) =
m/(\[?)\{([-0-9a-zA-Z]+)\}(?:\s*:)?\s*([^\[\]]+)/
$name =~ s/\s*[;\.]*\s*$//g;
print "<$tag> <$name>\n" if $Debug;
$last_name = $Name{$tag} = $name;
$Is_Disrec{$tag} = 1 if $disrec;
} elsif (m/[Ff]ormerly \"([-a-z0-9]+)\"/) {
$Name{$1} = "$last_name (old tag)" if $last_name;
die "No tags read??" unless $count;
#----------------------------------------------------------------------
my $tag = lc($_[0] || return);
} elsif($tag =~ m/^i-(.+)/) {
print "Input: {$tag}\n" if $Debug;
last if $name = $Name{$tag};
last if $name = $Name{$alt};
if($tag =~ s/(-[a-z0-9]+)$//s) {
print "Shaving off: $1 leaving $tag\n" if $Debug;
$alt =~ s/(-[a-z0-9]+)$//s && $Debug && print " alt -> $alt\n";
# we're trying to pull a subform off a primary tag. TILT!
print "Aborting on: {$name}{$subform}\n" if $Debug;
print "Output: {$name}{$subform}\n" if $Debug;
return unless $name; # Failure
return $name unless $subform; # Exact match
return "$name (Subform \"$subform\")";
#--------------------------------------------------------------------------
my $tag = lc($_[0] || return 0);
foreach my $bit (split('-', $tag)) {
scalar(@supers) ? ($supers[-1] . '-' . $bit) : $bit;
shift @supers if $supers[0] =~ m<^(i|x|sgn)$>s;
foreach my $f ($tag, @supers) {
return 0 if $Is_Disrec{$f};
# so that decent subforms of indecent tags are decent
return 2 if $Name{$tag}; # not only is it decent, it's known!
#--------------------------------------------------------------------------
I18N::LangTags::List -- tags and names for human languages
use I18N::LangTags::List;
print "Parlez-vous... ", join(', ',
I18N::LangTags::List::name('elx') || 'unknown_language',
I18N::LangTags::List::name('ar-Kw') || 'unknown_language',
I18N::LangTags::List::name('en') || 'unknown_language',
I18N::LangTags::List::name('en-CA') || 'unknown_language',
Parlez-vous... Elamite, Kuwait Arabic, English, Canadian English?
This module provides a function
C<I18N::LangTags::List::name( I<langtag> ) > that takes
a language tag (see L<I18N::LangTags|I18N::LangTags>)
and returns the best attempt at an English name for it, or
undef if it can't make sense of the tag.
The function I18N::LangTags::List::name(...) is not exported.
This module also provides a function
C<I18N::LangTags::List::is_decent( I<langtag> )> that returns true iff
the language tag is syntactically valid and is for general use (like
"fr" or "fr-ca", below). That is, it returns false for tags that are
syntactically invalid and for tags, like "aus", that are listed in
brackets below. This function is not exported.
The map of tags-to-names that it uses is accessable as
%I18N::LangTags::List::Name, and it's the same as the list
that follows in this documentation, which should be useful
to you even if you don't use this module.
=head1 ABOUT LANGUAGE TAGS
Internet language tags, as defined in RFC 3066, are a formalism
for denoting human languages. The two-letter ISO 639-1 language
codes are well known (as "en" for English), as are their forms
when qualified by a country code ("en-US"). Less well-known are the
arbitrary-length non-ISO codes (like "i-mingo"), and the
recently (in 2001) introduced three-letter ISO-639-2 codes.
Remember these important facts:
Language tags are not locale IDs. A locale ID is written with a "_"
instead of a "-", (almost?) always matches C<m/^\w\w_\w\w\b/>, and
I<means> something different than a language tag. A language tag
denotes a language. A locale ID denotes a language I<as used in>
a particular place, in combination with non-linguistic
location-specific information such as what currency is used
there. Locales I<also> often denote character set information,
Language tags are not for computer languages.
"Dialect" is not a useful term, since there is no objective
criterion for establishing when two language-forms are
dialects of eachother, or are separate languages.
Language tags are not case-sensitive. en-US, en-us, En-Us, etc.,
are all the same tag, and denote the same language.
Not every language tag really refers to a single language. Some
language tags refer to conditions: i-default (system-message text
in English plus maybe other languages), und (undetermined
language). Others (notably lots of the three-letter codes) are
bibliographic tags that classify whole groups of languages, as
with cus "Cushitic (Other)" (i.e., a
language that has been classed as Cushtic, but which has no more
specific code) or the even less linguistically coherent
sai for "South American Indian (Other)". Though useful in
bibliography, B<SUCH TAGS ARE NOT
FOR GENERAL USE>. For further guidance, email me.
Language tags are not country codes. In fact, they are often
distinct codes, as with language tag ja for Japanese, and
ISO 3166 country code C<.jp> for Japan.
The first part of each item is the language tag, between
is followed by an English name for the language or language-group.
Language tags that I judge to be not for general use, are bracketed.
This list is in alphabetical order by English name of the language.
The name in the =item line MUST NOT have E<...>'s in it!!
=item [{afa} : Afro-Asiatic (Other)]
=item [{alg} : Algonquian languages]
=item [{tut} : Altaic (Other)]
eq Amis. eq 'Amis. eq Pangca.
=item [{apa} : Apache languages]
Many forms are mutually un-intelligible in spoken media.
{ar-jo} Jordanian Arabic;
NOT Amharic! NOT Samaritan Aramaic!
=item [{art} : Artificial (Other)]
=item [{ath} : Athapascan languages]
eq Athabaskan. eq Athapaskan. eq Athabascan.
=item [{aus} : Australian languages]
=item [{map} : Austronesian (Other)]
{az-Arab} Azerbaijani in Arabic script;
{az-Cyrl} Azerbaijani in Cyrillic script;
{az-Latn} Azerbaijani in Latin script.
=item [{bat} : Baltic (Other)]
=item [{bai} : Bamileke languages]
=item [{bnt} : Bantu (Other)]
=item {btk} : Batak (Indonesia)
eq Belarussian. eq Byelarussian.
eq Belorussian. eq Byelorussian.
eq White Russian. eq White Ruthenian.
=item [{ber} : Berber (Other)]
eq CatalE<aacute>n. eq Catalonian.
=item [{cau} : Caucasian (Other)]
=item [{cel} : Celtic (Other)]
{cel-gaulish} Gaulish (Historical)
=item [{cai} : Central American Indian (Other)]
=item [{cmc} : Chamic languages]
(Historical) NOT Chibchan (which is a language family).
Many forms are mutually un-intelligible in spoken media.
{zh-Hans} Chinese, in simplified script;
{zh-Hant} Chinese, in traditional script;
{zh-sg} Singapore Chinese;
{zh-hk} Hong Kong Chinese;
{zh-guoyu} Mandarin [Putonghua/Guoyu];
{zh-hakka} Hakka [formerly "i-hakka"];
{zh-min-nan} Southern Hokkien;
{i-hakka} Hakka (old tag)
=item {chn} : Chinook Jargon
=item {cu} : Church Slavic
eq Trukese. eq Chuuk. eq Truk. eq Ruk.
NOT Creek! (Formerly "cre".)
=item [{cpe} : English-based Creoles and pidgins (Other)]
=item [{cpf} : French-based Creoles and pidgins (Other)]
=item [{cpp} : Portuguese-based Creoles and pidgins (Other)]
=item [{crp} : Creoles and pidgins (Other)]
=item [{cus} : Cushitic (Other)]
=item {i-default} : Default (Fallthru) Language
Defined in RFC 2277, this is for tagging text
(which must include English text, and might/should include text
in other appropriate languages) that is emitted in a context
where language-negotiation wasn't possible -- in SMTP mail failure
eq Maldivian. (Formerly "div".)
=item [{dra} : Dravidian (Other)]
eq Netherlander. Notable forms:
{nl-nl} Netherlands Dutch;
=item {dum} : Middle Dutch (ca.1050-1350)
=item {egy} : Ancient Egyptian
{en-au} Australian English;
{en-ca} Canadian English;
{en-jm} Jamaican English;
{en-nz} New Zealand English;
{en-ph} Philippine English;
{en-tt} Trinidad English;
{en-za} South African English;
{en-zw} Zimbabwe English.
=item {enm} : Old English (1100-1500)
=item {ang} : Old English (ca.450-1100)
eq Anglo-Saxon. (Historical)
=item {i-enochian} : Enochian (Artificial)
=item [{fiu} : Finno-Ugrian (Other)]
eq Finno-Ugric. NOT Ugaritic!
{fr-lu} Luxembourg French;
=item {frm} : Middle French (ca.1400-1600)
=item {fro} : Old French (842-ca.1400)
=item {gd} : Scots Gaelic
{de-li} Liechtenstein German;
{de-lu} Luxembourg German.
=item {gmh} : Middle High German (ca.1050-1500)
=item {goh} : Old High German (ca.750-1050)
=item [{gem} : Germanic (Other)]
=item {grc} : Ancient Greek
(Historical) (Until 15th century or so.)
=item {el} : Modern Greek
(Since 15th century or so.)
=item [{inc} : Indic (Other)]
=item [{ine} : Indo-European (Other)]
{in} Indonesian (old tag)
=item {ia} : Interlingua (International Auxiliary Language Association)
(Artificial) NOT Interlingue!
(Artificial) NOT Interlingua!
=item [{ira} : Iranian (Other)]
=item {mga} : Middle Irish (900-1200)
=item {sga} : Old Irish (to 900)
=item [{iro} : Iroquoian languages]
(Formerly "jw" because of a typo.)
=item {jrb} : Judeo-Arabic
=item {jpr} : Judeo-Persian
eq Kanarese. NOT Canadian!
=item {krc} : Karachay-Balkar
=item {kaa} : Kara-Kalpak
eq Cambodian. eq Kampuchean.
=item [{khi} : Khoisan (Other)]
=item {i-klingon} : Klingon
eq Judeo-Spanish. NOT Ladin (a minority language in Italy).
(Historical) NOT Ladin! NOT Ladino!
=item {lb} : Letzeburgesch
eq Luxemburgian, eq Luxemburger. (Formerly "i-lux".)
{i-lux} Letzeburgesch (old tag)
eq Limburger, eq Limburgan. NOT Letzeburgesch!
eq Low Saxon. eq Low German. eq Low Saxon.
=item {art-lojban} : Lojban (Artificial)
=item {lu} : Luba-Katanga
=item {luo} : Luo (Kenya and Tanzania)
eq the modern Slavic language spoken in what was Yugoslavia.
NOT the form of Greek spoken in Greek Macedonia!
=item [{mno} : Manobo languages]
=item [{myn} : Mayan languages]
=item {min} : Minangkabau
eq the Irquoian language West Virginia Seneca. NOT New York Seneca!
=item [{mis} : Miscellaneous languages]
=item [{mkh} : Mon-Khmer (Other)]
=item [{mul} : Multiple languages]
=item [{mun} : Munda languages]
eq Navaho. (Formerly "i-navajo".)
{i-navajo} Navajo (old tag)
=item {nd} : North Ndebele
=item {nr} : South Ndebele
eq Nepalese. Notable forms:
=item [{nic} : Niger-Kordofanian (Other)]
=item [{ssa} : Nilo-Saharan (Other)]
=item [{nai} : North American Indian]
Note the two following forms:
=item {nb} : Norwegian Bokmal
eq BokmE<aring>l, (A form of Norwegian.) (Formerly "no-bok".)
{no-bok} Norwegian Bokmal (old tag)
=item {nn} : Norwegian Nynorsk
(A form of Norwegian.) (Formerly "no-nyn".)
{no-nyn} Norwegian Nynorsk (old tag)
=item [{nub} : Nubian languages]
=item {oc} : Occitan (post 1500)
eq ProvenE<ccedil>al, eq Provencal
eq Ojibwe. (Formerly "oji".)
=item {os} : Ossetian; Ossetic
=item [{oto} : Otomian languages]
Group of languages collectively called "OtomE<iacute>".
=item [{paa} : Papuan (Other)]
=item {peo} : Old Persian (ca.600-400 B.C.)
=item [{phi} : Philippine (Other)]
eq Portugese. Notable forms:
{pt-pt} Portugal Portuguese;
{pt-br} Brazilian Portuguese.
=item [{pra} : Prakrit languages]
=item {pro} : Old Provencal (to 1500)
eq Old ProvenE<ccedil>al. (Historical.)
=item {rm} : Raeto-Romance
=item [{qaa - qtz} : Reserved for local use.]
=item [{roa} : Romance (Other)]
NOT Romanian! NOT Romany! NOT Romansh!
NOT White Russian! NOT Rusyn!
=item [{sal} : Salishan languages]
=item {sam} : Samaritan Aramaic
=item {se} : Northern Sami
eq Lappish. eq Lapp. eq (Northern) Saami.
=item {sma} : Southern Sami
=item [{smi} : Sami languages (Other)]
=item [{sem} : Semitic (Other)]
{sr-Cyrl} : Serbian in Cyrillic script;
{sr-Latn} : Serbian in Latin script.
=item {sgn-...} : Sign Languages
Always use with a subtag. Notable forms:
{sgn-gb} British Sign Language (BSL);
{sgn-ie} Irish Sign Language (ESL);
{sgn-ni} Nicaraguan Sign Language (ISN);
{sgn-us} American Sign Language (ASL).
(And so on with other country codes as the subtag.)
eq Blackfoot. eq Pikanii.
=item [{sit} : Sino-Tibetan (Other)]
=item [{sio} : Siouan languages]
=item {den} : Slave (Athapascan)
=item [{sla} : Slavic (Other)]
=item {wen} : Sorbian languages
eq Wendish. eq Sorb. eq Lusatian. eq Wend. NOT Venda! NOT Serbian!
=item {nso} : Northern Sotho
=item {st} : Southern Sotho
=item [{sai} : South American Indian (Other)]
{es-ar} Argentine Spanish;
{es-bo} Bolivian Spanish;
{es-co} Colombian Spanish;
{es-do} Dominican Spanish;
{es-ec} Ecuadorian Spanish;
{es-gt} Guatemalan Spanish;
{es-hn} Honduran Spanish;
{es-pa} Panamanian Spanish;
{es-pe} Peruvian Spanish;
{es-pr} Puerto Rican Spanish;
{es-py} Paraguay Spanish;
{es-sv} Salvadoran Spanish;
{es-uy} Uruguayan Spanish;
{es-ve} Venezuelan Spanish.
=item [{tai} : Tai (Other)]
=item {tog} : Tonga (Nyasa)
=item {to} : Tonga (Tonga Islands)
(Pronounced "Tong-a", not "Tong-ga")
=item [{tup} : Tupi languages]
(Typically in Roman script)
=item {ota} : Ottoman Turkish (1500-1928)
(Typically in Arabic script) (Historical)
=item {crh} : Crimean Turkish
=item {und} : Undetermined
Not a tag for normal use.
{uz-Cyrl} Uzbek in Cyrillic script;
{uz-Latn} Uzbek in Latin script.
NOT Wendish! NOT Wend! NOT Avestan! (Formerly "ven".)
eq VolapE<uuml>k. (Artificial)
=item [{wak} : Wakashan languages]
Presumably the Philippine language Waray-Waray (SamareE<ntilde>o),
not the smaller Philippine language Waray Sorsogon, nor the extinct
Australian language Waray.
=item {x-...} : Unregistered (Semi-Private Use)
"x-" is a prefix for language tags that are not registered with ISO
or IANA. Example, x-double-dutch
Formerly "ji". Usually in Hebrew script.
{yi-latn} Yiddish in Latin script
=item [{ypk} : Yupik languages]
Several "Eskimo" languages.
L<I18N::LangTags|I18N::LangTags> and its "See Also" section.
=head1 COPYRIGHT AND DISCLAIMER
Copyright (c) 2001+ Sean M. Burke. All rights reserved.
You can redistribute and/or
modify this document under the same terms as Perl itself.
This document is provided in the hope that it will be
useful, but without any warranty;
without even the implied warranty of accuracy, authoritativeness,
completeness, merchantability, or fitness for a particular purpose.
Email any corrections or questions to me.
Sean M. Burke, sburkeE<64>cpan.org
# To generate a list of just the two and three-letter codes:
require 5; # Time-stamp: "2001-03-13 21:53:39 MST"
# Sean M. Burke, sburke@cpan.org
# This program is for generating the language_codes.txt file
use HTML::TreeBuilder 3.10;
my $root = HTML::TreeBuilder->new();
my $url = 'http://lcweb.loc.gov/standards/iso639-2/bibcodes.html';
$root->parse(get($url) || die "Can't get $url");
foreach my $tr ($root->find_by_tag_name('tr')) {
my @f = map $_->as_text(), $tr->content_list();
#print map("<$_> ", @f), "\n";
pop @f; # nix the French name
next if $f[-1] eq 'Language Name (English)'; # it's a header line
my $xx = splice(@f, 2,1); # pull out the two-letter code
if($xx =~ m/[a-zA-Z]/) { # there's a two-letter code for it
push @codes, [ lc($f[-1]), "$xx\t$f[-1]\n" ];
} else { # print the three-letter codes.
push @codes, [ lc($f[-1]), "$f[1]\t$f[2]\n" ];
} else { # shouldn't happen
push @codes, [ lc($f[-1]), "@f !!!!!!!!!!\n" ];
print map $_->[1], sort {; $a->[0] cmp $b->[0] } @codes;
print "[ based on $url\n at ", scalar(localtime), "]\n",
"[Note: doesn't include IANA-registered codes.]\n";