package I18N
::LangTags
::List
;
# Time-stamp: "2002-02-02 20:13:58 MST"
use vars
qw(%Name $Debug $VERSION);
#----------------------------------------------------------------------
# read the table out of our own POD!
while(<I18N::LangTags::List::DATA>) {
$seeking = 0 if m/=for woohah/;
next unless ($tag, $name) =
m/\{([-0-9a-zA-Z]+)\}(?:\s*:)?\s*([^\[\]]+)/;
$name =~ s/\s*[;\.]*\s*$//g;
print "<$tag> <$name>\n" if $Debug;
die "No tags read??" unless $count;
#----------------------------------------------------------------------
my $tag = lc($_[0] || return);
} elsif($tag =~ m/^i-(.+)/) {
print "Input: {$tag}\n" if $Debug;
last if $name = $Name{$tag};
last if $name = $Name{$alt};
if($tag =~ s/(-[a-z0-9]+)$//s) {
print "Shaving off: $1 leaving $tag\n" if $Debug;
$alt =~ s/(-[a-z0-9]+)$//s && $Debug && print " alt -> $alt\n";
# we're trying to pull a subform off a primary tag. TILT!
print "Aborting on: {$name}{$subform}\n" if $Debug;
print "Output: {$name}{$subform}\n" if $Debug;
return unless $name; # Failure
return $name unless $subform; # Exact match
return "$name (Subform \"$subform\")";
I18N::LangTags::List -- tags and names for human languages
use I18N::LangTags::List;
print "Parlez-vous... ", join(', ',
I18N::LangTags::List::name('elx') || 'unknown_language',
I18N::LangTags::List::name('ar-Kw') || 'unknown_language',
I18N::LangTags::List::name('en') || 'unknown_language',
I18N::LangTags::List::name('en-CA') || 'unknown_language',
Parlez-vous... Elamite, Kuwait Arabic, English, Canadian English?
This module provides a function
C<I18N::LangTags::List::name( I<langtag> ) > that takes
a language tag (see L<I18N::LangTags|I18N::LangTags>)
and returns the best attempt at an English name for it, or
undef if it can't make sense of the tag.
The function I18N::LangTags::List::name(...) is not exported.
The map of tags-to-names that it uses is accessable as
%I18N::LangTags::List::Name, and it's the same as the list
that follows in this documentation, which should be useful
to you even if you don't use this module.
=head1 ABOUT LANGUAGE TAGS
Internet language tags, as defined in RFC 3066, are a formalism
for denoting human languages. The two-letter ISO 639-1 language
codes are well known (as "en" for English), as are their forms
when qualified by a country code ("en-US"). Less well-known are the
arbitrary-length non-ISO codes (like "i-mingo"), and the
recently (in 2001) introduced three-letter ISO-639-2 codes.
Remember these important facts:
Language tags are not locale IDs. A locale ID is written with a "_"
instead of a "-", (almost?) always matches C<m/^\w\w_\w\w\b/>, and
I<means> something different than a language tag. A language tag
denotes a language. A locale ID denotes a language I<as used in>
a particular place, in combination with non-linguistic
location-specific information such as what currency is used
there. Locales I<also> often denote character set information,
Language tags are not for computer languages.
"Dialect" is not a useful term, since there is no objective
criterion for establishing when two language-forms are
dialects of eachother, or are separate languages.
Language tags are not case-sensitive. en-US, en-us, En-Us, etc.,
are all the same tag, and denote the same language.
Not every language tag really refers to a single language. Some
language tags refer to conditions: i-default (system-message text
in English plus maybe other languages), und (undetermined
language). Others (notably lots of the three-letter codes) are
bibliographic tags that classify whole groups of languages, as
with cus "Cushitic (Other)" (i.e., a
language that has been classed as Cushtic, but which has no more
specific code) or the even less linguistically coherent
sai for "South American Indian (Other)". Though useful in
bibliography, B<SUCH TAGS ARE NOT
FOR GENERAL USE>. For further guidance, email me.
Language tags are not country codes. In fact, they are often
distinct codes, as with language tag ja for Japanese, and
ISO 3166 country code C<.jp> for Japan.
The first part of each item is the language tag, between
is followed by an English name for the language or language-group.
Language tags that I judge to be not for general use, are bracketed.
This list is in alphabetical order by English name of the language.
The name in the =item line MUST NOT have E<...>'s in it!!
=item [{afa} : Afro-Asiatic (Other)]
=item [{alg} : Algonquian languages]
=item [{tut} : Altaic (Other)]
eq Amis. eq 'Amis. eq Pangca.
=item [{apa} : Apache languages]
Many forms are mutually un-intelligible in spoken media.
{ar-jo} Jordanian Arabic;
NOT Amharic! NOT Samaritan Aramaic!
=item [{art} : Artificial (Other)]
=item [{ath} : Athapascan languages]
eq Athabaskan. eq Athapaskan. eq Athabascan.
=item [{aus} : Australian languages]
=item [{map} : Austronesian (Other)]
=item [{bat} : Baltic (Other)]
=item [{bai} : Bamileke languages]
=item [{bnt} : Bantu (Other)]
=item {btk} : Batak (Indonesia)
eq Belarussian. eq Byelarussian.
eq Belorussian. eq Byelorussian.
eq White Russian. eq White Ruthenian.
=item [{ber} : Berber (Other)]
eq CatalE<aacute>n. eq Catalonian.
=item [{cau} : Caucasian (Other)]
=item [{cel} : Celtic (Other)]
{cel-gaulish} Gaulish (Historical)
=item [{cai} : Central American Indian (Other)]
=item [{cmc} : Chamic languages]
(Historical) NOT Chibchan (which is a language family).
Many forms are mutually un-intelligible in spoken media.
{zh-hk} Hong Kong Chinese;
{zh-sg} Singapore Chinese;
{zh-guoyu} Mandarin [Putonghua/Guoyu];
{zh-hakka} Hakka [formerly i-hakka];
{zh-min-nan} Southern Hokkien;
{i-hakka} Hakka (old tag)
=item {chn} : Chinook Jargon
=item {cu} : Church Slavic
eq Trukese. eq Chuuk. eq Truk. eq Ruk.
=item [{cpe} : English-based Creoles and pidgins (Other)]
=item [{cpf} : French-based Creoles and pidgins (Other)]
=item [{cpp} : Portuguese-based Creoles and pidgins (Other)]
=item [{crp} : Creoles and pidgins (Other)]
=item [{cus} : Cushitic (Other)]
=item {i-default} : Default (Fallthru) Language
Defined in RFC 2277, this is for tagging text
(which must include English text, and might/should include text
in other appropriate languages) that is emitted in a context
where language-negotiation wasn't possible -- in SMTP mail failure
=item [{dra} : Dravidian (Other)]
eq Netherlander. Notable forms:
{nl-nl} Netherlands Dutch;
=item {dum} : Middle Dutch (ca.1050-1350)
=item {egy} : Ancient Egyptian
{en-au} Australian English;
{en-ca} Canadian English;
{en-jm} Jamaican English;
{en-nz} New Zealand English;
{en-ph} Philippine English;
{en-tt} Trinidad English;
{en-za} South African English;
{en-zw} Zimbabwe English.
=item {enm} : Old English (1100-1500)
=item {ang} : Old English (ca.450-1100)
eq Anglo-Saxon. (Historical)
=item [{fiu} : Finno-Ugrian (Other)]
eq Finno-Ugric. NOT Ugaritic!
{fr-lu} Luxembourg French;
=item {frm} : Middle French (ca.1400-1600)
=item {fro} : Old French (842-ca.1400)
=item {gd} : Scots Gaelic
{de-li} Liechtenstein German;
{de-lu} Luxembourg German.
=item {gmh} : Middle High German (ca.1050-1500)
=item {goh} : Old High German (ca.750-1050)
=item [{gem} : Germanic (Other)]
=item {grc} : Ancient Greek
(Historical) (Until 15th century or so.)
=item {el} : Modern Greek
(Since 15th century or so.)
=item [{inc} : Indic (Other)]
=item [{ine} : Indo-European (Other)]
{in} Indonesian (old tag)
=item {ia} : Interlingua (International Auxiliary Language Association)
(Artificial) NOT Interlingue!
(Artificial) NOT Interlingua!
=item [{ira} : Iranian (Other)]
=item {mga} : Middle Irish (900-1200)
=item {sga} : Old Irish (to 900)
=item [{iro} : Iroquoian languages]
=item {jrb} : Judeo-Arabic
=item {jpr} : Judeo-Persian
eq Kanarese. NOT Canadian!
=item {kaa} : Kara-Kalpak
eq Cambodian. eq Kampuchean.
=item [{khi} : Khoisan (Other)]
=item {i-klingon} : Klingon
eq Judeo-Spanish. NOT Ladin (a minority language in Italy).
(Historical) NOT Ladin! NOT Ladino!
=item {lb} : Letzeburgesch
eq Luxemburgian, eq Luxemburger. (Formerly i-lux.)
{i-lux} Letzeburgesch (old tag)
eq Low Saxon. eq Low German. eq Low Saxon.
=item {lub} : Luba-Katanga
=item {luo} : Luo (Kenya and Tanzania)
eq the modern Slavic language spoken in what was Yugoslavia.
NOT the form of Greek spoken in Greek Macedonia!
=item [{mno} : Manobo languages]
=item [{myn} : Mayan languages]
=item {min} : Minangkabau
eq the Irquoian language West Virginia Seneca. NOT New York Seneca!
=item [{mis} : Miscellaneous languages]
=item [{mkh} : Mon-Khmer (Other)]
=item [{mul} : Multiple languages]
=item [{mun} : Munda languages]
eq Navaho. (Formerly i-navajo.)
{i-navajo} Navajo (old tag)
=item {nd} : North Ndebele
=item {nr} : South Ndebele
eq Nepalese. Notable forms:
=item [{nic} : Niger-Kordofanian (Other)]
=item [{ssa} : Nilo-Saharan (Other)]
=item [{nai} : North American Indian]
=item {se} : Northern Sami
eq Lappish. eq Lapp. eq (Northern) Saami.
Note the two following forms:
=item {nb} : Norwegian Bokmal
eq BokmE<aring>l, (A form of Norwegian.) (Formerly no-bok.)
{no-bok} Norwegian Bokmal (old tag)
=item {nn} : Norwegian Nynorsk
(A form of Norwegian.) (Formerly no-nyn.)
{no-nyn} Norwegian Nynorsk (old tag)
=item [{nub} : Nubian languages]
=item {oc} : Occitan (post 1500)
eq ProvenE<ccedil>al, eq Provencal
=item {os} : Ossetian; Ossetic
=item [{oto} : Otomian languages]
Group of languages collectively called "OtomE<iacute>".
=item [{paa} : Papuan (Other)]
=item {peo} : Old Persian (ca.600-400 B.C.)
=item [{phi} : Philippine (Other)]
eq Portugese. Notable forms:
{pt-pt} Portugal Portuguese;
{pt-br} Brazilian Portuguese.
=item [{pra} : Prakrit languages]
=item {pro} : Old Provencal (to 1500)
eq Old ProvenE<ccedil>al. (Historical.)
=item {rm} : Raeto-Romance
=item [{qaa - qtz} : Reserved for local use.]
=item [{roa} : Romance (Other)]
NOT Romanian! NOT Romany! NOT Romansh!
NOT White Russian! NOT Rusyn!
=item [{sal} : Salishan languages]
=item {sam} : Samaritan Aramaic
=item [{smi} : Sami languages (Other)]
=item [{sem} : Semitic (Other)]
=item {sgn-...} : Sign Languages
Always use with a subtag. Notable forms:
{sgn-gb} British Sign Language (BSL);
{sgn-ie} Irish Sign Language (ESL);
{sgn-ni} Nicaraguan Sign Language (ISN);
{sgn-us} American Sign Language (ASL).
eq Blackfoot. eq Pikanii.
=item [{sit} : Sino-Tibetan (Other)]
=item [{sio} : Siouan languages]
=item {den} : Slave (Athapascan)
=item [{sla} : Slavic (Other)]
=item {wen} : Sorbian languages
eq Wendish. eq Sorb. eq Lusatian. eq Wend. NOT Venda! NOT Serbian!
=item {nso} : Northern Sotho
=item {st} : Southern Sotho
=item [{sai} : South American Indian (Other)]
{es-ar} Argentine Spanish;
{es-bo} Bolivian Spanish;
{es-co} Colombian Spanish;
{es-do} Dominican Spanish;
{es-ec} Ecuadorian Spanish;
{es-gt} Guatemalan Spanish;
{es-hn} Honduran Spanish;
{es-pa} Panamanian Spanish;
{es-pe} Peruvian Spanish;
{es-pr} Puerto Rican Spanish;
{es-py} Paraguay Spanish;
{es-sv} Salvadoran Spanish;
{es-uy} Uruguayan Spanish;
{es-ve} Venezuelan Spanish.
=item [{tai} : Tai (Other)]
=item {tog} : Tonga (Nyasa)
=item {to} : Tonga (Tonga Islands)
(Pronounced "Tong-a", not "Tong-ga")
(Typically in Roman script)
=item {ota} : Ottoman Turkish (1500-1928)
(Typically in Arabic script) (Historical)
=item {und} : Undetermined
Not a tag for normal use.
NOT Wendish! NOT Wend! NOT Avestan!
eq VolapE<uuml>k. (Artificial)
=item [{wak} : Wakashan languages]
Presumably the Philippine language Waray-Waray (SamareE<ntilde>o),
not the smaller Philippine language Waray Sorsogon, nor the extinct
Australian language Waray.
=item {x-...} : Unregistered (Semi-Private Use)
"x-" is a prefix for language tags that are not registered with ISO
or IANA. Example, x-double-dutch
Formerly "ji". Sometimes in Roman script, sometimes in Hebrew script.
=item [{ypk} : Yupik languages]
Several "Eskimo" languages.
L<I18N::LangTags|I18N::LangTags> and its "See Also" section.
=head1 COPYRIGHT AND DISCLAIMER
Copyright (c) 2001,2002 Sean M. Burke. All rights reserved.
You can redistribute and/or
modify this document under the same terms as Perl itself.
This document is provided in the hope that it will be
useful, but without any warranty;
without even the implied warranty of accuracy, authoritativeness,
completeness, merchantability, or fitness for a particular purpose.
Email any corrections or questions to me.
Sean M. Burke, sburkeE<64>cpan.org
# To generate a list of just the two and three-letter codes:
require 5; # Time-stamp: "2001-03-13 21:53:39 MST"
# Sean M. Burke, sburke@cpan.org
# This program is for generating the language_codes.txt file
use HTML::TreeBuilder 3.10;
my $root = HTML::TreeBuilder->new();
my $url = 'http://lcweb.loc.gov/standards/iso639-2/bibcodes.html';
$root->parse(get($url) || die "Can't get $url");
foreach my $tr ($root->find_by_tag_name('tr')) {
my @f = map $_->as_text(), $tr->content_list();
#print map("<$_> ", @f), "\n";
pop @f; # nix the French name
next if $f[-1] eq 'Language Name (English)'; # it's a header line
my $xx = splice(@f, 2,1); # pull out the two-letter code
if($xx =~ m/[a-zA-Z]/) { # there's a two-letter code for it
push @codes, [ lc($f[-1]), "$xx\t$f[-1]\n" ];
} else { # print the three-letter codes.
push @codes, [ lc($f[-1]), "$f[1]\t$f[2]\n" ];
} else { # shouldn't happen
push @codes, [ lc($f[-1]), "@f !!!!!!!!!!\n" ];
print map $_->[1], sort {; $a->[0] cmp $b->[0] } @codes;
print "[ based on $url\n at ", scalar(localtime), "]\n",
"[Note: doesn't include IANA-registered codes.]\n";