TEIhub
Discover TEI-encoded documents from GitHub public repositories.

Languages

In an effort to make it easier to locate TEI-encoded texts in a particular language, I have recorded the ISO 639 language codes for each repository where languages were clearly specified in TEI files.

Each time a repository appears in a batch of 1000 GitHub search results, I download the first .xml file for that repository that appears in the current batch of search results and check whether a language is specified in any of the following ways, in order:

  1. <langUsage><language id(ent)="xx(x)">...</language></langUsage>
  2. <text xml:lang="xx(x)">
  3. <body xml:lang="xx(x)">
  4. <textLang mainLang="xx(x)">. While this technically applies to the language of the manuscript, not necessarily of the TEI file, it was often used in files where only a facsimile was provided in the body.
  5. <TEI xml:lang="xx(x)">, but only where the language code is not en, as I found this did not reliably match the language used in the rest of the file.
  6. xml:lang on any element within body, excluding <foreign> tags

Based on a sample of 500 repositories, approximately 40% declare a language in one of these ways. If the same repository appears in a different batch of search results, I apply the same algorithm to the first .xml file found in that search, and append any new languages to the end of the list. Repositories/files/languages are not currently removed from this database even if the underlying files are changed or removed from GitHub.

The TEI Guidelines recommend the use of BCP 47 language tags. In their fullest version such tags allow you to encode language, region, script, and certain other information. For simplicity I have only recorded the language subtag, and discarded the other information. BCP 47 prefers the shortest available form of ISO 639 language codes for the language subtag (e.g. de instead of deu for German), however this means some languages have a two-character code, and some a three-character code. For consistency, and to avoid privileging “more common” languages, I have normalised all two-character language codes to use their three-character ISO 639-2 equivalent, choosing the Terminology variant where applicable (e.g. de becomes deu not ger).

I have not provided filtering of the search results by language, as this becomes too complicated to implement with a large data set in the context of a static website. Data files in CSV and JSON formats of all the results are provided on the home page if you need to do more advanced filtering.

Language Code Language Name Matching Repositories
abk Аҧсуа 1
abq 1
ady Aдыгэбзэ 1
aeb تونسي 2
aer 1
afr Afrikaans 14
ags 1
agx 1
aii ܣܘܪܝܬ 1
ajp اللهجة الشامية الجنوبية 3
ajw Ajawa 1
aka Akana 1
akk Akkadian 12
alg Algonquian languages 1
all 1
alt Southern Altai 1
amh አማርኛ 16
ang English, Old (ca.450-1100) 18
ani 1
apa Ndéé 1
apc اللهجة الشامي الشمال 1
aqc 1
ara العربية 200
arb العربية الفصحى 3
arc Official Aramaic (700-300 BCE); Imperial Aramaic (700-300 BCE) 13
arg Aragonés 1
arn Mapudungun; Mapuche 2
ars 1
art Artificial languages 1
ary الدارجة المغربية 1
arz مصرى 3
asm অসমীয়া 1
ava Авар 13
ave Avestan 2
awa Awadhi 1
aze Azərbaycanca / آذربايجان 1
bag 1
bak Башҡорт 2
bar Boarisch, Bairisch 7
bat Baltic languages 1
bbc 2
bbl 1
bel Беларуская 3
ben বাংলা 6
ber Tamaziɣt 6
bik Bikol 3
bis Bislama 1
bnt Bantu languages 2
bod དབུས་སྐད་ 20
bos Босански / Bosanski 1
bra Braj 1
bre Brezhoneg 6
brh براہوئی 2
brx बर' 1
bsk 1
btd 1
bua Buriat 1
bul Български 25
bya Palawan Batak 1
cai Central American Indian languages 1
cat Català 16
cau Caucasian languages 1
ccs 1
ceb Bisaya / Sinugbuanon 5
cel Celtic languages 3
ces Český 48
cha Chamoru 1
che Нохчийн 2
chg Chagatai 1
chm Mari 1
chr Cherokee 1
chu словѣньскъ / slověnĭskŭ 35
chv Чăваш 1
ckb 1
cmn 官話 / 官话 3
cop Coptic 48
cor Kernewek 2
cos Corsu 2
cpe Creoles and pidgins, English based 1
cym Cymraeg 35
da 1
dak Dakȟótiyapi 1
dan Dansk 67
dar Дарган Mез 1
ddo 1
del Delaware 2
des 3
deu Deutsch 760
dum Dutch, Middle (ca.1050-1350) 4
dzo ཇོང་ཁ 2
egy Egyptian (Ancient) 10
ell Ελληνικά 147
elx Elamite 1
eng English 1748
enm English, Middle (1100-1500) 13
enn 1
epo Esperanto 6
ess 1
est Eesti 14
esx Eskaleut 1
ett 1
eus Euskara 5
ewe Ɛʋɛ 1
fas فارسی 41
fij Na Vosa Vakaviti 1
fil Filipino; Pilipino 1
fin Suomi 20
fra Français 1207
frm French, Middle (ca.1400-1600) 26
fro French, Old (842-ca.1400) 65
fry Frysk 2
gad Gaddang 1
gae 1
gai 1
gbz 1
gcr 1
gda 5
gem Germanic languages 2
gez Geez 34
gig 2
gla Gàidhlig 7
gle Gaeilge 12
glg Galego 9
glv Gaelg 2
gmh German, Middle High (ca.1050-1500) 16
gml German, Middle Low (ca.1200-1650) 4
goh German, Old High (ca.750-1050) 4
got Gothic 2
grc Greek, Ancient (to 1453) 397
grk Hellenic languages 10
gsw Schwyzerdütsch 5
gug avañeʼẽ 1
guj ગુજરાતી 3
hac 1
hau هَوُسَ 5
haw ʻŌlelo Hawaiʻi 3
hbo Ancient Hebrew 3
heb עברית 151
hil Hiligaynon 2
hin हिन्दी 20
hit Hittite 1
hrv Hrvatski 30
hun Magyar 40
hye Հայերեն 22
ibe Abesabesi 2
ido Ido 2
iir 1
ile Interlingue 1
ilo Iloko 2
inc Indic languages 2
ind Bahasa Indonesia 10
ine Indo-European languages 3
ing 1
inh ГӀалгӀай мотт 1
inm Minaean 1
ira Iranian languages 1
isk 1
isl Íslenska 31
iso 1
ita Italiano 607
itl 1
jav Basa Jawa 1
jpa Jewish Palestinian Aramaic 1
jpn 日本語 58
jpr Judeo-Persian 4
jrb Judeo-Arabic 7
kan ಕನ್ನಡ 6
kap 1
kar Karen languages 1
kat ქართული 33
kau Kanuri 6
kaw Kawi 4
kaz Қазақ Tілі 1
kbd Kъэбэрдеибзэ 1
kca 1
kdr 1
khm ភាសាខ្មែរ 1
kho Khotanese; Sakan 1
khw 1
kir Kırgızca / Кыргызча 1
kjj 1
kom Коми 1
kor 조선말 / 한국어 37
krc Karachay-Balkar 1
krl Karelian 1
kum Къумукъ Tил 1
kur Kurdí, کوردی, or K’öрди 2
lad Ladino 3
lao ລາວ / Pha xa lao 2
lat Latina 1141
lav Latviešu 11
lbe 1
ldd 1
lea Shabunda Lega 2
lez Lezghian 1
lik 1
lin Lingála 1
lit Lietuvių 35
lng Lombardic 4
loz Lozi 2
lub Luba-Katanga 1
lun Lunda 2
lzh 古文 / 文言 1
lzz 1
mag मगही 1
mal മലയാളം 4
map Austronesian languages 1
mar मराठी 4
mck Mbúùnda / Chimbúùnda 2
mga Irish, Middle (900-1200) 1
mhd 1
mig San Miguel El Grande Mixtec 3
mis Uncoded languages 2
mix Mixtepec Mixtec 9
miy Ayutla Mixtec 3
miz Coatzospan Mixtec 3
mkd Македонски 8
mlg Malagasy 4
mlt Malti 2
mnc ᠮᠠᠨᠵᡠ ᡤᡳᠰᡠᠨ 4
mnj 1
mns 1
moh Mohawk 1
mon Монгол Хэл / ᠮᠣᠨᠭᠭᠣᠯ ᠬᠡᠯᠡ 7
mri Māori 5
msa Bahasa Melayu 10
mul Multiple languages 18
mwr Marwari 1
mya မြန်မာစာ / မြန်မာစကား 6
myn Mayan languages 1
nah Nahuatl languages 4
nai North American Indian languages 5
nau Dorerin Naoero 4
nbo 1
ndl 1
nds Plattdüütsch 6
nep नेपाली 6
new नेपाल भाषा 2
nil 1
nld Nederlands 222
nno Norsk (nynorsk) 1
nob Bokmål, Norwegian; Norwegian Bokmål 5
nog Nogai 1
non Norse, Old 27
nor Norsk (bokmål / riksmål) 27
nxq Naqxi geezheeq 2
nym Nyamwezi 2
nzz Naŋa tegu 1
oar Old Aramaic 1
oci Occitan 7