TEIhub
Discover TEI-encoded documents from GitHub public repositories.

Languages

In an effort to make it easier to locate TEI-encoded texts in a particular language, I have recorded the ISO 639 language codes for each repository where languages were clearly specified in TEI files.

Each time a repository appears in a batch of 1000 GitHub search results, I download the first .xml file for that repository that appears in the current batch of search results and check whether a language is specified in any of the following ways, in order:

  1. <langUsage><language id(ent)="xx(x)">...</language></langUsage>
  2. <text xml:lang="xx(x)">
  3. <body xml:lang="xx(x)">
  4. <textLang mainLang="xx(x)">. While this technically applies to the language of the manuscript, not necessarily of the TEI file, it was often used in files where only a facsimile was provided in the body.
  5. <TEI xml:lang="xx(x)">, but only where the language code is not en, as I found this did not reliably match the language used in the rest of the file.
  6. xml:lang on any element within body, excluding <foreign> tags

Based on a sample of 500 repositories, approximately 40% declare a language in one of these ways. If the same repository appears in a different batch of search results, I apply the same algorithm to the first .xml file found in that search, and append any new languages to the end of the list. Repositories/files/languages are not currently removed from this database even if the underlying files are changed or removed from GitHub.

The TEI Guidelines recommend the use of BCP 47 language tags. In their fullest version such tags allow you to encode language, region, script, and certain other information. For simplicity I have only recorded the language subtag, and discarded the other information. BCP 47 prefers the shortest available form of ISO 639 language codes for the language subtag (e.g. de instead of deu for German), however this means some languages have a two-character code, and some a three-character code. For consistency, and to avoid privileging “more common” languages, I have normalised all two-character language codes to use their three-character ISO 639-2 equivalent, choosing the Terminology variant where applicable (e.g. de becomes deu not ger).

I have not provided filtering of the search results by language, as this becomes too complicated to implement with a large data set in the context of a static website. Data files in CSV and JSON formats of all the results are provided on the home page if you need to do more advanced filtering.

Language Code Language Name Matching Repositories
eng English 1748
fra Français 1207
lat Latina 1141
deu Deutsch 760
ita Italiano 607
grc Greek, Ancient (to 1453) 397
spa Español 333
nld Nederlands 222
ara العربية 200
heb עברית 151
ell Ελληνικά 147
rus Русский 129
san संस्कृतम् 118
slv Slovenščina 100
por Português 87
zho 中文 86
pol Polski 82
dan Dansk 67
fro French, Old (842-ca.1400) 65
jpn 日本語 58
tur Türkçe 51
syr Syriac 51
ces Český 48
cop Coptic 48
swe Svenska 44
fas فارسی 41
srp Српски 40
hun Magyar 40
und Undetermined 40
kor 조선말 / 한국어 37
lit Lietuvių 35
cym Cymraeg 35
chu словѣньскъ / slověnĭskŭ 35
gez Geez 34
kat ქართული 33
pli Pāli / पाऴि 31
isl Íslenska 31
hrv Hrvatski 30
ron Română 28
nor Norsk (bokmål / riksmål) 27
non Norse, Old 27
frm French, Middle (ca.1400-1600) 26
bul Български 25
sqi Shqip 23
hye Հայերեն 22
bod དབུས་སྐད་ 20
fin Suomi 20
hin हिन्दी 20
ukr Українська 20
x-lap Igpay Atinlay 19
mul Multiple languages 18
ang English, Old (ca.450-1100) 18
slk Slovenčina 17
gmh German, Middle High (ca.1050-1500) 16
cat Català 16
amh አማርኛ 16
tam தமிழ் 16
ota Turkish, Ottoman (1500-1928) 15
est Eesti 14
afr Afrikaans 14
sog Sogdian 13
arc Official Aramaic (700-300 BCE); Imperial Aramaic (700-300 BCE) 13
enm English, Middle (1100-1500) 13
ava Авар 13
gle Gaeilge 12
akk Akkadian 12
tgl Tagalog 11
sh 11
lav Latviešu 11
msa Bahasa Melayu 10
phn Phoenician 10
sux Sumerian 10
grk Hellenic languages 10
egy Egyptian (Ancient) 10
vie Tiếng Việt 10
ind Bahasa Indonesia 10
urd اردو 9
glg Galego 9
mix Mixtepec Mixtec 9
pra Prakrit languages 9
xno Anglo-Normaund 9
pie 9
quc Qatzijobʼal 8
srd Sardu 8
mkd Македонски 8
swa Swahili languages 8
oci Occitan 7
x-unknown 7
gla Gàidhlig 7
peo Persian, Old (ca.600-400 B.C.) 7
mon Монгол Хэл / ᠮᠣᠨᠭᠭᠣᠯ ᠬᠡᠯᠡ 7
bar Boarisch, Bairisch 7
sme Sámegiella 7
uig Uyƣurqə / ئۇيغۇرچە 7
tel తెలుగు 7
jrb Judeo-Arabic 7
yid ייִדיש 7
mya မြန်မာစာ / မြန်မာစကား 6
rom Romany 6
kau Kanuri 6
kan ಕನ್ನಡ 6
syc Classical Syriac 6
bre Brezhoneg 6
pro Provençal, Old (to 1500);Occitan, Old (to 1500) 6
epo Esperanto 6
ber Tamaziɣt 6
nds Plattdüütsch 6
unk 6
ben বাংলা 6
nep नेपाली 6
x-verlan 6
gda 5
som اللغة الصومالية 5
zxx No linguistic content; Not applicable 5
hau هَوُسَ 5
eus Euskara 5
pan ਪੰਜਾਬੀ / पंजाबी / پنجابي 5
xcl գրաբար 5
gsw Schwyzerdütsch 5
tir ትግርኛ 5
tha ภาษาไทย 5
osc Oscan 5
ceb Bisaya / Sinugbuanon 5
mri Māori 5
nob Bokmål, Norwegian; Norwegian Bokmål 5
sco Scots 5
nai North American Indian languages 5
mlg Malagasy 4
gml German, Middle Low (ca.1200-1650) 4
jpr Judeo-Persian 4
goh German, Old High (ca.750-1050) 4
pus پښتو 4
lng Lombardic 4
mnc ᠮᠠᠨᠵᡠ ᡤᡳᠰᡠᠨ 4
kaw Kawi 4
mal മലയാളം 4
tsn Setswana 4
nau Dorerin Naoero 4
dum Dutch, Middle (ca.1050-1350) 4
mar मराठी 4
pag Pangasinan 4
wln Walon 4
sin සිංහල 4
nah Nahuatl languages 4
scx Sicel 4
tat Tatarça 4
pal Pahlavi 4
arz مصرى 3
spn 3
cmn 官話 / 官话 3
sav 3
oss Иронау 3
ine Indo-European languages 3
guj ગુજરાતી 3
ajp اللهجة الشامية الجنوبية 3
hbo Ancient Hebrew 3
lad Ladino 3
bik Bikol 3
haw ʻŌlelo Hawaiʻi 3
bel Беларуская 3
arb العربية الفصحى 3
mig San Miguel El Grande Mixtec 3
sot Sesotho 3
miy Ayutla Mixtec 3
tah Reo Mā`ohi 3
miz Coatzospan Mixtec 3
smd 3
xpu Punic 3
xly Elymian 3
xda 3
sxc Sicanian 3
pka 𑀅𑀭𑁆𑀥𑀫𑀸𑀕𑀥𑀻 3
cel Celtic languages 3
swh Kiswahili 3
des 3
arn Mapudungun; Mapuche 2
inc Indic languages 2
gig 2
xml 2
cor Kernewek 2
x-oldcam Old Cam 2
fry Frysk 2
txb 2
tso Xitsonga 2
sla Slavic languages 2
got Gothic 2
bbc 2
bak Башҡорт 2
glv Gaelg 2
roh Rumantsch 2
ilo Iloko 2
ido Ido 2
ave Avestan 2
kur Kurdí, کوردی, or K’öрди 2
cos Corsu 2
pam Pampanga; Kapampangan 2
ton Lea Faka-Tonga 2
xct Classical Tibetan 2
new नेपाल भाषा 2
gem Germanic languages 2
hil Hiligaynon 2
osx Sahsisk 2
mis Uncoded languages 2
pit 2
ibe Abesabesi 2
che Нохчийн 2
smi Sami languages 2
mlt Malti 2
x-grc 2
del Delaware 2
ofs Frysk 2
aeb تونسي 2
nym Nyamwezi 2
lea Shabunda Lega 2
bnt Bantu languages 2
loz Lozi 2
lun Lunda 2
mck Mbúùnda / Chimbúùnda 2
toi iSitonga 2
x-sa 2
dzo ཇོང་ཁ 2
lao ລາວ / Pha xa lao 2
nxq Naqxi geezheeq 2
brh براہوئی 2
bua Buriat 1
iir 1
ett 1
zkz 1
xld 1
trk 1
tgk Тоҷикӣ 1
paq 1
apc اللهجة الشامي الشمال 1
xsc 1
aer 1
x-dardic 1
roa Romance languages 1
sva 1
pau Palauan 1
inm Minaean 1
x-vicav 1
zza Zaza; Dimili; Dimli; Kirdki; Kirmanjki; Zazaki 1
agx 1
krc Karachay-Balkar 1
nog Nogai 1
kum Къумукъ Tил 1
dar Дарган Mез 1
cai Central American Indian languages 1
aqc 1
tkr 1