TEIhub
Discover TEI-encoded documents from GitHub public repositories.

Languages

In an effort to make it easier to locate TEI-encoded texts in a particular language, I have recorded the ISO 639 language codes for each repository where languages were clearly specified in TEI files.

Each time a repository appears in a batch of 1000 GitHub search results, I download the first .xml file for that repository that appears in the current batch of search results and check whether a language is specified in any of the following ways, in order:

  1. <langUsage><language id(ent)="xx(x)">...</language></langUsage>
  2. <text xml:lang="xx(x)">
  3. <body xml:lang="xx(x)">
  4. <textLang mainLang="xx(x)">. While this technically applies to the language of the manuscript, not necessarily of the TEI file, it was often used in files where only a facsimile was provided in the body.
  5. <TEI xml:lang="xx(x)">, but only where the language code is not en, as I found this did not reliably match the language used in the rest of the file.
  6. xml:lang on any element within body, excluding <foreign> tags

Based on a sample of 500 repositories, approximately 40% declare a language in one of these ways. If the same repository appears in a different batch of search results, I apply the same algorithm to the first .xml file found in that search, and append any new languages to the end of the list. Repositories/files/languages are not currently removed from this database even if the underlying files are changed or removed from GitHub.

The TEI Guidelines recommend the use of BCP 47 language tags. In their fullest version such tags allow you to encode language, region, script, and certain other information. For simplicity I have only recorded the language subtag, and discarded the other information. BCP 47 prefers the shortest available form of ISO 639 language codes for the language subtag (e.g. de instead of deu for German), however this means some languages have a two-character code, and some a three-character code. For consistency, and to avoid privileging “more common” languages, I have normalised all two-character language codes to use their three-character ISO 639-2 equivalent, choosing the Terminology variant where applicable (e.g. de becomes deu not ger).

I have not provided filtering of the search results by language, as this becomes too complicated to implement with a large data set in the context of a static website. Data files in CSV and JSON formats of all the results are provided on the home page if you need to do more advanced filtering.

Language Code Language Name Matching Repositories
zza Zaza; Dimili; Dimli; Kirdki; Kirmanjki; Zazaki 1
zxx No linguistic content; Not applicable 5
zul isiZulu 1
zkz 1
zkh 1
zho 中文 86
zab 1
yua maayaʼ tʼàan 1
ysc 1
yor Yorùbá 1
yid ייִדיש 7
yej יעואני גלוסא 1
ydg 1
yai 1
yah 1
xzp Zapotec 1
xtq 1
xto 1
xsc 1
xpu Punic 3
xpr 1
xno Anglo-Normaund 9
xml 2
xmf 1
xme 1
xly Elymian 3
xln 1
xld 1
xho isiXhosa 1
xda 3
xct Classical Tibetan 2
xcl գրաբար 5
xbo 1
xbc 1
x-vicav 1
x-verlan 6
x-vaynakh 1
x-unknown 7
x-tchr 1
x-sarm 1
x-sa 2
x-rushani 1
x-pamir 1
x-oldkhmer 1
x-oldirn 1
x-oldcam Old Cam 2
x-nuristan 1
x-mordvin 1
x-lap Igpay Atinlay 19
x-grc 2
x-dardic 1
x-balochi 1
wrm Warumungu 1
wne 1
wln Walon 4
win Hoocą́k hoit'éra 1
wbl 1
vol Volapük 1
vie Tiếng Việt 10
vep 1
ven Tshivenḓa 1
uzb Ўзбек 1
urd اردو 9
unk 6
und Undetermined 40
ukr Українська 20
uig Uyƣurqə / ئۇيغۇرچە 7
udm Udmurt 1
udi 1
uby 1
txh 1
txb 2
tur Türkçe 51
tup Tupi languages 1
ttt 1
tso Xitsonga 2
tsn Setswana 4
tru ܛܘܪܝܐ 1
trk 1
tpw Tupinambá 1
ton Lea Faka-Tonga 2
toi iSitonga 2
tmr Jewish Babylonian Aramaic 1
tly 1
tkr 1
tir ትግርኛ 5
tha ภาษาไทย 5
tgl Tagalog 11
tgk Тоҷикӣ 1
tel తెలుగు 7
ted 1
tat Tatarça 4
tam தமிழ் 16
tah Reo Mā`ohi 3
tab 1
syr Syriac 51
syc Classical Syriac 6
sxc Sicanian 3
swh Kiswahili 3
swe Svenska 44
swa Swahili languages 8
sve 1
sva 1
sux Sumerian 10
srp Српски 40
srh 1
srd Sardu 8
sqj 1
sqi Shqip 23
spn 3
spa Español 333
sot Sesotho 3
som اللغة الصومالية 5
sog Sogdian 13
snd सिनधि 1
smy 1
smi Sami languages 2
sme Sámegiella 7
smd 3
slv Slovenščina 100
slk Slovenčina 17
sla Slavic languages 2
sin සිංහල 4
shn Shan 1
sh 11
sgy 1
sgh 1
sga Irish, Old (to 900) 1
sem Semitic languages 1
scx Sicel 4
sco Scots 5
sav 3
san संस्कृतम् 118
sah Yakut 1
rut 1
rus Русский 129
rue русинськый язык; руски язик 1
ron Română 28
rom Romany 6
roh Rumantsch 2
roa Romance languages 1
reg 1
qwm 1
que Runa Simi 1
quc Qatzijobʼal 8
pus پښتو 4
prs دری 1
pro Provençal, Old (to 1500);Occitan, Old (to 1500) 6
prg 1
prc 1
pra Prakrit languages 9
por Português 87
pol Polski 82
pli Pāli / पाऴि 31
pka 𑀅𑀭𑁆𑀥𑀫𑀸𑀕𑀥𑀻 3
pit 2
pie 9
phn Phoenician 10
peo Persian, Old (ca.600-400 B.C.) 7
pcd Picard 1
pau Palauan 1
paq 1
pan ਪੰਜਾਬੀ / पंजाबी / پنجابي 5
pam Pampanga; Kapampangan 2
pal Pahlavi 4
pag Pangasinan 4
ott 1
ota Turkish, Ottoman (1500-1928) 15
osx Sahsisk 2
oss Иронау 3
osp 1
osc Oscan 5
orv 1
oru 1
oro 1
orm Oromoo 1
ori ଓଡ଼ିଆ 1
oos 1
onw 1
olt Old Lithuanian 1
oji ᐊᓂᔑᓈᐯᒧᐎᓐ / Anishinaabemowin 1
ohu 1
ofs Frysk 2
oci Occitan 7
oar Old Aramaic 1
nzz Naŋa tegu 1
nym Nyamwezi 2
nxq Naqxi geezheeq 2
nor Norsk (bokmål / riksmål) 27
non Norse, Old 27
nog Nogai 1
nob Bokmål, Norwegian; Norwegian Bokmål 5
nno Norsk (nynorsk) 1
nld Nederlands 222
nil 1
new नेपाल भाषा 2
nep नेपाली 6
nds Plattdüütsch 6
ndl 1
nbo 1
nau Dorerin Naoero 4
nai North American Indian languages 5
nah Nahuatl languages 4
myn Mayan languages 1
mya မြန်မာစာ / မြန်မာစကား 6
mwr Marwari 1
mul Multiple languages 18
msa Bahasa Melayu 10
mri Māori 5
mon Монгол Хэл / ᠮᠣᠨᠭᠭᠣᠯ ᠬᠡᠯᠡ 7
moh Mohawk 1
mns 1
mnj 1
mnc ᠮᠠᠨᠵᡠ ᡤᡳᠰᡠᠨ 4
mlt Malti 2
mlg Malagasy 4
mkd Македонски 8
miz Coatzospan Mixtec 3
miy Ayutla Mixtec 3
mix Mixtepec Mixtec 9
mis Uncoded languages 2
mig San Miguel El Grande Mixtec 3
mhd 1
mga Irish, Middle (900-1200) 1
mck Mbúùnda / Chimbúùnda 2
mar मराठी 4
map Austronesian languages 1
mal മലയാളം 4
mag मगही 1
lzz 1
lzh 古文 / 文言 1
lun Lunda 2
lub Luba-Katanga 1
loz Lozi 2
lng Lombardic 4
lit Lietuvių 35
lin Lingála 1
lik 1
lez Lezghian 1
lea Shabunda Lega 2
ldd 1
lbe 1
lav Latviešu 11
lat Latina 1141
lao ລາວ / Pha xa lao 2
lad Ladino 3
kur Kurdí, کوردی, or K’öрди 2
kum Къумукъ Tил 1
krl Karelian 1
krc Karachay-Balkar 1