TEIhub
Discover TEI-encoded documents from GitHub public repositories.

Languages

In an effort to make it easier to locate TEI-encoded texts in a particular language, I have recorded the ISO 639 language codes for each repository where languages were clearly specified in TEI files.

Each time a repository appears in a batch of 1000 GitHub search results, I download the first .xml file for that repository that appears in the current batch of search results and check whether a language is specified in any of the following ways, in order:

  1. <langUsage><language id(ent)="xx(x)">...</language></langUsage>
  2. <text xml:lang="xx(x)">
  3. <body xml:lang="xx(x)">
  4. <textLang mainLang="xx(x)">. While this technically applies to the language of the manuscript, not necessarily of the TEI file, it was often used in files where only a facsimile was provided in the body.
  5. <TEI xml:lang="xx(x)">, but only where the language code is not en, as I found this did not reliably match the language used in the rest of the file.
  6. xml:lang on any element within body, excluding <foreign> tags

Based on a sample of 500 repositories, approximately 40% declare a language in one of these ways. If the same repository appears in a different batch of search results, I apply the same algorithm to the first .xml file found in that search, and append any new languages to the end of the list. Repositories/files/languages are not currently removed from this database even if the underlying files are changed or removed from GitHub.

The TEI Guidelines recommend the use of BCP 47 language tags. In their fullest version such tags allow you to encode language, region, script, and certain other information. For simplicity I have only recorded the language subtag, and discarded the other information. BCP 47 prefers the shortest available form of ISO 639 language codes for the language subtag (e.g. de instead of deu for German), however this means some languages have a two-character code, and some a three-character code. For consistency, and to avoid privileging “more common” languages, I have normalised all two-character language codes to use their three-character ISO 639-2 equivalent, choosing the Terminology variant where applicable (e.g. de becomes deu not ger).

I have not provided filtering of the search results by language, as this becomes too complicated to implement with a large data set in the context of a static website. Data files in CSV and JSON formats of all the results are provided on the home page if you need to do more advanced filtering.

Language Code Language Name Matching Repositories
sah Yakut 1
nbo 1
que Runa Simi 1
btd 1
kir Kırgızca / Кыргызча 1
jav Basa Jawa 1
ven Tshivenḓa 1
apa Ndéé 1
fil Filipino; Pilipino 1
moh Mohawk 1
itl 1
ary الدارجة المغربية 1
chm Mari 1
ldd 1
ess 1
onw 1
ile Interlingue 1
da 1
txh 1
olt Old Lithuanian 1
aii ܣܘܪܝܬ 1
yej יעואני גלוסא 1
ted 1
wrm Warumungu 1
vol Volapük 1
tpw Tupinambá 1
gug avañeʼẽ 1
nzz Naŋa tegu 1
lub Luba-Katanga 1
ott 1
gcr 1
sem Semitic languages 1
osp 1
cpe Creoles and pidgins, English based 1
arg Aragonés 1
chr Cherokee 1
art Artificial languages 1
aka Akana 1
tmr Jewish Babylonian Aramaic 1
all 1
tup Tupi languages 1
pcd Picard 1
oar Old Aramaic 1
jpa Jewish Palestinian Aramaic 1
kho Khotanese; Sakan 1
kaz Қазақ Tілі 1
wbl 1
ira Iranian languages 1
vep 1
elx Elamite 1
krl Karelian 1
ajw Ajawa 1
x-pamir 1
xpr 1
x-balochi 1
ddo 1
bat Baltic languages 1
fij Na Vosa Vakaviti 1
alg Algonquian languages 1
sve 1
gae 1
x-sarm 1
gai 1
yor Yorùbá 1
enn 1
xbo 1
xbc 1
xme 1
hac 1
zab 1
xzp Zapotec 1
udi 1
ars 1
bag 1
x-oldirn 1
iso 1
bsk 1
aze Azərbaycanca / آذربايجان 1
gad Gaddang 1
bis Bislama 1
zul isiZulu 1
yua maayaʼ tʼàan 1
xho isiXhosa 1
win Hoocą́k hoit'éra 1
mga Irish, Middle (900-1200) 1
x-nuristan 1
rue русинськый язык; руски язик 1
ckb 1
ohu 1
x-vaynakh 1
mwr Marwari 1
chv Чăваш 1
mag मगही 1
kar Karen languages 1
ani 1
kap 1
yai 1
wne 1
esx Eskaleut 1
dak Dakȟótiyapi 1
ysc 1
brx बर' 1
asm অসমীয়া 1
alt Southern Altai 1
mhd 1
lin Lingála 1
gbz 1
prs دری 1
cau Caucasian languages 1
snd सिनधि 1
map Austronesian languages 1
chg Chagatai 1
oos 1
ttt 1
rut 1
ori ଓଡ଼ିଆ 1
lzh 古文 / 文言 1
ags 1
xtq 1
bya Palawan Batak 1
oji ᐊᓂᔑᓈᐯᒧᐎᓐ / Anishinaabemowin 1
ndl 1
khm ភាសាខ្មែរ 1
x-oldkhmer 1
tab 1
mns 1
kca 1
tly 1
x-rushani 1
x-mordvin 1
uby 1
cha Chamoru 1
ccs 1
reg 1
kdr 1
nil 1
udm Udmurt 1
ydg 1
sqj 1
xmf 1
ewe Ɛʋɛ 1
x-tchr 1
bra Braj 1
xln 1
lik 1
ady Aдыгэбзэ 1
kom Коми 1
ing 1
lbe 1
myn Mayan languages 1
prg 1
orv 1
lez Lezghian 1
kbd Kъэбэрдеибзэ 1
bbl 1
isk 1
khw 1
uzb Ўзбек 1
kjj 1
qwm 1
hit Hittite 1
nno Norsk (nynorsk) 1
sga Irish, Old (to 900) 1
abq 1
xto 1
shn Shan 1
sgy 1
yah 1
srh 1
sgh 1
mnj 1
prc 1
awa Awadhi 1
oru 1
smy 1
zkh 1
abk Аҧсуа 1
bos Босански / Bosanski 1
oro 1
orm Oromoo 1
lzz 1
inh ГӀалгӀай мотт 1
tru ܛܘܪܝܐ 1
tkr 1
aqc 1
cai Central American Indian languages 1
dar Дарган Mез 1
kum Къумукъ Tил 1
nog Nogai 1
krc Karachay-Balkar 1
agx 1
zza Zaza; Dimili; Dimli; Kirdki; Kirmanjki; Zazaki 1
x-vicav 1
inm Minaean 1
pau Palauan 1
sva 1
roa Romance languages 1
x-dardic 1
aer 1
xsc 1
apc اللهجة الشامي الشمال 1
paq 1
tgk Тоҷикӣ 1
trk 1
xld 1
zkz 1
ett 1
iir 1
bua Buriat 1
brh براہوئی 2
nxq Naqxi geezheeq 2
lao ລາວ / Pha xa lao 2
dzo ཇོང་ཁ 2
x-sa 2
toi iSitonga 2
mck Mbúùnda / Chimbúùnda 2
lun Lunda 2
loz Lozi 2
bnt Bantu languages 2
lea Shabunda Lega 2
nym Nyamwezi 2
aeb تونسي 2
ofs Frysk 2
del Delaware 2
x-grc 2
mlt Malti 2
smi Sami languages 2
che Нохчийн 2
ibe Abesabesi 2
pit 2
mis Uncoded languages 2
osx Sahsisk 2
hil Hiligaynon 2
gem Germanic languages 2
new नेपाल भाषा 2
xct Classical Tibetan 2
ton Lea Faka-Tonga 2
pam Pampanga; Kapampangan 2
cos Corsu 2
kur Kurdí, کوردی, or K’öрди 2
ave Avestan 2
ido Ido 2
ilo Iloko 2
roh Rumantsch 2
glv Gaelg 2
bak Башҡорт 2
bbc 2
got Gothic 2
sla Slavic languages 2
tso Xitsonga 2