In an effort to make it easier to locate TEI-encoded texts in a particular language, I have recorded the ISO 639 language codes for each repository where languages were clearly specified in TEI files.
Each time a repository appears in a batch of 1000 GitHub search results, I download the first .xml file for that repository that appears in the current batch of search results and check whether a language is specified in any of the following ways, in order:
<langUsage><language id(ent)="xx(x)">...</language></langUsage>
<text xml:lang="xx(x)">
<body xml:lang="xx(x)">
<textLang mainLang="xx(x)">
. While this technically applies to the language of the manuscript, not necessarily of the TEI file, it was often used in files where only a facsimile was provided in the body.<TEI xml:lang="xx(x)">
, but only where the language code is not en, as I found this did not reliably match the language used in the rest of the file.xml:lang
on any element within body, excluding <foreign>
tagsBased on a sample of 500 repositories, approximately 40% declare a language in one of these ways. If the same repository appears in a different batch of search results, I apply the same algorithm to the first .xml file found in that search, and append any new languages to the end of the list. Repositories/files/languages are not currently removed from this database even if the underlying files are changed or removed from GitHub.
The TEI Guidelines recommend the use of BCP 47 language tags. In their fullest version such tags allow you to encode language, region, script, and certain other information. For simplicity I have only recorded the language subtag, and discarded the other information. BCP 47 prefers the shortest available form of ISO 639 language codes for the language subtag (e.g. de
instead of deu
for German), however this means some languages have a two-character code, and some a three-character code. For consistency, and to avoid privileging “more common” languages, I have normalised all two-character language codes to use their three-character ISO 639-2 equivalent, choosing the Terminology variant where applicable (e.g. de
becomes deu
not ger
).
I have not provided filtering of the search results by language, as this becomes too complicated to implement with a large data set in the context of a static website. Data files in CSV and JSON formats of all the results are provided on the home page if you need to do more advanced filtering.
Language Code | Language Name | Matching Repositories |
---|---|---|
eng | English | 1751 |
fra | Français | 1210 |
lat | Latina | 1145 |
deu | Deutsch | 763 |
ita | Italiano | 611 |
grc | Greek, Ancient (to 1453) | 398 |
spa | Español | 334 |
nld | Nederlands | 223 |
ara | العربية | 201 |
heb | עברית | 152 |
ell | Ελληνικά | 147 |
rus | Русский | 130 |
san | संस्कृतम् | 119 |
slv | Slovenščina | 100 |
por | Português | 88 |
zho | 中文 | 86 |
pol | Polski | 82 |
dan | Dansk | 67 |
fro | French, Old (842-ca.1400) | 65 |
jpn | 日本語 | 58 |
syr | Syriac | 52 |
tur | Türkçe | 51 |
cop | Coptic | 49 |
ces | Český | 48 |
swe | Svenska | 44 |
fas | فارسی | 41 |
und | Undetermined | 40 |
hun | Magyar | 40 |
srp | Српски | 40 |
kor | 조선말 / 한국어 | 37 |
chu | словѣньскъ / slověnĭskŭ | 36 |
gez | Geez | 35 |
cym | Cymraeg | 35 |
lit | Lietuvių | 35 |
kat | ქართული | 34 |
pli | Pāli / पाऴि | 31 |
isl | Íslenska | 31 |
hrv | Hrvatski | 30 |
ron | Română | 28 |
nor | Norsk (bokmål / riksmål) | 27 |
non | Norse, Old | 27 |
frm | French, Middle (ca.1400-1600) | 26 |
tam | தமிழ் | 25 |
bul | Български | 25 |
hye | Հայերեն | 23 |
sqi | Shqip | 23 |
ukr | Українська | 20 |
bod | དབུས་སྐད་ | 20 |
fin | Suomi | 20 |
hin | हिन्दी | 20 |
x-lap | Igpay Atinlay | 19 |
mul | Multiple languages | 18 |
ang | English, Old (ca.450-1100) | 18 |
slk | Slovenčina | 18 |
gmh | German, Middle High (ca.1050-1500) | 16 |
amh | አማርኛ | 16 |
cat | Català | 16 |
ota | Turkish, Ottoman (1500-1928) | 15 |
afr | Afrikaans | 14 |
est | Eesti | 14 |
sog | Sogdian | 14 |
enm | English, Middle (1100-1500) | 13 |
arc | Official Aramaic (700-300 BCE); Imperial Aramaic (700-300 BCE) | 13 |
ava | Авар | 13 |
gle | Gaeilge | 12 |
akk | Akkadian | 12 |
sh | 11 | |
lav | Latviešu | 11 |
tgl | Tagalog | 11 |
msa | Bahasa Melayu | 10 |
phn | Phoenician | 10 |
sux | Sumerian | 10 |
grk | Hellenic languages | 10 |
egy | Egyptian (Ancient) | 10 |
vie | Tiếng Việt | 10 |
ind | Bahasa Indonesia | 10 |
urd | اردو | 9 |
glg | Galego | 9 |
mix | Mixtepec Mixtec | 9 |
pra | Prakrit languages | 9 |
xno | Anglo-Normaund | 9 |
pie | 9 | |
quc | Qatzijobʼal | 8 |
srd | Sardu | 8 |
mkd | Македонски | 8 |
swa | Swahili languages | 8 |
oci | Occitan | 7 |
x-unknown | 7 | |
gla | Gàidhlig | 7 |
peo | Persian, Old (ca.600-400 B.C.) | 7 |
mon | Монгол Хэл / ᠮᠣᠨᠭᠭᠣᠯ ᠬᠡᠯᠡ | 7 |
bar | Boarisch, Bairisch | 7 |
sme | Sámegiella | 7 |
uig | Uyƣurqə / ئۇيغۇرچە | 7 |
tel | తెలుగు | 7 |
jrb | Judeo-Arabic | 7 |
yid | ייִדיש | 7 |
mya | မြန်မာစာ / မြန်မာစကား | 6 |
rom | Romany | 6 |
kau | Kanuri | 6 |
kan | ಕನ್ನಡ | 6 |
syc | Classical Syriac | 6 |
bre | Brezhoneg | 6 |
pro | Provençal, Old (to 1500);Occitan, Old (to 1500) | 6 |
epo | Esperanto | 6 |
ber | Tamaziɣt | 6 |
nds | Plattdüütsch | 6 |
unk | 6 | |
ben | বাংলা | 6 |
nep | नेपाली | 6 |
x-verlan | 6 | |
gda | 5 | |
som | اللغة الصومالية | 5 |
zxx | No linguistic content; Not applicable | 5 |
hau | هَوُسَ | 5 |
eus | Euskara | 5 |
pan | ਪੰਜਾਬੀ / पंजाबी / پنجابي | 5 |
xcl | գրաբար | 5 |
gsw | Schwyzerdütsch | 5 |
tir | ትግርኛ | 5 |
tha | ภาษาไทย | 5 |
osc | Oscan | 5 |
ceb | Bisaya / Sinugbuanon | 5 |
mri | Māori | 5 |
nob | Bokmål, Norwegian; Norwegian Bokmål | 5 |
sco | Scots | 5 |
nai | North American Indian languages | 5 |
mlg | Malagasy | 4 |
gml | German, Middle Low (ca.1200-1650) | 4 |
jpr | Judeo-Persian | 4 |
goh | German, Old High (ca.750-1050) | 4 |
pus | پښتو | 4 |
lng | Lombardic | 4 |
mnc | ᠮᠠᠨᠵᡠ ᡤᡳᠰᡠᠨ | 4 |
kaw | Kawi | 4 |
mal | മലയാളം | 4 |
tsn | Setswana | 4 |
nau | Dorerin Naoero | 4 |
dum | Dutch, Middle (ca.1050-1350) | 4 |
mar | मराठी | 4 |
pag | Pangasinan | 4 |
wln | Walon | 4 |
sin | සිංහල | 4 |
nah | Nahuatl languages | 4 |
scx | Sicel | 4 |
tat | Tatarça | 4 |
pal | Pahlavi | 4 |
arz | مصرى | 3 |
spn | 3 | |
cmn | 官話 / 官话 | 3 |
sav | 3 | |
oss | Иронау | 3 |
ine | Indo-European languages | 3 |
guj | ગુજરાતી | 3 |
ajp | اللهجة الشامية الجنوبية | 3 |
hbo | Ancient Hebrew | 3 |
lad | Ladino | 3 |
bik | Bikol | 3 |
haw | ʻŌlelo Hawaiʻi | 3 |
bel | Беларуская | 3 |
arb | العربية الفصحى | 3 |
mig | San Miguel El Grande Mixtec | 3 |
sot | Sesotho | 3 |
miy | Ayutla Mixtec | 3 |
tah | Reo Mā`ohi | 3 |
miz | Coatzospan Mixtec | 3 |
smd | 3 | |
xpu | Punic | 3 |
xly | Elymian | 3 |
xda | 3 | |
sxc | Sicanian | 3 |
pka | 𑀅𑀭𑁆𑀥𑀫𑀸𑀕𑀥𑀻 | 3 |
cel | Celtic languages | 3 |
swh | Kiswahili | 3 |
des | 3 | |
arn | Mapudungun; Mapuche | 2 |
inc | Indic languages | 2 |
gig | 2 | |
xml | 2 | |
cor | Kernewek | 2 |
ess | 2 | |
x-oldcam | Old Cam | 2 |
fry | Frysk | 2 |
txb | 2 | |
tso | Xitsonga | 2 |
sla | Slavic languages | 2 |
got | Gothic | 2 |
bbc | 2 | |
bak | Башҡорт | 2 |
glv | Gaelg | 2 |
roh | Rumantsch | 2 |
ilo | Iloko | 2 |
ido | Ido | 2 |
ave | Avestan | 2 |
kur | Kurdí, کوردی, or K’öрди | 2 |
cos | Corsu | 2 |
pam | Pampanga; Kapampangan | 2 |
ton | Lea Faka-Tonga | 2 |
xct | Classical Tibetan | 2 |
new | नेपाल भाषा | 2 |
gem | Germanic languages | 2 |
hil | Hiligaynon | 2 |
osx | Sahsisk | 2 |
mis | Uncoded languages | 2 |
pit | 2 | |
ibe | Abesabesi | 2 |
che | Нохчийн | 2 |
smi | Sami languages | 2 |
mlt | Malti | 2 |
x-grc | 2 | |
del | Delaware | 2 |
ofs | Frysk | 2 |
aeb | تونسي | 2 |
nym | Nyamwezi | 2 |
lea | Shabunda Lega | 2 |
bnt | Bantu languages | 2 |
loz | Lozi | 2 |
lun | Lunda | 2 |
mck | Mbúùnda / Chimbúùnda | 2 |
toi | iSitonga | 2 |
x-sa | 2 | |
dzo | ཇོང་ཁ | 2 |
lao | ລາວ / Pha xa lao | 2 |
nxq | Naqxi geezheeq | 2 |
brh | براہوئی | 2 |
iir | 1 | |
ett | 1 | |
zkz | 1 | |
xld | 1 | |
trk | 1 | |
tgk | Тоҷикӣ | 1 |
paq | 1 | |
apc | اللهجة الشامي الشمال | 1 |
xsc | 1 | |
aer | 1 | |
x-dardic | 1 | |
roa | Romance languages | 1 |
sva | 1 | |
pau | Palauan | 1 |
inm | Minaean | 1 |
x-vicav | 1 | |
zza | Zaza; Dimili; Dimli; Kirdki; Kirmanjki; Zazaki | 1 |
agx | 1 | |
krc | Karachay-Balkar | 1 |
nog | Nogai | 1 |
kum | Къумукъ Tил | 1 |
dar | Дарган Mез | 1 |
cai | Central American Indian languages | 1 |
aqc | 1 | |
tkr | 1 |