TEIhub

Languages

In an effort to make it easier to locate TEI-encoded texts in a particular language, I have recorded the ISO 639 language codes for each repository where languages were clearly specified in TEI files.

Each time a repository appears in a batch of 1000 GitHub search results, I download the first .xml file for that repository that appears in the current batch of search results and check whether a language is specified in any of the following ways, in order:

<langUsage><language id(ent)="xx(x)">...</language></langUsage>
<text xml:lang="xx(x)">
<body xml:lang="xx(x)">
<textLang mainLang="xx(x)">. While this technically applies to the language of the manuscript, not necessarily of the TEI file, it was often used in files where only a facsimile was provided in the body.
<TEI xml:lang="xx(x)">, but only where the language code is not en, as I found this did not reliably match the language used in the rest of the file.
xml:lang on any element within body, excluding <foreign> tags

Based on a sample of 500 repositories, approximately 40% declare a language in one of these ways. If the same repository appears in a different batch of search results, I apply the same algorithm to the first .xml file found in that search, and append any new languages to the end of the list. Repositories/files/languages are not currently removed from this database even if the underlying files are changed or removed from GitHub.

The TEI Guidelines recommend the use of BCP 47 language tags. In their fullest version such tags allow you to encode language, region, script, and certain other information. For simplicity I have only recorded the language subtag, and discarded the other information. BCP 47 prefers the shortest available form of ISO 639 language codes for the language subtag (e.g. de instead of deu for German), however this means some languages have a two-character code, and some a three-character code. For consistency, and to avoid privileging “more common” languages, I have normalised all two-character language codes to use their three-character ISO 639-2 equivalent, choosing the Terminology variant where applicable (e.g. de becomes deu not ger).

I have not provided filtering of the search results by language, as this becomes too complicated to implement with a large data set in the context of a static website. Data files in CSV and JSON formats of all the results are provided on the home page if you need to do more advanced filtering.

Language Code	Language Name	Matching Repositories
eng	English	1751
fra	Français	1209
lat	Latina	1143
deu	Deutsch	762
ita	Italiano	609
grc	Greek, Ancient (to 1453)	398
spa	Español	334
nld	Nederlands	223
ara	العربية	201
heb	עברית	151
ell	Ελληνικά	147
rus	Русский	130
san	संस्कृतम्	118
slv	Slovenščina	100
por	Português	88
zho	中文	86
pol	Polski	82
dan	Dansk	67
fro	French, Old (842-ca.1400)	65
jpn	日本語	58
syr	Syriac	52
tur	Türkçe	51
cop	Coptic	49
ces	Český	48
swe	Svenska	44
fas	فارسی	41
und	Undetermined	40
hun	Magyar	40
srp	Српски	40
kor	조선말 / 한국어	37
chu	словѣньскъ / slověnĭskŭ	36
gez	Geez	35
lit	Lietuvių	35
cym	Cymraeg	35
kat	ქართული	34
pli	Pāli / पाऴि	31
isl	Íslenska	31
hrv	Hrvatski	30
ron	Română	28
non	Norse, Old	27
nor	Norsk (bokmål / riksmål)	27
frm	French, Middle (ca.1400-1600)	26
bul	Български	25
sqi	Shqip	23
hye	Հայերեն	23
tam	தமிழ்	21
hin	हिन्दी	20
fin	Suomi	20
ukr	Українська	20
bod	དབུས་སྐད་	20
x-lap	Igpay Atinlay	19
ang	English, Old (ca.450-1100)	18
mul	Multiple languages	18
slk	Slovenčina	17
amh	አማርኛ	16
cat	Català	16
gmh	German, Middle High (ca.1050-1500)	16
ota	Turkish, Ottoman (1500-1928)	15
afr	Afrikaans	14
sog	Sogdian	14
est	Eesti	14
arc	Official Aramaic (700-300 BCE); Imperial Aramaic (700-300 BCE)	13
ava	Авар	13
enm	English, Middle (1100-1500)	13
gle	Gaeilge	12
akk	Akkadian	12
tgl	Tagalog	11
sh		11
lav	Latviešu	11
msa	Bahasa Melayu	10
phn	Phoenician	10
sux	Sumerian	10
grk	Hellenic languages	10
egy	Egyptian (Ancient)	10
vie	Tiếng Việt	10
ind	Bahasa Indonesia	10
urd	اردو	9
glg	Galego	9
mix	Mixtepec Mixtec	9
pra	Prakrit languages	9
xno	Anglo-Normaund	9
pie		9
quc	Qatzijobʼal	8
srd	Sardu	8
mkd	Македонски	8
swa	Swahili languages	8
oci	Occitan	7
x-unknown		7
gla	Gàidhlig	7
peo	Persian, Old (ca.600-400 B.C.)	7
mon	Монгол Хэл / ᠮᠣᠨᠭᠭᠣᠯ ᠬᠡᠯᠡ	7
bar	Boarisch, Bairisch	7
sme	Sámegiella	7
uig	Uyƣurqə / ئۇيغۇرچە	7
tel	తెలుగు	7
jrb	Judeo-Arabic	7
yid	ייִדיש	7
mya	မြန်မာစာ / မြန်မာစကား	6
rom	Romany	6
kau	Kanuri	6
kan	ಕನ್ನಡ	6
syc	Classical Syriac	6
bre	Brezhoneg	6
pro	Provençal, Old (to 1500);Occitan, Old (to 1500)	6
epo	Esperanto	6
ber	Tamaziɣt	6
nds	Plattdüütsch	6
unk		6
ben	বাংলা	6
nep	नेपाली	6
x-verlan		6
gda		5
som	اللغة الصومالية	5
zxx	No linguistic content; Not applicable	5
hau	هَوُسَ	5
eus	Euskara	5
pan	ਪੰਜਾਬੀ / पंजाबी / پنجابي	5
xcl	գրաբար	5
gsw	Schwyzerdütsch	5
tir	ትግርኛ	5
tha	ภาษาไทย	5
osc	Oscan	5
ceb	Bisaya / Sinugbuanon	5
mri	Māori	5
nob	Bokmål, Norwegian; Norwegian Bokmål	5
sco	Scots	5
nai	North American Indian languages	5
mlg	Malagasy	4
gml	German, Middle Low (ca.1200-1650)	4
jpr	Judeo-Persian	4
goh	German, Old High (ca.750-1050)	4
pus	پښتو	4
lng	Lombardic	4
mnc	ᠮᠠᠨᠵᡠ ᡤᡳᠰᡠᠨ	4
kaw	Kawi	4
mal	മലയാളം	4
tsn	Setswana	4
nau	Dorerin Naoero	4
dum	Dutch, Middle (ca.1050-1350)	4
mar	मराठी	4
pag	Pangasinan	4
wln	Walon	4
sin	සිංහල	4
nah	Nahuatl languages	4
scx	Sicel	4
tat	Tatarça	4
pal	Pahlavi	4
arz	مصرى	3
spn		3
cmn	官話 / 官话	3
sav		3
oss	Иронау	3
ine	Indo-European languages	3
guj	ગુજરાતી	3
ajp	اللهجة الشامية الجنوبية	3
hbo	Ancient Hebrew	3
lad	Ladino	3
bik	Bikol	3
haw	ʻŌlelo Hawaiʻi	3
bel	Беларуская	3
arb	العربية الفصحى	3
mig	San Miguel El Grande Mixtec	3
sot	Sesotho	3
miy	Ayutla Mixtec	3
tah	Reo Mā`ohi	3
miz	Coatzospan Mixtec	3
smd		3
xpu	Punic	3
xly	Elymian	3
xda		3
sxc	Sicanian	3
pka	𑀅𑀭𑁆𑀥𑀫𑀸𑀕𑀥𑀻	3
cel	Celtic languages	3
swh	Kiswahili	3
des		3
arn	Mapudungun; Mapuche	2
inc	Indic languages	2
gig		2
xml		2
cor	Kernewek	2
ess		2
x-oldcam	Old Cam	2
fry	Frysk	2
txb		2
tso	Xitsonga	2
sla	Slavic languages	2
got	Gothic	2
bbc		2
bak	Башҡорт	2
glv	Gaelg	2
roh	Rumantsch	2
ilo	Iloko	2
ido	Ido	2
ave	Avestan	2
kur	Kurdí, کوردی, or K’öрди	2
cos	Corsu	2
pam	Pampanga; Kapampangan	2
ton	Lea Faka-Tonga	2
xct	Classical Tibetan	2
new	नेपाल भाषा	2
gem	Germanic languages	2
hil	Hiligaynon	2
osx	Sahsisk	2
mis	Uncoded languages	2
pit		2
ibe	Abesabesi	2
che	Нохчийн	2
smi	Sami languages	2
mlt	Malti	2
x-grc		2
del	Delaware	2
ofs	Frysk	2
aeb	تونسي	2
nym	Nyamwezi	2
lea	Shabunda Lega	2
bnt	Bantu languages	2
loz	Lozi	2
lun	Lunda	2
mck	Mbúùnda / Chimbúùnda	2
toi	iSitonga	2
x-sa		2
dzo	ཇོང་ཁ	2
lao	ລາວ / Pha xa lao	2
nxq	Naqxi geezheeq	2
brh	براہوئی	2
iir		1
ett		1
zkz		1
xld		1
trk		1
tgk	Тоҷикӣ	1
paq		1
apc	اللهجة الشامي الشمال	1
xsc		1
aer		1
x-dardic		1
roa	Romance languages	1
sva		1
pau	Palauan	1
inm	Minaean	1
x-vicav		1
zza	Zaza; Dimili; Dimli; Kirdki; Kirmanjki; Zazaki	1
agx		1
krc	Karachay-Balkar	1
nog	Nogai	1
kum	Къумукъ Tил	1
dar	Дарган Mез	1
cai	Central American Indian languages	1
aqc		1
tkr		1

TEIhub Discover TEI-encoded documents from GitHub public repositories.

Languages

TEIhub
Discover TEI-encoded documents from GitHub public repositories.