TEIhub

Languages

In an effort to make it easier to locate TEI-encoded texts in a particular language, I have recorded the ISO 639 language codes for each repository where languages were clearly specified in TEI files.

Each time a repository appears in a batch of 1000 GitHub search results, I download the first .xml file for that repository that appears in the current batch of search results and check whether a language is specified in any of the following ways, in order:

<langUsage><language id(ent)="xx(x)">...</language></langUsage>
<text xml:lang="xx(x)">
<body xml:lang="xx(x)">
<textLang mainLang="xx(x)">. While this technically applies to the language of the manuscript, not necessarily of the TEI file, it was often used in files where only a facsimile was provided in the body.
<TEI xml:lang="xx(x)">, but only where the language code is not en, as I found this did not reliably match the language used in the rest of the file.
xml:lang on any element within body, excluding <foreign> tags

Based on a sample of 500 repositories, approximately 40% declare a language in one of these ways. If the same repository appears in a different batch of search results, I apply the same algorithm to the first .xml file found in that search, and append any new languages to the end of the list. Repositories/files/languages are not currently removed from this database even if the underlying files are changed or removed from GitHub.

The TEI Guidelines recommend the use of BCP 47 language tags. In their fullest version such tags allow you to encode language, region, script, and certain other information. For simplicity I have only recorded the language subtag, and discarded the other information. BCP 47 prefers the shortest available form of ISO 639 language codes for the language subtag (e.g. de instead of deu for German), however this means some languages have a two-character code, and some a three-character code. For consistency, and to avoid privileging “more common” languages, I have normalised all two-character language codes to use their three-character ISO 639-2 equivalent, choosing the Terminology variant where applicable (e.g. de becomes deu not ger).

I have not provided filtering of the search results by language, as this becomes too complicated to implement with a large data set in the context of a static website. Data files in CSV and JSON formats of all the results are provided on the home page if you need to do more advanced filtering.

Language Code	Language Name	Matching Repositories
sah	Yakut	1
nbo		1
que	Runa Simi	1
btd		1
bua	Buriat	1
jav	Basa Jawa	1
ven	Tshivenḓa	1
apa	Ndéé	1
fil	Filipino; Pilipino	1
moh	Mohawk	1
itl		1
kir	Kırgızca / Кыргызча	1
ary	الدارجة المغربية	1
chm	Mari	1
onw		1
ile	Interlingue	1
da		1
ldd		1
olt	Old Lithuanian	1
aii	ܣܘܪܝܬ	1
yej	יעואני גלוסא	1
ted		1
wrm	Warumungu	1
vol	Volapük	1
tpw	Tupinambá	1
gug	avañeʼẽ	1
nzz	Naŋa tegu	1
lub	Luba-Katanga	1
ott		1
gcr		1
sem	Semitic languages	1
osp		1
cpe	Creoles and pidgins, English based	1
arg	Aragonés	1
chr	Cherokee	1
art	Artificial languages	1
aka	Akana	1
tmr	Jewish Babylonian Aramaic	1
all		1
tup	Tupi languages	1
pcd	Picard	1
oar	Old Aramaic	1
jpa	Jewish Palestinian Aramaic	1
txh		1
kho	Khotanese; Sakan	1
kaz	Қазақ Tілі	1
wbl		1
ira	Iranian languages	1
vep		1
elx	Elamite	1
ajw	Ajawa	1
krl	Karelian	1
x-pamir		1
xpr		1
x-balochi		1
ddo		1
fij	Na Vosa Vakaviti	1
alg	Algonquian languages	1
sve		1
gae		1
bat	Baltic languages	1
gai		1
yor	Yorùbá	1
enn		1
x-sarm		1
xbo		1
xbc		1
xme		1
zab		1
xzp	Zapotec	1
hac		1
ars		1
bag		1
udi		1
iso		1
x-oldirn		1
bsk		1
gad	Gaddang	1
bis	Bislama	1
zul	isiZulu	1
yua	maayaʼ tʼàan	1
xho	isiXhosa	1
win	Hoocą́k hoit'éra	1
aze	Azərbaycanca / آذربايجان	1
mga	Irish, Middle (900-1200)	1
rue	русинськый язык; руски язик	1
x-nuristan		1
ckb		1
ohu		1
mwr	Marwari	1
x-vaynakh		1
mag	मगही	1
kar	Karen languages	1
chv	Чăваш	1
ani		1
kap		1
yai		1
esx	Eskaleut	1
dak	Dakȟótiyapi	1
wne		1
brx	बर'	1
asm	অসমীয়া	1
ysc		1
mhd		1
lin	Lingála	1
alt	Southern Altai	1
prs	دری	1
gbz		1
snd	सिनधि	1
map	Austronesian languages	1
chg	Chagatai	1
cau	Caucasian languages	1
ttt		1
oos		1
ori	ଓଡ଼ିଆ	1
rut		1
ags		1
lzh	古文 / 文言	1
bya	Palawan Batak	1
oji	ᐊᓂᔑᓈᐯᒧᐎᓐ / Anishinaabemowin	1
ndl		1
khm	ភាសាខ្មែរ	1
x-oldkhmer		1
xtq		1
tab		1
mns		1
kca		1
tly		1
x-rushani		1
x-mordvin		1
uby		1
cha	Chamoru	1
ccs		1
reg		1
kdr		1
nil		1
udm	Udmurt	1
ydg		1
sqj		1
xmf		1
ewe	Ɛʋɛ	1
x-tchr		1
bra	Braj	1
xln		1
lik		1
ady	Aдыгэбзэ	1
kom	Коми	1
ing		1
lbe		1
myn	Mayan languages	1
prg		1
orv		1
lez	Lezghian	1
kbd	Kъэбэрдеибзэ	1
bbl		1
isk		1
khw		1
uzb	Ўзбек	1
kjj		1
qwm		1
hit	Hittite	1
nno	Norsk (nynorsk)	1
sga	Irish, Old (to 900)	1
abq		1
xto		1
shn	Shan	1
sgy		1
yah		1
srh		1
sgh		1
mnj		1
prc		1
awa	Awadhi	1
oru		1
smy		1
zkh		1
abk	Аҧсуа	1
bos	Босански / Bosanski	1
oro		1
orm	Oromoo	1
lzz		1
inh	ГӀалгӀай мотт	1
tru	ܛܘܪܝܐ	1
tkr		1
aqc		1
cai	Central American Indian languages	1
dar	Дарган Mез	1
kum	Къумукъ Tил	1
nog	Nogai	1
krc	Karachay-Balkar	1
agx		1
zza	Zaza; Dimili; Dimli; Kirdki; Kirmanjki; Zazaki	1
x-vicav		1
inm	Minaean	1
pau	Palauan	1
sva		1
roa	Romance languages	1
x-dardic		1
aer		1
xsc		1
apc	اللهجة الشامي الشمال	1
paq		1
tgk	Тоҷикӣ	1
trk		1
xld		1
zkz		1
ett		1
iir		1
brh	براہوئی	2
nxq	Naqxi geezheeq	2
lao	ລາວ / Pha xa lao	2
dzo	ཇོང་ཁ	2
x-sa		2
toi	iSitonga	2
mck	Mbúùnda / Chimbúùnda	2
lun	Lunda	2
loz	Lozi	2
bnt	Bantu languages	2
lea	Shabunda Lega	2
nym	Nyamwezi	2
aeb	تونسي	2
ofs	Frysk	2
del	Delaware	2
x-grc		2
mlt	Malti	2
smi	Sami languages	2
che	Нохчийн	2
ibe	Abesabesi	2
pit		2
mis	Uncoded languages	2
osx	Sahsisk	2
hil	Hiligaynon	2
gem	Germanic languages	2
new	नेपाल भाषा	2
xct	Classical Tibetan	2
ton	Lea Faka-Tonga	2
pam	Pampanga; Kapampangan	2
cos	Corsu	2
kur	Kurdí, کوردی, or K’öрди	2
ave	Avestan	2
ido	Ido	2
ilo	Iloko	2
roh	Rumantsch	2
glv	Gaelg	2
bak	Башҡорт	2
bbc		2
got	Gothic	2
sla	Slavic languages	2
tso	Xitsonga	2
txb		2

TEIhub Discover TEI-encoded documents from GitHub public repositories.

Languages

TEIhub
Discover TEI-encoded documents from GitHub public repositories.