The Proposed Solution – Tonalli Corpus

We propose to create a robust parallel corpus dataset that includes Nahuatl and Totonaco variations, with translations into Spanish and English. This dataset will be constructed through a combination of fieldwork, community engagement, and the aggregation of existing resources. Our goal is to fill the gap by providing a detailed and representative collection of these language variations that will support linguistic research, NLP applications, and the preservation of cultural heritage.

This project aims to create a transformative dataset that captures the rich linguistic diversity of Nahuatl and Totonaco in Puebla. Through meticulous data collection, rigorous quality control, and strong community engagement, we will produce a valuable resource that supports both technological innovation and cultural preservation.

Specifications and Deliverables for Proposed Data and Documentation

Quantity of Data

We are considering the five variants of Nahuatl and the four variants of Totonaco in Puebla State. For each variant we will sample 24 persons, each with a recorded speech of 20 minutes, as follows:We are considering the five variants of Nahuatl and the four variants of Totonaco in Puebla State. For each variant we will sample 24 persons, each with a recorded speech of 20 minutes, as follows:

SAMPLE	WOMEN	MEN	HOURS OF RECORDED SPEECH
Older than 60 years old	3	3	2
Older than 40 years old	3	3	2
Older than 20 years old	3	3	2
Younger than 20 years old	3	3	2
		TOTAL	8

Indigenous language	Recording (Hours)	Transcription (24 documents each)	Translation to Spanish	Translation to English
Nahuatl Variant 1	8	X	X	X
Nahuatl Variant 2	8	X	X	X
Nahuatl Variant 3	8	X	X	X
Nahuatl Variant 4	8	X	X	X
Nahuatl Variant 5	8	X	X	X
Totonaco Variant 1	8	X	X	X
Totonaco Variant 2	8	X	X	X
Totonaco Variant 3	8	X	X	X
Totonaco Variant 4	8	X	X	X
	72 hours	216 docs	216 docs	216 docs

Thus, the target will be to obtain a total number of 72 hours of recorded speech with approximately 216 transcribed texts, covering both Nahuatl and Totonaco, plus 648 translations, including Spanish, and English, covering all known regional variations of Nahuatl and Totonaco in Puebla State.

Types and Format of Data

Audio recordings in WAV format.
Transcriptions and translations in UTF-8 encoded plain text.
Annotated data in XML/JSON format for linguistic features and translations.

Metrics

Fairness metrics: Ensure representation of all variations of Nahuatl and Totonaco.
Quality Assurance (QA): Benchmark against existing linguistic standards and cross-validation with native speakers.

Intended Beneficiaries and Use Cases

Intended Beneficiaries

Linguists and researchers in indigenous languages.
NLP and AI developers focusing on low-resource languages.
Indigenous communities aiming to preserve and revitalize their language.

Use Cases

Development of multilingual language learning applications.
Enhanced machine translation systems for Nahuatl and Totonaco.
Automatic identification of variant speech for Nahuatl and Totonaco..
Image captioning for Nahuatl and Totonaco speakers.
Linguistic research and documentation with multilingual support.

Respecting Diversity

Our dataset will reflect the cultural and linguistic diversity of Nahuatl and Totonaco-speaking communities, ensuring fair representation and consultation with community leaders and native speakers throughout the project. Data gathering will consider both genders, women and men, and also the following age categories: older than 60, older than 40, older than 20, and younger than 20. Translations will be validated to ensure accuracy and cultural relevance.