Data set name: AraVeLex

Citation (if available): Beniamine, Sacha (2023). AraVeLex: Modern Standard Arabic Verbal lexicon [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10100678

Data set developer(s): Sacha Beniamine

Data sheet author(s): Sacha Beniamine

Others who contributed to this document: None

Motivation

For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.

This dataset was created in order to study the inflectional morphology of Modern Standard Arabic. It was first created for the comparative work I did during my PhD dissertation. It is suitable for use in NLP and linguistic investigation.

Who created the dataset (for example, which team, research group) and on behalf of which entity (for example, company, institution, organization)?

The original data was crowdsourced (Wiktionary); then extracted as part of the Unimorph project. This dataset was first adapted from the 2016 Unimorph by Sacha Beniamine as part of his PhD dissertation at the Laboratoire de Linguistique Formelle (LLF) of University Paris Cité.

Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.

The initial preparation of this dataset was funded by the Morph 1 operation of the axe 2 of the Laboratoire d’excellence « Fondements Empiriques de la Linguistique » (Labex EFL). The preparation of this release in Paralex format was funded by the Leverhulme Early Career Fellowship (ECF-2022-286) awarded to Sacha Beniamine.

Composition

Paralex datasets document paradigms of inflected forms.

Are forms given as orthographic, phonetic, and or phonemic sequences ?

Forms are given as orthographic strings, romanisation following the wiktionary scheme, and phonemic transcription.

How many instances are there in total?

Number of inflected forms: 132656
Number of lexemes: 1046
Maximal paradigm size in cells: 134

Language varieties

Languages differ from each other in structural ways that can interact with NLP algorithms. Within a language, regional or social dialects can also show great variation (Chambers and Trudgill, 1998). The language and language variety should be described with a language tag from BCP-47 identifying the language variety (e.g., en-US or yue-Hant-HK), and a prose description of the language variety, glossing the BCP-47 tag and also providing further information (e.g., "English as spoken in Palo Alto, California", or "Cantonese written with traditional characters by speakers in Hong Kong who are bilingual in Mandarin").

BCP-47 language tag: arb
Language variety description: Modern Standard Arabic

Does the data pertain to specific dialects, geographical locations, genre, etc ?

No.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?

If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (for example, geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (for example, to cover a more diverse range of instances, because instances were withheld or unavailable).

The dataset includes all verbs from the 2016 UNIMORPH dataset (as given in raw/). The sample of verbs comes from wiktionary, which is crowd sourced, leading to a sample of verbs which is not necessarily coherent with regard to genre, geographical location, etc.

Is any information missing from individual instances?

If so, please provide a description, explaining why this information is missing (for example, because it was unavailable). This does not include intentionally removed information, but might include, for example, redacted text.

What was the reasoning behind the selection of defective forms ?

Are counted as defectives all forms where the form was maked with "—" in wiktionary. This affects 13860 forms across 231 lexemes.

Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.

Wiktionary data should be taken with a grain of salt: the crowdsourcing is an important potential source of noise, which has not been quantified or evaluated. Moreover, this is derived from the 2016 parse of Wiktionary by the unimorph project: both the wiktionary and the unimorph crawler have since been improved.

How is certainty expressed (doubtful or unattested forms)

This is not given by the source, nor in this dataset.

Is the dataset self-contained, or does it link to or otherwise rely on external resources (for example, websites, tweets, other datasets)?

If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (that is, including the external resources as they existed at the time the dataset was created); c) are there any restrictions (for example, licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.

This is self contained.

If linking to vocabularies from other databases (such as databases of features, cells, sounds, languages, or online dictionnaries), were there any complex decisions in the matching of entries from this dataset to those of the vocabularies (eg. inexact language code) ?

When matching to the new Unimorph cell schema, we repeat any cells which are not specified for gender (eg. 1st person, 2nd dual) to match them to each of our syncretic, gendered cells.

Does the dataset contain data that might be considered confidential (for example, data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals' non-public communications)? If so, please provide a description.

No.

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why.

No, although no filtering on meaning (obscenities, etc) has been performed.

Collection process.

What is provenance for each table (lexemes, cells, forms, frequencies, sounds, features), as well as for segmentation marks if any ? Are any information derived from other datasets ?

Was any information (forms, lexemes, cells, frequency) extracted from a corpus, a dictionnary, elicitated, extracted from field notes, digitized from grammars, generated ? What are the sources ?

Lexemes and forms were computed from the raw file, using the script format_lexicon.py
Cells and features were elaborated by hand in order to map between annotation schemes
The sounds table was first generated using Hayes universal Spreadsheet, then corrected by hand.

See:

Hayes, Bruce (2012). Spreadsheet with segments and their feature values. Distributed as part of course material for Linguistics 120A : Phonology I at UCLA. URL : http://www.linguistics.ucla.edu/people/hayes/120a/index.htm.

How were paradigms separated between lexemes (eg. in the case of homonyms or variants) ? What theoretical or practical choices were made ?

We counted one lexeme per paradigm table in a single wiktionary page. Distinguishing between tables relies on the order of cells in the raw file.

How was the paradigm structure (set and labels of paradigm cells) decided ? What theoretical or practical choices were made ?

In this dataset, the feature value for Tense/Aspect are called Perfect and Imperfect for the two main tense/aspects, rather than respectively Past and Present, Finished and Unfinished, etc. See Bahloul 2007, p. 37--43 on a summary of the topic.
The dataset, following the original Unimorph data, comprises fully syncretic series of cells where there is systematic syncretism across gender (eg. in the first person or second dual). The new unimorph scheme, mapped in the cells table, merges these. Thus, it is possible for dataset users to merge the forms by mapping the cells to this scheme.

See:

Bahloul, M. (2007). Structure and function of the Arabic verb. Routledge.

What is the expertise of the contributors with the documented language ?

Are they areal expert, language experts, native speakers ?

The expertise of Wiktionary contributors is not publicly available. The author is a computational morphologist, neither an aeral or language expert of arabic, nor a native speaker. Samples of the generated phonological forms (a few hundred of forms in total) were reviewed in 2017 by a native speaker.

How was the data collected (for example, manual human curation, generation by software programs, software APIs, etc)? How were these mechanisms or procedures validated?

Crowd-sourced manual curation (Wiktionary), with some automation (wiktionary inflection templates).

If the dataset is a sample from a larger set, what was the sampling strategy (for example, deterministic, probabilistic with specific sampling probabilities)? > Curation rationale: Which lemmas, forms, cells were included and what were the goals in selecting entries, both in the original collection and in any further sub-selection? This can be especially important in datasets too large to thoroughly inspect by hand. An explicit statement of the curation rationale can help dataset users make inferences about what other kinds of texts systems trained with them could conceivably generalize to.

Who was involved in the data collection process (for example, students, crowdworkers, contractors) and how were they compensated (for example, how much were crowdworkers paid)?

60h internship: Kenza Ould Hamouda (2017), no compensation. Undergraduate student in Linguistics at the Université Paris Cité (then Paris 7).

Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (for example, recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created.

Prior to 2016.

Were any ethical review processes conducted (for example, by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation.

Preprocessing/cleaning/labeling.

How were frequencies measured ? If this was done directly from a corpus, is the software for frequency extraction available ?

How were the inflected forms obtained ? If generated, what was the generation process ? Is the software for generation available ?

Extracted from Wiktionary, see: Kirov & al 2016.

Kirov, Christo et al. (2016). « Very-large Scale Parsing and Normalization of Wiktionary Morphological Paradigms ». In : Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Sous la dir. de Nicoletta Calzolari (Conference Chair) et al. Portorož, Slovenia : European Language Resources Association (ELRA). ISBN : 978-2- 9517408-9-1. URL: https://aclanthology.org/L16-1498.pdf

How were the phonological or phonemic transcriptions obtained ? If generated, what was the generation process ? Is the software for generation available ?

The phonemic transcriptions were generated using the Epitran software, starting from the nearly phonemic romanisation found in the wiktionary. The rules used can be found in the folder .epitran/. Random samples of a few hundred forms were manually validated.

If relevant, how were the forms segmented ?

Was any preprocessing/cleaning/labeling of the data done (for example, discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values, cleaning of labels, mapping between vocabularies, etc)? If so, please provide a description. If not, you may skip the remaining questions in this section. This includes estimation of frequencies.

Mapping between vocabularies is given in the cells table.

Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data (for example, to support unanticipated future uses)? If so, please provide a link or other access point to the "raw" data.

Yes, it is available in the raw/ folder.

Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.

See format_lexicon.py and gen-metadata.py.

Uses

Has the dataset been used for any published work already? If so, please provide a description.

Yes, earlier versions of the dataset were used in the following comparative works:

Sacha Beniamine, Matías Guzmán Naranjo (2021). Multiple alignments of inflectional paradigms. In: Proceedings of the Society for Computation in Linguistics: Vol. 4, Article 21.
Sacha Beniamine (2021). One lexeme, many classes: inflection class systems as lattices. In: One-to-Many Relations in Morphology, Syntax and Semantics.
Sacha Beniamine (2018). Classifications flexionnelles: Étude quantitative des structures de paradigmes. In: Université Sorbonne Paris Cité - Université Paris Diderot (Paris 7), PhD thesis under the supervision of Olivier Bonami.

Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point.

What (other) tasks could the dataset be used for?

Any NLP task concerned with inflection and based on phonemic form; linguistic investigations into inflection, whether quantitative or qualitative.

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a dataset consumer might need to know to avoid uses that could result in unfair treatment of individuals or groups (for example, stereotyping, quality of service issues) or other risks or harms (for example, legal risks, financial harms)? If so, please provide a description. Is there anything a dataset consumer could do to mitigate these risks or harms?

Both the crowsourced nature of the original data and the automated generation of the transcription bring limitations : the sample of verbswas neither selected for frequency nor any sort of coherence; only a sample of the transcriptions were manually validated.

Are there tasks for which the dataset should not be used? If so, please provide a description.

Distribution.

Will the dataset be distributed to third parties outside of the entity (for example, company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.

No.

How will the dataset be distributed (for example, tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?

DOI: https://doi.org/10.5281/zenodo.10100678 The DOI points to a zenodo deposit The dataset is available as a repository on gitlab: https://gitlab.com/sbeniamine/aravelex

When will the dataset be distributed?

It is already distributed.

Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/ or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.

License: Attribution-ShareAlike 3.0 Unported (following the license for the initial Unimorph data this is derived from)

Have any third parties imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions.

No.

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation.

No.

Any other comments?

No.

Maintenance

Who will be supporting/hosting/maintaining the dataset?

Sacha Beniamine

How can the owner/curator/manager of the dataset be contacted (for example, email address)?

Please raise an issue on the gitlab repository, or email at s.@surrey.ac.uk.

Is there an erratum? If so, please provide a link or other access point.

No.

Will the dataset be updated (for example, to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to dataset consumers (for example, mailing list, GitHub)?

Yes, whenever relevant. Updates will be pushed to gitlab and lead to new versions, themselves pushed to zenodo.

If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (for example, were the individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced.

No.

Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers.

Yes, thanks to zenodo & gitlab.

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to dataset consumers? If so, please provide a description.

I welcome merge requests on gitlab.

Any other comments?

No.