AraVeLex: Modern Standard Arabic Verbal lexicon

Cite as: Beniamine, Sacha (2023). AraVeLex: Modern Standard Arabic Verbal lexicon [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10100678

This is an updated version of the lexicon published as part of:

Beniamine, Sacha 2018, July. Classifications flexionnelles: Étude quantitative des structures de paradigmes. Ph. D. thesis, Université Sorbonne Paris Cité - Université Paris Diderot

Which itself is based on the 2016 version of the Unimorph dataset for arabic. We keep that original file in raw/, as the novel versions found at https://github.com/unimorph/ara/blob/master/ara do not present as much information (in particular, they do not include the romanisations). See:

Kirov, Christo et al. (2016). « Very-large Scale Parsing and Normalization of Wiktionary Morphological Paradigms ». In : Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Sous la dir. de Nicoletta Calzolari (Conference Chair) et al. Portorož, Slovenia : European Language Resources Association (ELRA). ISBN : 978-2- 9517408-9-1. URL: https://aclanthology.org/L16-1498.pdf

Phonemic transcriptions are obtained automatically using rules, starting from the wiktionary romanisation, which is relatively transparent, and validated by a native speaker. A number of changes were made with respect to the dataset from Beniamine (2018):

The dataset is fully Paralex compliant, and include tables for cells, features, lexemes and sounds
Transcription changes:
We use Epitran for grapheme to phoneme conversion, with a custom set of rules (see epitran/)
The letter Sin (<س>) is always transcribed [s], never [sˤ], whereas in a number of case where it is followed by voiceless consonants, Beniamine (2018) wrote [sˤ]
When there was a single romanisation but multiple orthographic forms, we keep only the first one
Cells and feature changes:
We use the terms imperfect (imperf) and perfect (prf) rather than present and past in feature-values
Mapping for cells is given from our scheme to both the old unimorph one (in raw/) and the new one (found at https://github.com/unimorph/ara/blob/master/ara)
Lexemes changes: Different verbal paradigms from the same page have been distinguished.

Re-generating data

The lexemes and forms tables are generated from the data in /raw.

Install dependencies:

pip install -r requirements.txt

Generate the lexemes and forms tables:

python3 format_lexicon.py

Then, generate metadata:

python3 gen-metadata