This work is licensed under CC BY-SA 3.0
See the automatically generated site for these paradigms
AraVeLex: Modern Standard Arabic Verbal lexicon
Cite as: Beniamine, Sacha (2023). AraVeLex: Modern Standard Arabic Verbal lexicon [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10100678
This is an updated version of the lexicon published as part of:
Beniamine, Sacha 2018, July. Classifications flexionnelles: Étude quantitative des structures de paradigmes. Ph. D. thesis, Université Sorbonne Paris Cité - Université Paris Diderot
Which itself is based on the 2016 version of the Unimorph dataset for arabic. We keep that original file in raw/
, as the novel versions found at https://github.com/unimorph/ara/blob/master/ara do not present as much information (in particular, they do not include the romanisations). See:
Kirov, Christo et al. (2016). « Very-large Scale Parsing and Normalization of Wiktionary Morphological Paradigms ». In : Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Sous la dir. de Nicoletta Calzolari (Conference Chair) et al. Portorož, Slovenia : European Language Resources Association (ELRA). ISBN : 978-2- 9517408-9-1. URL: https://aclanthology.org/L16-1498.pdf
Phonemic transcriptions are obtained automatically using rules, starting from the wiktionary romanisation, which is relatively transparent, and validated by a native speaker. A number of changes were made with respect to the dataset from Beniamine (2018):
- The dataset is fully Paralex compliant, and include tables for cells, features, lexemes and sounds
- Transcription changes:
- We use Epitran for grapheme to phoneme conversion, with a custom set of rules (see
epitran/
) - The letter Sin (<س>) is always transcribed [s], never [sˤ], whereas in a number of case where it is followed by voiceless consonants, Beniamine (2018) wrote [sˤ]
- When there was a single romanisation but multiple orthographic forms, we keep only the first one
- Cells and feature changes:
- We use the terms imperfect (imperf) and perfect (prf) rather than present and past in feature-values
- Mapping for cells is given from our scheme to both the old unimorph one (in
raw/
) and the new one (found at https://github.com/unimorph/ara/blob/master/ara) - Lexemes changes: Different verbal paradigms from the same page have been distinguished.
Re-generating data
The lexemes and forms tables are generated from the data in /raw
.
Install dependencies:
pip install -r requirements.txt
Generate the lexemes and forms tables:
python3 format_lexicon.py
Then, generate metadata:
python3 gen-metadata