Restoration of Arabic Diacritics Using a Multilevel Statistical Model

Mohamed Seghir Hadj Ameur; Youcef Moulahoum; Ahmed Guessoum

doi:10.1007/978-3-319-19578-0_15

Conference Papers Year : 2015

Restoration of Arabic Diacritics Using a Multilevel Statistical Model

(1) , (1) , (1)

Mohamed Seghir Hadj Ameur

Function : Author
PersonId : 1031968

USTHB - Université des Sciences et de la Technologie Houari Boumediene = University of Sciences and Technology Houari Boumediene [Alger]

Youcef Moulahoum

Function : Author
PersonId : 1031969

USTHB - Université des Sciences et de la Technologie Houari Boumediene = University of Sciences and Technology Houari Boumediene [Alger]

Ahmed Guessoum

Function : Author
PersonId : 1031970

USTHB - Université des Sciences et de la Technologie Houari Boumediene = University of Sciences and Technology Houari Boumediene [Alger]

Abstract

Arabic texts are generally written without diacritics. This is the case for instance in newspapers, contemporary books, etc., which makes automatic processing of Arabic texts more difficult. When diacritical signs are present, Arabic script provides more information about the meanings of words and their pronunciation. Vocalization of Arabic texts is a complex task which may involve morphological, syntactic and semantic text processing.In this paper, we present a new approach to restore Arabic diacritics using a statistical language model and dynamic programming. Our system is based on two models: a bi-gram-based model which is first used for vocalization and a 4-gram character-based model which is then used to handle the words that remain non vocalized (OOV words). Moreover, smoothing methods are used in order to handle the problem of unseen words. The optimal vocalized word sequence is selected using the Viterbi algorithm from Dynamic Programming.Our approach represents an important contribution to the improvement of the performance of automatic Arabic vocalization. We have compared our results with some of the most efficient up-to-date vocalization systems; the experimental results show the high quality of our approach.

Keywords

Domains

Computer Science [cs]

Fichier principal

339159_1_En_15_Chapter.pdf (611.85 Ko)

Origin	Files produced by the author(s)
licence	CC BY 4.0 - Attribution

Connect in order to contact the contributor

https://inria.hal.science/hal-01789976

Submitted on : Friday, May 11, 2018-3:11:53 PM

Last modification on : Friday, August 5, 2022-2:54:44 PM

Long-term archiving on : Tuesday, September 25, 2018-11:18:50 AM

Dates and versions

hal-01789976 , version 1 (11-05-2018)

Licence

CC BY 4.0 - Attribution

Identifiers

HAL Id : hal-01789976 , version 1
DOI : 10.1007/978-3-319-19578-0_15

Cite

Mohamed Seghir Hadj Ameur, Youcef Moulahoum, Ahmed Guessoum. Restoration of Arabic Diacritics Using a Multilevel Statistical Model. 5th International Conference on Computer Science and Its Applications (CIIA), May 2015, Saida, Algeria. pp.181-192, ⟨10.1007/978-3-319-19578-0_15⟩. ⟨hal-01789976⟩

Restoration of Arabic Diacritics Using a Multilevel Statistical Model

Abstract

Keywords

Domains

Dates and versions

Licence

Identifiers

Cite

Export

Collections

Altmetric

Share