camel_tools.utils.dediac
This submodule contains functions for dediacritizing Arabic text in different encodings. See Encoding Schemes for more information on encodings.
Functions
- camel_tools.utils.dediac.dediac_ar(s)
Dediacritize Unicode Arabic string.
- camel_tools.utils.dediac.dediac_bw(s)
Dediacritize Buckwalter encoded string.
- camel_tools.utils.dediac.dediac_safebw(s)
Dediacritize Safe Buckwalter encoded string.
- camel_tools.utils.dediac.dediac_xmlbw(s)
Dediacritize XML Buckwalter encoded string.
Examples
from camel_tools.utils.dediac import dediac_ar, dediac_bw
# Strings to dediacritize
sentence_ar = 'ثابِتُ الدّائِرَةِ هُوَ نِسبَةُ مُحِيطِها لِقُطرِها وَيُعرَفُ بِالثّابِتِ ط'
sentence_bw = 'vAbitu Ald~A}irapi huwa nisbapu muHiyTihA liquTrihA wayuErafu biAlv~Abiti T'
# Dediacritize
sentence_ar_dediac = dediac_ar(sentence_ar)
sentence_bw_dediac = dediac_bw(sentence_bw)
# Print results
print('Diacritized and dediacritized Arabic sentences:\n\t{}\n\t{}'.format(sentence_ar, sentence_ar_dediac))
print('Diacritized and dediacritized Buckwalter sentences:\n\t{}\n\t{}'.format(sentence_bw, sentence_bw_dediac))
This will output:
Diacritized and dediacritized Arabic sentences:
ثابِتُ الدّائِرَةِ هُوَ نِسبَةُ مُحِيطِها لِقُطرِها وَيُعرَفُ بِالثّابِتِ ط
ثابت الدائرة هو نسبة محيطها لقطرها ويعرف بالثابت ط
Diacritized and dediacritized Buckwalter sentences:
vAbitu Ald~A}irapi huwa nisbapu muHiyTihA liquTrihA wayuErafu biAlv~Abiti T
vAbt AldA}rp hw nsbp mHyThA lqTrhA wyErf bAlvAbt T