camel_diac tool allows you to diacritize Arabic text.
Below is the usage information that can be generated by running
Usage: camel_diac [-d DATABASE | --db=DATABASE] [-m MARKER | --marker=MARKER] [-I | --ignore-markers] [-S | --strip-markers] [-p | --pretokenized] [-o OUTPUT | --output=OUTPUT] [FILE] camel_diac (-l | --list-schemes) camel_diac (-v | --version) camel_diac (-h | --help) Options: -d DATABASE --db=DATABASE Morphology database to use. DATABASE could be the name of a builtin database or a path to a database file. [default: calima-msa-r13] -o OUTPUT --output=OUTPUT Output file. If not specified, output will be printed to stdout. -m MARKER --marker=MARKER Marker used to prefix tokens not to be transliterated. [default: @@IGNORE@@] -I --ignore-markers Transliterate marked words as well. -S --strip-markers Remove markers in output. -p --pretokenized Input is already pre-tokenized by punctuation. When this is set, camel_diac will not split tokens by punctuation but any tokens that do contain punctuation will not be diacritized. -l --list Show a list of morphological databases. -h --help Show this screen. -v --version Show version.
We provide builtin databases to be able to run
camel_diac out of the box
that can be passed to
A list of available databases can be found at Databases.
You can always check what builtin databases are provided in your current
camel_tools installation by running
Alternatively, you can pass in a path to a database of your chosing instead of
one of the above listed databases.
If no database is specified, calima-msa-r13 is used.
Notes on markers¶
A marker a string with no whitespace characters at the beginning, middle, or
end of it (in otherwords, it’s a single token without padding spaces). As a
rule-of-thumb pick a marker that is not-likely to appear in your text. We
@@IGNORE@@ as a default value, while some Arabic NLP tools use
@@LAT@@ to denote latin/foreign text.