camel_diac ========== About ----- The ``camel_diac`` tool allows you to diacritize Arabic text. Usage ----- Below is the usage information that can be generated by running ``camel_diac --help``. .. code-block:: none Usage: camel_diac [-d DATABASE | --db=DATABASE] [-m MARKER | --marker=MARKER] [-I | --ignore-markers] [-S | --strip-markers] [-p | --pretokenized] [-o OUTPUT | --output=OUTPUT] [FILE] camel_diac (-l | --list-schemes) camel_diac (-v | --version) camel_diac (-h | --help) Options: -d DATABASE --db=DATABASE Morphology database to use. DATABASE could be the name of a builtin database or a path to a database file. [default: calima-msa-r13] -o OUTPUT --output=OUTPUT Output file. If not specified, output will be printed to stdout. -m MARKER --marker=MARKER Marker used to prefix tokens not to be transliterated. [default: @@IGNORE@@] -I --ignore-markers Transliterate marked words as well. -S --strip-markers Remove markers in output. -p --pretokenized Input is already pre-tokenized by punctuation. When this is set, camel_diac will not split tokens by punctuation but any tokens that do contain punctuation will not be diacritized. -l --list Show a list of morphological databases. -h --help Show this screen. -v --version Show version. Databases --------- We provide builtin databases to be able to run ``camel_diac`` out of the box that can be passed to ``-d`` or ``--db``. A list of available databases can be found at :ref:`camel_morphology_dbs`. You can always check what builtin databases are provided in your current ``camel_tools`` installation by running ``camel_diac --list``. Alternatively, you can pass in a path to a database of your chosing instead of one of the above listed databases. If no database is specified, **calima-msa-r13** is used. Notes on markers ---------------- A marker a string with no whitespace characters at the beginning, middle, or end of it (in otherwords, it's a single token without padding spaces). As a rule-of-thumb pick a marker that is not-likely to appear in your text. We use ``@@IGNORE@@`` as a default value, while some Arabic NLP tools use ``@@LAT@@`` to denote latin/foreign text.