camel_diac

About

The camel_diac tool allows you to diacritize Arabic text.

Usage

Below is the usage information that can be generated by running camel_diac --help.

Usage:
    camel_diac [-d DATABASE | --db=DATABASE]
               [-m MARKER | --marker=MARKER]
               [-I | --ignore-markers]
               [-S | --strip-markers]
               [-p | --pretokenized]
               [-o OUTPUT | --output=OUTPUT] [FILE]
    camel_diac (-l | --list-schemes)
    camel_diac (-v | --version)
    camel_diac (-h | --help)

Options:
  -d DATABASE --db=DATABASE
        Morphology database to use. DATABASE could be the name of a builtin
        database or a path to a database file. [default: calima-msa-r13]
  -o OUTPUT --output=OUTPUT
        Output file. If not specified, output will be printed to stdout.
  -m MARKER --marker=MARKER
        Marker used to prefix tokens not to be transliterated.
        [default: @@IGNORE@@]
  -I --ignore-markers
        Transliterate marked words as well.
  -S --strip-markers
        Remove markers in output.
  -p --pretokenized
        Input is already pre-tokenized by punctuation. When this is set,
        camel_diac will not split tokens by punctuation but any tokens that
        do contain punctuation will not be diacritized.
  -l --list
        Show a list of morphological databases.
  -h --help
        Show this screen.
  -v --version
        Show version.

Databases

We provide builtin databases to be able to run camel_diac out of the box that can be passed to -d or --db. A list of available databases can be found at Databases.

You can always check what builtin databases are provided in your current camel_tools installation by running camel_diac --list. Alternatively, you can pass in a path to a database of your chosing instead of one of the above listed databases.

If no database is specified, calima-msa-r13 is used.

Notes on markers

A marker a string with no whitespace characters at the beginning, middle, or end of it (in otherwords, it’s a single token without padding spaces). As a rule-of-thumb pick a marker that is not-likely to appear in your text. We use @@IGNORE@@ as a default value, while some Arabic NLP tools use @@LAT@@ to denote latin/foreign text.