The camel_dediac tool allows you to dediacritize Arabic text in multiple encoding schemes.


Below is the usage information that can be generated by running camel_dediac --help.

    camel_dediac [-s <SCHEME> | --scheme=<SCHEME>]
                 [-m <MARKER> | --marker=<MARKER>]
                 [-I | --ignore-markers]
                 [-S | --strip-markers]
                 [-o OUTPUT | --output=OUTPUT] [FILE]
    camel_dediac (-l | --list)
    camel_dediac (-v | --version)
    camel_dediac (-h | --help)

  -s <SCHEME> --scheme=<SCHEME>
        The encoding scheme of the input text. [default: ar]
  -o OUTPUT --output=OUTPUT
        Output file. If not specified, output will be printed to stdout.
  -m <MARKER> --marker=<MARKER>
        Marker used to prefix tokens not to be de-diacritized.
        [default: @@IGNORE@@]
  -I --ignore-markers
        De-diacritize words prefixed with a marker.
  -S --strip-markers
        Remove prefix markers in output if --ignore-markers is set.
  -l --list
        Show a list of available input encoding schemes.
  -h --help
        Show this screen.
  -v --version
        Show version.

Below is a list of currently available encoding schemes.

ar         Arabic script
bw         Buckwalter encoding
safebw     Safe Buckwalter encoding
xmlbw      XML Buckwalter encoding
hsb        Habash-Soudi-Buckwalter encoding

See Encoding Schemes for more information on encodings.

Notes on markers

A marker a string with no whitespace characters at the beginning, middle, or end of it (in otherwords, it’s a single token without padding spaces). As a rule-of-thumb pick a marker that is not-likely to appear in your text. We use @@IGNORE@@ as a default value, while some Arabic NLP tools use @@LAT@@ to denote latin/foreign text.