camel_transliterate
About
The camel_transliterate tool allows you to transliterate text from one form
to another using one of the builtin transliteration schemes. It also allows
tokens to be prefixed with a marker to indicate that they should not be
transliterated.
Usage
Below is the usage information that can be generated by running
camel_transliterate --help.
Usage:
camel_transliterate (-s SCHEME | --scheme=SCHEME)
[-m MARKER | --marker=MARKER]
[-I | --ignore-markers]
[-S | --strip-markers]
[-o OUTPUT | --output=OUTPUT] [FILE]
camel_transliterate (-l | --list)
camel_transliterate (-v | --version)
camel_transliterate (-h | --help)
Options:
-s SCHEME --scheme
Scheme used for transliteration.
-o OUTPUT --output=OUTPUT
Output file. If not specified, output will be printed to stdout.
-m MARKER --marker=MARKER
Marker used to prefix tokens not to be transliterated.
[default: @@IGNORE@@]
-I --ignore-markers
Transliterate marked words as well.
-S --strip-markers
Remove markers in output.
-l --list
Show a list of available transliteration schemes.
-h --help
Show this screen.
-v --version
Show version.
Below is a list of currently available transliteration schemes.
ar2bw Arabic to Buckwalter
ar2safebw Arabic to Safe Buckwalter
ar2xmlbw Arabic to XML Buckwalter
ar2hsb Arabic to Habash-Soudi-Buckwalter
bw2ar Buckwalter to Arabic
bw2safebw Buckwalter to Safe Buckwalter
bw2xmlbw Buckwalter to XML Buckwalter
bw2hsb Buckwalter to Habash-Soudi-Buckwalter
safebw2ar Safe Buckwalter to Arabic
safebw2bw Safe Buckwalter to Buckwalter
safebw2xmlbw Safe Buckwalter to XML Buckwalter
safebw2hsb Safe Buckwalter to Habash-Soudi-Buckwalter
xmlbw2ar XML Buckwalter to Arabic
xmlbw2bw XML Buckwalter to Buckwalter
xmlbw2safebw XML Buckwalter to Safe Buckwalter
xmlbw2hsb XML Buckwalter to Habash-Soudi-Buckwalter
hsb2ar Habash-Soudi-Buckwalter to Arabic
hsb2bw Habash-Soudi-Buckwalter to Buckwalter
hsb2safebw Habash-Soudi-Buckwalter to Safe Buckwalter
hsb2xmlbw Habash-Soudi-Buckwalter to Habash-Soudi-Buckwalter
Notes on markers
A marker a string with no whitespace characters at the beginning, middle, or
end of it (in otherwords, it’s a single token without padding spaces). As a
rule-of-thumb pick a marker that is not-likely to appear in your text. We
use @@IGNORE@@ as a default value, while some Arabic NLP tools use
@@LAT@@ to denote latin/foreign text.
Notes on schemes
The transliteration schemes ar2bw, ar2safebw, ar2xmlbw,
ar2hsb, bw2ar, bw2safebw, bw2xmlbw, bw2hsb,
safebw2ar, safebw2bw, safebw2xmlbw, safebw2hsb,
xmlbw2ar, xmlbw2bw, xmlbw2safebw, xmlbw2hsb,
hsb2ar, hsb2bw, hsb2safebw, and hsb2xmlbw,
use the conversion table listed in Encoding Schemes.
Input characters not listed in the conversion table are output as they appear
without any transliteration.