camel_tools.transliterate
Contains the Transliterator class for transliterating text using a
CharMapper.
Classes
- class camel_tools.utils.transliterate.Transliterator(mapper, marker='@@IGNORE@@')
A class for transliterating text using a
CharMapper. This class adds the extra utility of marking individual tokens to not be transliterated. It assumes that tokens are whitespace seperated.- Parameters:
mapper (
CharMapper) – TheCharMapperinstance to be used for transliteration.marker (
str, optional) – A string that is prefixed to all tokens that shouldn’t be transliterated. Should not contain any whitespace characters. Defaults to ‘@@IGNORE@@’.
- Raises:
TypeError – If mapper is not a
CharMapperinstance or marker is not a string.ValueError – If marker contains whitespace or is an empty string.
- transliterate(s, strip_markers=False, ignore_markers=False)
Transliterate a given string.
- Parameters:
s (
str) – The string to transliterate.strip_markers (
bool, optional) – Output is stripped of markers if True, otherwise markers are kept in the output. Defaults to False.ignore_markers (
bool, optional) – If set to True, all text, including marked tokens are transliterated as well excluding the markers. If you would like to transliterate the markers as well, useCharMapperdirectly instead. Defaults to False.
- Returns:
The transliteration of s with the exception of marked words.
- Return type:
Examples
from camel_tools.utils.charmap import CharMapper
from camel_tools.utils.transliterate import Transliterator
# Instantiate the builtin bw2ar (Buckwalter to Arabic) CharMapper
bw2ar = CharMapper.builtin_mapper('bw2ar')
# Instantiate Transliterator with the bw2ar CharMapper with '@@IGNORE@@' marker (default)
bw2ar_translit = Transliterator(bw2ar)
# String to transliterate
sentence_bw = 'Al>um~u madrasapN <i*A >aEdadtahA >aEdadta $aEbAF Tay~iba Al>aErAqi @@IGNORE@@#womenInSTEM'
# Generate Arabic transliteration from BW
sentence_ar = bw2ar_translit.transliterate(sentence_bw)
# Generate Arabic transliteration from BW and strip @@IGNORE@@ marker
sentence_ar_stripped = bw2ar_translit.transliterate(sentence_ar, strip_markers=True)
# Print results
print('Original sentence:\n\t', sentence_bw)
print('Buckwalter encoded sentence:\n\t', sentence_ar)
print('Buckwalter encoded sentence + stripped markers:\n\t', sentence_ar_stripped)
This will output:
Original sentence:
Al>um~u madrasapN <i*A >aEdadtahA >aEdadta $aEbAF Tay~iba Al>aErAqi @@IGNORE@@#womenInSTEM
Buckwalter encoded sentence:
الأُمُّ مَدرَسَةٌ إِذا أَعدَدتَها أَعدَدتَ شَعباً طَيِّبَ الأَعراقِ @@IGNORE@@#womenInSTEM
Buckwalter encoded sentence + stripped markers:
الأُمُّ مَدرَسَةٌ إِذا أَعدَدتَها أَعدَدتَ شَعباً طَيِّبَ الأَعراقِ #womenInSTEM