camel_tools.transliterate

Contains the Transliterator class for transliterating text using a CharMapper.

Classes

class camel_tools.utils.transliterate.Transliterator(mapper, marker='@@IGNORE@@')

A class for transliterating text using a CharMapper. This class adds the extra utility of marking individual tokens to not be transliterated. It assumes that tokens are whitespace seperated.

Parameters:
  • mapper (CharMapper) – The CharMapper instance to be used for transliteration.
  • marker (str, optional) – A string that is prefixed to all tokens that shouldn’t be transliterated. Should not contain any whitespace characters. Defaults to ‘@@IGNORE@@’.
Raises:
  • TypeError – If mapper is not a CharMapper instance or marker is not a string.
  • ValueError – If marker contains whitespace or is an empty string.
transliterate(s, strip_markers=False, ignore_markers=False)

Transliterate a given string.

Parameters:
  • s (str) – The string to transliterate.
  • strip_markers (bool, optional) – Output is stripped of markers if True, otherwise markers are kept in the output. Defaults to False.
  • ignore_markers (bool, optional) – If set to True, all text, including marked tokens are transliterated as well excluding the markers. If you would like to transliterate the markers as well, use CharMapper directly instead. Defaults to False.
Returns:

The transliteration of s with the exception of marked words.

Return type:

str

Examples

from camel_tools.utils.charmap import CharMapper
from camel_tools.utils.transliterate import Transliterator

# Instantiate the builtin bw2ar (Buckwalter to Arabic) CharMapper
bw2ar = CharMapper.builtin_mapper('bw2ar')

# Instantiate Transliterator with the bw2ar CharMapper with '@@IGNORE@@' marker (default)
bw2ar_translit = Transliterator(bw2ar)

# String to transliterate
sentence_bw = 'Al>um~u madrasapN <i*A >aEdadtahA >aEdadta $aEbAF Tay~iba Al>aErAqi @@IGNORE@@#womenInSTEM'

# Generate Arabic transliteration from BW
sentence_ar = bw2ar_translit.transliterate(sentence_bw)

# Generate Arabic transliteration from BW and strip @@IGNORE@@ marker
sentence_ar_stripped = bw2ar_translit.transliterate(sentence_ar, strip_markers=True)

# Print results
print('Original sentence:\n\t', sentence_bw)
print('Buckwalter encoded sentence:\n\t', sentence_ar)
print('Buckwalter encoded sentence + stripped markers:\n\t', sentence_ar_stripped)

This will output:

Original sentence:
         Al>um~u madrasapN <i*A >aEdadtahA >aEdadta $aEbAF Tay~iba Al>aErAqi @@IGNORE@@#womenInSTEM
Buckwalter encoded sentence:
         الأُمُّ مَدرَسَةٌ إِذا أَعدَدتَها أَعدَدتَ شَعباً طَيِّبَ الأَعراقِ @@IGNORE@@#womenInSTEM
Buckwalter encoded sentence + stripped markers:
         الأُمُّ مَدرَسَةٌ إِذا أَعدَدتَها أَعدَدتَ شَعباً طَيِّبَ الأَعراقِ #womenInSTEM