camel_tools.utils.charmap

Contains the CharMapper class for mapping characters in a Unicode string to other strings.

Classes

class camel_tools.utils.charmap.CharMapper(charmap, default=None)

A class for mapping characters in a Unicode string to other strings.

Parameters:
  • charmap (dict) – A dictionary or any other dictionary-like obeject (implementing collections.Mapping) mapping characters or range of characters to a string. Keys in the dictionary should be Unicode strings of length 1 or 3. Strings of length 1 indicate a single character to be mapped, while strings of length 3 indicate a range. Range strings should have the format ‘a-b’ where is the starting character in the range and ‘b’ is the last character in the range (inclusive). ‘b’ should have a strictly larger ordinal number than ‘a’. Dictionary values should be either strings or None, where None indicates that characters are mapped to themselves. Use an empty string to indicate deletion.
  • default (str, optional) – The default value to map characters not in charmap to. None indicates that characters map to themselves. Defaults to None.
Raises:
  • InvalidCharMapKeyError – If a key in charmap is not a Unicode string containing either a single character or a valid character range.
  • TypeError – If default or a value for a key in charmap is neither None nor a Unicode string, or if charmap is not a dictionary-like object.
__call__(s)

Alias for CharMapper.map_string().

static builtin_mapper(map_name)

Creates a CharMapper instance from built-in mappings.

Parameters:map_name (str) – Name of built-in map.
Returns:A new CharMapper instance of built-in map.
Return type:CharMapper
Raises:BuiltinCharMapNotFound – If map_name is not in the list of built-in maps.
map_string(s)

Maps each character in a given string to its corresponding value in the charmap.

Parameters:s (str) – A Unicode string to be mapped.
Returns:A new Unicode string with the charmap applied.
Return type:str
Raises:TypeError – If s is not a Unicode string.
static mapper_from_json(fpath)

Creates a CharMapper instance from a JSON file.

Parameters:

fpath (str) – Path to JSON file.

Returns:

A new CharMapper instance generated from given JSON file.

Return type:

CharMapper

Raises:
  • InvalidCharMapKeyError – If a key in charmap is not a Unicode string containing either a single character or a valid character range.
  • TypeError – If default or a value for a key in charmap is neither None nor a Unicode string.
  • FileNotFoundError – If file at fpath doesn’t exist.
  • JSONDecodeError – If fpath is not a valid JSON file.
class camel_tools.utils.charmap.InvalidCharMapKeyError(key, message)

Exception raised when an invalid key is found in a charmap used to initialize CharMapper.

class camel_tools.utils.charmap.BuiltinCharMapNotFoundError(map_name, message)

Exception raised when a specified map name passed to CharMapper.builtin_mapper() is not in the list of builtin maps.

JSON File Structure

JSON files to be used with CharMapper should have the following format:

{
    "default": "",

    "charmap": {
        "a": "z",
        "b-g": "",
        "x-z": null
    }
}

The root object in the file should be a dictionary with two keys: ‘default’ and ‘charmap’. These correspond to and follow the same restrictions as the respective input parameters to the CharMapper constructor (with null in the JSON file corresponding to None in Python).

Built-in mappings

Below is a listing of built-in mappings:

Arabic Transliteration

  • ar2bw Transliterates Arabic text to Buckwalter scheme.
  • ar2safebw Transliterates Arabic text to Safe Buckwalter scheme.
  • ar2xmlbw Transliterates Arabic text to XML Buckwalter scheme.
  • ar2hsb Transliterates Arabic text to Habash-Soudi-Buckwalter scheme.
  • bw2ar Transliterates Buckwalter scheme text to Arabic.
  • bw2safebw Transliterates Buckwalter scheme text to Safe Buckwalter scheme.
  • bw2xmlbw Transliterates Buckwalter scheme text to XML Buckwalter scheme.
  • bw2hsb Transliterates Buckwalter scheme text to Habash-Soudi-Buckwalter scheme.
  • safebw2ar Transliterates Safe Buckwalter scheme text to Arabic.
  • safebw2bw Transliterates Safe Buckwalter scheme text to Buckwalter scheme.
  • safebw2xmlbw Transliterates Safe Buckwalter scheme text to XML Buckwalter scheme.
  • safebw2hsb Transliterates Safe Buckwalter scheme text to Habash-Soudi-Buckwalter scheme.
  • xmlbw2ar Transliterates XML Buckwalter Scheme text to Arabic.
  • xmlbw2bw Transliterates XML Buckwalter Scheme text to Buckwalter scheme.
  • xmlbw2safebw Transliterates XML Buckwalter Scheme text to Safe Buckwalter scheme.
  • xmlbw2hsb Transliterates XML Buckwalter Scheme text to Habash-Soudi-Buckwalter scheme.
  • hsb2ar Transliterates Habash-Soudi-Buckwalter scheme text to Arabic.
  • hsb2bw Transliterates Habash-Soudi-Buckwalter scheme text to Buckwalter scheme.
  • hsb2safebw Transliterates Habash-Soudi-Buckwalter scheme text to Safe Buckwalter scheme.
  • hsb2xmlbw Transliterates Habash-Soudi-Buckwalter scheme text to XML Buckwalter scheme.

See Encoding Schemes for more information on Arabic encoding schemes.

Utility

  • arclean Cleans Arabic text by

    • Deleting characters that are not in Arabic, ASCII, or Latin-1.
    • Converting all spacing characters to an ASCII space character.
    • Converting Indic digits into Arabic digits.
    • Converting extended Arabic letters into basic Arabic letters.
    • Converting 1-char presentation froms into simple basic forms.