camel_tools.utils.charsets

This module provides a comprehensive list of character sets useful for Arabic text processing.

The character sets available in this module are:

  • UNICODE_PUNCT_CHARSET - A set of all Unicode characters marked as punctuation.
  • UNICODE_SYMBOL_CHARSET - A set of all Unicode characters marked as symbols.
  • UNICODE_PUNCT_SYMBOL_CHARSET - A set of all Unicode characters marked as either punctuation or symbol.
  • AR_CHARSET - A set of all Unicode Arabic letters and diacritics.
  • AR_LETTERS_CHARSET - A set of all Unicode Arabic letters.
  • AR_DIAC_CHARSET - A set of all Unicode Arabic diacritics.
  • BW_CHARSET - A set of all Arabic letters and diacritics in Buckwalter encoding.
  • BW_LETTERS_CHARSET - A set of all Arabic letters in Buckwalter encoding.
  • BW_DIAC_CHARSET - A set of all Arabic diacritics in Buckwalter encoding.
  • SAFEBW_CHARSET - A set of all Arabic letters and diacritics in Safe Buckwalter encoding.
  • SAFEBW_LETTERS_CHARSET - A set of all Arabic letters in Safe Buckwalter encoding.
  • SAFEBW_DIAC_CHARSET - A set of all Arabic diacritics in Safe Buckwalter encoding.
  • XMLBW_CHARSET - A set of all Arabic letters and diacritics in XML Buckwalter encoding.
  • XMLBW_LETTERS_CHARSET - A set of all Arabic letters in XML Buckwalter encoding.
  • XMLBW_DIAC_CHARSET - A set of all Arabic diacritics in XML Buckwalter encoding.
  • HSB_CHARSET - A set of all Arabic letters and diacritics in Habash-Soudi-Buckwalter encoding.
  • HSB_LETTERS_CHARSET - A set of all Arabic letters in Habash-Soudi-Buckwalter encoding.
  • HSB_DIAC_CHARSET - A set of all Arabic diacritics in Habash-Soudi-Buckwalter encoding.

All character sets are implemented as Python frozensets and therefore support all frozenset operations.

Using Character Sets

The simplest use case for character sets is checking whether a given character belongs in that set. For example, if we wanted to check if a given character is an Arabic letter, we can do the following:

from camel_tools.utils.charsets import AR_LETTERS_CHARSET

print('A' in AR_LETTERS_CHARSET)
# False

print('أ' in AR_LETTERS_CHARSET)
# True

If we wanted to check whether an entire word is an Arabic word we can use character sets to build a regular expression as follows:

import re

from camel_tools.utils.charsets import AR_CHARSET

# Concatinate all Arabic characters into a string
ar_str = u''.join(AR_CHARSET)

# Compile a regular expression using above string
arabic_re = re.compile(r'^[' + re.escape(ar_str) + r']+$')

print(arabic_re.match(u'Arabic') is not None)
# False

print(arabic_re.match(u'عربي') is not None)
# True