camel_tools.utils.charsets¶
This module provides a comprehensive list of character sets useful for Arabic text processing.
The character sets available in this module are:
UNICODE_PUNCT_CHARSET
- A set of all Unicode characters marked as punctuation.UNICODE_SYMBOL_CHARSET
- A set of all Unicode characters marked as symbols.UNICODE_PUNCT_SYMBOL_CHARSET
- A set of all Unicode characters marked as either punctuation or symbol.EMOJI_SINGLECHAR_CHARSET
- A set of all single-character emojis.EMOJI_MULTICHAR_CHARSET
- A set of all multi-character emojis.EMOJI_ALL_CHARSET
- A set of all emojis (union ofEMOJI_SINGLECHAR_CHARSET
andEMOJI_MULTICHAR_CHARSET
).AR_CHARSET
- A set of all Unicode Arabic letters and diacritics.AR_LETTERS_CHARSET
- A set of all Unicode Arabic letters.AR_DIAC_CHARSET
- A set of all Unicode Arabic diacritics.BW_CHARSET
- A set of all Arabic letters and diacritics in Buckwalter encoding.BW_LETTERS_CHARSET
- A set of all Arabic letters in Buckwalter encoding.BW_DIAC_CHARSET
- A set of all Arabic diacritics in Buckwalter encoding.SAFEBW_CHARSET
- A set of all Arabic letters and diacritics in Safe Buckwalter encoding.SAFEBW_LETTERS_CHARSET
- A set of all Arabic letters in Safe Buckwalter encoding.SAFEBW_DIAC_CHARSET
- A set of all Arabic diacritics in Safe Buckwalter encoding.XMLBW_CHARSET
- A set of all Arabic letters and diacritics in XML Buckwalter encoding.XMLBW_LETTERS_CHARSET
- A set of all Arabic letters in XML Buckwalter encoding.XMLBW_DIAC_CHARSET
- A set of all Arabic diacritics in XML Buckwalter encoding.HSB_CHARSET
- A set of all Arabic letters and diacritics in Habash-Soudi-Buckwalter encoding.HSB_LETTERS_CHARSET
- A set of all Arabic letters in Habash-Soudi-Buckwalter encoding.HSB_DIAC_CHARSET
- A set of all Arabic diacritics in Habash-Soudi-Buckwalter encoding.
All character sets are implemented as Python frozensets and therefore support all frozenset operations.
Using Character Sets¶
The simplest use case for character sets is checking whether a given character belongs in that set. For example, if we wanted to check if a given character is an Arabic letter, we can do the following:
from camel_tools.utils.charsets import AR_LETTERS_CHARSET
print('A' in AR_LETTERS_CHARSET)
# False
print('أ' in AR_LETTERS_CHARSET)
# True
If we wanted to check whether an entire word is an Arabic word we can use character sets to build a regular expression as follows:
import re
from camel_tools.utils.charsets import AR_CHARSET
# Concatinate all Arabic characters into a string
ar_str = u''.join(AR_CHARSET)
# Compile a regular expression using above string
arabic_re = re.compile(r'^[' + re.escape(ar_str) + r']+$')
print(arabic_re.match(u'Arabic') is not None)
# False
print(arabic_re.match(u'عربي') is not None)
# True