camel_tools.tokenizers.morphological¶

This module contains utilities for morphological tokenization.

Classes¶

class camel_tools.tokenizers.morphological.MorphologicalTokenizer(disambiguator, scheme, split=False, diac=False)¶

Class for morphologically tokenizing Arabic words.

Parameters:

disambiguator (Disambiguator) – The disambiguator to use for tokenization.
scheme (str) – The tokenization scheme to use. You can use the tok_feats() method of your chosen disambiguator to get a list of tokenization schemes it produces.
split (bool, optional) – If set to True, then morphological tokens will be split into separate strings, otherwise they will be delimited by an underscore. Defaults to False.
diac (bool, optional) – If set to True, then output tokens will be diacritized, otherwise they will be undiacritized. Defaults to False. Note that when the tokenization scheme is set to ‘bwtok’, the number of produced undiacritized tokens might be less than the diacritized tokens becuase the ‘bwtok’ scheme can have morphemes that are standalone diacritics (e.g. case and mood).

tokenize(words)¶

Generate morphological tokens for a given list of words.

Parameters:	words (`list` of `str`) – List of words to tokenize.
Returns:	List of morphologically tokenized words.
Return type:	`list` of `str`

Examples¶

from camel_tools.disambig.mle import MLEDisambiguator
from camel_tools.tokenizers.morphological import MorphologicalTokenizer

# Initialize disambiguators
mle_msa = MLEDisambiguator.pretrained('calima-msa-r13')
mle_egy = MLEDisambiguator.pretrained('calima-egy-r13')

# We expect a sentence to be whitespace/punctuation tokenized beforehand.
# We provide a simple whitespace and punctuation tokenizer as part of camel_tools.
# See camel_tools.tokenizers.word.simple_word_tokenize.
sentence_msa = ['فتنفست', 'الصعداء']
sentence_egy = ['وكاتباله', 'مكتوبين']

# Create different morphological tokenizer instances
msa_d3_tokenizer = MorphologicalTokenizer(disambiguator=mle_msa, scheme='d3tok')
msa_atb_tokenizer = MorphologicalTokenizer(disambiguator=mle_msa, scheme='atbtok')
msa_bw_tokenizer = MorphologicalTokenizer(disambiguator=mle_msa, scheme='bwtok')
egy_bw_tokenizer = MorphologicalTokenizer(disambiguator=mle_egy, scheme='bwtok')

# Generate tokenizations
# Note that our Egyptian resources currently provide bwtok tokenization only.
msa_d3_tok = msa_d3_tokenizer.tokenize(sentence_msa)
msa_atb_tok = msa_atb_tokenizer.tokenize(sentence_msa)
msa_bw_tok = msa_bw_tokenizer.tokenize(sentence_msa)
egy_bw_tok = egy_bw_tokenizer.tokenize(sentence_egy)

# Print results
print('D3 tokenization (MSA):', msa_d3_tok)
print('ATB tokenization (MSA):', msa_atb_tok)
print('BW tokenization (MSA):', msa_bw_tok)
print('BW tokenization (EGY):', egy_bw_tok)

This will output:

D3 tokenization (MSA): ['ف+_تنفست', 'ال+_صعداء']
ATB tokenization (MSA): ['ف+_تنفست', 'الصعداء']
BW tokenization (MSA): ['ف+_تنفس_+ت', 'ال+_صعداء']
BW tokenization (EGY): ['و+_كاتب_+ة_+ل_+ه', 'مكتوب_+ين']