camel_tools.disambig.mle¶

Contains a disambiguator that uses a Maximum Likelihood Estimation model.

Classes¶

class camel_tools.disambig.mle.MLEDisambiguator(analyzer, mle_path=None, top=1, cache_size=100000)¶

A disambiguator using a Maximum Likelihood Estimation (MLE) model. It first does a lookup in a given word-based MLE model. If none is provided or a word is not in the word-based model, then an analyzer is used to disambiguate words based on the pos-lex log probabilities of their analyses.

Parameters:

analyzer (Analyzer) – Disambiguator to use if a word is not in the word-based MLE model. The analyzer should provide the pos-lex log probabilities for analyses to disambiguate analyses.
mle_path (str, optional) – Path to MLE JSON file. If None, then no word-based MLE lookup is performed skipping directly to using the pos-lex model. Defaults to None.
top (int, optional) – The maximum number of top analyses to return. Defaults to 1.
cache_size (int, optional) – The number of unique word disambiguations to cache. The cache uses a least-frequently-used eviction policy. Defaults to 100000.

all_feats()¶

Return a set of all features produced by this disambiguator.

Returns:	The set all features produced by this disambiguator.
Return type:	`frozenset` of `str`

disambiguate(sentence)¶

Disambiguate all words in a given sentence.

Parameters:	sentence (`list` of `str`) – The list of space and punctuation seperated list of tokens comprising a given sentence.
Returns:	The list of disambiguations for each word in the given sentence.
Return type:	`list` of `DisambiguatedWord`

disambiguate_word(sentence, word_ndx)¶

Disambiguates a single word in a sentence. Note, that while MLE disambiguation operates on each word out of context, we maintain this interface to be compatible with disambiguators that work in context of a sentence.

Parameters:	sentence (`list` of `str`) – The list of space and punctuation seperated list of tokens comprising a given sentence. word_ndx (`int`) – The index of the word token in sentence to disambiguate.
Returns:	The disambiguation of the word token in sentence at word_ndx.
Return type:	`DisambiguatedWord`

static pretrained(model_name=None, analyzer=None, top=1, cache_size=100000)¶

Load a pre-trained MLE disambiguator provided with CAMeL Tools.

Parameters:	model_name (`str`, optional) – The name of the pretrained model. If none, the default model (‘calima-msa-r13’) is loaded. At the moment, the model names available are the same as those in Databases. Defaults to None. analyzer (`Analyzer`, optional) – Alternative analyzer to use. If None, an instance of the model’s default analyzer is created. Defaults to None. top (`int`, optional) – The maximum number of top analyses to return. Defaults to 1. cache_size (`int`, optional) – The number of unique word disambiguations to cache. The cache uses a least-frequently-used eviction policy. Defaults to 100000.
Returns:	The loaded MLE disambiguator.
Return type:	`MLEDisambiguator`

tok_feats()¶

Return a set of tokenization features produced by this disambiguator.

Returns:	The set tokenization features produced by this disambiguator.
Return type:	`frozenset` of `str`

Examples¶

Below is an example of how to load and use the default pre-trained MLE model to disambiguate words in a sentence.

from camel_tools.disambig.mle import MLEDisambiguator

mle = MLEDisambiguator.pretrained()

# We expect a sentence to be whitespace/punctuation tokenized beforehand.
# We provide a simple whitespace and punctuation tokenizer as part of camel_tools.
# See camel_tools.tokenizers.word.simple_word_tokenize.
sentence = ['سوف', 'نقرأ', 'الكتب']

disambig = mle.disambiguate(sentence)

# Let's, for example, use the top disambiguations to generate a diacritized
# version of the above sentence.
# Note that, in practice, you'll need to make sure that each word has a
# non-zero list of analyses.
diacritized = [d.analyses[0].analysis['diac'] for d in disambig]
print(' '.join(diacritized))