camel_tools.disambig.mle

Contains a disambiguator that uses a Maximum Likelihood Estimation model.

Classes

class camel_tools.disambig.mle.MLEDisambiguator(analyzer, mle_path=None, top=1, cache_size=100000)

A disambiguator using a Maximum Likelihood Estimation (MLE) model. It first does a lookup in a given word-based MLE model. If none is provided or a word is not in the word-based model, then an analyzer is used to disambiguate words based on the pos-lex log probabilities of their analyses.

Parameters:

analyzer (Analyzer) – Disambiguator to use if a word is not in the word-based MLE model. The analyzer should provide the pos-lex log probabilities for analyses to disambiguate analyses.
mle_path (str, optional) – Path to MLE JSON file. If None, then no word-based MLE lookup is performed skipping directly to using the pos-lex model. Defaults to None.
top (int, optional) – The maximum number of top analyses to return. Defaults to 1.
cache_size (int, optional) – The number of unique word disambiguations to cache. The cache uses a least-frequently-used eviction policy. Defaults to 100000.

all_feats()

Return a set of all features produced by this disambiguator.

Returns:: The set all features produced by this disambiguator.
Return type:: frozenset of str

disambiguate(sentence)

Disambiguate all words in a given sentence.

Parameters:: sentence (list of str) – The list of space and punctuation seperated list of tokens comprising a given sentence.
Returns:: The list of disambiguations for each word in the given sentence.
Return type:: list of DisambiguatedWord

disambiguate_word(sentence, word_ndx)

Disambiguates a single word in a sentence. Note, that while MLE disambiguation operates on each word out of context, we maintain this interface to be compatible with disambiguators that work in context of a sentence.

Parameters:

sentence (list of str) – The list of space and punctuation seperated list of tokens comprising a given sentence.
word_ndx (int) – The index of the word token in sentence to disambiguate.

Returns:

The disambiguation of the word token in sentence at word_ndx.

Return type:

DisambiguatedWord

static pretrained(model_name=None, analyzer=None, top=1, cache_size=100000)

Load a pre-trained MLE disambiguator provided with CAMeL Tools.

Parameters:

model_name (str, optional) – The name of the pretrained model. If none, the default model (‘calima-msa-r13’) is loaded. At the moment, the model names available are the same as those in Databases. Defaults to None.
analyzer (Analyzer, optional) – Alternative analyzer to use. If None, an instance of the model’s default analyzer is created. Defaults to None.
top (int, optional) – The maximum number of top analyses to return. Defaults to 1.
cache_size (int, optional) – The number of unique word disambiguations to cache. The cache uses a least-frequently-used eviction policy. Defaults to 100000.

Returns:

The loaded MLE disambiguator.

Return type:

MLEDisambiguator

tok_feats()

Return a set of tokenization features produced by this disambiguator.

Returns:: The set tokenization features produced by this disambiguator.
Return type:: frozenset of str

Examples

Below is an example of how to load and use the default pre-trained MLE model to disambiguate words in a sentence.

from camel_tools.disambig.mle import MLEDisambiguator

mle = MLEDisambiguator.pretrained()

# We expect a sentence to be whitespace/punctuation tokenized beforehand.
# We provide a simple whitespace and punctuation tokenizer as part of camel_tools.
# See camel_tools.tokenizers.word.simple_word_tokenize.
sentence = ['سوف', 'نقرأ', 'الكتب']

disambig = mle.disambiguate(sentence)

# Let's, for example, use the top disambiguations to generate a diacritized
# version of the above sentence.
# Note that, in practice, you'll need to make sure that each word has a
# non-zero list of analyses.
diacritized = [d.analyses[0].analysis['diac'] for d in disambig]
print(' '.join(diacritized))