camel_tools.disambig.mle
Contains a disambiguator that uses a Maximum Likelihood Estimation model.
Classes
- class camel_tools.disambig.mle.MLEDisambiguator(analyzer, mle_path=None, top=1, cache_size=100000)
A disambiguator using a Maximum Likelihood Estimation (MLE) model. It first does a lookup in a given word-based MLE model. If none is provided or a word is not in the word-based model, then an analyzer is used to disambiguate words based on the pos-lex log probabilities of their analyses.
- Parameters:
analyzer (
Analyzer) – Disambiguator to use if a word is not in the word-based MLE model. The analyzer should provide the pos-lex log probabilities for analyses to disambiguate analyses.mle_path (
str, optional) – Path to MLE JSON file. If None, then no word-based MLE lookup is performed skipping directly to using the pos-lex model. Defaults to None.top (
int, optional) – The maximum number of top analyses to return. Defaults to 1.cache_size (
int, optional) – The number of unique word disambiguations to cache. The cache uses a least-frequently-used eviction policy. Defaults to 100000.
- all_feats()
Return a set of all features produced by this disambiguator.
- disambiguate(sentence)
Disambiguate all words in a given sentence.
- disambiguate_word(sentence, word_ndx)
Disambiguates a single word in a sentence. Note, that while MLE disambiguation operates on each word out of context, we maintain this interface to be compatible with disambiguators that work in context of a sentence.
- Parameters:
- Returns:
The disambiguation of the word token in sentence at word_ndx.
- Return type:
- static pretrained(model_name=None, analyzer=None, top=1, cache_size=100000)
Load a pre-trained MLE disambiguator provided with CAMeL Tools.
- Parameters:
model_name (
str, optional) – The name of the pretrained model. If none, the default model (‘calima-msa-r13’) is loaded. At the moment, the model names available are the same as those in Databases. Defaults to None.analyzer (
Analyzer, optional) – Alternative analyzer to use. If None, an instance of the model’s default analyzer is created. Defaults to None.top (
int, optional) – The maximum number of top analyses to return. Defaults to 1.cache_size (
int, optional) – The number of unique word disambiguations to cache. The cache uses a least-frequently-used eviction policy. Defaults to 100000.
- Returns:
The loaded MLE disambiguator.
- Return type:
Examples
Below is an example of how to load and use the default pre-trained MLE model to disambiguate words in a sentence.
from camel_tools.disambig.mle import MLEDisambiguator
mle = MLEDisambiguator.pretrained()
# We expect a sentence to be whitespace/punctuation tokenized beforehand.
# We provide a simple whitespace and punctuation tokenizer as part of camel_tools.
# See camel_tools.tokenizers.word.simple_word_tokenize.
sentence = ['سوف', 'نقرأ', 'الكتب']
disambig = mle.disambiguate(sentence)
# Let's, for example, use the top disambiguations to generate a diacritized
# version of the above sentence.
# Note that, in practice, you'll need to make sure that each word has a
# non-zero list of analyses.
diacritized = [d.analyses[0].analysis['diac'] for d in disambig]
print(' '.join(diacritized))