camel_tools.disambig.mle¶
Contains a disambiguator that uses a Maximum Likelihood Estimation model.
Classes¶
-
class
camel_tools.disambig.mle.
MLEDisambiguator
(analyzer, mle_path=None, top=1, cache_size=100000)¶ A disambiguator using a Maximum Likelihood Estimation (MLE) model. It first does a lookup in a given word-based MLE model. If none is provided or a word is not in the word-based model, then an analyzer is used to disambiguate words based on the pos-lex log probabilities of their analyses.
Parameters: - analyzer (
Analyzer
) – Disambiguator to use if a word is not in the word-based MLE model. The analyzer should provide the pos-lex log probabilities for analyses to disambiguate analyses. - mle_path (
str
, optional) – Path to MLE JSON file. If None, then no word-based MLE lookup is performed skipping directly to using the pos-lex model. Defaults to None. - top (
int
, optional) – The maximum number of top analyses to return. Defaults to 1. - cache_size (
int
, optional) – The number of unique word disambiguations to cache. The cache uses a least-frequently-used eviction policy. Defaults to 100000.
-
all_feats
()¶ Return a set of all features produced by this disambiguator.
Returns: The set all features produced by this disambiguator. Return type: frozenset
ofstr
-
disambiguate
(sentence)¶ Disambiguate all words in a given sentence.
Parameters: sentence ( list
ofstr
) – The list of space and punctuation seperated list of tokens comprising a given sentence.Returns: The list of disambiguations for each word in the given sentence. Return type: list
ofDisambiguatedWord
-
disambiguate_word
(sentence, word_ndx)¶ Disambiguates a single word in a sentence. Note, that while MLE disambiguation operates on each word out of context, we maintain this interface to be compatible with disambiguators that work in context of a sentence.
Parameters: Returns: The disambiguation of the word token in sentence at word_ndx.
Return type:
-
static
pretrained
(model_name=None, analyzer=None, top=1, cache_size=100000)¶ Load a pre-trained MLE disambiguator provided with CAMeL Tools.
Parameters: - model_name (
str
, optional) – The name of the pretrained model. If none, the default model (‘calima-msa-r13’) is loaded. At the moment, the model names available are the same as those in Databases. Defaults to None. - analyzer (
Analyzer
, optional) – Alternative analyzer to use. If None, an instance of the model’s default analyzer is created. Defaults to None. - top (
int
, optional) – The maximum number of top analyses to return. Defaults to 1. - cache_size (
int
, optional) – The number of unique word disambiguations to cache. The cache uses a least-frequently-used eviction policy. Defaults to 100000.
Returns: The loaded MLE disambiguator.
Return type: - model_name (
- analyzer (
Examples¶
Below is an example of how to load and use the default pre-trained MLE model to disambiguate words in a sentence.
from camel_tools.disambig.mle import MLEDisambiguator
mle = MLEDisambiguator.pretrained()
# We expect a sentence to be whitespace/punctuation tokenized beforehand.
# We provide a simple whitespace and punctuation tokenizer as part of camel_tools.
# See camel_tools.tokenizers.word.simple_word_tokenize.
sentence = ['سوف', 'نقرأ', 'الكتب']
disambig = mle.disambiguate(sentence)
# Let's, for example, use the top disambiguations to generate a diacritized
# version of the above sentence.
# Note that, in practice, you'll need to make sure that each word has a
# non-zero list of analyses.
diacritized = [d.analyses[0].analysis['diac'] for d in disambig]
print(' '.join(diacritized))