camel_tools.disambig.bert

Classes

class camel_tools.disambig.bert.BERTUnfactoredDisambiguator(model_path, analyzer, features=['pos', 'per', 'form_gen', 'form_num', 'asp', 'mod', 'vox', 'stt', 'cas', 'prc0', 'prc1', 'prc2', 'prc3', 'enc0'], top=1, scorer='uniform', tie_breaker='tag', use_gpu=True, batch_size=32, ranking_cache=None, ranking_cache_size=100000)

A disambiguator using an unfactored BERT model. This model is based on Morphosyntactic Tagging with Pre-trained Language Models for Arabic and its Dialects by Inoue, Khalifa, and Habash. Findings of ACL 2022. (https://arxiv.org/abs/2110.06852)

Parameters:
  • model_path (str) – The path to the fine-tuned model.
  • analyzer (Analyzer) – Analyzer to use for providing full morphological analysis of a word.
  • featureslist, optional): A list of morphological features used in the model. Defaults to 14 features.
  • top (int, optional) – The maximum number of top analyses to return. Defaults to 1.
  • scorer (str, optional) – The scoring function that computes matches between the predicted features from the model and the output from the analyzer. If uniform, the scoring based on the uniform weight is used. Defaults to uniform.
  • tie_breaker (str, optional) – The tie breaker used in the feature match function. If tag, tie breaking based on the unfactored tag MLE and factored tag MLE is used. Defaults to tag.
  • use_gpu (bool, optional) – The flag to use a GPU or not. Defaults to True.
  • batch_size (int, optional) – The batch size. Defaults to 32.
  • ranking_cache (LFUCache, optional) – The cache of pre-computed scored analyses. Defaults to None.
  • ranking_cache_size (int, optional) – The number of unique word disambiguations to cache. If 0, no ranked analyses will be cached. The cache uses a least-frequently-used eviction policy. Defaults to 100000.
all_feats()

Return a set of all features produced by this disambiguator.

Returns:The set all features produced by this disambiguator.
Return type:frozenset of str
disambiguate(sentence)

Disambiguate all words of a single sentence.

Parameters:sentence (list of str) – The input sentence.
Returns:The disambiguated analyses for the given sentence.
Return type:list of DisambiguatedWord
disambiguate_sentences(sentences)

Disambiguate all words of a list of sentences.

Parameters:sentences (list of list of str) – The input sentences.
Returns:The disambiguated analyses for the given sentences.
Return type:list of list of DisambiguatedWord
disambiguate_word(sentence, word_ndx)

Disambiguates a single word of a sentence.

Parameters:
  • sentence (list of str) – The input sentence.
  • word_ndx (int) – The index of the word token in sentence to disambiguate.
Returns:

The disambiguation of the word token in sentence at word_ndx.

Return type:

DisambiguatedWord

static pretrained(model_name='msa', top=1, use_gpu=True, batch_size=32, cache_size=10000, pretrained_cache=True, ranking_cache_size=100000)

Load a pre-trained model provided with camel_tools.

Parameters:
  • model_name (str, optional) – Name of pre-trained model to load. Three models are available: ‘msa’, ‘egy’, and ‘glf. Defaults to msa.
  • top (int, optional) – The maximum number of top analyses to return. Defaults to 1.
  • use_gpu (bool, optional) – The flag to use a GPU or not. Defaults to True.
  • batch_size (int, optional) – The batch size. Defaults to 32.
  • cache_size (int, optional) – If greater than zero, then the analyzer will cache the analyses for the cache_size most frequent words, otherwise no analyses will be cached. Defaults to 100000.
  • pretrained_cache (bool, optional) – The flag to use a pretrained cache that stores ranked analyses. Defaults to True.
  • ranking_cache_size (int, optional) – The number of unique word disambiguations to cache. If 0, no ranked analyses will be cached. The cache uses a least-frequently-used eviction policy. This argument is ignored if pretrained_cache is True. Defaults to 100000.
Returns:

Instance with loaded pre-trained model.

Return type:

BERTUnfactoredDisambiguator

tag_sentence(sentence, use_analyzer=True)

Predict the morphosyntactic labels of a single sentence.

Parameters:
  • sentence (list of str) – The list of space and punctuation seperated list of tokens comprising a given sentence.
  • use_analyzer (bool) – The flag to use an analyzer or not. If set to False, we return the original input as diac and lex. Defaults to True.
Returns:

The list of feature tags for each word in the given sentence

Return type:

list of dict

tag_sentences(sentences, use_analyzer=True)

Predict the morphosyntactic labels of a list of sentences.

Parameters:
  • sentences (list of list of str) – The input sentences.
  • use_analyzer (bool) – The flag to use an analyzer or not. If set to False, we return the original input as diac and lex. Defaults to True.
Returns:

The predicted The list of feature tags for each word in the given sentences

Return type:

list of list of dict

tok_feats()

Return a set of tokenization features produced by this disambiguator.

Returns:The set tokenization features produced by this disambiguator.
Return type:frozenset of str

Examples

Below is an example of how to load and use the default pre-trained CAMeLBERT based model to disambiguate words in a sentence.

from camel_tools.disambig.bert import BERTUnfactoredDisambiguator

unfactored = BERTUnfactoredDisambiguator.pretrained()

# We expect a sentence to be whitespace/punctuation tokenized beforehand.
# We provide a simple whitespace and punctuation tokenizer as part of camel_tools.
# See camel_tools.tokenizers.word.simple_word_tokenize.
sentence = ['سوف', 'نقرأ', 'الكتب']

disambig = unfactored.disambiguate(sentence)

# Let's, for example, use the top disambiguations to generate a diacritized
# version of the above sentence.
# Note that, in practice, you'll need to make sure that each word has a
# non-zero list of analyses.
diacritized = [d.analyses[0].analysis['diac'] for d in disambig]
print(' '.join(diacritized))