camel_tools.disambig.bert¶
Classes¶
-
class
camel_tools.disambig.bert.
BERTUnfactoredDisambiguator
(model_path, analyzer, features=['pos', 'per', 'form_gen', 'form_num', 'asp', 'mod', 'vox', 'stt', 'cas', 'prc0', 'prc1', 'prc2', 'prc3', 'enc0'], top=1, scorer='uniform', tie_breaker='tag', use_gpu=True, batch_size=32, ranking_cache=None, ranking_cache_size=100000)¶ A disambiguator using an unfactored BERT model. This model is based on Morphosyntactic Tagging with Pre-trained Language Models for Arabic and its Dialects by Inoue, Khalifa, and Habash. Findings of ACL 2022. (https://arxiv.org/abs/2110.06852)
Parameters: - model_path (
str
) – The path to the fine-tuned model. - analyzer (
Analyzer
) – Analyzer to use for providing full morphological analysis of a word. - features –
list
, optional): A list of morphological features used in the model. Defaults to 14 features. - top (
int
, optional) – The maximum number of top analyses to return. Defaults to 1. - scorer (
str
, optional) – The scoring function that computes matches between the predicted features from the model and the output from the analyzer. If uniform, the scoring based on the uniform weight is used. Defaults to uniform. - tie_breaker (
str
, optional) – The tie breaker used in the feature match function. If tag, tie breaking based on the unfactored tag MLE and factored tag MLE is used. Defaults to tag. - use_gpu (
bool
, optional) – The flag to use a GPU or not. Defaults to True. - batch_size (
int
, optional) – The batch size. Defaults to 32. - ranking_cache (
LFUCache
, optional) – The cache of pre-computed scored analyses. Defaults to None. - ranking_cache_size (
int
, optional) – The number of unique word disambiguations to cache. If 0, no ranked analyses will be cached. The cache uses a least-frequently-used eviction policy. Defaults to 100000.
-
all_feats
()¶ Return a set of all features produced by this disambiguator.
Returns: The set all features produced by this disambiguator. Return type: frozenset
ofstr
-
disambiguate
(sentence)¶ Disambiguate all words of a single sentence.
Parameters: sentence ( list
ofstr
) – The input sentence.Returns: The disambiguated analyses for the given sentence. Return type: list
ofDisambiguatedWord
-
disambiguate_sentences
(sentences)¶ Disambiguate all words of a list of sentences.
Parameters: sentences ( list
oflist
ofstr
) – The input sentences.Returns: The disambiguated analyses for the given sentences. Return type: list
oflist
ofDisambiguatedWord
-
disambiguate_word
(sentence, word_ndx)¶ Disambiguates a single word of a sentence.
Parameters: Returns: The disambiguation of the word token in sentence at word_ndx.
Return type:
-
static
pretrained
(model_name='msa', top=1, use_gpu=True, batch_size=32, cache_size=10000, pretrained_cache=True, ranking_cache_size=100000)¶ Load a pre-trained model provided with camel_tools.
Parameters: - model_name (
str
, optional) – Name of pre-trained model to load. Three models are available: ‘msa’, ‘egy’, and ‘glf. Defaults to msa. - top (
int
, optional) – The maximum number of top analyses to return. Defaults to 1. - use_gpu (
bool
, optional) – The flag to use a GPU or not. Defaults to True. - batch_size (
int
, optional) – The batch size. Defaults to 32. - cache_size (
int
, optional) – If greater than zero, then the analyzer will cache the analyses for the cache_size most frequent words, otherwise no analyses will be cached. Defaults to 100000. - pretrained_cache (
bool
, optional) – The flag to use a pretrained cache that stores ranked analyses. Defaults to True. - ranking_cache_size (
int
, optional) – The number of unique word disambiguations to cache. If 0, no ranked analyses will be cached. The cache uses a least-frequently-used eviction policy. This argument is ignored if pretrained_cache is True. Defaults to 100000.
Returns: Instance with loaded pre-trained model.
Return type: - model_name (
-
tag_sentence
(sentence, use_analyzer=True)¶ Predict the morphosyntactic labels of a single sentence.
Parameters: Returns: The list of feature tags for each word in the given sentence
Return type:
-
tag_sentences
(sentences, use_analyzer=True)¶ Predict the morphosyntactic labels of a list of sentences.
Parameters: Returns: The predicted The list of feature tags for each word in the given sentences
Return type:
- model_path (
Examples¶
Below is an example of how to load and use the default pre-trained CAMeLBERT based model to disambiguate words in a sentence.
from camel_tools.disambig.bert import BERTUnfactoredDisambiguator
unfactored = BERTUnfactoredDisambiguator.pretrained()
# We expect a sentence to be whitespace/punctuation tokenized beforehand.
# We provide a simple whitespace and punctuation tokenizer as part of camel_tools.
# See camel_tools.tokenizers.word.simple_word_tokenize.
sentence = ['سوف', 'نقرأ', 'الكتب']
disambig = unfactored.disambiguate(sentence)
# Let's, for example, use the top disambiguations to generate a diacritized
# version of the above sentence.
# Note that, in practice, you'll need to make sure that each word has a
# non-zero list of analyses.
diacritized = [d.analyses[0].analysis['diac'] for d in disambig]
print(' '.join(diacritized))