Morphology Databases Post-installation

This page provides post-installation instructions for specific morphological database packages that require additional installation steps.

Post-installation Steps

calima-msa-s31

Install the database by running camel_data -i morphology-db-msa-s31.
Purchase a copy SAMA 3.1 from the Linguistic Data Consortium.
Download the SAMA 3.1 archive (should be called LDC2010L01.tgz).
Run camel_data -p morphology-db-msa-s31 /path/to/LDC2010L01.tgz.

Usage

The example below shows how we can now use calima-msa-s31 after performing the above post-installation steps. In this case, we will be using calima-mas-s31 to diacritize a sentence.

from camel_tools.morphology.analyzer import Analyzer
from camel_tools.morphology.database import MorphologyDB
from camel_tools.disambig.bert import BERTUnfactoredDisambiguator

# Load the calima-msa-s31 database
db = MorphologyDB.builtin_db('calima-msa-s31')

# Create an analyzer instance using the calima-msa-s31 database
analyzer = Analyzer(db, 'ADD_PROP', cache_size=100000)

# Load the pretrained MSA BERT disambiguator
disambig = BERTUnfactoredDisambiguator.pretrained(model_name='msa', pretrained_cache=False)

# Replace the default analyzer with the calima-msa-s31 analyzer
disambig.set_analyzer(analyzer)

# Disambiguate sentence
sentence = 'سوف نقرأ الكتب'.split()
sentence_disambig = disambig.disambiguate(sentence)

# Extract diacritized words
sentence_diacritized = [d.analyses[0].analysis['diac'] for d in sentence_disambig]
print(' '.join(sentence_diacritized))