camel_tools.morphology.analyzer

The morphological analyzer component of CAMeL Tools.

Globals

camel_tools.morphology.analyzer.DEFAULT_NORMALIZE_MAP

The default character map used for normalization by Analyzer.

Removes the tatweel/kashida character and does the following conversions:

  • ‘إ’ to ‘ا’
  • ‘أ’ to ‘ا’
  • ‘آ’ to ‘ا’
  • ‘ٱ’ to ‘ا’
  • ‘ى’ to ‘ي’
  • ‘ة’ to ‘ه’
Type:CharMapper

Classes

class camel_tools.morphology.analyzer.AnalyzedWord

A named tuple containing a word and its analyses.

word

The analyzed word.

Type:str
analyses

List of analyses for word. See CAMeL Morphology Features for more information on features and their values.

Type:list of dict
class camel_tools.morphology.analyzer.Analyzer(db, backoff='NONE', norm_map=None, strict_digit=False, cache_size=0)

Morphological analyzer component.

Parameters:
  • db (MorphologyDB) – Database to use for analysis. Must be opened in analysis or reinflection mode.
  • backoff (str, optional) – Backoff mode. Can be one of the following: ‘NONE’, ‘NOAN_ALL’, ‘NOAN_PROP’, ‘ADD_ALL’, or ‘ADD_PROP’. Defaults to ‘NONE’.
  • norm_map (CharMapper, optional) – Character map for normalizing input words. If set to None, then DEFAULT_NORMALIZE_MAP is used. Defaults to None.
  • strict_digit (bool, optional) – If set to True, then only words completely comprised of digits are considered numbers, otherwise, all words containing a digit are considered numbers. Defaults to False.
  • cache_size (int, optional) – If greater than zero, then the analyzer will cache the analyses for the cache_Size most frequent words, otherwise no analyses will be cached.
Raises:

AnalyzerError – If database is not an instance of (MorphologyDB), if db does not support analysis, or if backoff is not a valid backoff mode.

all_feats()

Return a set of all features provided by the database used in this analyzer instance.

Returns:The set all features provided by the database used in this analyzer instance.
Return type:frozenset of str
analyze(word)

Analyze a given word.

Parameters:word (str) – Word to analyze.
Returns:The list of analyses for word. See CAMeL Morphology Features for more information on features and their values.
Return type:list of dict
analyze_words(words)

Analyze a list of words.

Parameters:words (list of str) – List of words to analyze.
Returns:The list of analyses for each word in words.
Return type:list of AnalyzedWord
tok_feats()

Return a set of tokenization features provided by the database used in this analyzer instance.

Returns:The set tokenization features provided by the database used in this analyzer instance.
Return type:frozenset of str

Examples

from camel_tools.morphology.database import MorphologyDB
from camel_tools.morphology.analyzer import Analyzer

db = MorphologyDB.builtin_db()

# Create analyzer with no backoff
analyzer = Analyzer(db)


# Create analyzer with NOAN_PROP backoff
analyzer = Analyzer(db, 'NOAN_PROP')

# or
analyzer = Analyzer(db, backoff='NOAN_PROP')


# To analyze a word, we can use the analyze() method
analyses = analyzer.analyze('شارع')