camel_tools.morphology.analyzer¶

The morphological analyzer component of CAMeL Tools.

Globals¶

camel_tools.morphology.analyzer.DEFAULT_NORMALIZE_MAP¶

The default character map used for normalization by Analyzer.

Removes the tatweel/kashida character and does the following conversions:

‘إ’ to ‘ا’
‘أ’ to ‘ا’
‘آ’ to ‘ا’
‘ٱ’ to ‘ا’
‘ى’ to ‘ي’
‘ة’ to ‘ه’

Type:	`CharMapper`

Classes¶

class camel_tools.morphology.analyzer.AnalyzedWord¶

A named tuple containing a word and its analyses.

word¶

The analyzed word.

Type:	`str`

analyses¶

List of analyses for word. See CAMeL Morphology Features for more information on features and their values.

Type:	`list` of `dict`

class camel_tools.morphology.analyzer.Analyzer(db, backoff='NONE', norm_map=None, strict_digit=False, cache_size=0)¶

Morphological analyzer component.

Parameters:

db (MorphologyDB) – Database to use for analysis. Must be opened in analysis or reinflection mode.
backoff (str, optional) – Backoff mode. Can be one of the following: ‘NONE’, ‘NOAN_ALL’, ‘NOAN_PROP’, ‘ADD_ALL’, or ‘ADD_PROP’. Defaults to ‘NONE’.
norm_map (CharMapper, optional) – Character map for normalizing input words. If set to None, then DEFAULT_NORMALIZE_MAP is used. Defaults to None.
strict_digit (bool, optional) – If set to True, then only words completely comprised of digits are considered numbers, otherwise, all words containing a digit are considered numbers. Defaults to False.
cache_size (int, optional) – If greater than zero, then the analyzer will cache the analyses for the cache_Size most frequent words, otherwise no analyses will be cached.

Raises:

AnalyzerError – If database is not an instance of (MorphologyDB), if db does not support analysis, or if backoff is not a valid backoff mode.

all_feats()¶

Return a set of all features provided by the database used in this analyzer instance.

Returns:	The set all features provided by the database used in this analyzer instance.
Return type:	`frozenset` of `str`

analyze(word)¶

Analyze a given word.

Parameters:	word (`str`) – Word to analyze.
Returns:	The list of analyses for word. See CAMeL Morphology Features for more information on features and their values.
Return type:	`list` of `dict`

analyze_words(words)¶

Analyze a list of words.

Parameters:	words (`list` of `str`) – List of words to analyze.
Returns:	The list of analyses for each word in words.
Return type:	`list` of `AnalyzedWord`

tok_feats()¶

Return a set of tokenization features provided by the database used in this analyzer instance.

Returns:	The set tokenization features provided by the database used in this analyzer instance.
Return type:	`frozenset` of `str`

Examples¶

from camel_tools.morphology.database import MorphologyDB
from camel_tools.morphology.analyzer import Analyzer

db = MorphologyDB.builtin_db()

# Create analyzer with no backoff
analyzer = Analyzer(db)


# Create analyzer with NOAN_PROP backoff
analyzer = Analyzer(db, 'NOAN_PROP')

# or
analyzer = Analyzer(db, backoff='NOAN_PROP')


# To analyze a word, we can use the analyze() method
analyses = analyzer.analyze('شارع')