camel_tools.dialectid

Danger

Note: This component is not available on Windows.

This module contains the CAMeL Tools dialect identification component. This Dialect Identification system can identify between 25 Arabic city dialects as well as Modern Standard Arabic. It is based on the system described by Salameh, Bouamor and Habash.

Classes

class camel_tools.dialectid.DIDPred(top, scores)

A named tuple containing dialect ID prediction results.

top

The dialect label with the highest score. See Labels for a list of output labels.

Type:: str

scores

A dictionary mapping each dialect label to it’s computed score.

Type:: dict

camel_tools.dialectid.DialectIdentifier: alias of DIDModel26

class camel_tools.dialectid.DIDModel26(labels=None, labels_extra=None, char_lm_dir=None, word_lm_dir=None)

A class for training, evaluating and running the dialect identification model ‘Model-26’ described by Salameh et al. After initializing an instance, you must run the train method once before using it.

Parameters:

labels (set of str, optional) – The set of dialect labels used in the training data in the main model. If None, the default labels are used. Defaults to None.
labels_extra (set of str, optional) – The set of dialect labels used in the training data in the extra features model. If None, the default labels are used. Defaults to None.
char_lm_dir (str, optional) – Path to the directory containing the character-based language models. If None, use the language models that come with this package. Defaults to None.
word_lm_dir (str, optional) – Path to the directory containing the word-based language models. If None, use the language models that come with this package. Defaults to None.

predict(sentences, output='label')

Predict the dialect probability scores for a given list of sentences.

Parameters:

sentences (list of str) – The list of sentences.
output (str) – The output label type. Possible values are ‘label’, ‘city’, ‘country’, or ‘region’. Defaults to ‘label’.

Returns:

A list of prediction results, each corresponding to its respective sentence.

Return type:

list of DIDPred

static pretrained()

Load the default pre-trained model provided with camel-tools.

Raises:: PretrainedModelError – When a pre-trained model compatible with the current Python version isn’t available.
Returns:: The loaded model.
Return type:: DialectIdentifier

class camel_tools.dialectid.DIDModel6(labels=None, char_lm_dir=None, word_lm_dir=None)

A class for training, evaluating and running the dialect identification model ‘Model-6’ described by Salameh et al. After initializing an instance, you must run the train method once before using it.

Parameters:

labels (set of str, optional) – The set of dialect labels used in the training data in the main model. If None, the default labels are used. Defaults to None.
char_lm_dir (str, optional) – Path to the directory containing the character-based language models. If None, use the language models that come with this package. Defaults to None.
word_lm_dir (str, optional) – Path to the directory containing the word-based language models. If None, use the language models that come with this package. Defaults to None.

predict(sentences, output='label')

Predict the dialect probability scores for a given list of sentences.

Parameters:

sentences (list of str) – The list of sentences.
output (str) – The output label type. Possible values are ‘label’, ‘city’, ‘country’, or ‘region’. Defaults to ‘label’.

Returns:

A list of prediction results, each corresponding to its respective sentence.

Return type:

list of DIDPred

static pretrained()

Load the default pre-trained model provided with camel-tools.

Raises:: PretrainedModelError – When a pre-trained model compatible with the current Python version isn’t available.
Returns:: The loaded model.
Return type:: DialectIdentifier

class camel_tools.dialectid.DialectIdError(msg): Base class for all CAMeL Dialect ID errors.

class camel_tools.dialectid.UntrainedModelError(msg): Error thrown when attempting to use an untrained DialectIdentifier instance.

class camel_tools.dialectid.InvalidDataSetError(dataset): Error thrown when an invalid data set name is given to eval.

class camel_tools.dialectid.PretrainedModelError(msg): Error thrown when attempting to load a pretrained model provided with camel-tools.

Labels

Below is a table mapping output labels to their respective city, country, and region dialects:

Label	City	Country	Region
ALE	Aleppo	Syria	Levant
ALG	Algiers	Algeria	Maghreb
ALX	Alexandria	Egypt	Nile Basin
AMM	Amman	Jordan	Levant
ASW	Aswan	Egypt	Nile Basin
BAG	Baghdad	Iraq	Iraq
BAS	Basra	Iraq	Iraq
BEI	Beirut	Lebanon	Levant
BEN	Benghazi	Libya	Maghreb
CAI	Cairo	Egypt	Nile Basin
DAM	Damascus	Syria	Levant
DOH	Doha	Qatar	Gulf
FES	Fes	Morocco	Maghreb
JED	Jeddah	Saudi Arabia	Gulf
JER	Jerusalem	Palestine	Levant
KHA	Khartoum	Sudan	Nile Basin
MOS	Mosul	Iraq	Iraq
MSA	Modern Standard Arabic	Modern Standard Arabic	Modern Standard Arabic
MUS	Muscat	Oman	Gulf
RAB	Rabat	Morocco	Maghreb
RIY	Riyadh	Saudi Arabia	Gulf
SAL	Salt	Jordan	Levant
SAN	Sana’a	Yemen	Gulf of Aden
SFX	Sfax	Tunisia	Maghreb
TRI	Tripoli	Libya	Maghreb
TUN	Tunis	Tunisia	Maghreb

Examples

Below is an example of how to load and use the default pre-trained model.

from camel_tools.dialectid import DialectIdentifier

did = DialectIdentifier.pretrained()

sentences = [
    'مال الهوى و مالي شكون اللي جابني ليك  ما كنت انايا ف حالي بلاو قلبي يانا بيك',
    'بدي دوب قلي قلي بجنون بحبك انا مجنون ما بنسى حبك يوم'
]

predictions = did.predict(sentences)

# Each prediction is a tuple containing both the top prediction and the
# percentage score of each dialect. To get only the top prediction, we can
# do the following:
top_dialects = [p.top for p in predictions]