camel_tools.dialectid

Danger

Note: This component is not available on Windows.

This module contains the CAMeL Tools dialect identification component. This Dialect Identification system can identify between 25 Arabic city dialects as well as Modern Standard Arabic. It is based on the system described by Salameh, Bouamor and Habash.

Classes

class camel_tools.dialectid.DIDPred

A named tuple containing dialect ID prediction results.

top

The dialect label with the highest score. See Labels for a list of output labels.

Type:str
scores

A dictionary mapping each dialect label to it’s computed score.

Type:dict
class camel_tools.dialectid.DialectIdentifier(labels=None, labels_extra=None, char_lm_dir=None, word_lm_dir=None)

A class for training, evaluating and running the dialect identification model described by Salameh et al. After initializing an instance, you must run the train method once before using it.

Parameters:
  • labels (set of str, optional) – The set of dialect labels used in the training data in the main model. If None, the default labels are used. Defaults to None.
  • labels_extra (set of str, optional) – The set of dialect labels used in the training data in the extra features model. If None, the default labels are used. Defaults to None.
  • char_lm_dir (str, optional) – Path to the directory containing the character-based language models. If None, use the language models that come with this package. Defaults to None.
  • word_lm_dir (str, optional) – Path to the directory containing the word-based language models. If None, use the language models that come with this package. Defaults to None.
predict(sentences, output='label')

Predict the dialect probability scores for a given list of sentences.

Parameters:
  • sentences (list of str) – The list of sentences.
  • output (str) – The output label type. Possible values are ‘label’, ‘city’, ‘country’, or ‘region’. Defaults to ‘label’.
Returns:

A list of prediction results, each corresponding to its respective sentence.

Return type:

list of DIDPred

static pretrained()

Load the default pre-trained model provided with camel-tools.

Raises:PretrainedModelError – When a pre-trained model compatible with the current Python version isn’t available.
Returns:The loaded model.
Return type:DialectIdentifier
class camel_tools.dialectid.DialectIdError(msg)

Base class for all CAMeL Dialect ID errors.

class camel_tools.dialectid.UntrainedModelError(msg)

Error thrown when attempting to use an untrained DialectIdentifier instance.

class camel_tools.dialectid.PretrainedModelError(msg)

Error thrown when attempting to load a pretrained model provided with camel-tools.

Functions

camel_tools.dialectid.label_to_city(prediction)

Converts a dialect prediction using labels to use city names instead.

Parameters:pred (DIDPred) – The prediction to convert.
Returns:DIDPred The converted prediction.
camel_tools.dialectid.label_to_country(prediction)

Converts a dialect prediction using labels to use country names instead.

Parameters:pred (DIDPred) – The prediction to convert.
Returns:DIDPred The converted prediction.
camel_tools.dialectid.label_to_region(prediction)

Converts a dialect prediction using labels to use region names instead.

Parameters:pred (DIDPred) – The prediction to convert.
Returns:DIDPred The converted prediction.
camel_tools.dialectid.label_city_pairs()

Returns the set of default label-city pairs.

Returns:The set of default label-dialect pairs.
Return type:frozenset of tuple
camel_tools.dialectid.label_country_pairs()

Returns the set of default label-country pairs.

Returns:The set of default label-country pairs.
Return type:frozenset of tuple
camel_tools.dialectid.label_region_pairs()

Returns the set of default label-region pairs.

Returns:The set of default label-region pairs.
Return type:frozenset of tuple

Labels

Below is a table mapping output labels to their respective city, country, and region dialects:

Label City Country Region
ALE Aleppo Syria Levant
ALG Algiers Algeria Maghreb
ALX Alexandria Egypt Nile Basin
AMM Amman Jordan Levant
ASW Aswan Egypt Nile Basin
BAG Baghdad Iraq Iraq
BAS Basra Iraq Iraq
BEI Beirut Lebanon Levant
BEN Benghazi Libya Maghreb
CAI Cairo Egypt Nile Basin
DAM Damascus Syria Levant
DOH Doha Qatar Gulf
FES Fes Morocco Maghreb
JED Jeddah Saudi Arabia Gulf
JER Jerusalem Palestine Levant
KHA Khartoum Sudan Nile Basin
MOS Mosul Iraq Iraq
MSA Modern Standard Arabic Modern Standard Arabic Modern Standard Arabic
MUS Muscat Oman Gulf
RAB Rabat Morocco Maghreb
RIY Riyadh Saudi Arabia Gulf
SAL Salt Jordan Levant
SAN Sana’a Yemen Gulf of Aden
SFX Sfax Tunisia Maghreb
TRI Tripoli Libya Maghreb
TUN Tunis Tunisia Maghreb

Examples

Below is an example of how to load and use the default pre-trained model.

from camel_tools.dialectid import DialectIdentifier

did = DialectIdentifier.pretrained()

sentences = [
    'مال الهوى و مالي شكون اللي جابني ليك  ما كنت انايا ف حالي بلاو قلبي يانا بيك',
    'بدي دوب قلي قلي بجنون بحبك انا مجنون ما بنسى حبك يوم'
]

predictions = did.predict(sentences)

# Each prediction is a tuple containing both the top prediction and the
# percentage score of each dialect. To get only the top prediction, we can
# do the following:
top_dialects = [p.top for p in predictions]