camel_tools.dialectid

Danger

Note: This component is not available on Windows.

This module contains the CAMeL Tools dialect identification component. This Dialect Identification system can identify between 25 Arabic city dialects as well as Modern Standard Arabic. It is based on the system described by Salameh, Bouamor and Habash.

Classes

class camel_tools.dialectid.DIDPred(top, scores)

A named tuple containing dialect ID prediction results.

top

The dialect label with the highest score. See Labels for a list of output labels.

Type:

str

scores

A dictionary mapping each dialect label to it’s computed score.

Type:

dict

camel_tools.dialectid.DialectIdentifier

alias of DIDModel26

class camel_tools.dialectid.DIDModel26(labels=None, labels_extra=None, char_lm_dir=None, word_lm_dir=None)

A class for training, evaluating and running the dialect identification model ‘Model-26’ described by Salameh et al. After initializing an instance, you must run the train method once before using it.

Parameters:
  • labels (set of str, optional) – The set of dialect labels used in the training data in the main model. If None, the default labels are used. Defaults to None.

  • labels_extra (set of str, optional) – The set of dialect labels used in the training data in the extra features model. If None, the default labels are used. Defaults to None.

  • char_lm_dir (str, optional) – Path to the directory containing the character-based language models. If None, use the language models that come with this package. Defaults to None.

  • word_lm_dir (str, optional) – Path to the directory containing the word-based language models. If None, use the language models that come with this package. Defaults to None.

predict(sentences, output='label')

Predict the dialect probability scores for a given list of sentences.

Parameters:
  • sentences (list of str) – The list of sentences.

  • output (str) – The output label type. Possible values are ‘label’, ‘city’, ‘country’, or ‘region’. Defaults to ‘label’.

Returns:

A list of prediction results, each corresponding to its respective sentence.

Return type:

list of DIDPred

static pretrained()

Load the default pre-trained model provided with camel-tools.

Raises:

PretrainedModelError – When a pre-trained model compatible with the current Python version isn’t available.

Returns:

The loaded model.

Return type:

DialectIdentifier

class camel_tools.dialectid.DIDModel6(labels=None, char_lm_dir=None, word_lm_dir=None)

A class for training, evaluating and running the dialect identification model ‘Model-6’ described by Salameh et al. After initializing an instance, you must run the train method once before using it.

Parameters:
  • labels (set of str, optional) – The set of dialect labels used in the training data in the main model. If None, the default labels are used. Defaults to None.

  • char_lm_dir (str, optional) – Path to the directory containing the character-based language models. If None, use the language models that come with this package. Defaults to None.

  • word_lm_dir (str, optional) – Path to the directory containing the word-based language models. If None, use the language models that come with this package. Defaults to None.

predict(sentences, output='label')

Predict the dialect probability scores for a given list of sentences.

Parameters:
  • sentences (list of str) – The list of sentences.

  • output (str) – The output label type. Possible values are ‘label’, ‘city’, ‘country’, or ‘region’. Defaults to ‘label’.

Returns:

A list of prediction results, each corresponding to its respective sentence.

Return type:

list of DIDPred

static pretrained()

Load the default pre-trained model provided with camel-tools.

Raises:

PretrainedModelError – When a pre-trained model compatible with the current Python version isn’t available.

Returns:

The loaded model.

Return type:

DialectIdentifier

class camel_tools.dialectid.DialectIdError(msg)

Base class for all CAMeL Dialect ID errors.

class camel_tools.dialectid.UntrainedModelError(msg)

Error thrown when attempting to use an untrained DialectIdentifier instance.

class camel_tools.dialectid.InvalidDataSetError(dataset)

Error thrown when an invalid data set name is given to eval.

class camel_tools.dialectid.PretrainedModelError(msg)

Error thrown when attempting to load a pretrained model provided with camel-tools.

Labels

Below is a table mapping output labels to their respective city, country, and region dialects:

Label

City

Country

Region

ALE

Aleppo

Syria

Levant

ALG

Algiers

Algeria

Maghreb

ALX

Alexandria

Egypt

Nile Basin

AMM

Amman

Jordan

Levant

ASW

Aswan

Egypt

Nile Basin

BAG

Baghdad

Iraq

Iraq

BAS

Basra

Iraq

Iraq

BEI

Beirut

Lebanon

Levant

BEN

Benghazi

Libya

Maghreb

CAI

Cairo

Egypt

Nile Basin

DAM

Damascus

Syria

Levant

DOH

Doha

Qatar

Gulf

FES

Fes

Morocco

Maghreb

JED

Jeddah

Saudi Arabia

Gulf

JER

Jerusalem

Palestine

Levant

KHA

Khartoum

Sudan

Nile Basin

MOS

Mosul

Iraq

Iraq

MSA

Modern Standard Arabic

Modern Standard Arabic

Modern Standard Arabic

MUS

Muscat

Oman

Gulf

RAB

Rabat

Morocco

Maghreb

RIY

Riyadh

Saudi Arabia

Gulf

SAL

Salt

Jordan

Levant

SAN

Sana’a

Yemen

Gulf of Aden

SFX

Sfax

Tunisia

Maghreb

TRI

Tripoli

Libya

Maghreb

TUN

Tunis

Tunisia

Maghreb

Examples

Below is an example of how to load and use the default pre-trained model.

from camel_tools.dialectid import DialectIdentifier

did = DialectIdentifier.pretrained()

sentences = [
    'مال الهوى و مالي شكون اللي جابني ليك  ما كنت انايا ف حالي بلاو قلبي يانا بيك',
    'بدي دوب قلي قلي بجنون بحبك انا مجنون ما بنسى حبك يوم'
]

predictions = did.predict(sentences)

# Each prediction is a tuple containing both the top prediction and the
# percentage score of each dialect. To get only the top prediction, we can
# do the following:
top_dialects = [p.top for p in predictions]