camel_tools.dialectid
Danger
Note: This component is not available on Windows.
This module contains the CAMeL Tools dialect identification component. This Dialect Identification system can identify between 25 Arabic city dialects as well as Modern Standard Arabic. It is based on the system described by Salameh, Bouamor and Habash.
Classes
- class camel_tools.dialectid.DIDPred(top, scores)
A named tuple containing dialect ID prediction results.
- camel_tools.dialectid.DialectIdentifier
alias of
DIDModel26
- class camel_tools.dialectid.DIDModel26(labels=None, labels_extra=None, char_lm_dir=None, word_lm_dir=None)
A class for training, evaluating and running the dialect identification model ‘Model-26’ described by Salameh et al. After initializing an instance, you must run the train method once before using it.
- Parameters:
labels (
setofstr, optional) – The set of dialect labels used in the training data in the main model. If None, the default labels are used. Defaults to None.labels_extra (
setofstr, optional) – The set of dialect labels used in the training data in the extra features model. If None, the default labels are used. Defaults to None.char_lm_dir (
str, optional) – Path to the directory containing the character-based language models. If None, use the language models that come with this package. Defaults to None.word_lm_dir (
str, optional) – Path to the directory containing the word-based language models. If None, use the language models that come with this package. Defaults to None.
- predict(sentences, output='label')
Predict the dialect probability scores for a given list of sentences.
- static pretrained()
Load the default pre-trained model provided with camel-tools.
- Raises:
PretrainedModelError – When a pre-trained model compatible with the current Python version isn’t available.
- Returns:
The loaded model.
- Return type:
- class camel_tools.dialectid.DIDModel6(labels=None, char_lm_dir=None, word_lm_dir=None)
A class for training, evaluating and running the dialect identification model ‘Model-6’ described by Salameh et al. After initializing an instance, you must run the train method once before using it.
- Parameters:
labels (
setofstr, optional) – The set of dialect labels used in the training data in the main model. If None, the default labels are used. Defaults to None.char_lm_dir (
str, optional) – Path to the directory containing the character-based language models. If None, use the language models that come with this package. Defaults to None.word_lm_dir (
str, optional) – Path to the directory containing the word-based language models. If None, use the language models that come with this package. Defaults to None.
- predict(sentences, output='label')
Predict the dialect probability scores for a given list of sentences.
- static pretrained()
Load the default pre-trained model provided with camel-tools.
- Raises:
PretrainedModelError – When a pre-trained model compatible with the current Python version isn’t available.
- Returns:
The loaded model.
- Return type:
- class camel_tools.dialectid.DialectIdError(msg)
Base class for all CAMeL Dialect ID errors.
- class camel_tools.dialectid.UntrainedModelError(msg)
Error thrown when attempting to use an untrained DialectIdentifier instance.
- class camel_tools.dialectid.InvalidDataSetError(dataset)
Error thrown when an invalid data set name is given to eval.
- class camel_tools.dialectid.PretrainedModelError(msg)
Error thrown when attempting to load a pretrained model provided with camel-tools.
Labels
Below is a table mapping output labels to their respective city, country, and region dialects:
Label |
City |
Country |
Region |
|---|---|---|---|
ALE |
Aleppo |
Syria |
Levant |
ALG |
Algiers |
Algeria |
Maghreb |
ALX |
Alexandria |
Egypt |
Nile Basin |
AMM |
Amman |
Jordan |
Levant |
ASW |
Aswan |
Egypt |
Nile Basin |
BAG |
Baghdad |
Iraq |
Iraq |
BAS |
Basra |
Iraq |
Iraq |
BEI |
Beirut |
Lebanon |
Levant |
BEN |
Benghazi |
Libya |
Maghreb |
CAI |
Cairo |
Egypt |
Nile Basin |
DAM |
Damascus |
Syria |
Levant |
DOH |
Doha |
Qatar |
Gulf |
FES |
Fes |
Morocco |
Maghreb |
JED |
Jeddah |
Saudi Arabia |
Gulf |
JER |
Jerusalem |
Palestine |
Levant |
KHA |
Khartoum |
Sudan |
Nile Basin |
MOS |
Mosul |
Iraq |
Iraq |
MSA |
Modern Standard Arabic |
Modern Standard Arabic |
Modern Standard Arabic |
MUS |
Muscat |
Oman |
Gulf |
RAB |
Rabat |
Morocco |
Maghreb |
RIY |
Riyadh |
Saudi Arabia |
Gulf |
SAL |
Salt |
Jordan |
Levant |
SAN |
Sana’a |
Yemen |
Gulf of Aden |
SFX |
Sfax |
Tunisia |
Maghreb |
TRI |
Tripoli |
Libya |
Maghreb |
TUN |
Tunis |
Tunisia |
Maghreb |
Examples
Below is an example of how to load and use the default pre-trained model.
from camel_tools.dialectid import DialectIdentifier
did = DialectIdentifier.pretrained()
sentences = [
'مال الهوى و مالي شكون اللي جابني ليك ما كنت انايا ف حالي بلاو قلبي يانا بيك',
'بدي دوب قلي قلي بجنون بحبك انا مجنون ما بنسى حبك يوم'
]
predictions = did.predict(sentences)
# Each prediction is a tuple containing both the top prediction and the
# percentage score of each dialect. To get only the top prediction, we can
# do the following:
top_dialects = [p.top for p in predictions]