camel_tools.tokenizers.word¶

This module contains utilities for word-boundary tokenization.

Functions¶

camel_tools.tokenizers.word.simple_word_tokenize(sentence, split_digits=False)¶

Tokenizes a sentence by splitting on whitespace and seperating punctuation. The resulting tokens are either alpha-numeric words, single punctuation/symbol/emoji characters, or multi-character emoji sequences. This function is language agnostic and splits all characters marked as punctuation or symbols in the Unicode specification. For example, tokenizing 'Hello, world!!!' would yield ['Hello', ',', 'world', '!', '!', '!']. If split_digits is set to True, it also splits on number. For example, tokenizing 'Hello, world123!!!' would yield ['Hello', ',', 'world', '123', '!', '!', '!'].

Parameters:	sentence (`str`) – Sentence to tokenize. split_digits (`bool`, optional) – The flag to split on number. Defaults to False.
Returns:	The list of tokens.
Return type:	`list` of `str`