camel_tools.tokenizers.word¶
This module contains utilities for word-boundary tokenization.
Functions¶
-
camel_tools.tokenizers.word.
simple_word_tokenize
(sentence, split_digits=False)¶ Tokenizes a sentence by splitting on whitespace and seperating punctuation. The resulting tokens are either alpha-numeric words, single punctuation/symbol/emoji characters, or multi-character emoji sequences. This function is language agnostic and splits all characters marked as punctuation or symbols in the Unicode specification. For example, tokenizing
'Hello, world!!!'
would yield['Hello', ',', 'world', '!', '!', '!']
. If split_digits is set to True, it also splits on number. For example, tokenizing'Hello, world123!!!'
would yield['Hello', ',', 'world', '123', '!', '!', '!']
.Parameters: Returns: The list of tokens.
Return type: