camel_tools.tokenizers.word¶
This module contains utilities for word-boundary tokenization.
Functions¶
-
camel_tools.tokenizers.word.simple_word_tokenize(sentence)¶ Tokenizes a sentence by splitting on whitespace and seperating punctuation. The resulting tokens are either alpha-numeric words or single punctuation/symbol characters. This function is language agnostic and splits all characters marked as punctuation or symbols in the Unicode specification. For example, tokenizing
'Hello, world!!!'would yield['Hello', ',', 'world', '!', '!', '!'].Parameters: sentence ( str) – Sentence to tokenize.Returns: The list of tokens. Return type: listofstr