This module contains utilities for word-boundary tokenization.



Tokenizes a sentence by splitting on whitespace and seperating punctuation. The resulting tokens are either alpha-numeric words or single punctuation/symbol characters. This function is language agnostic and splits all characters marked as punctuation or symbols in the Unicode specification. For example, tokenizing 'Hello,    world!!!' would yield ['Hello', ',', 'world', '!', '!', '!'].

Parameters:sentence (str) – Sentence to tokenize.
Returns:The list of tokens.
Return type:list of str