camel_word_tokenize

About

The camel_word_tokenize tool splits words from punctuation while collapsing contiguous segments of spaces into a single whitespace character. It is also language agnostic and splits all characters marked as punctuation or symbols in the Unicode specification.

For example the following sentence:

Hello,     world!!!!
مرحبا يا عالم!!!

becomes:

Hello , world ! ! ! !
مرحبا يا عالم ! ! !

Usage

Below is the usage information that can be generated by running camel_word_tokenize --help.

Usage:
    camel_word_tokenize [-o OUTPUT | --output=OUTPUT] [FILE]
    camel_word_tokenize (-v | --version)
    camel_word_tokenize (-h | --help)

Options:
  -o OUTPUT --output=OUTPUT
        Output file. If not specified, output will be printed to stdout.
  -h --help
        Show this screen.
  -v --version
        Show version.