tokenizers
Fast, Consistent Tokenization of Natural Language Text
Description
Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.
Downloads
37.3K
Last 30 days
575th
37.3K
Last 90 days
37.3K
Last year
CRAN Check Status
Show all 14 flavors
| Flavor | Status |
|---|---|
| r-devel-linux-x86_64-debian-clang | NOTE |
| r-devel-linux-x86_64-debian-gcc | NOTE |
| r-devel-linux-x86_64-fedora-clang | OK |
| r-devel-linux-x86_64-fedora-gcc | OK |
| r-devel-macos-arm64 | OK |
| r-devel-windows-x86_64 | OK |
| r-oldrel-macos-arm64 | OK |
| r-oldrel-macos-x86_64 | OK |
| r-oldrel-windows-x86_64 | OK |
| r-patched-linux-x86_64 | OK |
| r-release-linux-x86_64 | OK |
| r-release-macos-arm64 | OK |
| r-release-macos-x86_64 | OK |
| r-release-windows-x86_64 | OK |
Check details (14 non-OK)
CRAN incoming feasibility
Maintainer: ‘Lincoln Mullen <lincoln@lincolnmullen.com>’ Package CITATION file contains call(s) to old-style personList() or as.personList(). Please use c() on person objects instead. Package CITATION file contains call(s) to old-style citEntry(). Please use bibentry() instead.
CRAN incoming feasibility
Maintainer: ‘Lincoln Mullen <lincoln@lincolnmullen.com>’ Package CITATION file contains call(s) to old-style personList() or as.personList(). Please use c() on person objects instead. Package CITATION file contains call(s) to old-style citEntry(). Please use bibentry() instead.
*
*
*
*
*
*
*
*
*
*
*
*
Check History
NOTE 12 OK · 2 NOTE · 0 WARNING · 0 ERROR · 0 FAILURE Mar 9, 2026
CRAN incoming feasibility
Maintainer: ‘Lincoln Mullen <lincoln@lincolnmullen.com>’ Package CITATION file contains call(s) to old-style personList() or as.personList(). Please use c() on person objects instead. Package CITATION file contains call(s) to old-style citEntry(). Please use bibentry() instead.
CRAN incoming feasibility
Maintainer: ‘Lincoln Mullen <lincoln@lincolnmullen.com>’ Package CITATION file contains call(s) to old-style personList() or as.personList(). Please use c() on person objects instead. Package CITATION file contains call(s) to old-style citEntry(). Please use bibentry() instead.