Skip to content

tokenizers

Fast, Consistent Tokenization of Natural Language Text

v0.3.0 · Dec 22, 2022 · MIT + file LICENSE

Description

Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.

Downloads

37.3K

Last 30 days

575th

37.3K

Last 90 days

37.3K

Last year

CRAN Check Status

2 NOTE
12 OK
Show all 14 flavors
Flavor Status
r-devel-linux-x86_64-debian-clang NOTE
r-devel-linux-x86_64-debian-gcc NOTE
r-devel-linux-x86_64-fedora-clang OK
r-devel-linux-x86_64-fedora-gcc OK
r-devel-macos-arm64 OK
r-devel-windows-x86_64 OK
r-oldrel-macos-arm64 OK
r-oldrel-macos-x86_64 OK
r-oldrel-windows-x86_64 OK
r-patched-linux-x86_64 OK
r-release-linux-x86_64 OK
r-release-macos-arm64 OK
r-release-macos-x86_64 OK
r-release-windows-x86_64 OK
Check details (14 non-OK)
NOTE r-devel-linux-x86_64-debian-clang

CRAN incoming feasibility

Maintainer: ‘Lincoln Mullen <lincoln@lincolnmullen.com>’

Package CITATION file contains call(s) to old-style personList() or
as.personList().  Please use c() on person objects instead.
Package CITATION file contains call(s) to old-style citEntry().  Please
use bibentry() instead.
NOTE r-devel-linux-x86_64-debian-gcc

CRAN incoming feasibility

Maintainer: ‘Lincoln Mullen <lincoln@lincolnmullen.com>’

Package CITATION file contains call(s) to old-style personList() or
as.personList().  Please use c() on person objects instead.
Package CITATION file contains call(s) to old-style citEntry().  Please
use bibentry() instead.
OK r-devel-linux-x86_64-fedora-clang

*


            
OK r-devel-linux-x86_64-fedora-gcc

*


            
OK r-devel-macos-arm64

*


            
OK r-devel-windows-x86_64

*


            
OK r-oldrel-macos-arm64

*


            
OK r-oldrel-macos-x86_64

*


            
OK r-oldrel-windows-x86_64

*


            
OK r-patched-linux-x86_64

*


            
OK r-release-linux-x86_64

*


            
OK r-release-macos-arm64

*


            
OK r-release-macos-x86_64

*


            
OK r-release-windows-x86_64

*


            

Check History

NOTE 12 OK · 2 NOTE · 0 WARNING · 0 ERROR · 0 FAILURE Mar 9, 2026
NOTE r-devel-linux-x86_64-debian-clang

CRAN incoming feasibility

Maintainer: ‘Lincoln Mullen <lincoln@lincolnmullen.com>’

Package CITATION file contains call(s) to old-style personList() or
as.personList().  Please use c() on person objects instead.
Package CITATION file contains call(s) to old-style citEntry().  Please
use bibentry() instead.
NOTE r-devel-linux-x86_64-debian-gcc

CRAN incoming feasibility

Maintainer: ‘Lincoln Mullen <lincoln@lincolnmullen.com>’

Package CITATION file contains call(s) to old-style personList() or
as.personList().  Please use c() on person objects instead.
Package CITATION file contains call(s) to old-style citEntry().  Please
use bibentry() instead.

Reverse Dependencies (15)

Dependency Network

Dependencies Reverse dependencies stringi Rcpp SnowballC DramaAnalysis WhatsR blocking covfefe deeplr pdfsearch proustr rslp textrecipes tidypmc tidytext wactor edgarWebR sumup torchdatasets tokenizers

Version History

new 0.3.0 Mar 9, 2026