sentencepiece
Text Tokenization using Byte Pair Encoding and Unigram Modelling
Description
Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library <https://github.com/google/sentencepiece> which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) <doi:10.18653/v1/D18-2012>. Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) <http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf>.
Downloads
521
Last 30 days
7158th
521
Last 90 days
521
Last year
CRAN Check Status
Show all 14 flavors
| Flavor | Status |
|---|---|
| r-devel-linux-x86_64-debian-clang | OK |
| r-devel-linux-x86_64-debian-gcc | OK |
| r-devel-linux-x86_64-fedora-clang | OK |
| r-devel-linux-x86_64-fedora-gcc | OK |
| r-devel-macos-arm64 | OK |
| r-devel-windows-x86_64 | OK |
| r-oldrel-macos-arm64 | NOTE |
| r-oldrel-macos-x86_64 | NOTE |
| r-oldrel-windows-x86_64 | NOTE |
| r-patched-linux-x86_64 | OK |
| r-release-linux-x86_64 | OK |
| r-release-macos-arm64 | OK |
| r-release-macos-x86_64 | OK |
| r-release-windows-x86_64 | OK |
Check details (14 non-OK)
*
*
*
*
*
*
installed package size
installed size is 21.6Mb
sub-directories of 1Mb or more:
libs 19.9Mb
models 1.6Mb
installed package size
installed size is 23.1Mb
sub-directories of 1Mb or more:
libs 21.3Mb
models 1.6Mb
installed package size
installed size is 5.0Mb
sub-directories of 1Mb or more:
libs 3.3Mb
models 1.6Mb
*
*
*
*
*
Check History
NOTE 11 OK · 3 NOTE · 0 WARNING · 0 ERROR · 0 FAILURE Mar 9, 2026
installed package size
installed size is 21.6Mb
sub-directories of 1Mb or more:
libs 19.9Mb
models 1.6Mb
installed package size
installed size is 23.1Mb
sub-directories of 1Mb or more:
libs 21.3Mb
models 1.6Mb
installed package size
installed size is 5.0Mb
sub-directories of 1Mb or more:
libs 3.3Mb
models 1.6Mb