Skip to content

tokenizers.bpe

Byte Pair Encoding Text Tokenization

v0.1.4 · Sep 5, 2025 · MPL-2.0

Description

Unsupervised text tokenizer focused on computational efficiency. Wraps the 'YouTokenToMe' library <https://github.com/VKCOM/YouTokenToMe> which is an implementation of fast Byte Pair Encoding (BPE) <https://aclanthology.org/P16-1162/>.

Downloads

751

Last 30 days

4673rd

751

Last 90 days

751

Last year

CRAN Check Status

1 WARNING
2 NOTE
11 OK
Show all 14 flavors
Flavor Status
r-devel-linux-x86_64-debian-clang WARNING
r-devel-linux-x86_64-debian-gcc OK
r-devel-linux-x86_64-fedora-clang OK
r-devel-linux-x86_64-fedora-gcc OK
r-devel-macos-arm64 OK
r-devel-windows-x86_64 OK
r-oldrel-macos-arm64 NOTE
r-oldrel-macos-x86_64 NOTE
r-oldrel-windows-x86_64 OK
r-patched-linux-x86_64 OK
r-release-linux-x86_64 OK
r-release-macos-arm64 OK
r-release-macos-x86_64 OK
r-release-windows-x86_64 OK
Check details (14 non-OK)
WARNING r-devel-linux-x86_64-debian-clang

whether package can be installed

Found the following significant warnings:
  ./parallel_hashmap/phmap_base.h:1266:1: warning: 'is_always_equal' is deprecated: use 'std::allocator_traits::is_always_equal' instead [-Wdeprecated-declarations]
See ‘/home/hornik/tmp/R.check/r-devel-clang/Work/PKGS/tokenizers.bpe.Rcheck/00install.out’ for details.
* used C++ compiler: ‘Debian clang version 21.1.8 (3+b1)’
OK r-devel-linux-x86_64-debian-gcc

*


            
OK r-devel-linux-x86_64-fedora-clang

*


            
OK r-devel-linux-x86_64-fedora-gcc

*


            
OK r-devel-macos-arm64

*


            
OK r-devel-windows-x86_64

*


            
NOTE r-oldrel-macos-arm64

installed package size

  installed size is  6.1Mb
  sub-directories of 1Mb or more:
    libs   5.2Mb
NOTE r-oldrel-macos-x86_64

installed package size

  installed size is  6.2Mb
  sub-directories of 1Mb or more:
    libs   5.3Mb
OK r-oldrel-windows-x86_64

*


            
OK r-patched-linux-x86_64

*


            
OK r-release-linux-x86_64

*


            
OK r-release-macos-arm64

*


            
OK r-release-macos-x86_64

*


            
OK r-release-windows-x86_64

*


            

Check History

WARNING 11 OK · 2 NOTE · 1 WARNING · 0 ERROR · 0 FAILURE Mar 9, 2026
WARNING r-devel-linux-x86_64-debian-clang

whether package can be installed

Found the following significant warnings:
  ./parallel_hashmap/phmap_base.h:1266:1: warning: 'is_always_equal' is deprecated: use 'std::allocator_traits::is_always_equal' instead [-Wdeprecated-declarations]
See ‘/home/hornik/tmp/R.check/r-devel-clang/Work/PKGS/tokenizers.bpe.Rcheck/00install.out’ for details.
* used C++ compiler: ‘Debian clang version 21.1.8 (3+b1)’
NOTE r-oldrel-macos-arm64

installed package size

  installed size is  6.1Mb
  sub-directories of 1Mb or more:
    libs   5.2Mb
NOTE r-oldrel-macos-x86_64

installed package size

  installed size is  6.2Mb
  sub-directories of 1Mb or more:
    libs   5.3Mb

Reverse Dependencies (3)

Dependency Network

Dependencies Reverse dependencies Rcpp doc2vec sentencepiece textrecipes tokenizers.bpe

Version History

new 0.1.4 Mar 9, 2026