Here are a few things that you can try to speed up the tokenizer, which were suggested by the GitHub repo author (you can also find this information here):
1. Remove redundant white-space tokenization in BasicTokenizer
2. Convert basic tokenized tokens to UTF32 in one call in FullTokenizer, and modify WordPieceTokenizer to accept UTF32 as input.
3. Only call sub.string() once in WordPieceTokenizer.
4. Remove input validation in WhitespaceTokenizer which may be called many times.
If the issue still exists, you could also create a new issue on the GitHub page itself.