Byte pair encoding (BPE) is a tokenization algorithm that allows transformer networks to
handle a wide range of vocabulary without assigning individual tokens for every possible
word. During tokenization, the algorithm replaces out-of-vocabulary (OOV) words with subword
counterparts, which allows models to handle unseen words more effectively. This process
creates a set of subword tokens that can better represent common and rare words.
These steps outline the algorithm for training a BPE tokenizer:
Start with a corpus of text. For example, a corpus that includes phrases like
"use byte pair encoding to tokenize text"
. Split the text
data into words using a specified pretokenization algorithm.
Initialize a vocabulary of bytes. For example, start with a vocabulary of
["a" "b" "c" ... "z"]
. For non-ASCII characters like
emojis that comprise multiple bytes, start with the byte values that make up the
character.
Encode each word in the text data as a sequence of bytes, and represent the
words as sequences of integers that index into the vocabulary. For example,
represent the word "use"
as [21 19 5]
.
When the encoding of a character is more than one byte, the resulting sequence
of bytes can have more elements than the number of characters in the
word.
Count the frequency of all adjacent pairs of bytes in the corpus. For example,
among the words ["use" "byte" "pair" "encoding" "to" "tokenize"
"text"]
, the token pairs ["t" "e"]
,
["e" "n"]
, and ["t" "o"]
appear twice,
and the remaining pairs appear once.
Identify the most frequent pair and add the corresponding merged token to the
vocabulary. In the words represented as sequences of vocabulary indices, replace
the corresponding pairs with the index of the new merged token in the
vocabulary. Then, add this token pair to the merge list. For example, append the
token pair ["t" "e"]
to the merge list. Then, add the
corresponding merged token "te"
to the vocabulary so that it
has the index 27
. Then, in the text data represented as
vocabulary indices, replace the pairs of vocabulary indices [20
5]
(which corresponds to ["t" "e"]
with the
corresponding new vocabulary index:
Repeat the frequency count and merge operations until you reach a specified
number of iterations or vocabulary size. For example, repeating these steps
several times leads to merging the pair ["b" "y"]
to make the
token "by"
, and then subsequently the pair ["by"
"te"]
to make the token "byte"
.
These steps outline how a BPE tokenizer tokenizes new text:
Pretokenization — Split text into individual words.
Byte-encoding — Encode each word into sequences of bytes
Merge — By starting at the top of the merge list and progressing through it,
iteratively apply each merge to pairs of tokens when possible.