Tokens for LLMs: Byte Pair Encoding in Go – The Global Tofay

Tokens for LLMs: Byte Pair Encoding in Go - The Global Tofay Global Today

A basic unit of currency in modern LLMs is the token; exciting new models
have long context windows of millions of tokens. API pricing for the large
providers is per-token. We’re even seeing the invention of new, derived units
like TPM (tokens per minute).

But what are tokens?

This OpenAI help article
tells us that tokens are pieces of words, and gives some useful rules of thumb
like a token being equivalent to approximately 4 characters or 3/4 of a word
for the English language.

In this post I want to review the most commonly used algorithm for splitting
text into tokens, provide a complete implementation in Go, and show a playground
for experimenting with it. While my implementation isn’t tuned for speed, it
aims to be complete, readable and compatible with OpenAI’s tiktoken library, generating identical results and
working with the same vocabulary files.

Byte pair encoding – introduction

Byte pair encoding (BPE) is
an algorithm originally designed for data compression. A 2016 paper suggested re-purposing it for “word
segmentation” for machine learning tasks. The colloquial term for word
segmentation is tokenization.

  • Input: arbitrary text with words, numbers, whitespace and punctuation.
  • Output: list of tokens representing the same text. Each token is an integer
    identifier which can be looked up in a vocabulary to reproduce the input text
    .

The BPE algorithm has an important pre-processing step: splitting the input text
into words. The splitting is customizable and different models / vocabularies
use different regexps for splitting (more on this later). The main idea is
some sort of whitespace-based splitting (though whitespace itself is preserved)
because we typically don’t want inter-word tokens .

We’ll be using this line from a catchy 1990s song
as an example:

A word splitter will produce something like the following list, where spaces are
replaced by underscores _ for the sake of presentation (they remain as
spaces in the actual implementation of the algorithm and its trained
vocabulary):

i
'm
_blue
_dabadee
_dabadam

A few things to note:

  • The contraction 'm is split from i – this is common for English language
    splitters, which want things like 'm, 'll, 're as separate words.
  • Whitespace is preserved and attached at the start of a word. Whitespace is
    important because tokens at the beginning of words sometimes have different
    semantic meaning from tokens not at the beginning of words.
    The choice of where it’s attached is arbitrary. From this point on, whitespace
    bytes are considered like any other bytes in the BPE algorithm.

Now is a good time for some terminology we’ll be using while talking about BPE:

  • Word: produced by the splitter in pre-processing, like the list shown above.
  • Token: typically a sub-word sequence of bytes; the output of the tokenizer
    is a list of tokens, by ID.
  • Token ID: unique numerical identifier for a token.
  • Vocabulary: a mapping of token IDs –> token values learned by the tokenizer
    during the training process.
  • Training: the process in which BPE learns a vocabulary from a corpus of
    text.
  • Splitter regexp: regular expression used to split text into words during
    pre-processing. Given an algorithm (in this case BPE), the pair vocabulary +
    splitter regexp unambiguously defines how a given text will be tokenized.
  • Encoder: given a vocabulary and a splitter regexp, tokenizes any text into
    a list of IDs from the vocabulary.
  • Decoder: given a list of IDs and the vocabulary, reconstructs the
    original text.

Training

BPE training proceeds by first assuming each byte is its own token, and then
successively merging pairs of tokens into longer tokens and adding these to
the vocabulary, until the desired vocabulary size is achieved.

Let’s reuse our example, starting with these words:

i
'm
_blue
_dabadee
_dabadam

The BPE process starts by creating a token for each byte in the inclusive
range [0..255]. So the minimal vocabulary size is 256; this guarantees that
from the very start, there’s a valid encoded representation of any text.

Then, the following process is repeated:

  • Count how many times each ordered pair of bytes appears in the input.
    Ordered pair here means two bytes right next to each other. In our example,
    some such pairs are “bl”, “da”, “de”, “ee” etc.
  • Find the pair with the highest count, and create a new token from it (create
    a new token ID, mapping it to the concatenation of the most common pair).
  • Replace this most common pair with the combined token in the input set.

In our example, we start by splitting input words to bytes, so it’s a list of
single-byte token lists. This is our working list:

[i]
[' m]
[_ b l u e]
[_ d a b a d e e]
[_ d a b a d a m]

Next, we count the frequency of appearance of each ordered pair:

[d a] --> 3
[a b] --> 2
[b a] --> 2
[' m] --> 1
[_ b] --> 1
[l u] --> 1
[u e] --> 1
[_ d] --> 2
[a d] --> 2
[d e] --> 1
[e e] --> 1
[b l] --> 1
[a m] --> 1

The pair “da” is the most common one, so we’re creating a new token for it,
and substituting it everywhere in the working list:

[i]
[' m]
[_ b l u e]
[_ da b a d e e]
[_ da b a da m]

As you can see, in every instance “d” followed by “a” was combined into “da”.
Now repeat the process; finding the most common pairs in this new working list:

[e e] --> 1
[a da] --> 1
[l u] --> 1
[_ da] --> 2
[da b] --> 2
[a d] --> 1
[d e] --> 1
[da m] --> 1
[' m] --> 1
[_ b] --> 1
[b l] --> 1
[u e] --> 1
[b a] --> 2

Several pairs have a count of 2, so we pick one arbitrarily. Let’s say it’s
_da (a space followed by “da”). We add _da as a new token and
make replacements in the working list:

[i]
[' m]
[_ b l u e]
[_da b a d e e]
[_da b a da m]

And so on. When does this process stop? When we either run out of pairs (every
word consists of a single token) or – more realistically for an actual training
corpus – when we reach our desired vocabulary size. For example the vocabulary
used for GPT-4 has around 100,000 tokens (more on this later).

The output of the training process is a vocabulary; let’s say we’ve only run
two cycles on our input text as described. The vocabulary will have 258 tokens
in it: 256 for the single bytes, one for da and another for _da. Each
of these would have a unique integer ID.

In our Go sample code, the training is implemented in this file. You
can set the debugTrain variable to true to follow the process on some
sample text.

Encoding

Having learned a vocabulary, the process of encoding is what happens every time
we feed text into an LLM and it needs to be tokenized. The input is arbitrary
text, a splitting regexp and a vocabulary. For example, let’s take the input
text “yada daba”. Splitting is performed as before, and the input is broken
into individual bytes:

BPE encoding takes the vocabulary and tries to apply learned tokens to the input
text, word by word. The process is greedy – tokens are applied in the same
order they’ve been learned (this is easy to accomplish by assigning
monotonically increasing integer IDs to new tokens in the vocabulary, and then
prioritizing lower-numbered tokens for encoding).

The first token we learned was da, so let’s apply that:

The next token we learned was _da:

This is the final stage; there are no more learned tokens to apply. The result
will consist of 6 tokens.

In our sample code, the encoder is in this file.

Realistic vocabulary and splitting

The examples shown so far have been toys, but the algorithms are real and work
with the actual vocabularies and splitters used in modern models. As a case
study, the tokenizer used for OpenAI’s GPT-4 uses a vocabulary called
cl100k_base, which contains 100k tokens in addition to the 256 byte-sized
ones. This is also the vocabulary (encoding) the
tiktoken library uses. It can be
freely downloaded from OpenAI – a copy is available in my
sample repository.
The file is base64 encoded, which is easy to unravel and we’ll see
tokens like:

" Fritz"  91083
"Initially"  91084
"nodeValue"  91085
"_TRIANGLES"  91086
"-backend"  91087

The token string value is to the left, and the numerical token ID is to the
right. As you can see, the algorithm is not particularly discerning about what
it learns – names, pieces of code – whatever works!

The other important data needed to reproduce OpenAI’s tokenization is the
splitting regexp, which is this:

(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^rnp{L}p{N}]?p{L}+|p{N}{1,3}| ?[^sp{L}p{N}]+[rn]*|s*[rn]+|s+(?!S)|s+

It’s just a combination of several alternatives. You could
use one of the many “regexp explainer” websites out there to study
it, or ask a modern LLM, but the gist of it is: this regexp splits
space-delimited words, leaving spaces in front of the words, with some special
provisions like English contractions (being separate words) and long numbers
being split to groups of 3. For Go programmers, it’s important to note that
this pattern uses ?! – negative lookahead – which the standard regexp
package doesn’t support. Therefore, we’ll have to reach for the 3rd party
regexp2 to implement this .

In our sample repository, take a look at this test
that ties everything together – it loads the cl100k_base encoding and uses
it alongside the splitting regexp to tokenize some real text.

Full online demo with a web UI and WebAssembly

My goal with this project wasn’t only to understand the BPE algorithm, but to
also try reproducing the actual tokenizer used by OpenAI for its most modern
models. And this goal was accomplished!

OpenAI has a nice website here
that lets you enter text and see how it’s tokenized. I’ve managed to reproduce
this UI – see the cmd/wasm directory in the repository. I’ve also placed it
online – it can ran in your browser from here.
Here’s a screenshot :

How it works: the Go implementation of BPE is compiled to a WebAssembly binary
that’s loaded from a bit of glue JavaScript embedded in a simple HTML page.
The JavaScript watches the text box as you type and sends the string to a Go
function exported from the WASM, which tokenizes it on the fly. So we get a nice
effect of “tokens updated as we type”. The selection button at the bottom also
lets us see the numerical IDs for these tokens – they should be equivalent to
what tiktoken is producing.


#Tokens #LLMs #Byte #Pair #Encoding

Leave a Reply

Your email address will not be published. Required fields are marked *