AI Nexus Logo

Tokens

Understanding tokens in LLM

Tokens are the fundamental units of text that a Large Language Model (LLM) processes. A token can be a word, subword, or even a single character depending on the tokenisation method used.

How tokenisation works

Most LLMs, such as GPT models, use Byte Pair Encoding (BPE) or other tokenisation methods to split text. Here’s a simple example of tokenisation:

            
// Example of tokenised sentence:
Input: "Hello, world!   "
Tokens: ["Hello", ",", "world", "!" ]
            
        

Why tokens matter

Example: Counting tokens

Below is an example Python snippet to count tokens using the tiktoken library:

            
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
text = "Understanding tokens in LLMs is essential."
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}")