Understanding tokens in LLM
Tokens are the fundamental units of text that a Large Language Model (LLM) processes. A token can be a word, subword, or even a single character depending on the tokenisation method used.
How tokenisation works
Most LLMs, such as GPT models, use Byte Pair Encoding (BPE) or other tokenisation methods to split text. Here’s a simple example of tokenisation:
// Example of tokenised sentence:
Input: "Hello, world! "
Tokens: ["Hello", ",", "world", "!" ]
Why tokens matter
- Tokens determine the cost of LLM API calls (e.g., OpenAI’s models charge based on token usage).
- They define the model's context length (e.g., GPT-4 has a context window of up to 32K tokens).
- Understanding tokenisation helps in optimising prompts for better responses.
Example: Counting tokens
Below is an example Python snippet to count tokens using the tiktoken
library:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
text = "Understanding tokens in LLMs is essential."
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}")