Text Tokenization and Embeddings in Deep Learning
Why Text Needs to Be Numbered
Deep learning models are based on processing numbers
Language models are built on mathematics, and math is based on numbers
Optimization, learning algorithms, and probabilities all require numerical representations
Text must be represented as numbers for computational processing
What is a Token?
A token is a piece of text that can be represented as an integer.
Examples of Tokenization:
Character-level: The word "HELLO" represented as five numbers (a vector with one number for each character)
Subword-level: "HELLO" split as "HE-LLO" → two numbers representing two subwords
Word-level: The entire word "HELLO" as a single token (one number)
Why Not Use Characters as Tokens?
Simple character mapping (e.g., a→1, b→2, ... z→26) has several problems:
Problem 1: Unicode Complexity
Many different unicode systems exist for letters across languages
Problem 2: Statistical Inefficiency
Ignores statistical regularities in language
Misses patterns and relationships between character sequences
Problem 3: Memory Limitations
Requires significant additional memory
Severely limits the context window size
How LLMs Actually Work
Important: LLMs don't work directly with tokens - they work with embeddings!
The Processing Pipeline:
Text → Token ID → Embedding → [LLM Processing] → Unembedding → Token → Text
Text is converted into tokens
Tokens must be converted into embeddings before the LLM can process them
LLMs modify embeddings for classification and generation tasks
It's all about the embeddings
What Are Embeddings?
Embeddings are dense numeric representations of tokens.
Advantages Over Simple Integers:
Efficiency: More text can be represented using fewer numbers
Semantic Relations: Relationships across tokens can be represented
Characteristics of Real Embeddings:
High-dimensional: Often more than 1000 dimensions
Not human-interpretable: Abstract mathematical representations
Dynamic: Modified during model calculations (e.g., the "dog" vector can change based on context)
Tokenization Challenges
Finding the optimal tokenization strategy is difficult:
Trade-offs:
Fewer tokens:
✓ Less memory usage
✓ Improved generalization
✗ Less efficient and effective
More tokens:
✓ Better text compression
✓ Can convey more information
✗ Requires more training
Key Considerations:
Statistical dependencies change across languages
Dependencies vary between databases within the same language
Tokenization must be learned from text
Different texts will create different tokenization schemes
Key Definitions
Encoder: A function that maps text into integers
Decoder: The opposite of encoder - a function (lookup table) that maps integers back into text
Inverse Property: decoder(encoder(x)) = x
Summary
Text must be transformed into numbers before LLM processing
A chunk of text (token) can be a character, subword, or full word
Embeddings are dense representations of tokens
Both tokenization and embeddings are learned from data
There are many valid ways to create these schemes
This is an excellent overview of tokenization and embeddings! You've covered the fundamental concepts clearly. Let me add a few insights that might complement your notes:
Additional Context on Tokenization Methods
Byte-Pair Encoding (BPE) is one of the most common tokenization algorithms used today (GPT models, for instance). It starts with characters and iteratively merges the most frequent pairs to create subword tokens. This helps balance vocabulary size with semantic meaningfulness.
WordPiece (used by BERT) and SentencePiece (used by T5, LLaMA) are other popular approaches that achieve similar goals through slightly different algorithms.
The Embedding Space
Your point about embeddings capturing semantic relations is crucial. In practice, this means:
Similar words cluster together in the embedding space
Vector arithmetic can capture relationships (the classic example: king - man + woman ≈ queen)
Contextualized embeddings (from transformers) can represent the same word differently based on surrounding context
Practical Trade-offs
The tokenization choice significantly impacts:
Computational efficiency: Fewer tokens = faster processing
Multilingual capability: Byte-level tokenization handles any language but needs more tokens
Out-of-vocabulary handling: Subword tokenization can handle novel words by breaking them into known pieces
A Concrete Example
For the sentence "The tokenization is interesting":
Character-level: ~30 tokens (including spaces)
Subword (BPE): ~6-8 tokens (might split "tokenization" into "token" + "ization")
Word-level: 4 tokens
Each approach affects how much context fits in the model's window and how well it generalizes.
0 Σχόλια