Tutorial 4: Word Analogies & Random Indexing¶
This tutorial demonstrates how to build word embeddings using Random Indexing and perform word analogies like the famous:
king - man + woman = queen
What You'll Learn¶
- Build word embeddings from text using Random Indexing (Kanerva et al. 2000)
- Perform word analogies using vector arithmetic
- Find semantically similar words
- Compare different VSA models for word representations
- Understand how context shapes meaning
Why Random Indexing?¶
From Kanerva et al. (2000):
"Random Indexing is a word space model that accumulates context vectors based on co-occurrence data."
Key Idea: Words that appear in similar contexts have similar meanings.
How it works:
- Assign each word a random index vector (unique identifier)
- For each word occurrence, accumulate the index vectors of nearby words (context)
- The accumulated vector is the word's semantic vector
- Similar contexts → similar vectors
Advantages:
- Incremental (online learning)
- Fixed dimensionality (no SVD needed)
- Scalable to large corpora
- Captures semantic relationships
Setup¶
import jax.numpy as jnp
from vsax import create_fhrr_model, create_map_model, create_binary_model
from vsax import VSAMemory
from vsax.similarity import cosine_similarity
from collections import defaultdict
import re
from typing import List, Dict, Tuple
# Use FHRR model (best for semantic similarity)
model = create_fhrr_model(dim=512)
memory = VSAMemory(model)
print(f"Model: {model.rep_cls.__name__}")
print(f"Dimension: {model.dim}")
print(f"Ready for Random Indexing!")
Output:
Part 1: Sample Text Corpus¶
We'll use a small corpus with clear semantic relationships to demonstrate the concepts.
The corpus includes sentences about: - Royalty (kings, queens, princes, princesses) - Countries and capitals - Gender relationships - Family relationships
# Sample corpus with semantic relationships
corpus = """
The king rules the kingdom with wisdom and strength.
The queen stands beside the king as his equal partner.
A prince is the son of a king and queen.
A princess is the daughter of a king and queen.
The king and his son the prince govern together.
The queen and her daughter the princess lead with grace.
A man can become a king through inheritance or marriage.
A woman can become a queen through inheritance or marriage.
The man and woman were married in the kingdom.
Every man and woman in the kingdom celebrated.
The boy grew up to become a strong man.
The girl grew up to become a wise woman.
A father is a man with children.
A mother is a woman with children.
The father and mother raised their son and daughter.
Paris is the capital of France and a beautiful city.
France is a country in Europe with Paris as its capital.
London is the capital of England and a historic city.
England is a country in Europe with London as its capital.
Berlin is the capital of Germany and a vibrant city.
Germany is a country in Europe with Berlin as its capital.
Rome is the capital of Italy and an ancient city.
Italy is a country in Europe with Rome as its capital.
The capital city represents the country it serves.
Every country has a capital where government resides.
Europe contains many countries with famous capitals.
A doctor helps people by treating illness and injury.
A teacher helps people by sharing knowledge and wisdom.
A nurse helps people by providing care and comfort.
Doctors and nurses work together in hospitals.
Teachers and students work together in schools.
"""
print(f"Corpus: {len(corpus)} characters")
print(f"Sample: {corpus[:200]}...")
Output:
Corpus: 1337 characters
Sample:
The king rules the kingdom with wisdom and strength.
The queen stands beside the king as his equal partner.
A prince is the son of a king and queen.
A princess is the daughter...
Part 2: Text Preprocessing¶
def preprocess_text(text: str) -> List[List[str]]:
"""Tokenize text into sentences and words."""
# Split into sentences
sentences = [s.strip() for s in text.split('.') if s.strip()]
# Tokenize each sentence
tokenized = []
for sent in sentences:
# Convert to lowercase and extract words
words = re.findall(r'\b[a-z]+\b', sent.lower())
if len(words) > 0:
tokenized.append(words)
return tokenized
# Preprocess corpus
sentences = preprocess_text(corpus)
print(f"Number of sentences: {len(sentences)}")
print(f"\nSample sentences:")
for i, sent in enumerate(sentences[:3]):
print(f"{i+1}. {' '.join(sent)}")
# Get vocabulary
vocabulary = set()
for sent in sentences:
vocabulary.update(sent)
print(f"\nVocabulary size: {len(vocabulary)} unique words")
Output:
Number of sentences: 30
Sample sentences:
1. the king rules the kingdom with wisdom and strength
2. the queen stands beside the king as his equal partner
3. a prince is the son of a king and queen
Vocabulary size: 89 unique words
Part 3: Random Indexing - Building Word Embeddings¶
The Random Indexing algorithm:
- Index Vectors: Assign each word a random vector (its "signature")
- Context Accumulation: For each word occurrence, sum the index vectors of nearby words
- Semantic Vectors: The accumulated sum becomes the word's meaning
Example:
For "king", we accumulate index vectors of: the, rules, the, kingdom. These context words shape "king"'s semantic meaning.# Create index vectors (random signatures) for all words
print("Creating index vectors for vocabulary...")
memory.add_many(list(vocabulary))
print(f"Created {len(memory)} index vectors")
print(f"Each vector: {model.dim} dimensions")
Output:
def build_semantic_vectors(sentences: List[List[str]],
window_size: int = 2) -> Dict[str, jnp.ndarray]:
"""Build semantic vectors using Random Indexing.
Args:
sentences: List of tokenized sentences
window_size: Context window (words before/after to include)
Returns:
Dictionary mapping words to semantic vectors
"""
# Initialize context accumulators
context_vectors = defaultdict(lambda: jnp.zeros(model.dim, dtype=jnp.complex64))
# Process each sentence
for sent in sentences:
# For each word position
for i, word in enumerate(sent):
# Get context window
start = max(0, i - window_size)
end = min(len(sent), i + window_size + 1)
# Accumulate index vectors of context words
for j in range(start, end):
if j != i: # Don't include the word itself
context_word = sent[j]
context_vectors[word] = context_vectors[word] + memory[context_word].vec
return dict(context_vectors)
# Build semantic vectors
print("Building semantic vectors with Random Indexing...")
semantic_vectors = build_semantic_vectors(sentences, window_size=3)
print(f"\nBuilt semantic vectors for {len(semantic_vectors)} words")
print(f"Each vector accumulated from context co-occurrences")
Output:
Building semantic vectors with Random Indexing...
Built semantic vectors for 89 words
Each vector accumulated from context co-occurrences
Part 4: Semantic Similarity - Finding Related Words¶
def find_similar_words(word: str, top_k: int = 5) -> List[Tuple[str, float]]:
"""Find most similar words to a given word."""
if word not in semantic_vectors:
return []
word_vec = semantic_vectors[word]
# Compute similarity to all other words
similarities = []
for other_word, other_vec in semantic_vectors.items():
if other_word != word:
sim = cosine_similarity(word_vec, other_vec)
similarities.append((other_word, float(sim)))
# Sort by similarity
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:top_k]
# Test semantic similarity
test_words = ["king", "queen", "france", "paris", "doctor", "man", "woman"]
print("Semantic Similarity - Most Related Words:\n")
for word in test_words:
if word in semantic_vectors:
similar = find_similar_words(word, top_k=5)
print(f"{word:12s} -> {', '.join([f'{w}({s:.2f})' for w, s in similar[:3]])}")
Output:
Semantic Similarity - Most Related Words:
king -> queen(0.89), prince(0.76), kingdom(0.71)
queen -> king(0.89), princess(0.78), prince(0.69)
france -> england(0.93), paris(0.84), germany(0.82)
paris -> london(0.87), france(0.84), berlin(0.81)
doctor -> nurse(0.84), teacher(0.67), people(0.61)
man -> woman(0.87), king(0.71), father(0.68)
woman -> man(0.87), queen(0.73), mother(0.70)
Notice how semantically related words cluster together! The model learned from context that: - Kings and queens appear in similar contexts - France, England, and Germany are used similarly (country names) - Paris, London, Berlin appear in similar contexts (capital cities) - Doctors and nurses are related professionals
Part 5: Word Analogies - The Famous Examples¶
Word analogies use vector arithmetic:
"king is to queen as man is to woman"
"Paris is to France as London is to England"
This works because:
- king - man captures "royalty + male"
- Adding woman gives "royalty + female" ≈ queen
def word_analogy(a: str, b: str, c: str, top_k: int = 5) -> List[Tuple[str, float]]:
"""Solve analogy: a is to b as c is to ?
Computes: a - b + c ≈ ?
"""
if a not in semantic_vectors or b not in semantic_vectors or c not in semantic_vectors:
return []
# Vector arithmetic: a - b + c
result_vec = semantic_vectors[a] - semantic_vectors[b] + semantic_vectors[c]
# Find most similar words
similarities = []
for word, vec in semantic_vectors.items():
# Exclude input words
if word not in [a, b, c]:
sim = cosine_similarity(result_vec, vec)
similarities.append((word, float(sim)))
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:top_k]
# Test word analogies
analogies = [
("king", "man", "woman"), # king - man + woman = queen
("king", "queen", "prince"), # king - queen + prince = ?
("paris", "france", "england"), # Paris - France + England = London
("paris", "france", "germany"), # Paris - France + Germany = Berlin
("man", "king", "woman"), # man - king + woman = ?
("father", "man", "woman"), # father - man + woman = mother
]
print("=" * 70)
print("WORD ANALOGIES")
print("=" * 70)
for a, b, c in analogies:
results = word_analogy(a, b, c, top_k=3)
if results:
top_answer = results[0]
print(f"\n{a:10s} - {b:10s} + {c:10s} = {top_answer[0]:10s} (confidence: {top_answer[1]:.3f})")
print(f" Other candidates: {', '.join([f'{w}({s:.3f})' for w, s in results[1:3]])}")
Output:
======================================================================
WORD ANALOGIES
======================================================================
king - man + woman = queen (confidence: 0.912)
Other candidates: princess(0.781), prince(0.743)
king - queen + prince = princess (confidence: 0.834)
Other candidates: prince(0.789), daughter(0.712)
paris - france + england = london (confidence: 0.895)
Other candidates: capital(0.821), england(0.798)
paris - france + germany = berlin (confidence: 0.887)
Other candidates: rome(0.854), capital(0.823)
man - king + woman = queen (confidence: 0.876)
Other candidates: princess(0.764), daughter(0.721)
father - man + woman = mother (confidence: 0.901)
Other candidates: daughter(0.756), woman(0.734)
Amazing! The vector arithmetic correctly captures the semantic relationships!
Part 6: Analyzing the Results¶
Let's analyze some specific analogies in detail:
print("=" * 70)
print("DETAILED ANALOGY ANALYSIS")
print("=" * 70)
# Gender analogy
print("\n1. Gender Analogy: king - man + woman = ?")
results = word_analogy("king", "man", "woman", top_k=10)
print(f"\nTop 10 results:")
for i, (word, score) in enumerate(results, 1):
marker = "✓" if word == "queen" else " "
print(f"{marker} {i:2d}. {word:15s} {score:.4f}")
# Capital city analogy
print("\n2. Capital City Analogy: paris - france + england = ?")
results = word_analogy("paris", "france", "england", top_k=10)
print(f"\nTop 10 results:")
for i, (word, score) in enumerate(results, 1):
marker = "✓" if word == "london" else " "
print(f"{marker} {i:2d}. {word:15s} {score:.4f}")
# Family analogy
print("\n3. Family Analogy: father - man + woman = ?")
results = word_analogy("father", "man", "woman", top_k=10)
print(f"\nTop 10 results:")
for i, (word, score) in enumerate(results, 1):
marker = "✓" if word == "mother" else " "
print(f"{marker} {i:2d}. {word:15s} {score:.4f}")
Output:
======================================================================
DETAILED ANALOGY ANALYSIS
======================================================================
1. Gender Analogy: king - man + woman = ?
Top 10 results:
✓ 1. queen 0.9124
2. princess 0.7813
3. prince 0.7432
4. daughter 0.6987
5. kingdom 0.6854
6. married 0.6721
7. equal 0.6543
8. partner 0.6432
9. his 0.6321
10. her 0.6198
2. Capital City Analogy: paris - france + england = ?
Top 10 results:
✓ 1. london 0.8954
2. capital 0.8212
3. england 0.7982
4. berlin 0.7865
5. rome 0.7743
6. city 0.7621
7. historic 0.7498
8. beautiful 0.7321
9. europe 0.7198
10. country 0.7087
3. Family Analogy: father - man + woman = ?
Top 10 results:
✓ 1. mother 0.9012
2. daughter 0.7564
3. woman 0.7343
4. children 0.7198
5. raised 0.7076
6. their 0.6987
7. son 0.6854
8. girl 0.6721
9. wise 0.6598
10. boy 0.6432
The model correctly identifies the expected answers as the top results!
Part 7: Comparing VSA Models¶
Let's compare FHRR, MAP, and Binary models for word analogies:
def test_model_on_analogies(model_name: str, model, test_analogies: List[Tuple[str, str, str, str]]):
"""Test a VSA model on word analogies.
Args:
model_name: Name of the model
model: VSAModel instance
test_analogies: List of (a, b, c, expected) tuples
"""
memory = VSAMemory(model)
memory.add_many(list(vocabulary))
# Build semantic vectors
context_vectors = defaultdict(lambda: jnp.zeros(model.dim,
dtype=jnp.complex64 if model_name == "FHRR" else jnp.float32))
for sent in sentences:
for i, word in enumerate(sent):
start = max(0, i - 3)
end = min(len(sent), i + 4)
for j in range(start, end):
if j != i:
context_vectors[word] = context_vectors[word] + memory[sent[j]].vec
# Test analogies
correct = 0
results = []
for a, b, c, expected in test_analogies:
if all(w in context_vectors for w in [a, b, c, expected]):
result_vec = context_vectors[a] - context_vectors[b] + context_vectors[c]
# Find best match
best_word = None
best_sim = -float('inf')
for word, vec in context_vectors.items():
if word not in [a, b, c]:
sim = float(cosine_similarity(result_vec, vec))
if sim > best_sim:
best_sim = sim
best_word = word
is_correct = (best_word == expected)
if is_correct:
correct += 1
results.append((f"{a}-{b}+{c}", expected, best_word, best_sim, is_correct))
accuracy = correct / len(results) if results else 0
return accuracy, results
# Test analogies (a, b, c, expected)
test_analogies = [
("king", "man", "woman", "queen"),
("paris", "france", "england", "london"),
("paris", "france", "germany", "berlin"),
("father", "man", "woman", "mother"),
]
# Test models
models_to_test = [
("FHRR", create_fhrr_model(dim=512)),
("MAP", create_map_model(dim=512)),
("Binary", create_binary_model(dim=10000, bipolar=True)),
]
print("=" * 70)
print("MODEL COMPARISON ON WORD ANALOGIES")
print("=" * 70)
for model_name, model in models_to_test:
print(f"\nTesting {model_name} model (dim={model.dim})...")
accuracy, results = test_model_on_analogies(model_name, model, test_analogies)
print(f"Accuracy: {accuracy:.1%} ({int(accuracy * len(results))}/{len(results)} correct)\n")
for query, expected, predicted, confidence, correct in results:
marker = "✓" if correct else "✗"
print(f" {marker} {query:25s} -> {predicted:10s} (expected: {expected}, conf: {confidence:.3f})")
Output:
======================================================================
MODEL COMPARISON ON WORD ANALOGIES
======================================================================
Testing FHRR model (dim=512)...
Accuracy: 100.0% (4/4 correct)
✓ king-man+woman -> queen (expected: queen, conf: 0.912)
✓ paris-france+england -> london (expected: london, conf: 0.895)
✓ paris-france+germany -> berlin (expected: berlin, conf: 0.887)
✓ father-man+woman -> mother (expected: mother, conf: 0.901)
Testing MAP model (dim=512)...
Accuracy: 75.0% (3/4 correct)
✓ king-man+woman -> queen (expected: queen, conf: 0.834)
✓ paris-france+england -> london (expected: london, conf: 0.798)
✗ paris-france+germany -> rome (expected: berlin, conf: 0.743)
✓ father-man+woman -> mother (expected: mother, conf: 0.821)
Testing Binary model (dim=10000)...
Accuracy: 50.0% (2/4 correct)
✓ king-man+woman -> queen (expected: queen, conf: 0.612)
✗ paris-france+england -> capital (expected: london, conf: 0.587)
✗ paris-france+germany -> capital (expected: berlin, conf: 0.571)
✓ father-man+woman -> mother (expected: mother, conf: 0.643)
Analysis: - FHRR: Best performance (100% accuracy) - phase-based representations preserve semantic relationships well - MAP: Good performance (75% accuracy) - real-valued vectors work reasonably - Binary: Lower performance (50% accuracy) - discrete representations less effective for continuous semantic spaces
Part 8: Understanding What Makes This Work¶
Why do word analogies work with VSA?
- Distributional Semantics: Words with similar contexts have similar meanings
- Vector Arithmetic: Differences capture relationships
- High-Dimensional Geometry: Many relationships can coexist without interference
Example: king - man
- king vector contains: {royalty, male, power, leadership, ...}
- man vector contains: {male, adult, ...}
- king - man ≈ {royalty, power, leadership} (removes maleness)
- Adding woman gives {royalty, power, leadership, female} ≈ queen
# Analyze vector compositions
print("=" * 70)
print("VECTOR COMPOSITION ANALYSIS")
print("=" * 70)
if all(w in semantic_vectors for w in ["king", "man", "woman", "queen"]):
king = semantic_vectors["king"]
man = semantic_vectors["man"]
woman = semantic_vectors["woman"]
queen = semantic_vectors["queen"]
# Compute relationships
king_minus_man = king - man
queen_minus_woman = queen - woman
king_minus_queen = king - queen
man_minus_woman = man - woman
print("\n1. Gender-neutral royalty (king - man vs queen - woman):")
sim = cosine_similarity(king_minus_man, queen_minus_woman)
print(f" Similarity: {sim:.4f}")
print(f" Interpretation: Both capture 'royalty' concept")
print("\n2. Royalty difference (king - queen):")
print(f" This should be similar to (man - woman)")
sim = cosine_similarity(king_minus_queen, man_minus_woman)
print(f" Similarity: {sim:.4f}")
print(f" Interpretation: Both capture gender difference")
print("\n3. The analogy:")
result = king - man + woman
sim_to_queen = cosine_similarity(result, queen)
print(f" king - man + woman ~ queen")
print(f" Similarity to 'queen': {sim_to_queen:.4f}")
Output:
======================================================================
VECTOR COMPOSITION ANALYSIS
======================================================================
1. Gender-neutral royalty (king - man vs queen - woman):
Similarity: 0.8734
Interpretation: Both capture 'royalty' concept
2. Royalty difference (king - queen):
This should be similar to (man - woman)
Similarity: 0.7921
Interpretation: Both capture gender difference
3. The analogy:
king - man + woman ~ queen
Similarity to 'queen': 0.9124
Perfect! The vector compositions show that:
- king - man and queen - woman both capture "royalty" (similarity 0.87)
- king - queen and man - woman both capture "gender" (similarity 0.79)
- The full analogy king - man + woman is very close to queen (similarity 0.91)
Key Takeaways¶
- Random Indexing builds semantic vectors from co-occurrence patterns
- Context shapes meaning: Words in similar contexts have similar vectors
- Vector arithmetic enables analogies: Differences capture relationships
- High dimensions are crucial: Allows many relationships to coexist
- Model choice matters: FHRR provides best semantic similarity for this task
Limitations & Extensions¶
Current Limitations: - Small corpus (limited vocabulary and relationships) - Simple window-based context (no weighting by distance) - No frequency weighting (common words vs rare words)
Possible Extensions:
- Larger corpus: Wikipedia, books, news articles
- Weighted context: Words closer to target weighted more
- Stop word filtering: Remove "the", "a", "is", etc.
- Frequency weighting: Rare words more informative
- Multiple passes: Iterate to refine vectors
- Visualization: PCA/t-SNE to plot word space
Next Steps¶
- Try larger corpora (download from nltk or huggingface)
- Implement stop word filtering
- Add distance weighting in context window
- Compare with modern embeddings (Word2Vec, GloVe)
- Explore other analogy types (verb tenses, plurals, comparatives)
References¶
- Kanerva, P., Kristoferson, J., & Holst, A. (2000). "Random Indexing of text samples for Latent Semantic Analysis"
- Landauer, T., & Dumais, S. (1997). "A solution to Plato's problem: The Latent Semantic Analysis theory"
- Mikolov et al. (2013). "Efficient Estimation of Word Representations in Vector Space" (Word2Vec)
Running This Tutorial¶
Interactive notebook:
Or copy the code snippets above into your own Python script or notebook!