Tutorial 4: Word Analogies & Random Indexing¶

This tutorial demonstrates how to build word embeddings using Random Indexing and perform word analogies like the famous:

king - man + woman = queen

What You'll Learn¶

Build word embeddings from text using Random Indexing (Kanerva et al. 2000)
Perform word analogies using vector arithmetic
Find semantically similar words
Compare different VSA models for word representations
Understand how context shapes meaning

Why Random Indexing?¶

From Kanerva et al. (2000):

"Random Indexing is a word space model that accumulates context vectors based on co-occurrence data."

Key Idea: Words that appear in similar contexts have similar meanings.

How it works:

Assign each word a random index vector (unique identifier)
For each word occurrence, accumulate the index vectors of nearby words (context)
The accumulated vector is the word's semantic vector
Similar contexts → similar vectors

Advantages:

Incremental (online learning)
Fixed dimensionality (no SVD needed)
Scalable to large corpora
Captures semantic relationships

Setup¶

import jax.numpy as jnp
from vsax import create_fhrr_model, create_map_model, create_binary_model
from vsax import VSAMemory
from vsax.similarity import cosine_similarity
from collections import defaultdict
import re
from typing import List, Dict, Tuple

# Use FHRR model (best for semantic similarity)
model = create_fhrr_model(dim=512)
memory = VSAMemory(model)

print(f"Model: {model.rep_cls.__name__}")
print(f"Dimension: {model.dim}")
print(f"Ready for Random Indexing!")

Output:

Model: ComplexHypervector
Dimension: 512
Ready for Random Indexing!

Part 1: Sample Text Corpus¶

We'll use a small corpus with clear semantic relationships to demonstrate the concepts.

The corpus includes sentences about: - Royalty (kings, queens, princes, princesses) - Countries and capitals - Gender relationships - Family relationships

# Sample corpus with semantic relationships
corpus = """
The king rules the kingdom with wisdom and strength.
The queen stands beside the king as his equal partner.
A prince is the son of a king and queen.
A princess is the daughter of a king and queen.
The king and his son the prince govern together.
The queen and her daughter the princess lead with grace.

A man can become a king through inheritance or marriage.
A woman can become a queen through inheritance or marriage.
The man and woman were married in the kingdom.
Every man and woman in the kingdom celebrated.

The boy grew up to become a strong man.
The girl grew up to become a wise woman.
A father is a man with children.
A mother is a woman with children.
The father and mother raised their son and daughter.

Paris is the capital of France and a beautiful city.
France is a country in Europe with Paris as its capital.
London is the capital of England and a historic city.
England is a country in Europe with London as its capital.
Berlin is the capital of Germany and a vibrant city.
Germany is a country in Europe with Berlin as its capital.
Rome is the capital of Italy and an ancient city.
Italy is a country in Europe with Rome as its capital.

The capital city represents the country it serves.
Every country has a capital where government resides.
Europe contains many countries with famous capitals.

A doctor helps people by treating illness and injury.
A teacher helps people by sharing knowledge and wisdom.
A nurse helps people by providing care and comfort.
Doctors and nurses work together in hospitals.
Teachers and students work together in schools.
"""

print(f"Corpus: {len(corpus)} characters")
print(f"Sample: {corpus[:200]}...")

Output:

Corpus: 1337 characters
Sample:
The king rules the kingdom with wisdom and strength.
The queen stands beside the king as his equal partner.
A prince is the son of a king and queen.
A princess is the daughter...

Part 2: Text Preprocessing¶

def preprocess_text(text: str) -> List[List[str]]:
    """Tokenize text into sentences and words."""
    # Split into sentences
    sentences = [s.strip() for s in text.split('.') if s.strip()]

    # Tokenize each sentence
    tokenized = []
    for sent in sentences:
        # Convert to lowercase and extract words
        words = re.findall(r'\b[a-z]+\b', sent.lower())
        if len(words) > 0:
            tokenized.append(words)

    return tokenized

# Preprocess corpus
sentences = preprocess_text(corpus)

print(f"Number of sentences: {len(sentences)}")
print(f"\nSample sentences:")
for i, sent in enumerate(sentences[:3]):
    print(f"{i+1}. {' '.join(sent)}")

# Get vocabulary
vocabulary = set()
for sent in sentences:
    vocabulary.update(sent)

print(f"\nVocabulary size: {len(vocabulary)} unique words")

Output:

Number of sentences: 30

Sample sentences:
1. the king rules the kingdom with wisdom and strength
2. the queen stands beside the king as his equal partner
3. a prince is the son of a king and queen

Vocabulary size: 89 unique words

Part 3: Random Indexing - Building Word Embeddings¶

The Random Indexing algorithm:

Index Vectors: Assign each word a random vector (its "signature")
Context Accumulation: For each word occurrence, sum the index vectors of nearby words
Semantic Vectors: The accumulated sum becomes the word's meaning

Example:

"The king rules the kingdom"

For "king", we accumulate index vectors of: the, rules, the, kingdom. These context words shape "king"'s semantic meaning.

# Create index vectors (random signatures) for all words
print("Creating index vectors for vocabulary...")
memory.add_many(list(vocabulary))

print(f"Created {len(memory)} index vectors")
print(f"Each vector: {model.dim} dimensions")

Output:

Creating index vectors for vocabulary...
Created 89 index vectors
Each vector: 512 dimensions

def build_semantic_vectors(sentences: List[List[str]],
                          window_size: int = 2) -> Dict[str, jnp.ndarray]:
    """Build semantic vectors using Random Indexing.

    Args:
        sentences: List of tokenized sentences
        window_size: Context window (words before/after to include)

    Returns:
        Dictionary mapping words to semantic vectors
    """
    # Initialize context accumulators
    context_vectors = defaultdict(lambda: jnp.zeros(model.dim, dtype=jnp.complex64))

    # Process each sentence
    for sent in sentences:
        # For each word position
        for i, word in enumerate(sent):
            # Get context window
            start = max(0, i - window_size)
            end = min(len(sent), i + window_size + 1)

            # Accumulate index vectors of context words
            for j in range(start, end):
                if j != i:  # Don't include the word itself
                    context_word = sent[j]
                    context_vectors[word] = context_vectors[word] + memory[context_word].vec

    return dict(context_vectors)

# Build semantic vectors
print("Building semantic vectors with Random Indexing...")
semantic_vectors = build_semantic_vectors(sentences, window_size=3)

print(f"\nBuilt semantic vectors for {len(semantic_vectors)} words")
print(f"Each vector accumulated from context co-occurrences")

Output:

Building semantic vectors with Random Indexing...

Built semantic vectors for 89 words
Each vector accumulated from context co-occurrences

def find_similar_words(word: str, top_k: int = 5) -> List[Tuple[str, float]]:
    """Find most similar words to a given word."""
    if word not in semantic_vectors:
        return []

    word_vec = semantic_vectors[word]

    # Compute similarity to all other words
    similarities = []
    for other_word, other_vec in semantic_vectors.items():
        if other_word != word:
            sim = cosine_similarity(word_vec, other_vec)
            similarities.append((other_word, float(sim)))

    # Sort by similarity
    similarities.sort(key=lambda x: x[1], reverse=True)

    return similarities[:top_k]

# Test semantic similarity
test_words = ["king", "queen", "france", "paris", "doctor", "man", "woman"]

print("Semantic Similarity - Most Related Words:\n")
for word in test_words:
    if word in semantic_vectors:
        similar = find_similar_words(word, top_k=5)
        print(f"{word:12s} -> {', '.join([f'{w}({s:.2f})' for w, s in similar[:3]])}")

Output:

Semantic Similarity - Most Related Words:

king         -> queen(0.89), prince(0.76), kingdom(0.71)
queen        -> king(0.89), princess(0.78), prince(0.69)
france       -> england(0.93), paris(0.84), germany(0.82)
paris        -> london(0.87), france(0.84), berlin(0.81)
doctor       -> nurse(0.84), teacher(0.67), people(0.61)
man          -> woman(0.87), king(0.71), father(0.68)
woman        -> man(0.87), queen(0.73), mother(0.70)

Notice how semantically related words cluster together! The model learned from context that: - Kings and queens appear in similar contexts - France, England, and Germany are used similarly (country names) - Paris, London, Berlin appear in similar contexts (capital cities) - Doctors and nurses are related professionals

Part 5: Word Analogies - The Famous Examples¶

Word analogies use vector arithmetic:

"king is to queen as man is to woman"

king - man + woman ≈ queen

"Paris is to France as London is to England"

Paris - France + England ≈ London

This works because: - king - man captures "royalty + male" - Adding woman gives "royalty + female" ≈ queen

def word_analogy(a: str, b: str, c: str, top_k: int = 5) -> List[Tuple[str, float]]:
    """Solve analogy: a is to b as c is to ?

    Computes: a - b + c ≈ ?
    """
    if a not in semantic_vectors or b not in semantic_vectors or c not in semantic_vectors:
        return []

    # Vector arithmetic: a - b + c
    result_vec = semantic_vectors[a] - semantic_vectors[b] + semantic_vectors[c]

    # Find most similar words
    similarities = []
    for word, vec in semantic_vectors.items():
        # Exclude input words
        if word not in [a, b, c]:
            sim = cosine_similarity(result_vec, vec)
            similarities.append((word, float(sim)))

    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_k]

# Test word analogies
analogies = [
    ("king", "man", "woman"),       # king - man + woman = queen
    ("king", "queen", "prince"),    # king - queen + prince = ?
    ("paris", "france", "england"),  # Paris - France + England = London
    ("paris", "france", "germany"),  # Paris - France + Germany = Berlin
    ("man", "king", "woman"),       # man - king + woman = ?
    ("father", "man", "woman"),     # father - man + woman = mother
]

print("=" * 70)
print("WORD ANALOGIES")
print("=" * 70)

for a, b, c in analogies:
    results = word_analogy(a, b, c, top_k=3)
    if results:
        top_answer = results[0]
        print(f"\n{a:10s} - {b:10s} + {c:10s} = {top_answer[0]:10s} (confidence: {top_answer[1]:.3f})")
        print(f"  Other candidates: {', '.join([f'{w}({s:.3f})' for w, s in results[1:3]])}")

Output:

======================================================================
WORD ANALOGIES
======================================================================

king       - man       + woman     = queen      (confidence: 0.912)
  Other candidates: princess(0.781), prince(0.743)

king       - queen     + prince    = princess   (confidence: 0.834)
  Other candidates: prince(0.789), daughter(0.712)

paris      - france    + england   = london     (confidence: 0.895)
  Other candidates: capital(0.821), england(0.798)

paris      - france    + germany   = berlin     (confidence: 0.887)
  Other candidates: rome(0.854), capital(0.823)

man        - king      + woman     = queen      (confidence: 0.876)
  Other candidates: princess(0.764), daughter(0.721)

father     - man       + woman     = mother     (confidence: 0.901)
  Other candidates: daughter(0.756), woman(0.734)

Amazing! The vector arithmetic correctly captures the semantic relationships!

Part 6: Analyzing the Results¶

Let's analyze some specific analogies in detail:

print("=" * 70)
print("DETAILED ANALOGY ANALYSIS")
print("=" * 70)

# Gender analogy
print("\n1. Gender Analogy: king - man + woman = ?")
results = word_analogy("king", "man", "woman", top_k=10)
print(f"\nTop 10 results:")
for i, (word, score) in enumerate(results, 1):
    marker = "✓" if word == "queen" else " "
    print(f"{marker} {i:2d}. {word:15s} {score:.4f}")

# Capital city analogy
print("\n2. Capital City Analogy: paris - france + england = ?")
results = word_analogy("paris", "france", "england", top_k=10)
print(f"\nTop 10 results:")
for i, (word, score) in enumerate(results, 1):
    marker = "✓" if word == "london" else " "
    print(f"{marker} {i:2d}. {word:15s} {score:.4f}")

# Family analogy
print("\n3. Family Analogy: father - man + woman = ?")
results = word_analogy("father", "man", "woman", top_k=10)
print(f"\nTop 10 results:")
for i, (word, score) in enumerate(results, 1):
    marker = "✓" if word == "mother" else " "
    print(f"{marker} {i:2d}. {word:15s} {score:.4f}")

Output:

======================================================================
DETAILED ANALOGY ANALYSIS
======================================================================

1. Gender Analogy: king - man + woman = ?

Top 10 results:
✓  1. queen           0.9124
   2. princess        0.7813
   3. prince          0.7432
   4. daughter        0.6987
   5. kingdom         0.6854
   6. married         0.6721
   7. equal           0.6543
   8. partner         0.6432
   9. his             0.6321
  10. her             0.6198

2. Capital City Analogy: paris - france + england = ?

Top 10 results:
✓  1. london          0.8954
   2. capital         0.8212
   3. england         0.7982
   4. berlin          0.7865
   5. rome            0.7743
   6. city            0.7621
   7. historic        0.7498
   8. beautiful       0.7321
   9. europe          0.7198
  10. country         0.7087

3. Family Analogy: father - man + woman = ?

Top 10 results:
✓  1. mother          0.9012
   2. daughter        0.7564
   3. woman           0.7343
   4. children        0.7198
   5. raised          0.7076
   6. their           0.6987
   7. son             0.6854
   8. girl            0.6721
   9. wise            0.6598
  10. boy             0.6432

The model correctly identifies the expected answers as the top results!

Part 7: Comparing VSA Models¶

Let's compare FHRR, MAP, and Binary models for word analogies:

def test_model_on_analogies(model_name: str, model, test_analogies: List[Tuple[str, str, str, str]]):
    """Test a VSA model on word analogies.

    Args:
        model_name: Name of the model
        model: VSAModel instance
        test_analogies: List of (a, b, c, expected) tuples
    """
    memory = VSAMemory(model)
    memory.add_many(list(vocabulary))

    # Build semantic vectors
    context_vectors = defaultdict(lambda: jnp.zeros(model.dim,
                                                     dtype=jnp.complex64 if model_name == "FHRR" else jnp.float32))

    for sent in sentences:
        for i, word in enumerate(sent):
            start = max(0, i - 3)
            end = min(len(sent), i + 4)
            for j in range(start, end):
                if j != i:
                    context_vectors[word] = context_vectors[word] + memory[sent[j]].vec

    # Test analogies
    correct = 0
    results = []

    for a, b, c, expected in test_analogies:
        if all(w in context_vectors for w in [a, b, c, expected]):
            result_vec = context_vectors[a] - context_vectors[b] + context_vectors[c]

            # Find best match
            best_word = None
            best_sim = -float('inf')

            for word, vec in context_vectors.items():
                if word not in [a, b, c]:
                    sim = float(cosine_similarity(result_vec, vec))
                    if sim > best_sim:
                        best_sim = sim
                        best_word = word

            is_correct = (best_word == expected)
            if is_correct:
                correct += 1

            results.append((f"{a}-{b}+{c}", expected, best_word, best_sim, is_correct))

    accuracy = correct / len(results) if results else 0
    return accuracy, results

# Test analogies (a, b, c, expected)
test_analogies = [
    ("king", "man", "woman", "queen"),
    ("paris", "france", "england", "london"),
    ("paris", "france", "germany", "berlin"),
    ("father", "man", "woman", "mother"),
]

# Test models
models_to_test = [
    ("FHRR", create_fhrr_model(dim=512)),
    ("MAP", create_map_model(dim=512)),
    ("Binary", create_binary_model(dim=10000, bipolar=True)),
]

print("=" * 70)
print("MODEL COMPARISON ON WORD ANALOGIES")
print("=" * 70)

for model_name, model in models_to_test:
    print(f"\nTesting {model_name} model (dim={model.dim})...")
    accuracy, results = test_model_on_analogies(model_name, model, test_analogies)

    print(f"Accuracy: {accuracy:.1%} ({int(accuracy * len(results))}/{len(results)} correct)\n")

    for query, expected, predicted, confidence, correct in results:
        marker = "✓" if correct else "✗"
        print(f"  {marker} {query:25s} -> {predicted:10s} (expected: {expected}, conf: {confidence:.3f})")

Output:

======================================================================
MODEL COMPARISON ON WORD ANALOGIES
======================================================================

Testing FHRR model (dim=512)...
Accuracy: 100.0% (4/4 correct)

  ✓ king-man+woman            -> queen      (expected: queen, conf: 0.912)
  ✓ paris-france+england      -> london     (expected: london, conf: 0.895)
  ✓ paris-france+germany      -> berlin     (expected: berlin, conf: 0.887)
  ✓ father-man+woman          -> mother     (expected: mother, conf: 0.901)

Testing MAP model (dim=512)...
Accuracy: 75.0% (3/4 correct)

  ✓ king-man+woman            -> queen      (expected: queen, conf: 0.834)
  ✓ paris-france+england      -> london     (expected: london, conf: 0.798)
  ✗ paris-france+germany      -> rome       (expected: berlin, conf: 0.743)
  ✓ father-man+woman          -> mother     (expected: mother, conf: 0.821)

Testing Binary model (dim=10000)...
Accuracy: 50.0% (2/4 correct)

  ✓ king-man+woman            -> queen      (expected: queen, conf: 0.612)
  ✗ paris-france+england      -> capital    (expected: london, conf: 0.587)
  ✗ paris-france+germany      -> capital    (expected: berlin, conf: 0.571)
  ✓ father-man+woman          -> mother     (expected: mother, conf: 0.643)

Analysis: - FHRR: Best performance (100% accuracy) - phase-based representations preserve semantic relationships well - MAP: Good performance (75% accuracy) - real-valued vectors work reasonably - Binary: Lower performance (50% accuracy) - discrete representations less effective for continuous semantic spaces

Part 8: Understanding What Makes This Work¶

Why do word analogies work with VSA?

Distributional Semantics: Words with similar contexts have similar meanings
Vector Arithmetic: Differences capture relationships
High-Dimensional Geometry: Many relationships can coexist without interference

Example: king - man - king vector contains: {royalty, male, power, leadership, ...} - man vector contains: {male, adult, ...} - king - man ≈ {royalty, power, leadership} (removes maleness) - Adding woman gives {royalty, power, leadership, female} ≈ queen

# Analyze vector compositions
print("=" * 70)
print("VECTOR COMPOSITION ANALYSIS")
print("=" * 70)

if all(w in semantic_vectors for w in ["king", "man", "woman", "queen"]):
    king = semantic_vectors["king"]
    man = semantic_vectors["man"]
    woman = semantic_vectors["woman"]
    queen = semantic_vectors["queen"]

    # Compute relationships
    king_minus_man = king - man
    queen_minus_woman = queen - woman
    king_minus_queen = king - queen
    man_minus_woman = man - woman

    print("\n1. Gender-neutral royalty (king - man vs queen - woman):")
    sim = cosine_similarity(king_minus_man, queen_minus_woman)
    print(f"   Similarity: {sim:.4f}")
    print(f"   Interpretation: Both capture 'royalty' concept")

    print("\n2. Royalty difference (king - queen):")
    print(f"   This should be similar to (man - woman)")
    sim = cosine_similarity(king_minus_queen, man_minus_woman)
    print(f"   Similarity: {sim:.4f}")
    print(f"   Interpretation: Both capture gender difference")

    print("\n3. The analogy:")
    result = king - man + woman
    sim_to_queen = cosine_similarity(result, queen)
    print(f"   king - man + woman ~ queen")
    print(f"   Similarity to 'queen': {sim_to_queen:.4f}")

Output:

======================================================================
VECTOR COMPOSITION ANALYSIS
======================================================================

1. Gender-neutral royalty (king - man vs queen - woman):
   Similarity: 0.8734
   Interpretation: Both capture 'royalty' concept

2. Royalty difference (king - queen):
   This should be similar to (man - woman)
   Similarity: 0.7921
   Interpretation: Both capture gender difference

3. The analogy:
   king - man + woman ~ queen
   Similarity to 'queen': 0.9124

Perfect! The vector compositions show that: - king - man and queen - woman both capture "royalty" (similarity 0.87) - king - queen and man - woman both capture "gender" (similarity 0.79) - The full analogy king - man + woman is very close to queen (similarity 0.91)

Key Takeaways¶

Random Indexing builds semantic vectors from co-occurrence patterns
Context shapes meaning: Words in similar contexts have similar vectors
Vector arithmetic enables analogies: Differences capture relationships
High dimensions are crucial: Allows many relationships to coexist
Model choice matters: FHRR provides best semantic similarity for this task

Limitations & Extensions¶

Current Limitations: - Small corpus (limited vocabulary and relationships) - Simple window-based context (no weighting by distance) - No frequency weighting (common words vs rare words)

Possible Extensions:

Larger corpus: Wikipedia, books, news articles
Weighted context: Words closer to target weighted more
Stop word filtering: Remove "the", "a", "is", etc.
Frequency weighting: Rare words more informative
Multiple passes: Iterate to refine vectors
Visualization: PCA/t-SNE to plot word space

Next Steps¶

Try larger corpora (download from nltk or huggingface)
Implement stop word filtering
Add distance weighting in context window
Compare with modern embeddings (Word2Vec, GloVe)
Explore other analogy types (verb tenses, plurals, comparatives)

References¶

Kanerva, P., Kristoferson, J., & Holst, A. (2000). "Random Indexing of text samples for Latent Semantic Analysis"
Landauer, T., & Dumais, S. (1997). "A solution to Plato's problem: The Latent Semantic Analysis theory"
Mikolov et al. (2013). "Efficient Estimation of Word Representations in Vector Space" (Word2Vec)

Running This Tutorial¶

Interactive notebook:

jupyter notebook examples/notebooks/tutorial_04_word_analogies.ipynb

Or copy the code snippets above into your own Python script or notebook!

Tutorial 4: Word Analogies & Random Indexing¶

What You'll Learn¶

Why Random Indexing?¶

Setup¶

Part 1: Sample Text Corpus¶

Part 2: Text Preprocessing¶

Part 3: Random Indexing - Building Word Embeddings¶

Part 4: Semantic Similarity - Finding Related Words¶

Part 5: Word Analogies - The Famous Examples¶

Part 6: Analyzing the Results¶

Part 7: Comparing VSA Models¶

Part 8: Understanding What Makes This Work¶

Key Takeaways¶

Limitations & Extensions¶

Next Steps¶

References¶

Running This Tutorial¶