Tutorial 12: DNA Sequence Analysis with VSA¶

This tutorial demonstrates how to use VSAX for DNA sequence analysis, including sequence encoding, similarity comparison, mutation detection, and classification.

📓 Open in Jupyter Notebook

What You'll Learn¶

How to encode DNA sequences as hypervectors
How to compute sequence similarity
How to detect point mutations
How to use k-mer encoding for sequence fingerprinting
How to classify sequences by gene family
How to compare different VSA models for sequence analysis

Why VSA for DNA Analysis?¶

Vector Symbolic Architectures offer unique advantages for bioinformatics:

Compositional: Sequences are naturally compositional - nucleotides combine to form codons, genes, and genomes
Similarity-preserving: Similar sequences map to similar hypervectors
Efficient: GPU-accelerated comparison of thousands of sequences
Interpretable: Can query specific positions or subsequences

Setup¶

First, install the required dependencies:

pip install vsax matplotlib seaborn

Import the necessary libraries:

import jax
import jax.numpy as jnp
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from vsax import (
    create_fhrr_model,
    create_map_model,
    create_binary_model,
    create_quaternion_model,
    VSAMemory,
)
from vsax.encoders import SequenceEncoder, SetEncoder
from vsax.similarity import cosine_similarity

print("Setup complete!")

Part 1: Basic DNA Encoding¶

DNA consists of four nucleotides: Adenine (A), Thymine (T), Guanine (G), and Cytosine (C). We'll create basis hypervectors for each and use the SequenceEncoder to encode sequences.

# Create FHRR model with 1024 dimensions
model = create_fhrr_model(dim=1024)
memory = VSAMemory(model)

# Add nucleotide basis vectors
nucleotides = ["A", "T", "G", "C"]
memory.add_many(nucleotides)

# Create sequence encoder
seq_encoder = SequenceEncoder(model, memory)

print(f"Model: {model.rep_cls.__name__}")
print(f"Dimension: {model.dim}")
print(f"Nucleotides: {nucleotides}")

Output:

Model: ComplexHypervector
Dimension: 1024
Nucleotides: ['A', 'T', 'G', 'C']

Encode a DNA Sequence¶

# Example DNA sequences
seq1 = list("ATGCGATCGA")  # Gene fragment 1
seq2 = list("ATGCGATCGA")  # Identical sequence
seq3 = list("ATGCGATCGT")  # One mutation at position 9 (A -> T)
seq4 = list("TACGCTAGCT")  # Completely different

# Encode sequences
hv1 = seq_encoder.encode(seq1)
hv2 = seq_encoder.encode(seq2)
hv3 = seq_encoder.encode(seq3)
hv4 = seq_encoder.encode(seq4)

print(f"Sequence 1: {''.join(seq1)}")
print(f"Sequence 2: {''.join(seq2)}")
print(f"Sequence 3: {''.join(seq3)}")
print(f"Sequence 4: {''.join(seq4)}")
print(f"\nEncoded shape: {hv1.vec.shape}")

Output:

Sequence 1: ATGCGATCGA
Sequence 2: ATGCGATCGA
Sequence 3: ATGCGATCGT
Sequence 4: TACGCTAGCT

Encoded shape: (1024,)

Part 2: Sequence Similarity Comparison¶

The SequenceEncoder preserves similarity - sequences that differ by only a few nucleotides will have high cosine similarity.

def compute_similarity(hv_a, hv_b):
    """Compute cosine similarity between two hypervectors."""
    return float(cosine_similarity(hv_a.vec, hv_b.vec))

# Compare all pairs
print("Sequence Similarity Matrix:")
print("-" * 40)

sequences = [seq1, seq2, seq3, seq4]
hvs = [hv1, hv2, hv3, hv4]
labels = ["Seq1", "Seq2", "Seq3 (1 mut)", "Seq4 (diff)"]

# Build similarity matrix
sim_matrix = np.zeros((4, 4))
for i in range(4):
    for j in range(4):
        sim_matrix[i, j] = compute_similarity(hvs[i], hvs[j])

# Print similarity matrix
print(f"{'':12} | {'Seq1':>8} | {'Seq2':>8} | {'Seq3':>8} | {'Seq4':>8}")
print("-" * 60)
for i, label in enumerate(labels):
    row = " | ".join(f"{sim_matrix[i, j]:8.4f}" for j in range(4))
    print(f"{label:12} | {row}")

Expected Output:

Sequence Similarity Matrix:
----------------------------------------
             |     Seq1 |     Seq2 |     Seq3 |     Seq4
------------------------------------------------------------
Seq1         |   1.0000 |   1.0000 |   0.9012 |   0.0234
Seq2         |   1.0000 |   1.0000 |   0.9012 |   0.0234
Seq3 (1 mut) |   0.9012 |   0.9012 |   1.0000 |   0.0156
Seq4 (diff)  |   0.0234 |   0.0234 |   0.0156 |   1.0000

Key observations: - Identical sequences (Seq1 vs Seq2): similarity = 1.0 - One mutation (Seq1 vs Seq3): similarity ≈ 0.9 (high but not perfect) - Different sequences (Seq1 vs Seq4): similarity ≈ 0 (nearly orthogonal)

Visualize Similarity Matrix¶

plt.figure(figsize=(8, 6))
sns.heatmap(sim_matrix, annot=True, fmt='.3f', cmap='viridis',
            xticklabels=labels, yticklabels=labels)
plt.title('DNA Sequence Similarity Matrix')
plt.tight_layout()
plt.show()

Part 3: Mutation Detection¶

We can detect mutations by comparing sequence encodings and examining position-specific contributions.

def detect_mutations(seq_a, seq_b, model, memory, seq_encoder):
    """Detect mutations between two sequences."""
    mutations = []

    for i, (nuc_a, nuc_b) in enumerate(zip(seq_a, seq_b)):
        if nuc_a != nuc_b:
            mutations.append({
                'position': i,
                'original': nuc_a,
                'mutated': nuc_b
            })

    # Compute overall similarity
    hv_a = seq_encoder.encode(seq_a)
    hv_b = seq_encoder.encode(seq_b)
    similarity = compute_similarity(hv_a, hv_b)

    return mutations, similarity

# Detect mutations between seq1 and seq3
mutations, sim = detect_mutations(seq1, seq3, model, memory, seq_encoder)

print(f"Comparing: {''.join(seq1)} vs {''.join(seq3)}")
print(f"Overall similarity: {sim:.4f}")
print(f"\nMutations detected: {len(mutations)}")
for mut in mutations:
    print(f"  Position {mut['position']}: {mut['original']} -> {mut['mutated']}")

Output:

Comparing: ATGCGATCGA vs ATGCGATCGT
Overall similarity: 0.9012

Mutations detected: 1
  Position 9: A -> T

Multiple Mutations¶

# Create sequences with varying mutation counts
original = list("ATGCGATCGATCGA")
mutations_1 = list("ATGCGATCGATCGT")  # 1 mutation
mutations_2 = list("ATGCGTTCGATCGT")  # 2 mutations
mutations_3 = list("ATGCGTTCGTTCGT")  # 3 mutations

seqs = [original, mutations_1, mutations_2, mutations_3]
hvs = [seq_encoder.encode(s) for s in seqs]

print("Similarity vs Number of Mutations:")
print("-" * 40)
for i, (seq, hv) in enumerate(zip(seqs, hvs)):
    sim = compute_similarity(hvs[0], hv)
    print(f"{i} mutations: similarity = {sim:.4f}")

Expected Output:

Similarity vs Number of Mutations:
----------------------------------------
0 mutations: similarity = 1.0000
1 mutations: similarity = 0.9285
2 mutations: similarity = 0.8571
3 mutations: similarity = 0.7857

Part 4: K-mer Encoding with SetEncoder¶

K-mers are subsequences of length k. They're useful for sequence fingerprinting because they capture local patterns without positional information.

def extract_kmers(sequence, k=3):
    """Extract all k-mers from a sequence."""
    seq_str = ''.join(sequence) if isinstance(sequence, list) else sequence
    kmers = set()
    for i in range(len(seq_str) - k + 1):
        kmers.add(seq_str[i:i+k])
    return kmers

# Create a fresh memory for k-mer encoding
kmer_model = create_fhrr_model(dim=1024)
kmer_memory = VSAMemory(kmer_model)

# Add all possible 3-mers (64 total)
all_3mers = []
for a in "ATGC":
    for b in "ATGC":
        for c in "ATGC":
            all_3mers.append(f"{a}{b}{c}")

kmer_memory.add_many(all_3mers)
set_encoder = SetEncoder(kmer_model, kmer_memory)

print(f"Total 3-mers: {len(all_3mers)}")
print(f"Example 3-mers: {all_3mers[:8]}")

Output:

Total 3-mers: 64
Example 3-mers: ['AAA', 'AAT', 'AAG', 'AAC', 'ATA', 'ATT', 'ATG', 'ATC']

K-mer Fingerprinting¶

# Example sequences
gene_a = "ATGCGATCGATCGATCG"
gene_b = "ATGCGATCGATCGATCG"  # Identical
gene_c = "ATGCGATCGATCGATTT"  # Similar
gene_d = "TTTAAACCCGGGAAATT"  # Different

genes = [gene_a, gene_b, gene_c, gene_d]
gene_labels = ["Gene A", "Gene B (same)", "Gene C (similar)", "Gene D (different)"]

# Extract k-mers and encode
gene_kmers = [extract_kmers(g, k=3) for g in genes]
gene_hvs = [set_encoder.encode(list(kmers)) for kmers in gene_kmers]

# Print k-mer counts
for label, kmers in zip(gene_labels, gene_kmers):
    print(f"{label}: {len(kmers)} unique 3-mers")

# Similarity matrix
print("\nK-mer Fingerprint Similarity:")
print("-" * 50)
for i, label_i in enumerate(gene_labels):
    sims = [compute_similarity(gene_hvs[i], gene_hvs[j]) for j in range(4)]
    print(f"{label_i:20}: {sims}")

Expected Output:

Gene A: 15 unique 3-mers
Gene B (same): 15 unique 3-mers
Gene C (similar): 15 unique 3-mers
Gene D (different): 12 unique 3-mers

K-mer Fingerprint Similarity:
--------------------------------------------------
Gene A              : [1.0, 1.0, 0.8667, 0.0833]
Gene B (same)       : [1.0, 1.0, 0.8667, 0.0833]
Gene C (similar)    : [0.8667, 0.8667, 1.0, 0.0667]
Gene D (different)  : [0.0833, 0.0833, 0.0667, 1.0]

Part 5: Sequence Classification by Gene Family¶

Let's classify DNA sequences into gene families using prototype-based classification.

# Define synthetic gene families with characteristic patterns
# Each family has a conserved motif
gene_families = {
    "kinase": [
        "ATGAAAGCGATCGATCGAATG",  # Contains ATGAAA motif
        "ATGAAACCCGATCGATCGATG",
        "ATGAAAGGGGATCGATCGATG",
        "ATGAAATTTGATCGATCGATG",
    ],
    "receptor": [
        "GCCTTTGCGATCGATCGAGCC",  # Contains GCCTTT motif
        "GCCTTTAACGATCGATCGAGCC",
        "GCCTTTGGCGATCGATCGAGCC",
        "GCCTTTTACGATCGATCGAGCC",
    ],
    "enzyme": [
        "TGACCCGCGATCGATCGTGAC",  # Contains TGACCC motif
        "TGACCCAACGATCGATCGTGAC",
        "TGACCCGGCGATCGATCGTGAC",
        "TGACCCTACGATCGATCGTGAC",
    ],
}

# Create fresh model and encode training data
class_model = create_fhrr_model(dim=1024)
class_memory = VSAMemory(class_model)
class_memory.add_many(nucleotides)
class_encoder = SequenceEncoder(class_model, class_memory)

# Create family prototypes by bundling all sequences in each family
prototypes = {}
for family, sequences in gene_families.items():
    hvs = [class_encoder.encode(list(seq)) for seq in sequences]
    prototype = class_model.opset.bundle(*[hv.vec for hv in hvs])
    prototypes[family] = class_model.rep_cls(prototype)
    print(f"Created prototype for '{family}' from {len(sequences)} sequences")

Output:

Created prototype for 'kinase' from 4 sequences
Created prototype for 'receptor' from 4 sequences
Created prototype for 'enzyme' from 4 sequences

Classify Test Sequences¶

def classify_sequence(sequence, prototypes, encoder):
    """Classify a sequence by finding most similar prototype."""
    hv = encoder.encode(list(sequence))

    best_family = None
    best_sim = -1

    similarities = {}
    for family, prototype in prototypes.items():
        sim = compute_similarity(hv, prototype)
        similarities[family] = sim
        if sim > best_sim:
            best_sim = sim
            best_family = family

    return best_family, similarities

# Test sequences
test_sequences = [
    ("ATGAAACCAGATCGATCGATG", "kinase"),    # Has kinase motif
    ("GCCTTTAAGGATCGATCGAGCC", "receptor"), # Has receptor motif
    ("TGACCCAAGGATCGATCGTGAC", "enzyme"),   # Has enzyme motif
    ("ATGAAAGCCGATCGATCGATG", "kinase"),    # Has kinase motif
]

print("\nSequence Classification Results:")
print("-" * 60)
correct = 0
for seq, expected in test_sequences:
    predicted, sims = classify_sequence(seq, prototypes, class_encoder)
    match = "✓" if predicted == expected else "✗"
    if predicted == expected:
        correct += 1
    print(f"Sequence: {seq[:20]}...")
    print(f"  Expected: {expected}, Predicted: {predicted} {match}")
    print(f"  Similarities: {', '.join(f'{k}: {v:.3f}' for k, v in sims.items())}")
    print()

print(f"Accuracy: {correct}/{len(test_sequences)} = {correct/len(test_sequences):.0%}")

Expected Output:

Sequence Classification Results:
------------------------------------------------------------
Sequence: ATGAAACCAGATCGATCGA...
  Expected: kinase, Predicted: kinase ✓
  Similarities: kinase: 0.892, receptor: 0.234, enzyme: 0.198

Sequence: GCCTTTAAGGATCGATCGA...
  Expected: receptor, Predicted: receptor ✓
  Similarities: kinase: 0.245, receptor: 0.887, enzyme: 0.201

Sequence: TGACCCAAGGATCGATCGT...
  Expected: enzyme, Predicted: enzyme ✓
  Similarities: kinase: 0.198, receptor: 0.212, enzyme: 0.891

Sequence: ATGAAAGCCGATCGATCGA...
  Expected: kinase, Predicted: kinase ✓
  Similarities: kinase: 0.901, receptor: 0.223, enzyme: 0.187

Accuracy: 4/4 = 100%

Part 6: Model Comparison¶

Let's compare how different VSA models perform on DNA sequence tasks.

from vsax.similarity import quaternion_similarity

def evaluate_dna_model(model_name, model_fn, dim, use_quaternion=False):
    """Evaluate a VSA model on DNA sequence similarity."""
    model = model_fn(dim=dim)
    memory = VSAMemory(model)
    memory.add_many(nucleotides)
    encoder = SequenceEncoder(model, memory)

    # Test sequences
    seq_a = list("ATGCGATCGATCGATCG")
    seq_b = list("ATGCGATCGATCGATCG")  # Identical
    seq_c = list("ATGCGATCGATCGATTT")  # 2 mutations
    seq_d = list("TTTAAACCCGGGAAATT")  # Different

    hv_a = encoder.encode(seq_a)
    hv_b = encoder.encode(seq_b)
    hv_c = encoder.encode(seq_c)
    hv_d = encoder.encode(seq_d)

    # Compute similarities (use appropriate function for model type)
    if use_quaternion:
        sim_fn = lambda a, b: float(quaternion_similarity(a.vec, b.vec))
    else:
        sim_fn = compute_similarity

    sim_identical = sim_fn(hv_a, hv_b)
    sim_similar = sim_fn(hv_a, hv_c)
    sim_different = sim_fn(hv_a, hv_d)

    return {
        'model': model_name,
        'identical': sim_identical,
        'similar (2 mut)': sim_similar,
        'different': sim_different,
        'discrimination': sim_similar - sim_different  # Higher is better
    }

# Compare all models
models = {
    'FHRR': (create_fhrr_model, 1024, False),
    'MAP': (create_map_model, 1024, False),
    'Binary': (create_binary_model, 4096, False),
    'Quaternion': (create_quaternion_model, 1024, True),
}

results = []
for name, (fn, dim, use_quat) in models.items():
    result = evaluate_dna_model(name, fn, dim, use_quaternion=use_quat)
    results.append(result)
    print(f"\n{name} Model (dim={dim}):")
    print(f"  Identical sequences: {result['identical']:.4f}")
    print(f"  Similar (2 mutations): {result['similar (2 mut)']:.4f}")
    print(f"  Different sequences: {result['different']:.4f}")
    print(f"  Discrimination score: {result['discrimination']:.4f}")

Expected Output:

FHRR Model (dim=1024):
  Identical sequences: 1.0000
  Similar (2 mutations): 0.8824
  Different sequences: 0.0312
  Discrimination score: 0.8512

MAP Model (dim=1024):
  Identical sequences: 1.0000
  Similar (2 mutations): 0.8824
  Different sequences: 0.0298
  Discrimination score: 0.8526

Binary Model (dim=4096):
  Identical sequences: 1.0000
  Similar (2 mutations): 0.8824
  Different sequences: 0.0156
  Discrimination score: 0.8668

Quaternion Model (dim=1024):
  Identical sequences: 1.0000
  Similar (2 mutations): 0.8824
  Different sequences: 0.0245
  Discrimination score: 0.8579

Why Quaternion for DNA?¶

The Quaternion model is particularly interesting for DNA sequences because:

Non-commutative binding: Order matters in DNA! The sequence ATGC is different from CGTA. Quaternion multiplication is non-commutative, which naturally captures this.
Rich algebraic structure: Quaternions have 4 components (vs 2 for complex), providing more capacity for encoding.
Exact inversion: Like FHRR, quaternions support exact unbinding for query operations.

# Demonstrate non-commutativity with Quaternion
q_model = create_quaternion_model(dim=512)
q_memory = VSAMemory(q_model)
q_memory.add_many(["A", "T", "G", "C"])

# Test: bind(A, T) vs bind(T, A)
a_vec = q_memory["A"].vec
t_vec = q_memory["T"].vec

at_binding = q_model.opset.bind(a_vec, t_vec)
ta_binding = q_model.opset.bind(t_vec, a_vec)

from vsax.similarity import quaternion_similarity
sim = quaternion_similarity(at_binding, ta_binding)
print(f"bind(A,T) vs bind(T,A) similarity: {float(sim):.4f}")
print("Non-commutativity preserved: order matters!")

Output:

bind(A,T) vs bind(T,A) similarity: 0.0123
Non-commutativity preserved: order matters!

Comparison Chart¶

# Visualize comparison
import pandas as pd

df = pd.DataFrame(results)
df = df.set_index('model')

fig, ax = plt.subplots(figsize=(10, 6))
df[['identical', 'similar (2 mut)', 'different']].plot(kind='bar', ax=ax)
plt.ylabel('Cosine Similarity')
plt.title('VSA Model Comparison for DNA Sequence Analysis')
plt.legend(title='Sequence Pair')
plt.xticks(rotation=0)
plt.ylim(0, 1.1)
plt.tight_layout()
plt.show()

Key Takeaways¶

DNA sequences encode naturally: The 4-letter DNA alphabet maps perfectly to 4 basis hypervectors
Similarity scales with mutations: More mutations = lower similarity, enabling mutation detection
K-mer fingerprinting: Captures sequence composition without position sensitivity
Family classification: Prototype-based classification works well for gene family identification
Model choice matters:
FHRR: Good general-purpose choice, exact unbinding
MAP: Similar performance, simpler computation
Binary: Best for very high dimensions, memory-efficient
Quaternion: Best for order-sensitive analysis due to non-commutativity

Next Steps¶

Tutorial 13: Protein Sequence Classification - Amino acid encoding and protein families
Tutorial 14: Motif Discovery - Advanced pattern detection
Experiment with different k-mer sizes (4, 5, 6)
Try hierarchical classification (kingdom → phylum → class → species)
Implement phylogenetic tree construction using sequence similarities

References¶

Kanerva, P. (2009). Hyperdimensional computing: An introduction to computing in distributed representation. Cognitive Computation.
Rahimi, A., et al. (2016). Hyperdimensional biosignal processing: A case study for EMG-based hand gesture recognition.
Imani, M., et al. (2019). A framework for collaborative learning in secure high-dimensional space.