Galton Lab - Interactive Introduction

What is a Galton Board?

Imagine dropping a ball from the top. At each row, it hits a peg and randomly bounces left or right. After many bounces, it lands in a bucket at the bottom.

Drop many balls, and you'll see a pattern emerge—usually a bell curve (normal distribution). This is probability made visible through physics!

0

Balls Dropped

—

Most Balls In

💡 Key Concept

Each ball's path is random, but the overall pattern is predictable. This is the foundation of probability theory—randomness at the individual level creates order at the population level.

What if Pegs Could Be Biased?

Instead of random bounces, what if each peg could push balls left or right with different strengths?

Now we can learn which way each peg should push to create any distribution we want! The pegs become parameters we can train.

Target Bucket (where we want balls to go)

Bucket 3

Bias Strength (how much pegs push)

70%

0

Balls Dropped

0%

Accuracy (Target Hit)

💡 Key Concept

Learned bias fields replace random chance. Instead of computing "this bucket has 60% probability," we create a geometry that naturally guides 60% of balls there.

The pegs are like neural network weights—we adjust them to get the output we want!

🔗 ML Connection: This is exactly how neural networks learn! Adjust parameters (pegs) to transform inputs (ball positions) into desired outputs (bucket distributions).

Using It for Real Predictions

Now imagine the input is context (like "The capital of France is ___") and buckets are words (Paris, London, Berlin, Madrid).

The context configures the pegs, balls drop, and whichever bucket fills up first is our prediction!

Choose Context (simulated)

0

Probes Dropped

—

Current Prediction

0%

Confidence

💡 The Magic of Adaptive Compute

Notice how confidence predictions need fewer probes? When the model "knows" the answer, balls converge quickly. When uncertain, they spread out and we drop more probes.

This is adaptive compute—the model automatically uses more resources when needed!

🔗 ML Connection: In transformers, every prediction costs the same (full softmax over 50,000 tokens). With Galton samplers, confident predictions are faster and uncertainty is visible.

🚀 From Here to Production

Discrete → Continuous: Scale this up by replacing discrete pegs with continuous flow fields (ODEs on a torus)

Hierarchical: Use multiple boards in sequence (coarse → fine) for large vocabularies

Context Encoding: Replace our simple selector with a real transformer that maps text to peg configurations

From Discrete to Continuous: The SDF Sampler

Discrete pegs work great for demos, but they have limits: probes can only move one column at a time, and training on discrete jumps is tricky.

Solution? Continuous flow on a ring. Instead of bouncing off pegs, probes follow smooth trajectories through a learned velocity field.

SDF Target Bucket

Field Strength (α = 1.0)

Integration Steps (RK2)

40 steps

0

Probes Integrated

—

Winning Bucket

0%

Peak Mass

🌀 What's Happening Here

Ring Topology: The horizontal axis wraps around (like Pac-Man). Position 0 connects to position C—it's a circle!

SDF (Signed Distance Field): A neural network learns a "distance" function. Close to the target? Negative distance. Far away? Positive.

Velocity from Gradient: v = -α · ∂φ/∂x where φ = softplus(-D). This creates a "downhill" flow toward the target.

RK2 Integration: Each probe follows the velocity field with Runge-Kutta 2nd order (more accurate than simple Euler steps).

🔗 Implementation Details:

src/galton_lab/ode/field.py — SDFField: MLP that outputs signed distance given (position, time, context)
src/galton_lab/ode/integrate.py — integrate_fixed(): RK2 integration with ring wrapping
src/galton_lab/ode/buckets.py — soft_bucket_mass(): Gaussian windows convert positions → bucket probabilities
Ring encoding: sin(2πx/C), cos(2πx/C) at multiple frequencies
Potential function: φ = softplus(-D) creates smooth wells
Velocity: Computed via autograd or finite differences

🎯 Why This Scales

Differentiable: The entire flow is smooth → backpropagation works beautifully

Infinite Reach: Probes can travel across the entire ring in one integration (no column-by-column hops)

Adaptive σ: Start with wide Gaussian windows (σ=0.9) for learning, then sharpen (σ=0.5) for precision

Symmetry Breaking: Add directional bias or distillation to prevent mirror solutions

📚 Training the SDF

Warm Start Phase:

Wide sigma (σ=0.9) for forgiving gradients
Directional bias (+0.15 drift) breaks ring symmetry
Knowledge distillation from a "teacher" model
KL loss (τ=1.5) + velocity alignment loss

Auto-Handoff: System detects when margin ≥ 0.05 and target prob ≥ 0.25, then switches to...

Sharpen Phase:

Tighter sigma (σ=0.5) for precise peaks
Remove bias and distillation (training wheels off!)
Pure cross-entropy optimization

See docs/char32_ode_warmstart.md for the full protocol and galton/train.py --warm-start-preset char32 to run it!

🧪 Try It Yourself:

# Train the ODE sampler on character-level task
python galton/train.py --task char32 --device auto --amp \
  --sampler ode --batch 4096 --warm-start-preset char32 \
  --auto-handoff

# Or start with a simple ABCD pattern
python galton/train.py --task abcd --device auto \
  --per-example-fields --batch 8192

How This Fits into Transformers

Transformers are the backbone of modern AI (GPT, Claude, etc.). They have two key components we can replace with Galton samplers:

1. Attention mechanism (builds context) → Keep this!

2. Final softmax layer (picks next token) → Replace with Galton sampler!

🔄 The Architecture

Standard Transformer:

Input text: "The capital of France is"
    ↓
[Transformer Encoder] (attention layers)
    ↓
Context vector (768-dim embedding)
    ↓
[Linear projection to vocab size]
    ↓
Logits: [50,000 numbers]
    ↓
Softmax(logits)  ← Expensive! Compute ALL probabilities
    ↓
Probabilities: [0.7, 0.05, 0.001, ...]
    ↓
Sample: "Paris"

Galton Transformer:

Input text: "The capital of France is"
    ↓
[Transformer Encoder] (attention layers) ← Same!
    ↓
Context vector (768-dim embedding)
    ↓
[Galton SDF Field Generator]  ← New!
    ↓
Velocity field parameters (α, bias, target wells)
    ↓
Drop probes & integrate ODEs  ← Physics!
    ↓
Bucket masses emerge from flow
    ↓
Sample: "Paris" (+ free uncertainty estimate!)

🔗 Key Insight: The context vector from the transformer configures the probability landscape. Different contexts create different flow fields, just like in Stage 3, but now driven by real learned representations instead of our toy selector!

🎨 How Context Shapes the Field

The context vector (from transformer) goes through a small MLP to produce:

SDF parameters: Weights for the distance field network
Per-bucket strengths: How strong is each token "attractor"?
Field strength (α): Overall sharpness of the flow
Initial bias: Directional preference (breaks symmetry)

This is contextual composition—the same idea as Stage 3, but learned end-to-end!

⚡ Training the Full Stack

End-to-end learning:

Transformer produces context vector for "The capital of France is ___"
Context → field parameters (learned MLP)
Field parameters → velocity field (SDF network)
Integrate probes → bucket masses
Cross-entropy loss: `CE(bucket_masses, target="Paris")`
Backprop through entire pipeline (differentiable!)

Gradients flow from loss → SDF → context encoder → transformer. The whole system learns to create flow fields that guide probes correctly!

📊 What You Get:

Same accuracy as softmax (with proper training)
Adaptive compute — confident predictions finish faster
Uncertainty quantification — probe spread = confidence
Interpretability — visualize the flow field to understand "why"
Potential efficiency — hierarchical routing for large vocabs

🧩 Implementation Sketch

class GaltonTransformer(nn.Module):
    def __init__(self, vocab_size, d_model):
        self.transformer = TransformerEncoder(...)
        self.context_to_field = MLP(d_model, sdf_params)
        self.sdf_field = SDFField(...)
        self.integrator = ODEIntegrator(...)

    def forward(self, input_ids):
        # Standard transformer part
        context = self.transformer(input_ids)  # (batch, d_model)

        # Galton part: context → flow field
        field_params = self.context_to_field(context)

        # Drop probes and integrate
        x0 = torch.rand(batch, n_probes) * C  # Random starts
        x_final = self.integrator(x0, field_params, steps=40)

        # Soft bucket assignment
        bucket_probs = soft_bucket_mass(x_final, sigma=0.5)

        return bucket_probs  # (batch, vocab_size)

The Softmax Problem (and Why Galton Helps)

Modern language models use softmax to pick the next token. It works, but it has serious scaling issues...

⚠️ The O(N²) Attention Problem

Attention mechanism: Every token attends to every other token

# For sequence length L, vocab size V
Q, K, V = split(embeddings)         # Each (L × d)
attention = softmax(Q @ K.T / √d)   # O(L²) ← Problem!
output = attention @ V               # (L² operations)

Vocabulary softmax: Computing probabilities over all tokens

logits = linear(hidden)              # (V,) where V=50,000
probs = softmax(logits)              # O(V) — every token!
next_token = sample(probs)           # Pick one from 50k

For GPT-3 (vocab=50k, context=2048):

Attention: 2048² = 4.2M operations per layer
Softmax: 50k normalization every prediction
Memory: Must store full attention matrix
Result: Massive compute and memory usage

📊 Scaling Comparison

Key Observations:

Softmax: Linear in vocab size — must touch every token
Galton (hierarchical): Sub-linear — routes to likely regions first
Galton (adaptive): Variable cost — uses less when confident

💡 How Galton Samplers Help:

Problem	Softmax	Galton Sampler
Vocab scaling	O(V) every time	O(log V) hierarchical
Adaptive compute	Fixed cost	Variable (fewer probes when confident)
Uncertainty	Post-hoc (entropy)	Built-in (probe spread)
Interpretability	Opaque numbers	Visible trajectories
Memory	Store full distribution	Only probe states

🚀 The Vision: A New Paradigm

What if probability was always a flow, never a lookup?

Near-term (achievable now):

Replace final softmax in transformers
Train on moderate vocabs (10k-50k tokens)
Demonstrate adaptive compute savings
Show uncertainty quantification in action

Mid-term (research frontier):

Hierarchical routing for massive vocabularies (1M+ tokens)
SDE variants for stochastic exploration
Integrate with sparse attention mechanisms
Prove theoretical convergence guarantees

Long-term (paradigm shift):

Replace attention itself with flow-based routing
Unified architecture: everything is geometric flow
Extend to continuous action spaces (robotics, control)
Bridge to physics-inspired AI (energy-based, thermodynamic)

🎯 What You Can Do

Experiment:

Clone the repo and run the demos
Train your own Galton sampler on toy tasks
Visualize the flow fields and understand the geometry
Try different SDF architectures and hyperparameters

Contribute:

Scale to larger vocabularies
Integrate with production transformers (HuggingFace)
Benchmark against softmax baselines
Explore theoretical connections (optimal transport, diffusion)

Spread the word:

Share this demo with ML researchers and practitioners
Cite the work if you build on it
Open issues and discussions on GitHub
Help us make probability flow the new standard!

💬 Final Thought:

"For decades, we've computed probabilities algebraically. Softmax, argmax, probability vectors—all lookups and calculations. But probability in the physical world flows. Water finds its level. Particles settle into wells. Energy minimizes.

What if our models could do the same? Not calculate probability, but become probability. Let the geometry do the work. Let physics guide the bits.

That's the Galton vision. From a 4am idea to a working sampler to—perhaps—a new way of thinking about uncertainty in AI."

— Start exploring: docs/char32_ode_warmstart.md

🎲 Galton Lab

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Stage 6

What is a Galton Board?

💡 Key Concept

What if Pegs Could Be Biased?

💡 Key Concept

Using It for Real Predictions

💡 The Magic of Adaptive Compute

🚀 From Here to Production

From Discrete to Continuous: The SDF Sampler

🌀 What's Happening Here

🎯 Why This Scales

📚 Training the SDF

How This Fits into Transformers

🔄 The Architecture

🎨 How Context Shapes the Field

⚡ Training the Full Stack

🧩 Implementation Sketch

The Softmax Problem (and Why Galton Helps)

⚠️ The O(N²) Attention Problem

📊 Scaling Comparison

🚀 The Vision: A New Paradigm

🎯 What You Can Do