🎲 Galton Lab

An Interactive Journey from Physics to Machine Learning

Stage 1

The Physical Board

Stage 2

Learned Pegs

Stage 3

ML Prediction

Stage 4

Continuous Flow (SDF)

Stage 5

Transformer Integration

Stage 6

Why This Matters

What is a Galton Board?

Imagine dropping a ball from the top. At each row, it hits a peg and randomly bounces left or right. After many bounces, it lands in a bucket at the bottom.

Drop many balls, and you'll see a pattern emergeβ€”usually a bell curve (normal distribution). This is probability made visible through physics!

0
Balls Dropped
β€”
Most Balls In

πŸ’‘ Key Concept

Each ball's path is random, but the overall pattern is predictable. This is the foundation of probability theoryβ€”randomness at the individual level creates order at the population level.

What if Pegs Could Be Biased?

Instead of random bounces, what if each peg could push balls left or right with different strengths?

Now we can learn which way each peg should push to create any distribution we want! The pegs become parameters we can train.

Bucket 3
70%
0
Balls Dropped
0%
Accuracy (Target Hit)

πŸ’‘ Key Concept

Learned bias fields replace random chance. Instead of computing "this bucket has 60% probability," we create a geometry that naturally guides 60% of balls there.

The pegs are like neural network weightsβ€”we adjust them to get the output we want!

πŸ”— ML Connection: This is exactly how neural networks learn! Adjust parameters (pegs) to transform inputs (ball positions) into desired outputs (bucket distributions).

Using It for Real Predictions

Now imagine the input is context (like "The capital of France is ___") and buckets are words (Paris, London, Berlin, Madrid).

The context configures the pegs, balls drop, and whichever bucket fills up first is our prediction!

0
Probes Dropped
β€”
Current Prediction
0%
Confidence

πŸ’‘ The Magic of Adaptive Compute

Notice how confidence predictions need fewer probes? When the model "knows" the answer, balls converge quickly. When uncertain, they spread out and we drop more probes.

This is adaptive computeβ€”the model automatically uses more resources when needed!

πŸ”— ML Connection: In transformers, every prediction costs the same (full softmax over 50,000 tokens). With Galton samplers, confident predictions are faster and uncertainty is visible.

πŸš€ From Here to Production

Discrete β†’ Continuous: Scale this up by replacing discrete pegs with continuous flow fields (ODEs on a torus)

Hierarchical: Use multiple boards in sequence (coarse β†’ fine) for large vocabularies

Context Encoding: Replace our simple selector with a real transformer that maps text to peg configurations

From Discrete to Continuous: The SDF Sampler

Discrete pegs work great for demos, but they have limits: probes can only move one column at a time, and training on discrete jumps is tricky.

Solution? Continuous flow on a ring. Instead of bouncing off pegs, probes follow smooth trajectories through a learned velocity field.

40 steps
0
Probes Integrated
β€”
Winning Bucket
0%
Peak Mass

πŸŒ€ What's Happening Here

Ring Topology: The horizontal axis wraps around (like Pac-Man). Position 0 connects to position Cβ€”it's a circle!

SDF (Signed Distance Field): A neural network learns a "distance" function. Close to the target? Negative distance. Far away? Positive.

Velocity from Gradient: v = -Ξ± Β· βˆ‚Ο†/βˆ‚x where Ο† = softplus(-D). This creates a "downhill" flow toward the target.

RK2 Integration: Each probe follows the velocity field with Runge-Kutta 2nd order (more accurate than simple Euler steps).

πŸ”— Implementation Details:
  • src/galton_lab/ode/field.py β€” SDFField: MLP that outputs signed distance given (position, time, context)
  • src/galton_lab/ode/integrate.py β€” integrate_fixed(): RK2 integration with ring wrapping
  • src/galton_lab/ode/buckets.py β€” soft_bucket_mass(): Gaussian windows convert positions β†’ bucket probabilities
  • Ring encoding: sin(2Ο€x/C), cos(2Ο€x/C) at multiple frequencies
  • Potential function: Ο† = softplus(-D) creates smooth wells
  • Velocity: Computed via autograd or finite differences

🎯 Why This Scales

Differentiable: The entire flow is smooth β†’ backpropagation works beautifully

Infinite Reach: Probes can travel across the entire ring in one integration (no column-by-column hops)

Adaptive Οƒ: Start with wide Gaussian windows (Οƒ=0.9) for learning, then sharpen (Οƒ=0.5) for precision

Symmetry Breaking: Add directional bias or distillation to prevent mirror solutions

πŸ“š Training the SDF

Warm Start Phase:

  • Wide sigma (Οƒ=0.9) for forgiving gradients
  • Directional bias (+0.15 drift) breaks ring symmetry
  • Knowledge distillation from a "teacher" model
  • KL loss (Ο„=1.5) + velocity alignment loss

Auto-Handoff: System detects when margin β‰₯ 0.05 and target prob β‰₯ 0.25, then switches to...

Sharpen Phase:

  • Tighter sigma (Οƒ=0.5) for precise peaks
  • Remove bias and distillation (training wheels off!)
  • Pure cross-entropy optimization

See docs/char32_ode_warmstart.md for the full protocol and galton/train.py --warm-start-preset char32 to run it!

πŸ§ͺ Try It Yourself:
# Train the ODE sampler on character-level task
python galton/train.py --task char32 --device auto --amp \
  --sampler ode --batch 4096 --warm-start-preset char32 \
  --auto-handoff

# Or start with a simple ABCD pattern
python galton/train.py --task abcd --device auto \
  --per-example-fields --batch 8192

How This Fits into Transformers

Transformers are the backbone of modern AI (GPT, Claude, etc.). They have two key components we can replace with Galton samplers:

1. Attention mechanism (builds context) β†’ Keep this!

2. Final softmax layer (picks next token) β†’ Replace with Galton sampler!

πŸ”„ The Architecture

Standard Transformer:

Input text: "The capital of France is"
    ↓
[Transformer Encoder] (attention layers)
    ↓
Context vector (768-dim embedding)
    ↓
[Linear projection to vocab size]
    ↓
Logits: [50,000 numbers]
    ↓
Softmax(logits)  ← Expensive! Compute ALL probabilities
    ↓
Probabilities: [0.7, 0.05, 0.001, ...]
    ↓
Sample: "Paris"

Galton Transformer:

Input text: "The capital of France is"
    ↓
[Transformer Encoder] (attention layers) ← Same!
    ↓
Context vector (768-dim embedding)
    ↓
[Galton SDF Field Generator]  ← New!
    ↓
Velocity field parameters (Ξ±, bias, target wells)
    ↓
Drop probes & integrate ODEs  ← Physics!
    ↓
Bucket masses emerge from flow
    ↓
Sample: "Paris" (+ free uncertainty estimate!)
πŸ”— Key Insight: The context vector from the transformer configures the probability landscape. Different contexts create different flow fields, just like in Stage 3, but now driven by real learned representations instead of our toy selector!

🎨 How Context Shapes the Field

The context vector (from transformer) goes through a small MLP to produce:

  • SDF parameters: Weights for the distance field network
  • Per-bucket strengths: How strong is each token "attractor"?
  • Field strength (Ξ±): Overall sharpness of the flow
  • Initial bias: Directional preference (breaks symmetry)

This is contextual compositionβ€”the same idea as Stage 3, but learned end-to-end!

⚑ Training the Full Stack

End-to-end learning:

  1. Transformer produces context vector for "The capital of France is ___"
  2. Context β†’ field parameters (learned MLP)
  3. Field parameters β†’ velocity field (SDF network)
  4. Integrate probes β†’ bucket masses
  5. Cross-entropy loss: `CE(bucket_masses, target="Paris")`
  6. Backprop through entire pipeline (differentiable!)

Gradients flow from loss β†’ SDF β†’ context encoder β†’ transformer. The whole system learns to create flow fields that guide probes correctly!

πŸ“Š What You Get:
  • Same accuracy as softmax (with proper training)
  • Adaptive compute β€” confident predictions finish faster
  • Uncertainty quantification β€” probe spread = confidence
  • Interpretability β€” visualize the flow field to understand "why"
  • Potential efficiency β€” hierarchical routing for large vocabs

🧩 Implementation Sketch

class GaltonTransformer(nn.Module):
    def __init__(self, vocab_size, d_model):
        self.transformer = TransformerEncoder(...)
        self.context_to_field = MLP(d_model, sdf_params)
        self.sdf_field = SDFField(...)
        self.integrator = ODEIntegrator(...)

    def forward(self, input_ids):
        # Standard transformer part
        context = self.transformer(input_ids)  # (batch, d_model)

        # Galton part: context β†’ flow field
        field_params = self.context_to_field(context)

        # Drop probes and integrate
        x0 = torch.rand(batch, n_probes) * C  # Random starts
        x_final = self.integrator(x0, field_params, steps=40)

        # Soft bucket assignment
        bucket_probs = soft_bucket_mass(x_final, sigma=0.5)

        return bucket_probs  # (batch, vocab_size)

The Softmax Problem (and Why Galton Helps)

Modern language models use softmax to pick the next token. It works, but it has serious scaling issues...

⚠️ The O(N²) Attention Problem

Attention mechanism: Every token attends to every other token

# For sequence length L, vocab size V
Q, K, V = split(embeddings)         # Each (L Γ— d)
attention = softmax(Q @ K.T / √d)   # O(LΒ²) ← Problem!
output = attention @ V               # (LΒ² operations)

Vocabulary softmax: Computing probabilities over all tokens

logits = linear(hidden)              # (V,) where V=50,000
probs = softmax(logits)              # O(V) β€” every token!
next_token = sample(probs)           # Pick one from 50k

For GPT-3 (vocab=50k, context=2048):

  • Attention: 2048Β² = 4.2M operations per layer
  • Softmax: 50k normalization every prediction
  • Memory: Must store full attention matrix
  • Result: Massive compute and memory usage

πŸ“Š Scaling Comparison

Key Observations:

  • Softmax: Linear in vocab size β€” must touch every token
  • Galton (hierarchical): Sub-linear β€” routes to likely regions first
  • Galton (adaptive): Variable cost β€” uses less when confident
πŸ’‘ How Galton Samplers Help:
Problem Softmax Galton Sampler
Vocab scaling O(V) every time O(log V) hierarchical
Adaptive compute Fixed cost Variable (fewer probes when confident)
Uncertainty Post-hoc (entropy) Built-in (probe spread)
Interpretability Opaque numbers Visible trajectories
Memory Store full distribution Only probe states

πŸš€ The Vision: A New Paradigm

What if probability was always a flow, never a lookup?

Near-term (achievable now):

  • Replace final softmax in transformers
  • Train on moderate vocabs (10k-50k tokens)
  • Demonstrate adaptive compute savings
  • Show uncertainty quantification in action

Mid-term (research frontier):

  • Hierarchical routing for massive vocabularies (1M+ tokens)
  • SDE variants for stochastic exploration
  • Integrate with sparse attention mechanisms
  • Prove theoretical convergence guarantees

Long-term (paradigm shift):

  • Replace attention itself with flow-based routing
  • Unified architecture: everything is geometric flow
  • Extend to continuous action spaces (robotics, control)
  • Bridge to physics-inspired AI (energy-based, thermodynamic)

🎯 What You Can Do

Experiment:

  • Clone the repo and run the demos
  • Train your own Galton sampler on toy tasks
  • Visualize the flow fields and understand the geometry
  • Try different SDF architectures and hyperparameters

Contribute:

  • Scale to larger vocabularies
  • Integrate with production transformers (HuggingFace)
  • Benchmark against softmax baselines
  • Explore theoretical connections (optimal transport, diffusion)

Spread the word:

  • Share this demo with ML researchers and practitioners
  • Cite the work if you build on it
  • Open issues and discussions on GitHub
  • Help us make probability flow the new standard!
πŸ’¬ Final Thought:

"For decades, we've computed probabilities algebraically. Softmax, argmax, probability vectorsβ€”all lookups and calculations. But probability in the physical world flows. Water finds its level. Particles settle into wells. Energy minimizes.

What if our models could do the same? Not calculate probability, but become probability. Let the geometry do the work. Let physics guide the bits.

That's the Galton vision. From a 4am idea to a working sampler toβ€”perhapsβ€”a new way of thinking about uncertainty in AI."

β€” Start exploring: docs/char32_ode_warmstart.md