An Interactive Journey from Physics to Machine Learning
The Physical Board
Learned Pegs
ML Prediction
Continuous Flow (SDF)
Transformer Integration
Why This Matters
Imagine dropping a ball from the top. At each row, it hits a peg and randomly bounces left or right. After many bounces, it lands in a bucket at the bottom.
Drop many balls, and you'll see a pattern emergeβusually a bell curve (normal distribution). This is probability made visible through physics!
Each ball's path is random, but the overall pattern is predictable. This is the foundation of probability theoryβrandomness at the individual level creates order at the population level.
Instead of random bounces, what if each peg could push balls left or right with different strengths?
Now we can learn which way each peg should push to create any distribution we want! The pegs become parameters we can train.
Learned bias fields replace random chance. Instead of computing "this bucket has 60% probability," we create a geometry that naturally guides 60% of balls there.
The pegs are like neural network weightsβwe adjust them to get the output we want!
Now imagine the input is context (like "The capital of France is ___") and buckets are words (Paris, London, Berlin, Madrid).
The context configures the pegs, balls drop, and whichever bucket fills up first is our prediction!
Notice how confidence predictions need fewer probes? When the model "knows" the answer, balls converge quickly. When uncertain, they spread out and we drop more probes.
This is adaptive computeβthe model automatically uses more resources when needed!
Discrete β Continuous: Scale this up by replacing discrete pegs with continuous flow fields (ODEs on a torus)
Hierarchical: Use multiple boards in sequence (coarse β fine) for large vocabularies
Context Encoding: Replace our simple selector with a real transformer that maps text to peg configurations
Discrete pegs work great for demos, but they have limits: probes can only move one column at a time, and training on discrete jumps is tricky.
Solution? Continuous flow on a ring. Instead of bouncing off pegs, probes follow smooth trajectories through a learned velocity field.
Ring Topology: The horizontal axis wraps around (like Pac-Man). Position 0 connects to position Cβit's a circle!
SDF (Signed Distance Field): A neural network learns a "distance" function. Close to the target? Negative distance. Far away? Positive.
Velocity from Gradient: v = -Ξ± Β· βΟ/βx where Ο = softplus(-D). This creates a "downhill" flow toward the target.
RK2 Integration: Each probe follows the velocity field with Runge-Kutta 2nd order (more accurate than simple Euler steps).
src/galton_lab/ode/field.py β SDFField: MLP that outputs signed distance given (position, time, context)src/galton_lab/ode/integrate.py β integrate_fixed(): RK2 integration with ring wrappingsrc/galton_lab/ode/buckets.py β soft_bucket_mass(): Gaussian windows convert positions β bucket probabilitiessin(2Οx/C), cos(2Οx/C) at multiple frequenciesΟ = softplus(-D) creates smooth wellsDifferentiable: The entire flow is smooth β backpropagation works beautifully
Infinite Reach: Probes can travel across the entire ring in one integration (no column-by-column hops)
Adaptive Ο: Start with wide Gaussian windows (Ο=0.9) for learning, then sharpen (Ο=0.5) for precision
Symmetry Breaking: Add directional bias or distillation to prevent mirror solutions
Warm Start Phase:
Auto-Handoff: System detects when margin β₯ 0.05 and target prob β₯ 0.25, then switches to...
Sharpen Phase:
See docs/char32_ode_warmstart.md for the full protocol and galton/train.py --warm-start-preset char32 to run it!
# Train the ODE sampler on character-level task python galton/train.py --task char32 --device auto --amp \ --sampler ode --batch 4096 --warm-start-preset char32 \ --auto-handoff # Or start with a simple ABCD pattern python galton/train.py --task abcd --device auto \ --per-example-fields --batch 8192
Transformers are the backbone of modern AI (GPT, Claude, etc.). They have two key components we can replace with Galton samplers:
1. Attention mechanism (builds context) β Keep this!
2. Final softmax layer (picks next token) β Replace with Galton sampler!
Standard Transformer:
Input text: "The capital of France is"
β
[Transformer Encoder] (attention layers)
β
Context vector (768-dim embedding)
β
[Linear projection to vocab size]
β
Logits: [50,000 numbers]
β
Softmax(logits) β Expensive! Compute ALL probabilities
β
Probabilities: [0.7, 0.05, 0.001, ...]
β
Sample: "Paris"
Galton Transformer:
Input text: "The capital of France is"
β
[Transformer Encoder] (attention layers) β Same!
β
Context vector (768-dim embedding)
β
[Galton SDF Field Generator] β New!
β
Velocity field parameters (Ξ±, bias, target wells)
β
Drop probes & integrate ODEs β Physics!
β
Bucket masses emerge from flow
β
Sample: "Paris" (+ free uncertainty estimate!)
The context vector (from transformer) goes through a small MLP to produce:
This is contextual compositionβthe same idea as Stage 3, but learned end-to-end!
End-to-end learning:
Gradients flow from loss β SDF β context encoder β transformer. The whole system learns to create flow fields that guide probes correctly!
class GaltonTransformer(nn.Module): def __init__(self, vocab_size, d_model): self.transformer = TransformerEncoder(...) self.context_to_field = MLP(d_model, sdf_params) self.sdf_field = SDFField(...) self.integrator = ODEIntegrator(...) def forward(self, input_ids): # Standard transformer part context = self.transformer(input_ids) # (batch, d_model) # Galton part: context β flow field field_params = self.context_to_field(context) # Drop probes and integrate x0 = torch.rand(batch, n_probes) * C # Random starts x_final = self.integrator(x0, field_params, steps=40) # Soft bucket assignment bucket_probs = soft_bucket_mass(x_final, sigma=0.5) return bucket_probs # (batch, vocab_size)
Modern language models use softmax to pick the next token. It works, but it has serious scaling issues...
Attention mechanism: Every token attends to every other token
# For sequence length L, vocab size V Q, K, V = split(embeddings) # Each (L Γ d) attention = softmax(Q @ K.T / βd) # O(LΒ²) β Problem! output = attention @ V # (LΒ² operations)
Vocabulary softmax: Computing probabilities over all tokens
logits = linear(hidden) # (V,) where V=50,000 probs = softmax(logits) # O(V) β every token! next_token = sample(probs) # Pick one from 50k
For GPT-3 (vocab=50k, context=2048):
2048Β² = 4.2M operations per layer50k normalization every predictionKey Observations:
| Problem | Softmax | Galton Sampler |
|---|---|---|
| Vocab scaling | O(V) every time | O(log V) hierarchical |
| Adaptive compute | Fixed cost | Variable (fewer probes when confident) |
| Uncertainty | Post-hoc (entropy) | Built-in (probe spread) |
| Interpretability | Opaque numbers | Visible trajectories |
| Memory | Store full distribution | Only probe states |
What if probability was always a flow, never a lookup?
Near-term (achievable now):
Mid-term (research frontier):
Long-term (paradigm shift):
Experiment:
Contribute:
Spread the word:
"For decades, we've computed probabilities algebraically. Softmax, argmax, probability vectorsβall lookups and calculations. But probability in the physical world flows. Water finds its level. Particles settle into wells. Energy minimizes.
What if our models could do the same? Not calculate probability, but become probability. Let the geometry do the work. Let physics guide the bits.
That's the Galton vision. From a 4am idea to a working sampler toβperhapsβa new way of thinking about uncertainty in AI."
β Start exploring: docs/char32_ode_warmstart.md