top of page

The Multiplicative "Cheat Code": How Dynamic Weights Power Transformers and the Brain

Giulio Ruffini, PhD

Co-Founder & CTO, Neuroelectrics


A recent, brilliant post by Dimitris Papailiopoulos on X (formerly Twitter) [1] sparked a fascinating train of thought for me. He was discussing the recent trend of training Transformers to act as general-purpose computers. But rather than just marveling at the results, he pointed out how we should properly view the mechanics under the hood: it all comes down to dynamic, programmable weights.


This observation brought an old puzzle of mine back to the surface. If you look closely at architectures like LSTMs, GRUs, and Transformers, they use a mathematical trick that departs from classical neural networking: the direct multiplication of signals.


For a long time, this felt almost like cheating. Classical neural networks don't do this. So, why do modern architectures rely on it? As Dimitris pointed out, and as I recently explored in my own biophysical research, this multiplicative trick isn't a cheat at all—it's the foundation of programmable intelligence, and it mirrors exactly how the human brain evaluates prediction errors.


The Limitation of Fixed Weights


To understand why multiplication is so powerful, we have to look at traditional Artificial Neural Networks (ANNs).


In a classic feed-forward network, or even a vanilla Recurrent Neural Network (RNN), the math is highly standardized: you take an input $x$, multiply it by a weight matrix $W$, add a bias $b$, and pass it through a non-linear activation function.


$$y = f(Wx + b)$$


Once the network is trained, $W$ is permanently fixed. The network is essentially a static piece of hardware. While vanilla RNNs are theoretically Turing complete, pushing a static circuit to perform complex, algorithmic tasks in practice is incredibly difficult.


The "Cheat Code": Programmable Connections


This is where Papailiopoulos's insight shines. Architectures like Transformers introduce a radical departure from the static $f(Wx + b)$ formula: they let inputs multiply other inputs.


In a Transformer, this happens via Self-Attention. The network derives Queries and Keys from the input, multiplies them together to create an attention matrix, and then multiplies that by the Values.


$$Attention(Q, K, V) = Softmax(QK^T)V$$


Why is this so profound? Because multiplication allows the weights of the neural network to become dynamic. When signals multiply signals, the input itself dictates how the network routes information. The network is no longer just static hardware; it becomes programmable. The data acts as "software" that temporarily rewires the network's connections on the fly. This is exactly why a Transformer can be trained to emulate a computer—its weights are dynamically programmed by the context.


Left: Diagram of classical neural network with blue nodes and fixed weights. Right: Transformer model with dynamic weights, orange nodes, brain-inspired.
Fixed and Dynamic Weights

The Biological Mirror: Precision Weighting and Gating


If dynamic weighting is the secret to Turing-complete flexibility in AI, does the brain use the same trick?


This question brings us directly to the concept of predictive coding, and specifically to my recently accepted paper in Neural Computation: Decoding Prediction Errors in the Brain: A Laminar Neural Mass Model Approach [2].


In predictive coding, the brain is an active inferential machine. It constantly generates top-down predictions and compares them to bottom-up sensory inputs. When there is a mismatch, a "prediction error" is generated. However, the brain must weight these errors based on their reliability—a process known as precision weighting.


How does the biological hardware of the cortex physically implement this? It turns out, the brain uses the exact same multiplicative trick.


Cross-Frequency Coupling as a Multiplicative Gate


In our Laminar Neural Mass Model (LaNMM), we demonstrate how this happens via Cross-Frequency Coupling (CFC) across cortical layers. Information in the brain isn't just carried by spikes; it's carried by the amplitude and phase of oscillations.


  • Top-down predictions are generally carried by slower brain rhythms (like Alpha, 8–12 Hz).

  • Bottom-up sensory inputs/errors are carried by fast rhythms (like Gamma, 30–100 Hz).


Our model shows that the brain evaluates prediction errors and applies precision weighting through two coupled mechanisms:


  1. Signal-Envelope Coupling (SEC): Slow oscillations modulate the envelope of fast oscillations to compute the fast-time mismatch (the error itself).

  2. Envelope-Envelope Coupling (EEC): The envelope of slow oscillations modulates the envelope of faster rhythms to govern slow-time gating and precision.


Mathematically and biophysically, amplitude modulation is multiplication. When a slow alpha wave scales the amplitude of a fast gamma wave, the brain is executing a dynamic, multiplicative gate. Top-down control acts as a "gain control" knob, scaling the prediction error signal. Just like the self-attention matrix in a Transformer, the brain uses the current context to dynamically multiply and re-weight the flow of information.


Three graphs (a, b, c) showing SEC, EEC, and combined. Each graph has waveforms in black, red, and blue. Arrows indicate direction.
Fig 1 from [2]

The Roads to Turing Completeness


The parallel between state-of-the-art AI and biological intelligence is striking, but it raises a deeper theoretical question.


We know from foundational work in computer science that even standard, fixed-weight RNNs are theoretically Turing complete [3]. That is, given enough neurons, precision, and time, a static RNN can simulate any arbitrary algorithm. They don't strictly need dynamic weights to be universal computers.


So, why did both biological evolution and modern AI research converge on architectures that utilize multiplicative, dynamic weights?


The answer likely lies not in theoretical possibility, but in practical implementation. While fixed-weight systems can compute anything, training them to perform complex, algorithmic tasks robustly is notoriously difficult. Multiplicative gating—seen in LSTMs, Transformers, and biological circuits—may represent an evolutionarily superior implementation path.


Dynamic weights allow for systems that are likely more stable dynamically, easier to train (or evolve), and more robust to noise. The multiplication of signals isn't the only road to powerful computation, but it appears to be the most efficient "cheat code" for building flexible, programmable intelligence in the real world.


References

[1] Papailiopoulos, D. (2024). Commentary on Transformer training. X (formerly Twitter). URL: https://x.com/dimitrispapail/status/2028669695344148946?s=12

[2] Ruffini, G., et al. (2024). Decoding Prediction Errors in the Brain: A Laminar Neural Mass Model Approach. Neural Computation (Accepted). URL: https://www.neuroelectrics.com/blog/decoding-prediction-errors-in-the-brain-a-laminar-neural-mass-model-approach

[3] Siegelmann, H. T., & Sontag, E. D. (1995). On the computational power of neural networks. Journal of computer and system sciences, 50(1), 132-150.

Comments


Commenting on this post isn't available anymore. Contact the site owner for more info.
bottom of page