6 min read
Benchmarking Predictive Coding Networks Made Simple
Tommaso Salvatori : Mar 6, 2025 6:00:00 AM

Artificial intelligence (AI), as exemplified by large models such as ChatGPT, is one of the key technologies of this century. While most AI breakthroughs rely on deep neural networks trained via backpropagation on GPUs, this approach has several drawbacks, the main one being the need of a large amount of computational power. This makes this technology less useful locally in low-power edge devices, less democratic, leaves a substantial carbon footprint, and does not store information similarly to how biological systems do. How can we go past this computational wall, and develop systems that reason using an efficiency comparable to that of the human brain? A promising solution is to move from GPUs — general purpose machines not specifically designed for AI — towards specialized hardware, designed to efficiently perform tensor operations and circumvent the Von Neumann bottleneck — the separation between memory and processing units. While it is still unclear which kind of hardware will be used in the future to address this problem, there is a large consensus that transitioning to new hardware will require transition to new training algorithms due to two central issues of backpropagation, that are the requirement of sequential forward and backward passes, and the need to analytically compute gradients of a global cost to perform a synaptic update. These features demand digital hardware architectures that can precisely match forward and backward passes in a low-noise environment, as even minor fluctuations can propagate numerical errors that can alter the final performance of the model. Thus, there is an urgent need for alternative, hardware-friendly, neuroscience-inspired learning algorithms that would allow us to reach brain-like efficiency.
A promising research direction is the study of energy-based models and of a technique called Equilibrium Propagation [1], that is the process of performing inference, followed by a parameter update, by bringing a physical system to an equilibrium. In the case the analog hardware is the physical system that is brought to an equilibrium, and this system is able to simulate hierarchical tensor multiplications, this technique can be used as an extremely efficient training algorithm for deep neural networks [2]. One of such energy based learning algorithms is inspired from the neuroscientific theory of predictive coding (PC) [3,4], a computational framework able to model hierarchical information processing in the brain, and source of inspiration for the development of the free energy principle and active inference [5,6].
State of the Art
While there has been a large amount of research that has shown interesting properties of this class of methods, little effort has been placed into making this method work at scale — a key feature that made backpropagation so powerful. The scalability issue has been neglected for several reasons: Firstly, it is a complex problem, and the reason why PC matches backpropagation (BP) on small scale tasks, such as the convolutional neural networks on CIFAR10, but fails to do so for larger models remains unclear. Understanding and addressing this problem would be extremely beneficial to the field, as it would prove that neuroscience-inspired learning algorithms have the power of working on large-scale tasks. Secondly, the absence of specialized libraries makes the training of PC models very slow: a complete hyperparameter search on a small convolutional network can take hours. Thirdly, the lack of a shared framework makes the reproducibility of the results hard.
Contributions
To address this problem, in a collaboration with the University of Oxford and the Vienna University of Technology, we are tackling the problems of scale with a call to arms to invest more research time in the field, and have hence laid down the foundations of future works with three main contributions, that we have denominated tool, benchmarks, and analysis. As a tool, we released PCX, an open-source JAX library for accelerated predictive coding training, whose development was led by Luca Pinchetti (Oxford). This library offers a user-friendly interface, extensive tutorials, ensuring efficiency and ease of use. As a set of benchmarks, we have designed a large set of tacks and training methods that can (and should!) be used to evaluate future PC variations and analyze PC performance. Furthermore, we show that by taking advantage of the efficiency of PCX, which has allowed us to try a large number of hyperparameter combinations, we manage to achieve state-of-the-art results on all of the proposed benchmarks. To conclude, our analysis includes a discussion of results, highlighting areas for improvement and providing insights into the way PC performs learning and inference, hence paving the way for future research.
Tool
PCX is built on JAX, with a focus on performance and versatility.
- Compatibility: It uses a functional approach and is fully compatible with many JAX libraries and tools. It offers both a functional and an imperative object-oriented interface for building PCNs.
- Modularity: It has an object-oriented abstraction that provides modular primitives (a module class, vectorised nodes, optimizers, and Layers) that can be combined to create a PCN.
- Efficiency: It extensively relies on just-in-time compilation, which can lead to significant speed-ups.
PCX offers a unified interface to test multiple variations of PC on several tasks. Our modular codebase can easily be expanded in the future.
Benchmark
The benchmarks we have proposed are taken from computer vision, and are related to image classification (supervised learning), and image generation (unsupervised learning), due to their simplicity and popularity in the field, and we have compared them against standard deep learning models trained with backprop in a controlled experiment, meaning that we have always used models of the same complexity. When tested on challenging datasets like CIFAR100 and Tiny Imagenet, the results show that PC is able to reach the same performance of standard models trained with backprop on convolutional models with 5 and 7 layers. However, when using deeper models, such as convolutional networks with 9 layers, or ResNets, we see that the performance of our PC models decreases, while that of models trained with backprop increases, as it can be seen from the table below.
This leaves us with two different conclusions: First, the results obtained in these experiments are promising, due to the complexity of the models and the data, and the fact that we are not using backpropagation, but brain-inspired computations; Second, while promising, these methods still do not work at the scales that we would like them to, and hence we need more research to understand the underlying reason, and address it. The image below represents the test accuracy on CIFAR10 using iPC, the best performing variation of predictive coding on a ResNet18. The results show that, the more layers we add, the worse the performance. We believe that future effort should be dedicated towards inverting this trend, that is, have an energy-based model trained with a neuroscience inspired algorithm that has the same scaling properties as backprop.
Analysis
We then tried to understand what the problems that prevent PC to scale to larger scale tasks are by running multiple experiments that study the internal neuronal dynamics that occur during learning. The main problem we identify is that of the concentration of the energy of the model in the last layers, making it hard for the inference process to then propagate such an energy back to the first layers. In detail, we observe that the energy in the last layer is orders of magnitude larger than the one in the input layer, even after performing several inference steps. To better understand how the energy propagation relates to the performance of the model, we have analyzed both the test accuracy and the ratio of the energies of subsequent layers as a function of multiple parameters, such as the learning rates of the model γ. The results, reported in Fig a,b, show that small learning rates lead to better performance, but also to large energy imbalances among layers, a problem that leads to exponentially small gradients when the depth of the model increases. In detail. (a) shows the decay in test accuracy when increasing the learning rate of the states γ, tested using both SGD and Adam; (b) plots the imbalance between energies in the layers. Figures are obtained using a three layer model on FashionMNIST.
Conclusion
Our work introduces PCX, an open-source library that leverages JAX's efficiency for deep learning tasks with PCNs. Its user-friendly setup and comprehensive tutorials make PCN training accessible to those familiar with frameworks like PyTorch. Our extensive comparative study reveals that predictive coding networks, when utilizing small/medium architectures like VGG 7, rival standard deep learning networks trained with BP. However, as model size increases, the performance of predictive coding falls short compared to BP's scalability. To this end, in this work we have laid the foundation to perform equilibrium propagation (with predictive coding energy) at scale, that clearly highlights the next steps, and concrete milestones: Can we train deep predictive coding models on complex datasets, such as ResNets on Imagenet? Can we train large models on different modalities, such as graph neural networks and small transformer models using learning algorithms that can be implemented on some kind of low-energy neuromorphic hardware? Can we do machine learning at scale using a learning algorithm initially developed to model hierarchical information processing in the brain?
The paper may be found at https://arxiv.org/abs/2407.01163
[1] Scellier, Benjamin, and Yoshua Bengio. "Equilibrium propagation: Bridging the gap between energy-based models and backpropagation." Frontiers in computational neuroscience 11 (2017): 24.
[2] Kendall, Jack, et al. "Training end-to-end analog neural networks with equilibrium propagation." arXiv preprint arXiv:2006.01981 (2020).
[3] Rao, Rajesh PN, and Dana H. Ballard. "Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects." Nature neuroscience 2.1 (1999): 79-87.
[4] Friston, Karl. "Does predictive coding have a future?." Nature neuroscience 21.8 (2018): 1019-102
[5] Friston, Karl. "The free-energy principle: a unified brain theory?." Nature reviews neuroscience 11.2 (2010): 127-138
[6] Parr, Thomas, Giovanni Pezzulo, and Karl J. Friston. Active inference: the free energy principle in mind, brain, and behavior. MIT Press, 2022.