Toward Loss Functions that Better Track Intelligence

by Nick Alonso

The standard way of measuring performance in deep learning is through a test loss: a neural network is trained to perform some task, and the goal is to get the network to reduce the loss as much as possible on a subset of the data that is not presented during training (i.e., the test set). A low test loss means the model performs the task well when applied to new data points that were not previously observed. The test loss, however, does not account for how quickly the model learns. It provides a measure of how well the model is performing at the time the test is performed but ignores the time and effort it took the model to get to that performance.

This way of measuring performance is well and good if our only interest is to have some model eventually reach a desired state of competence. However, we often have an interest in the amount of computation, energy, and time costs associated with training the model. Further, this loss measure seems to me to be largely indifferent to an essential aspect of intelligence: learning efficiency. Simple examples and existing theories of intelligence make clear that intelligent systems are not just those that eventually reach some state of high competence on some set of tasks. They are those that also do so efficiently, where ‘efficiency’ roughly corresponds with the amount of data the model needs to learn a task, which is also correlated with training speed, energy costs, etc. If this is right it would have interesting implications, e.g. recent versions of GPT are not as intelligent as they may seem on first impression. These models may be incredibly competent at generating human sounding language in response to prompts and incredibly knowledgeable, but they are just too data hungry to be near the intelligence of more efficient learners, like humans and some complex animals.

Why does learning efficiency matter for intelligence? Here’s one example. Most people think child prodigies are highly intelligent. Consider the chess prodigy who is playing competitively with professionals three times his age, or the child who graduates high school at age nine. Why is it so obvious these children are smarter than most others? It is not the knowledge and skills they have. That is, it is not their competence at certain tasks that is so impressive. Though the chess prodigy may be very good, there exist many professionals who have equal or better skill than him, and it is unlikely the 9 year old high school graduate has more factual knowledge or skills than others who have graduated high school and college who are much older. The reason we universally acknowledge the intelligence of these children is because of how efficiently they learn compared most other people. They acquire more knowledge and skill from less data and less practice than just about everyone else. We admire not the end result, but the ease and speed with which they achieved the end result. Thus, any theory of intelligence must account not just for knowledge/competence/skill, but also the efficiency with which that knowledge and skill is acquired. Examples like these have motivated some researchers to develop formal theories of intelligence that explicitly take this intuitive notion of efficiency into account (e.g., [1]).

The test loss measure of performance, therefore, completely misses a crucial ingredient of intelligence. It is not sensitive to how efficiently the model learns. How might we better track intelligence? We could begin with a formal theory of intelligence to develop a measure. Some of these theories may be a bit cumbersome, however, in practice since many require expensive computations, which may limit the ability to frequently track a model’s progress. Further, these theories make very specific claims, and do not all agree, which raises the question of, which theory, if any, gets all the details right?

Alternatively, I suggest the cumulative loss may provide a simpler, easy to compute method that better tracks intelligence than the test loss. I do not claim the cumulative loss is a measure of intelligence, but I think it may better correlate with intelligence than the test loss. The cumulative loss is simple. Let \mathcal{L}^t be some loss measure computed at training iteration t given the data at t and the parameters \theta^{t-1} computed from the previous iteration. The cumulative loss is just

\mathcal{L}_{cumulative} = \frac{1}{T} \sum^T_{t=0} \mathcal{L}^t

That is, the cumulative loss is just the losses computed at each training iteration averaged together. Notice that this loss is very easy to compute and is task general. However, it is sensitive to how the model performs early in training. Consider two models/neural networks. Model 1 learns slowly so that it has high loss early in training, while model 2 trains quickly. Both converge to the same loss by the end of training. As desired, the second model (the more efficient learner) would have a lower (better) cumulative loss since it has lower losses early in training and these losses are incorporated into the cumulative loss.

Now consider the case where both models learn with equal efficiency (e.g., both converge or nearly converge after the same number of updates) but model 1 converges to a better loss. Model 1 will have a lower cumulative loss in this case, which seems to correctly track the model we would say is more intelligent.

Finally, consider the case where model 1 converges much more quickly than model 2, but model 2 converges to a better loss by the end of training. Which model is more intelligent? My intuitions are that it is not clear, which is also true of the cumulative loss: it is not clear from the details given which model would have better cumulative loss. More details are needed on how much more quickly model 1 reduced the loss compared to model 2, how much better the performance of model 2 was by the end of training than model 1, and how long the two trained for. Most people I suspect will likely have mixed feelings in these cases about which model is more intelligent, which means it is a theoretical gray area. In such cases, our loss measure, if it is to track intelligence, should measure the two models as having similar losses. This is just what the cumulative loss will end up doing, which is why we cannot tell which model has a higher cumulative loss from the details given (more details are needed because the two will have similar cumulative losses in this case). It would be interesting empirically to see how well this loss measure tracks common judgments and various theories of intelligence, but just from this bit of analysis it is clear it may be a good start.

References

Chollet, F. (2019). On the measure of intelligence. arXiv preprint arXiv:1911.01547.

Online Machine Learning: What it is and Why it Matters

Introduction

In the field of deep learning, the standard training paradigm is offline learning. During offline learning 1) parameter updates are mini-batched (i.e., updates from multiple datapoints are averaged together each iteration) and 2) training is performed for multiple epochs (i.e., multiple passes over the training dataset). The goal is to get the neural network to minimize some loss function on a test/hold out set of data as much as possible by the end of training. The tacit assumption behind offline training is that neural networks are trained first on a stored/locally generated dataset, then deployed to perform the task on some new data (e.g., only after AlphaGo was trained did it then compete against professional human Go players in publicly viewed matches).

Though offline learning is currently standard, I’ve recently come to appreciate the importance of online learning, a learning scenario which is distinct in important ways from offline learning. Online learning better describes the learning scenario humans and animals face and it has an increasing number of applications in machine learning. However, it seems to me most researchers in deep learning and computational neuroscience are not working directly or thinking deeply about online learning. Part of the issue may be that the term ‘online learning’ is often not well defined in the literature. Additionally, I suspect many believe the differences between offline and online learning are not mathematically interesting, or they believe that the best performing algorithms and architectures for offline scenarios will also be the best in online scenarios. However, there are reasons to believe this is not quite right. In what follows, I describe what online learning is, how it is different in interesting ways from offline learning, and why neuroscientists and machine learning researchers should care about it.

What it is Online Machine Learning?

First, it should be said that ‘online learning’ is not synonymous with ‘continual learning‘. Continual learning is the learning scenario where a model must learn to reduce some loss across multiple tasks that are not independent and identically distributed (i.i.d) (Hadshell et al., 2020). For example, it is common in continual learning scenarios to present tasks sequentially in blocks, such that the model is first presented with data from one task, then the data from a second task, and so on. The difficulty is in preventing the model from forgetting previously learned tasks when presented with new ones. Continual learning is a popular topic in deep learning, and many high-performing solutions have been proposed.

Formal work on online learning often makes the assumption data is i.i.d, although combined online-continual learning scenarios have also been worked on (e.g., Hayes et al. 2022). Informally, online learning can be described as having at least the following properties: 1) each training iteration a single datapoint is presented to the model, 2) each datapoint is presented to the model only once (one epoch of training), and 3) the model’s goal is to minimize the loss averaged over every training iteration (called the cumulative loss, defined below). For examples of classic/widely cited papers that use this description, see Crammer et al. (2006), Daniely et al. (2015), and Shalev-Shwartz (2012) . This is opposed to offline scenarios where multiple datapoints (i.e., mini-batches or batches) of datapoints are presented each iteration, each datapoint is presented multiple time (i.e., multiple epochs of training), and the model’s goal is to minimize the loss on some hold-out/test data as much as possible by the end of training.

Formally, the online learning objective is the cumulative loss. Consider the scenario where a model is given a single datapoint x^t and prediction target y^t each iteration t. The model, parameterized by \theta^{t-1}, must try to predict y^t given x^t as input. After the prediction is outputted, feedback is provided in the form of a loss \mathcal{L}(y^t, \hat{y}^t, \theta^{t-1}), where \hat{y} is the prediction generated by \theta^{t-1} given x^t. In this case the cumulative loss is:

\mathcal{L}_{cumulative} = \frac{1}{T}\sum_{t=1}^T \mathcal{L}(y^t, \hat{y}^t , \theta^{t-1}).

The cumulative loss is the average loss produced by the model during training on each data point as it was received in the sequence. In order to achieve a good cumulative loss the model not only must eventually perform well but must improve its performance quickly, since the losses produced at early iteration are factored into the final cumulative loss. A similar loss called the ‘regret’ is also sometimes used in online scenarios. The regret is just the cumulative loss achieved by the optimal parameters (e.g., parameters pretrained offline), minus the cumulative loss.

Compare this to offline learning, where the loss is averaged/summed over a hold-out/test dataset that is not used to train the model:

\mathcal{L}_{test} = \frac{1}{N}\sum_n^N \mathcal{L}(x_{test}^n, y_{test}^n, \theta^T),

where n refers to the datapoint in the test set and \theta^T are the parameters at iteration T. The test loss describes how well the model is doing right now at the current training iteration T. Thus, minimizing the test loss only requires that the model perform well eventually, by the end of training. It does not factor in how how the model performed at earlier training iterations. For the same reasons, the cumulative loss is distinct from the training loss, which describes how the current parameters \theta^T perform on the entire training dataset.

Some examples of online learners may include spam bots, surveillance devices, robots, autonomous vehicles, animal brains, and human brains. In all of these cases, the models are often assessed by how well they perform/learn while they are being deployed, and in all of these scenarios input data is generated one at a time, and no two datapoints are exactly the same (due to noise and the infinite possible varieties of inputs these models could receive from the real world).

In addition to the properties listed above, online scenarios often face other problems that are not standard in offline scenarios like concept drift and imbalanced data. Concept drift refers to the event where the underlying processes generating the data change (e.g., a robot moves to a new environment) (for review see Lu et al. (2018)). Imbalanced data refers to datasets where there are uneven numbers of training instances across classes/tasks (e.g., a robot in the desert may see 1,000 rocks for every one cactus). Dealing with these two scenarios are common topics in the online learning literature (e.g., see Hayes et al., (2022)).

The Mathematical Distinction between Online and Offline Learning Matters

I suspect some machine learning researchers assume that any learning algorithm that performs well in offline learning scenarios will also perform well in online learning scenarios. This judgment is not necessarily true and the mathematical distinction between the two learning scenarios makes clear why. Here are two important distinctions:

1. The online learning objective places more emphasis on reducing the loss quickly than offline learning objectives. Minimizing cumulative loss requires a learning algorithm not only update the model to perform well eventually, but also to perform well early in training. Thus, cumulative loss is affected by the rate at which the learning algorithm reduces the loss. Test loss is not directly affected by the speed of the algorithm. The test loss only cares how well the model is performing at the end of training (or the iteration where it is measured), and usually one trains long enough for the model to converge. Thus, a slow algorithm that reduces test loss well in offline scenarios may not reduce cumulative loss well in online scenarios.

2. Convergence guarantees are less important in online learning scenarios than offline learning scenarios. In offline scenarios, it is typically desired that a learning algorithm be guaranteed to converge to a minimum of the test loss after some number of training iterations (and epochs), since models can often be trained to convergence in offline scenarios. However, in online scenarios convergence is not always achievable, since the model is only doing a single pass through the data and in some cases there is not enough data for the model to converge. This is also the case with infinite/streaming data scenarios when concept drift occurs, and the model only trains for a finite period of time before the data distribution changes. In online scenarios, it may be desirable to sacrifice convergence guarantees for gains in training speed and low computation overhead. An example of an algorithm that is not guaranteed to converge but performs well in online scenarios are winner-take-all clustering models (e.g., see Hayes et al. (2020, 2022) and Zhong (2005) for example of online winner take all clustering. See Neal and Hinton (1998) for discussion of why winner-take-all clustering lacks the convergence guarantees of expectation maximization).

These differences imply that not all good solutions to the offline learning problem are good solutions to the online problem. We can, for instance, imagine an algorithm that reduces the loss slowly but consistently converges to very good local minima when trained with large mini-batches for many epochs. This algorithm would be great for offline learning, but is non-ideal for online learning. Interestingly, the standard learning algorithm used in deep learning is backpropagation which implements stochastic gradient descent (SGD). It is well known that SGD finds very good local minima in deep neural networks in offline training but trains slowly.

Understanding the Brain Requires Understanding Online Learning

Here’s an argument for why computational neuroscientists should care about online learning:

  1. Humans and animals learn online during most of their waking lives. It therefore seems safe to assume our brains evolved learning algorithms that deal well with the online scenario specifically.
  2. Solutions to the offline learning scenario will not always be the best solutions to the online scenario, for reasons noted above.
  3. Therefore, computational neuroscientists interested in understanding learning in the brain should focus on developing theories of biological learning that are based on good solutions to the online scenarios, without assuming that solutions to the offline scenario will work well too.

Again, backpropagation, which implements SGD, is not clearly the best solution to online learning in deep neural networks. It is currently, in practice, the best solution we have for offline learning, but again SGD is slow and therefore may be non-ideal for online scenarios. There is thus some reason to look to other optimization procedures and algorithms as possible explanations for what the brain is doing.

Online Learning as a Standard Approach in Neuromorphic Computing

Neuromorphic chips are brain-inspired hardware that are highly energy efficient. They typically implement spiking neural networks and have similar properties as the brain. These chips are ideal for robots and embedded sensors interacting with the real world that have tight energy constraints (e.g., finite battery power). Notice these are the same scenarios where online learning would be especially useful, i.e., an embedded system interacting with the real world in real time. Thus, there is motivation from a neuromorphic computing point of view to think hard about the online learning problem. Much work on spiking networks, it seems to me, is focused on adapting backprop to spiking networks. However, for reason discussed above, assuming backpropagation is the ideal/gold-standard solution for online learning to here is not obvious. It may therefore be useful to look to other optimization methods or to develop new ones specifically for online learning on neuromorphic chips.

References

Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive aggressive algorithms.

Daniely, A., Gonen, A., & Shalev-Shwartz, S. (2015, June). Strongly adaptive online learning. In International Conference on Machine Learning (pp. 1405-1411). PMLR.

Hadsell, R., Rao, D., Rusu, A. A., & Pascanu, R. (2020). Embracing change: Continual learning in deep neural networks. Trends in cognitive sciences24(12), 1028-1040.

Hayes, T. L., & Kanan, C. (2020). Lifelong machine learning with deep streaming linear discriminant analysis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 220-221).

Hayes, T. L., & Kanan, C. (2022). Online Continual Learning for Embedded Devices. arXiv preprint arXiv:2203.10681.

Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., & Zhang, G. (2018). Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering31(12), 2346-2363.

Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models (pp. 355-368). Springer, Dordrecht.

Shalev-Shwartz, S. (2012). Online learning and online convex optimization. Foundations and Trends® in Machine Learning4(2), 107-194.

Zhong, S. (2005, July). Efficient online spherical k-means clustering. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005. (Vol. 5, pp. 3180-3185). IEEE.