A VeryGood[tm] name for a VeryGood[tm] Algorithm.
In a previous document I described how bayesian models can recursively update, thus making them ideal as a starting point for designing streaming machine learning models. In this document I will describe a different method proposed by Crammer et al. which includes a passive agressive approach to model updates. I will focus on intuition first before moving to the mathy bits. At the end I will demo some sklearn code with implementations of these models.
Let’s say you’re doing a regression for a single data point \(d_i\). If you only have one datapoint then you don’t know what line is best, despite this fact we can come up with lines that will fit the data point perfectly. For example; The yellow line will pass through the point perfectly and so will the blue line. There are many lines we could come up with but we’ll keep these two in mind.
We can describe all the lines that will go through the line perfectly by describing the lines in terms of the slope and intercept. If we make a plot of the weight space of the linear regression (\(w_0\) is the constant and \(w_1\) is the slope) we can describe all the possible perfect fits with a line. You’ll note that our original blue line is now denoted with a blue dot, but we are referring to the same thing. Same for the yellow line and yellow dot.
Any point on that line is as good as far as \(d_i\) is concerned, so how might we go about selecting the optimal one? How about we compare it to the weights that the regression had before it came across this one point? Let’s call these weights \(w_{\text{orig}}\) and in the case of this being the first datapoint we might imagine that \(w_{\text{orig}}\) is in the origin. In that case the blue regression seems better than the yellow one but is it the best choice?
This is where we can use maths to find the coordinate on the line that is as close as our original weights \(w_{\text{orig}}\). In a very linear system you can get away with linear algebra but depending on what you are trying to do you may need to introduce more and more maths to keep the update rule consistent.
To prevent the system from becomming numerically unstable we may also choose to introduce a limit on how large the step size may be (no larger than \(C\)). This way, we don’t massively overfit to outliers. Also, we probably only want to update our model if our algorithm makes a very large mistake. We can then make an somewhat agressive update and remain passive at other times. Hence the name! Note that this approach will for slightly different for system that do linear classification but the idea of passive agressive updating can still be applied.
For those who are interested in the formal maths: check the appendix.
It should be relatively easy to implement this algorithm but you don’t need to because scikit learn has support for this algorithm. Scikit learn is great; be sure to thank people who contribute to the project.
For the regression task let’s compare how well the algorithm performs on some simulated data. We will compare it to a normal, batch oriented, linear regression.
import numpy as np
import sklearn
from sklearn.datasets import make_regression
X, y, w = make_regression(n_features=3, n_samples=2000, random_state=42, coef=True, noise=1.0)
mod_lm = sklearn.linear_model.LinearRegression()
mod_lm.fit(X, y)
batch_acc = np.abs(mod_lm.predict(X) - y).sum()
start_c = <value>
warm_c = <value>
mod_pa = sklearn.linear_model.PassiveAggressiveRegressor(C=start_c, warm_start=True)
acc = []
coefs = []
for i, x in enumerate(X):
mod_pa.partial_fit([x], [y[i]])
acc.append(np.abs(mod_pa.predict(X) - y).sum())
coefs.append(mod_pa.coef_.flatten())
if i == 30:
mod_pa.C = warm_c
You’ll notice in the code that I’ve added a starting value for \(C\) (start_c
) and a value for when it has partly converged (warm_c
). This code for illustration, as it is a strange assumption that the algorithm is “warm” after 30 iterations.
You can see in the plots below what the effect of this is.
\[c_{\text{start}} = 1, c_{\text{warm}} = 1\]The first plot shows the mean squared error over the entire set after the regression has seen more data. The orange line demonstrates the baseline performance of the batch algorithm. The second plot demonstrates how the weights change over time. In this case you can confirm that the MSE fluctuates quite a bit.
\[c_{\text{start}} = 0.1, c_{\text{warm}} = 0.1\]
The fluctuations are small, but the algorithm seems to need a lot of data before the regression starts to become sensible.
\[c_{\text{start}} = 3, c_{\text{warm}} = 0.1\]
You can now see that the fluctuations are still very small but the large steps that are allowed in the first few iterations ensure that the algorithm can converge a bit globally before it starts to limit itself to only local changes.
We can repeat this exercise for classification too.
import numpy as np
import sklearn
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=4000, n_features=2, n_redundant=0,
random_state=42, n_clusters_per_class=1)
mod_lmc = sklearn.linear_model.LogisticRegression()
normal_acc = np.sum(mod_lmc.fit(X, y).predict(X) == y)
start_c = <value>
warm_c = <value>
mod_pac = sklearn.linear_model.PassiveAggressiveClassifier(C=start_c)
acc = []
coefs = []
for i, x in enumerate(X):
mod_pac.partial_fit([x], [y[i]], classes=[0,1])
coefs.append(mod_pac.coef_.flatten())
acc.append(np.sum(mod_pac.predict(X) == y))
if i == 30:
mod_pac.C = warm_c
The code is very similar, the only difference is that we are now working on a classification problem.
\[c_{\text{start}} = 1, c_{\text{warm}} = 1\]
The first plot denotes performance again but this time it is measured by accuracy. The second plot shows the weights again. In the current setting you see a lot of noise and the performance is all over the place. The effect seems greater than with the regression when \(c_{\text{start}} = 1, c_{\text{warm}} = 1\). This is because the optimal weights are a lot smaller. In the regression case they were around 30-40 while here they are around 2.5 and 1. This means that a maximum step size of 1 is relatively seen very large.
\[c_{\text{start}} = 0.1, c_{\text{warm}} = 0.1\]
The performance is much better, especially when you consider that the y-axes on the charts are different.
It seems like this \(C\) hyperparameter is something to keep an eye on if these sorts of algorithms are within your interest.
It is nice to see that we’re able to have a model work in a streaming setting. I would wonder how often this is useful though, mainly because you need a label to be available in streaming to actually be able to learn from it. One can wonder, if the label is available at streaming then why have the need to predict it?
Then again, all passive agressive linear models have a nice and simple update rule and this can be extended to include matrix factorisation models. For large webshops this is very interesting because it suggests a method where you may be able to update per click. For a nice hint at this, check out this presentation from flink forward.
The bayesian inside of me can’t help but see an analogy to something bayesian happening here too. We update our prior belief (weights at time \(n\)) if we see something that we did not expect. By doing this repeatedly we come to a somewhat recusive model that feels very similar to bayesian updating.
\[ p(w_{n+1}| D, d_{n+1}) \propto p(d_{n+1} | w_n) p(w_n)\]
Bayesian updating seems more appealing because we get to learn from every datapoint which also has a regularisation functionality (it removes both the passive and the agressive aspect from the update rule). The main downside will be the datastructure though, which is super easy with this passive aggressive approach.
Hopefully the intuition makes sense by now which means we can discuss some of the formal maths. Feel free to skip if it does not appeal to you. Again, you can find more details in the original paper found here.
We will now discuss linear regression and I’ll leave the classification case up to the paper, but the proofs are very similar. In these tasks we are looking for new weights \(\mathbf{w}\) while we currently have weights \(\mathbf{w}_t\). We will assume that the current timestep we have a true label \(y_t\) that belongs to the data we see \(x_t\).
We have a new point at time \(t+1\) that we’ve just observed and we’re interested in making the smallest step in the correct direction such that we perfectly fit the new point. In maths we’d like to find \(\mathbf{w}_{t+1}\) that is defined via;
\[ \mathbf{w}_{t+1} = \text{argmin} \frac{1}{2} ||\mathbf{w} - \mathbf{w}_t||^2 \]
We do want to make sure that we adhere to our constraint, this new datapoint needs to be fitted perfectly such that \(l(\mathbf{w}; (\mathbf{x}_t, y_t)) = 0\).
Here \(l(\mathbf{w}; (\mathbf{x}_t, y_t))\) is the loss function of the regression. Let’s define that more properly. The idea is that we only update our system when our prediction makes an error that is too big.
If the error margin is greater than \(\epsilon\);
\[ l_t^R = l(\mathbf{w}; (\mathbf{x}_t, y_t)) = |\mathbf{w} \mathbf{x}_t - y_t| - \epsilon \]
To optimise these systems we’re going to use a lagrangian trick. It means that we are going to introduce a parameter \(\tau\) which is meant to be a ‘punishment variable’. This will relieve us of our constraint because we will pull into the function that we are optimising. If we search in a region that we cannot use we will be punished by a value of \(\tau\) for every unit of constraint violation.
With this trick, we can turn our problem into this;
\[ L^R(\mathbf{w}, \tau) = \frac{1}{2} ||\mathbf{w} - \mathbf{w}_t||^2 + \tau l_t^R \]
Now in order to optimise we will start with a differentiation.
\[ \frac{\delta L^R}{\delta \mathbf{w}} = \mathbf{w} - \mathbf{w}_t + \tau \mathbf{x}_t \frac{\mathbf{w}\mathbf{x}_t - y_t}{|\mathbf{w}\mathbf{x}_t - y_t|} = 0 \]
\[ \mathbf{w} = \mathbf{w}_t - \tau \mathbf{x}_t \frac{\mathbf{w}\mathbf{x}_t - y_t}{|\mathbf{w}\mathbf{x}_t - y_t|} \]
We now have an optimum for \(\mathbf{w}\) but we still have a variable \(\tau\) hanging around. Let’s use our newfound \(\mathbf{w}\) and replace it in \(L^R(\mathbf{w}, \tau)\).
\[\begin{equation} \label{eq1} \begin{split} L^R(\tau) & = \frac{1}{2}\left(-\tau \mathbf{x}_t \frac{\mathbf{w}\mathbf{x}_t - y_t}{|\mathbf{w}\mathbf{x}_t - y_t|}\right)^2 + \tau |\mathbf{w} \mathbf{x}_t - y_t| \\ & = \frac{1}{2}\tau^2||\mathbf{x}_t||^2 + \tau |\mathbf{w} \mathbf{x}_t - y_t| \end{split} \end{equation}\]
Once again, we have something we can differentiate. If we differentiate and solve we get the optimal value for \(\tau\).
\[ \tau = -\frac{|\mathbf{w} \mathbf{x}_t - y_t|}{||\mathbf{x}_t||^2} = \frac{l_t^R}{||\mathbf{x}_t||^2}\]
You can even introduce a maximum stepsize \(C\) if you want.
\[\mathbf{w}^* = \mathbf{w}_t - \tau \mathbf{x}_t \frac{\mathbf{w}\mathbf{x}_t - y_t}{|\mathbf{w}\mathbf{x}_t - y_t|}\] \[ \tau_t^* = \text{min} \left(C, \frac{l_t^R}{||\mathbf{x}_t||^2} \right) \]
We have closed from solutions! For a more formal deep dive I’ll gladly refer you to the original work.
I think the proof might’ve been simpler with mere linear algebra but this is the proof the paper supplied.