We give a demonstration of the capabilities of liquidSVM from a Python viewpoint.
More information can be found in the help (e.g. ?mcSVM
).
Disclaimer: liquidSVM and the Python-bindings are in general quite stable and well tested by several people. However, use in production is at your own risk.
If you run into problems please check first the documentation for more details, or report the bug to the maintainer.
easy_install --user --upgrade liquidSVM
pip install --user --upgrade liquidSVM
from liquidSVM import *
Some stuff we need for this notebook
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
# Load test and training data
reg = liquidData('reg-1d')
Now reg.train
contains the training data and reg.test
the testing data.
Both have the labels in its .target
and the feature is in .data
.
To train on the data and select the best hyperparameters do
(display=1
gives some information as the training progresses, but only on the command line, not in this jupyter notebook.)
model = lsSVM(reg.train,display=1)
Now you can test with any test set:
result, err = model.test(reg.test)
err[0,0]
We also can plot the regression:
plt.plot(reg.test.data, reg.test.target, '.')
x = np.linspace(-.2,1.4)
y = model.predict(x)
plt.plot(x, y, 'r-', linewidth=2)
plt.ylim(-.2,.8);
As a convenience, since reg
already contains .train
and .test
you can do the whole experiment in one line.
Then the result is stored in model.lastResult
:
model = lsSVM(reg, display=1)
result, err = model.lastResult
err[0,0]
banana = liquidData('banana-mc')
The following performs multi-class classification
model = mcSVM(banana.train)
print(banana.train.data.T.shape)
plt.scatter(banana.train.data[:,0],banana.train.data[:,1], c=banana.train.target)
x = np.arange(-1.1,1.1,.05)
X,Y = np.meshgrid(x, x)
z = np.array(np.meshgrid(x,x)).reshape(2,-1).T
print(x.shape,X.shape, z.shape)
Z = model.predict(z).reshape(len(x),len(x))
#contour(x,x,z, add=T, levels=1:4,col=1,lwd=4)
CS = plt.contour(X, Y, Z, 4, linewidth=4)
In this case err[:,0]
shows both the global miss-classification error as well the errors
of the underlying binary tasks, for more details see [Multiclass classification]:
result,err = model.test(banana.test)
err[:,0]
If data gets too big for the memory on your machine:
covtype = liquidData('covtype.5000')
model = mcSVM(covtype, display=1, useCells=True)
result, err = model.lastResult
err[0,0]
A major issue with SVMs is that for larger sample sizes the kernel matrix does not fit into the memory any more. Classically this gives an upper limit for the class of problems that traditional SVMs can handle without significant runtime increase. The concept of cells makes it possible to circumvent these issues.
If you specify useCells=True
then the sample space $X$ gets partitioned into
a number of cells.
The training is done first for cell 1 then for cell 2 and so on.
Now, to predict the label for a value $x\in X$ liquidSVM first finds out
to which cell this $x$ belongs and then uses the SVM of that cell to predict
a label for it.
We first consider a medium size sample of the covtype
data set.
liquidData
will download this from http://pnp.mathematik.uni-stuttgart.de/isa/steinwart/liquidData/:
co = liquidData('covtype.10000')
%time mcSVM(co.train);
%time mcSVM(co.train, useCells=True);
This is about 5 times faster! (The user time is about three times the elapsed time since we are using 2 threads.)
By using the partitioning facility of liquidSVM you can even bigger problems:
co = liquidData('covtype.50000')
%time mcSVM(co.train,useCells=True);
Note that with this data set useCells=F
here only works if your system has enough free memory (~26GB).
Even the full covtype
data set with over 460'000 rows
(about 110'000 samples retained for testing) is now treatable in under 9 minutes from within python:
co = liquidData('covtype-full')
%time mcSVM(co.train,useCells=True);
If you run into memory issues turn cells on:
useCells=True
. If you have less than 10GB of RAM usestore_solutions_internally=False
for the latter.
liquidSVM organizes its work into tasks: E.g. in multiclass classification the problem has to be reduced into several binary classification problems. Or in Quantile regression, the SVM is learned simultaneously for different weights and then the selection of hyperparameters produces different tasks.
Behind the scenes svm(formula, data, ...)
does the following:
model = SVM(data)
model.train(...)
model.select(...)
The following learning scenarios hide these in higher level functions.
Multiclass classification has to be reduced to binary classification There are two strategies for this:
Then for any point in the test set, the winning label is chosen. A second choice to make is whether the hinge or the least-squares loss should be used for the binary classification problems.
Let us look at the example dataset banana-mc
which has 4 labels.
Since there are 6 pairings, AvA
trains 6 tasks, whereas
OvA
trains 4 tasks:
banana = liquidData('banana-mc')
for mcType in ["AvA_hinge", "OvA_hinge", "AvA_ls", "OvA_hinge"]:
print("\n======", mcType, "======")
model = mcSVM(banana.train, mcType=mcType)
result, err = model.test(banana.test)
print("global err:", err[0,0])
print("task errs:", err[1:,0])
print(result[:3,])
The first element in the errors gives the overall test error. The other errors correspond to the tasks. Also the result displays in the first column the final decision for a test sample, and in the other columns the results of the binary classifications. One can see nicely how the final prediction vote for any sample is based on the 4 or 6 binary tasks.
NOTE
AvA
is usually faster, since every binary SVM just trains on the data belonging to only two labels. On the other handOvA_ls
can give better results at the cost of longer training time.OvA_hinge should not be used as it is not universally consistent.
If labels have values -1 or 1, then using the least-squares will estimate the conditional expectation. Hence, this can be used to estimate probabilities:
banana_bc = liquidData('banana-bc')
m = mcSVM(banana_bc.train, mcType="OvA_ls",display=1)
result, err = m.test(banana_bc.test)
probs = (result+1) / 2.0
print(probs[:5,:])
plt.hist(probs, 100);
And for multi-class classification it is similar:
banana = liquidData('banana-mc')
m = mcSVM(banana.train, mcType="OvA_ls",display=1)
result, err = m.test(banana.test)
probs = (result[:,1:]+1) / 2.0
print(result.shape, probs.shape)
print(np.hstack((result,probs))[:5,:].round(2))
This uses the quantile solver with pinball loss and performs selection for every quantile provided.
reg = liquidData('reg-1d')
quantiles_list = [ 0.05, 0.1, 0.5, 0.9, 0.95 ]
model = qtSVM(reg.train, weights=quantiles_list)
result, err = model.test(reg.test)
err[:,0]
plt.plot(reg.test.data,reg.test.target,'.')
plt.ylim(-.2,.8)
x = np.arange(-0.2,1.4,0.05).reshape((-1,1))
lines = model.predict(x)
for i in range(len(quantiles_list)):
plt.plot(x, lines[:,i], '-', linewidth=2)
In this plot you see estimations for two lower and upper quantiles as well as the median of the distribution of the label $y$ given $x$.
This uses the expectile solver with weighted least squares loss and performs selection for every weight. The 0.5-expectile in fact is just the ordinary least squares regression and hence estimates the mean of $y$ given $x$. And in the same way as quantiles generalize the median, expectiles generalize the mean.
reg = liquidData('reg-1d')
expectiles_list = [ .05, 0.1, 0.5, 0.9, 0.95 ]
model = exSVM(reg.train, weights=expectiles_list)
result, err = model.test(reg.test)
err[:,0]
plt.plot(reg.test.data,reg.test.target,'.')
plt.ylim(-.2,.8)
x = np.arange(-0.2,1.4,0.05).reshape((-1,1))
lines = model.predict(x)
for i in range(len(expectiles_list)):
plt.plot(x, lines[:,i], '-', linewidth=2)
Neyman-Pearson-Learning attempts classification under the constraint that the probability of false positives (Type-I error) is bound by a significance level alpha, which is called here the NPL-constraint.
banana = liquidData('banana-bc')
constraint = 0.08
constraintFactors = np.array([1/2,2/3,1,3/2,2])
# class=-1 specifies the normal class
model = nplSVM(banana.train, nplClass=-1, constraintFactors=constraintFactors, constraint=constraint)
result, err = model.test(banana.test)
false_alarm_rate = (result[banana.test.target==-1,]==1).mean(0)
detection_rate = (result[banana.test.target==1,]==1).mean(0)
np.vstack( (constraint * constraintFactors,false_alarm_rate,detection_rate) ).round(3)
You can see that the false alarm rate in the test set meet the NPL-constraints quite nicely, and on the other hand the the detection rate is increasing.
Receiver Operating Characteristic curve (ROC curve) plots trade-off between the false alarm rate and the detection rate for different weights (default is 9 weights).
banana = liquidData('banana-bc')
model = rocSVM(banana.train,display=1)
result, err = model.test(banana.test)
false_positive_rate = (result[banana.test.target==-1,:]==1).mean(0)
detection_rate = (result[banana.test.target==1,]==1).mean(0)
print(err.T.round(3))
print("1-DR:", 1-detection_rate)
print("FPR:",false_positive_rate)
plt.plot(false_positive_rate, detection_rate, 'x-')
plt.xlim(0,1); plt.ylim(0,1)
plt.plot([0,1],[0,1], '--');
This shows nice learning, since the ROC curve is near the north-west corner.
As a convenience we provide several datasets prepared for training and testing.
http://pnp.mathematik.uni-stuttgart.de/isa/steinwart/liquidData
They can be imported by name e.g. using:
liquidData('reg-1d');
This loads both reg-1d.train.csv
as well as reg-1d.test.csv
into
reg.train
and reg.test
respectively.
liquidData sets have a strict format, they are comma-separated values and no header.
The first column is the target.
The other columns are the features and are put in to .data
.
Before getting these data sets from our website, liquidData
first tries some directories in the filesystem:
getwd()
"~/liquidData"
.
In Windows, ~
typically is C:\Users\username\Documents
http://pnp.mathematik.uni-stuttgart.de/isa/steinwart/liquidData
The data sets can be gzip-ped, which is recognized by the additional extension .gz
, e.g.
reg-1d.train.csv.gz
and reg-1d.test.csv.gz