We will work with KDD detection outlier dataset. The dataset can be downloaded here. Data was processed as follows:
Datatransformation corresponds to the description in this article
The following code allows to load the data (4 variables, 5858 observations).
%matplotlib inline
from matplotlib import pyplot as plt
import math
import numpy as np
import scipy.misc
import scipy as sp
import numpy.random as npr
import sklearn as sk
from sklearn import neighbors
from sklearn import model_selection
from sklearn import decomposition
## Whole data (4 variables, )
KDD = np.loadtxt('http://www.math.univ-toulouse.fr/~epauwels/LearningM2SID/KDDNetworkIntrusion.txt', delimiter=',')
## Input
X = KDD[:,(0,1,2)]
## Binary output (error or not)
y = KDD[:,3]
Give basic descriptive statistics of the data, correlations, graphical representation etc ...
For $k = 1, 6, 11, 16, 21, \ldots, 81$ perform outlier detection on the input X using LOF. Compute area under the precision recall curves (AUPR) for each value of $k$. Use scikitlearn internal function to compute the area under the curve
from sklearn.metrics import precision_recall_curve
from sklearn.neighbors import LocalOutlierFactor
Ks = np.arange(1,81,5)
AUPR = np.zeros(len(Ks))
for i in np.arange(len(Ks)):
## Insert your code here
plt.plot(Ks,AUPR)
plt.title('AUPR as a function of k')
Pic the best value of $k$ and plot the corresponding AUPR curve. Compare with neighboring values of $k$.
for k in (30,40,50):
## Insert your code here
plt.legend()
plt.title('Precision recall curve')
We split the data set in two: $X_{train}$ which contains only normal traffic and $X_{test}$ which contains both normal traffic and attacks. Perform the same experiment as before but this time fitting the LOF model only on normal traffic and computing the performances on $X_{test}$. Use $\texttt{novelty = True}$, to use the LOF model for computing scores on data different from training samle. What do you observe regarding the best value of $k$?
indexNormal = np.where(y == 0)[0]
npr.seed(654321)
trainIndex = npr.choice(indexNormal, size=np.int(len(indexNormal)/2), replace=False)
testIndex = np.setdiff1d(range(len(y)), trainIndex)
Xtrain = X[trainIndex,:]
ytrain = y[trainIndex]
Xtest = X[testIndex,:]
ytest = y[testIndex]
Ks = np.arange(1,41,2)
AUPR = np.zeros(len(Ks))
for i in np.arange(len(Ks)):
## Insert your code here
plt.plot(Ks,AUPR)
plt.title('AUPR as a function of k')
Pic the best value of $k$ and plot the corresponding AUPR curve. Compare with neighboring values of $k$.
for k in (11,13, 15, 17,19):
## Insert your code here
plt.legend()
plt.title('Precision recall curve')
Make a comment regarding semi-supervised novelty detection versus unsupervised outlier detection.
Same questions using kernel density estimation with Gaussian kernel and various bandwidth (see solution plot for details).
from sklearn.metrics import precision_recall_curve
from sklearn.neighbors import KernelDensity
Semi-supervised novelty detection