Université Paul Sabatier, M2 SID¶

Machine learning - Practical, outlier detection¶

In [1]:

%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np
import numpy.random as npr
import sklearn as sk

Question 1: precision recall from scikitlearn¶

The following code generates $3$ vectors of size $1000$, representing a dataset of size $1000$. outlier has entries in $\{0,1\}$ and represents observations corresponding to true outliers (with value $1$). scoresPredictor contains outlier scores given by an unknown anomaly detector. scoresRandom contains prediction scores from a random presictor.

Use scikitlearn internal functions to represent precision recall curves and compute area under the curve for both the anomaly detector and the random detector.

In [2]:

npr.seed(12)
N = 1000
delta = 0.1
deltaScore = 0.1
temp2 = npr.rand(N)
outlier = (temp2 <= delta) * 1.
scoresPredictor = (npr.rand(N)+deltaScore) * outlier + (npr.rand(N)) * (1-outlier) + npr.rand(N) * 2 * deltaScore 
scoresRandom = npr.rand(N)

In [3]:

from sklearn.metrics import precision_recall_curve

Out[3]:

(0.0, 1.0)

Represent as a function of the score the precision and the recall for both the anolmaly detector and the random predictor. Read the documentation carefully and make sure you precisely understand the output of precision recall curves. What happens?

Since scoresRandom and scoresPredictor are not on the same scale, we will use score quantiles instead of scores absolute value: 1/n for the smallest score, 2/n for the second smallest ,..., 1 for the largest score. This can be interpreted as the proportion of predicted outliers if the score was thresholded at the corresponding value. You can use argsort from numpy to compute quantiles.

In [4]:

Out[4]:

<matplotlib.legend.Legend at 0x7f5783907b50>

What is the last value provided by precision recall curves? Why? Which of these value is an arbitrary choice (precision or recall)? Draw the PR curve after modification of the last precision value. Make a comment regarding the overall aspect of the obtained curve.

In [5]:

Out[5]:

(0.0, 1.0)

Question 2: compare ROC and PR curves¶

Consider $N$ observation and $p\leq N/4$ outliers. We will compare formal detectors with known behavior, each will be represented by a shuffled sequence of numbers from $1$ to $N$.

We will compare three detectors:

A "completely random" detector which predicts outlier score independantly at random
A "perfect then random" detector, which attributes the highest score to half of the outliers ($p/2$) and then performs random detection for the remaining $N - p/2$ observations
A "random then perfect" which provides the highest scores randomly on a subset of data which contains $4p$ observations, including the $p$ outliers, and then predicts the remaining smaller scores for the last $N - 4p$ observations

Write the code to simulate these three detectors (example below for $N = 20$ and $p=4$). You can use npr.choice with replace = False to shuffle lists or arrays and np.concatenate to concatenate arrays.

In [6]:

Out[6]:

	Outliers ground truth	Random detector	Perfect then random	Random then perfect
0	1	1	20	17
1	1	7	19	18
2	1	11	3	13
3	1	13	7	15
4	0	6	8	10
5	0	4	12	16
6	0	10	13	5
7	0	12	4	19
8	0	15	18	14
9	0	16	1	12
10	0	19	11	6
11	0	2	17	8
12	0	20	2	20
13	0	8	14	11
14	0	5	15	7
15	0	17	9	9
16	0	18	16	4
17	0	3	5	3
18	0	14	6	2
19	0	9	10	1

For N=10000 and p=100, using the formal detectors above, compare ROC curves and precision recall curve. Make comments on the results that you get. Try to describe practical situations in which one detector is more favorable than the other?

In [7]:

from sklearn.metrics import roc_curve

Out[7]:

<matplotlib.legend.Legend at 0x7f57837f7e80>

In [8]:

from sklearn.metrics import precision_recall_curve

Out[8]:

<matplotlib.legend.Legend at 0x7f5783849000>

Question 3: score versus prediction¶

Same question as above, but this time, the detectors are not represented by scores (list of numbers between 1 and $N$), but rather by their actual prediction (list of numbers in $\{0,1\}$). An example is given below. Comment on the result that you get. Why do you observe such outputs? Would you recommend to use scores or predictions to evaluate models using AUC metrics?

In [9]:

Out[9]:

	Outliers ground truth	Random detector	Perfect then random	Random then perfect
0	1	0	1	1
1	1	0	1	1
2	1	0	0	0
3	1	0	0	0
4	0	0	0	0
5	0	0	0	0
6	0	0	0	0
7	0	0	0	1
8	0	0	1	0
9	0	0	0	0
10	0	1	0	0
11	0	0	1	0
12	0	1	0	1
13	0	0	0	0
14	0	0	0	0
15	0	1	0	0
16	0	1	0	0
17	0	0	0	0
18	0	0	0	0
19	0	0	0	0

In [10]:

Out[10]:

<matplotlib.legend.Legend at 0x7f5783848bb0>

In [11]:

Out[11]:

<matplotlib.legend.Legend at 0x7f5781d6a500>

In [ ]:

	Outliers ground truth	Random detector	Perfect then random	Random then perfect
0	1	1	20	17
1	1	7	19	18
2	1	11	3	13
3	1	13	7	15
4	0	6	8	10
5	0	4	12	16
6	0	10	13	5
7	0	12	4	19
8	0	15	18	14
9	0	16	1	12
10	0	19	11	6
11	0	2	17	8
12	0	20	2	20
13	0	8	14	11
14	0	5	15	7
15	0	17	9	9
16	0	18	16	4
17	0	3	5	3
18	0	14	6	2
19	0	9	10	1

	Outliers ground truth	Random detector	Perfect then random	Random then perfect
0	1	0	1	1
1	1	0	1	1
2	1	0	0	0
3	1	0	0	0
4	0	0	0	0
5	0	0	0	0
6	0	0	0	0
7	0	0	0	1
8	0	0	1	0
9	0	0	0	0
10	0	1	0	0
11	0	0	1	0
12	0	1	0	1
13	0	0	0	0
14	0	0	0	0
15	0	1	0	0
16	0	1	0	0
17	0	0	0	0
18	0	0	0	0
19	0	0	0	0

	Outliers ground truth	Random detector	Perfect then random	Random then perfect
0	1	1	20	17
1	1	7	19	18
2	1	11	3	13
3	1	13	7	15
4	0	6	8	10
5	0	4	12	16
6	0	10	13	5
7	0	12	4	19
8	0	15	18	14
9	0	16	1	12
10	0	19	11	6
11	0	2	17	8
12	0	20	2	20
13	0	8	14	11
14	0	5	15	7
15	0	17	9	9
16	0	18	16	4
17	0	3	5	3
18	0	14	6	2
19	0	9	10	1

	Outliers ground truth	Random detector	Perfect then random	Random then perfect
0	1	0	1	1
1	1	0	1	1
2	1	0	0	0
3	1	0	0	0
4	0	0	0	0
5	0	0	0	0
6	0	0	0	0
7	0	0	0	1
8	0	0	1	0
9	0	0	0	0
10	0	1	0	0
11	0	0	1	0
12	0	1	0	1
13	0	0	0	0
14	0	0	0	0
15	0	1	0	0
16	0	1	0	0
17	0	0	0	0
18	0	0	0	0
19	0	0	0	0

	Outliers ground truth	Random detector	Perfect then random	Random then perfect
0	1	1	20	17
1	1	7	19	18
2	1	11	3	13
3	1	13	7	15
4	0	6	8	10
5	0	4	12	16
6	0	10	13	5
7	0	12	4	19
8	0	15	18	14
9	0	16	1	12
10	0	19	11	6
11	0	2	17	8
12	0	20	2	20
13	0	8	14	11
14	0	5	15	7
15	0	17	9	9
16	0	18	16	4
17	0	3	5	3
18	0	14	6	2
19	0	9	10	1

	Outliers ground truth	Random detector	Perfect then random	Random then perfect
0	1	0	1	1
1	1	0	1	1
2	1	0	0	0
3	1	0	0	0
4	0	0	0	0
5	0	0	0	0
6	0	0	0	0
7	0	0	0	1
8	0	0	1	0
9	0	0	0	0
10	0	1	0	0
11	0	0	1	0
12	0	1	0	1
13	0	0	0	0
14	0	0	0	0
15	0	1	0	0
16	0	1	0	0
17	0	0	0	0
18	0	0	0	0
19	0	0	0	0