MATLAB Codes : inference in
logistic regression models
[Updated : January, 31 2005]
Download the package
PLS-logistic.tar.gz
, developped by G. Fort.
contact about the codes :
gfort at tsi dot enst dot fr
Demo :
download the package
call the function
ExampleBinary in the Matlab command
window for a demonstration (binary classification).
call the function
ExampleMulti in the Matlab
command window for a demonstration (multi-class classification).
The package contains :
- Codes for the 2-class case :
- IRPLS : a Matlab function that codes the
algorithm proposed in Marx, 1996, Technometrics.
- IRRLS : a Matlab function
called by RPLS.
- ExampleBinary : an example
of Matlab instructions to use the above functions. The different procedures
are applied to the estimation of a regression coefficient in a logistic regression
model. The data set (from the Microarrays litterature) is characterized by
a number of covariables far larger than the number of observations, and a
highly collinear data matrix. In this example, the estimate of the regression
coefficient raising from extensions of PLS to GLM is plugged in a logistic
discrimination procedure, and (binary) classification is performed. The data
set is the Colon data (see below);
the publicly available data have to be pre-processed and this is done by
calling the function PreProcess.
- PreProcess : a Matlab function
called by ExampleBinary, that codes pre-processing steps (that can be applied
to Leukemia data, Colon data, Prostate
data, ...).
- Codes for the multi-class case :
- MIRRLS : a Matlab function
called by MRPLS.
- ExampleMulti : an example
of Matlab instructions to use the above functions. The different procedures
are applied to the estimation of a regression coefficient in a logistic regression
model. The data set (from the Microarrays litterature) is characterized by
a number of covariables far larger than the number of observations, and a
highly collinear data matrix. In this example, the estimate of the regression
coefficient raising from extensions of PLS to GLM is plugged in a polychotomous
discrimination procedure, and (multi-class) classification is performed.
The data set is the Leukemia data
(see below); the publicly available data have to be pre-processed and this
is done by calling the function PreProcess
(same as above).
- Help
- The functions have a "help". For a help on the function RPLS, for
example, type help RPLS in the MATLAB
command window.
- We now roughly present the input and output variables of the different
functions :
INPUT arguments
of the functions
NR, IRPLS, IRPLSF, RPLS,
RIDGE, MNR, MIRPLSF, MRPLS, MRIDGE :
- Y contains the {0, ..., c}-valued
response variable from the learning set : matrix nLearn x 1.
- X contains the data matrix
: matrix nLearn x p (do NOT include the intercept
term; it is added in the function).
- flag is a {0,1}-valued; it is set to 1 if
the data matrix has to be standardized (each column is centered with norm
1).
- All of these methods have a dimension reduction step based on PLS;
Kappa contains the number of
PLS components.
- All of these methods contain at least one iterative part; Parameters is a vector that contains the
maximal number of iterations and a tolerance for the stopping criterion.
A suggestion for the definition of this variable is given in the help of each function.
- In the multi-class case, NbrClass
is the number of classes. Ex. if 5 classes, NbrClass=5 and Y is {0, ..., 4}-valued.
- RPLS and MRPLS also depends on a shrinkage parameter; Lambdarange is a range of candidate values;
the computation of the 'optimal' one is part of the RPLS and MRPLS functions.
OUTPUT structure
of the functions
NR, IRPLS, IRPLSF, RPLS,
RIDGE, MNR, MIRPLSF, MRPLS, MRIDGE :
- The structure contains a field Gamma,
that collects the estimate of the regression coefficients
- 2-class case : the regression coefficient is a (p+1) x 1 matrix.
Thus Gamma contains an estimate for
each value of the hyper-parameter Kappa
: (p+1) x length(Kappa) matrix.
- Multi-class case : the regression coefficient is a (p+1) x (NbrClass-1)
matrix. Thus Gamma contains an estimate
for each value of the hyper-parameter Kappa
: length(Kappa)
x (p+1) x (NbrClass-1) matrix.
- The structure also collects informations on the convergence of the
different iterative schemes.
- The functions RPLS and MRPLS also return the optimal value of the
hyperparameter Lambda determined over the range Lambdarange.
DATA
SETS (publicly available) from the Microarray literature.
In our convention, the expression
levels of a microarray are collected in a row of the data matrix :
** column
#j contains all the values relative to gene #j.
** row #i
contains all the values relative to sample #i.
Some data sets have to be pre-processed before use. The classical pre-processing
steps for the first three data sets is coded in the MATLAB function
PreProcess (see above).
Colon data set :
- download the data in .txt format from Princeton
: Data matrix 62 x 2000.
- see Alon
et al. (1999) for a complete description of the data set.
- 2 class example. To be pre-processed.
Prostate data set :
- download the data from MIT
: Data matrix 102 x 12600
- see Singh
et al. (2002) for a complete description of the data set.
- 2 class example. To be pre-processed.
Leukemia data set :
- download the data in .txt format from the MIT
: training set (38 x 7129) and test set (34 x 7129).
- see Golub
et al. (1999) for a complete description of the data set.
- 2 or 3 class example. To be pre-processed.
NCI60 data set :
- download the data from nci : Data
matrix 60 x 1415
Ovarian data set :
- download the data from the Latent Process Decomposition home page : Data
matrix 54 x 1536
Other famous data set (breast cancer, diffuse large B-cell
lymphoma, ...) :