Title: | Intrinsic Dimension for Data Mining |
---|---|
Description: | Contains techniques for mining large and high-dimensional data sets by using the concept of Intrinsic Dimension (ID). Here the ID is not necessarily an integer. It is extended to fractal dimensions. And the Morisita estimator is used for the ID estimation, but other tools are included as well. |
Authors: | Jean Golay [aut, cre], Mohamed Laib [aut] |
Maintainer: | Jean Golay <[email protected]> |
License: | CC BY-NC-SA 4.0 |
Version: | 1.0.7 |
Built: | 2024-11-05 02:40:14 UTC |
Source: | https://github.com/jeangolay/idmining |
Contains techniques for mining large and high-dimensional data sets by using the concept of Intrinsic Dimension (ID). Here the ID is not necessarily an integer. It is extended to fractal dimensions. And the Morisita estimator is used for the ID estimation, but other tools are included as well.
Jean Golay [email protected] and Mohamed Laib [email protected],
Maintainer: Jean Golay [email protected]
J. Golay and M. Kanevski (2015). A new estimator of intrinsic dimension based on the multipoint Morisita index, Pattern Recognition 48 (12):4070–4081.
J. Golay, M. Leuenberger and M. Kanevski (2017). Feature selection for regression problems based on the Morisita estimator of intrinsic dimension, Pattern Recognition 70:126–138.
J. Golay and M. Kanevski (2017). Unsupervised feature selection based on the Morisita estimator of intrinsic dimension, Knowledge-Based Systems 135:125-134.
J. Golay, M. Leuenberger and M. Kanevski (2015). Morisita-based feature selection for regression problems. Proceedings of the 23rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Bruges (Belgium).
Useful links:
Generates a random simulation of the butterfly data set with a given number of points.
Butterfly(N=10000)
Butterfly(N=10000)
N |
The number of points to be generated (by default: |
A
data.frame
. The first eight columns are the input variables,
and the last one is the output (or target) variable .
Jean Golay [email protected]
J. Golay, M. Leuenberger and M. Kanevski (2016). Feature selection for regression problems based on the Morisita estimator of intrinsic dimension, Pattern Recognition 70:126–138.
bf <- Butterfly(1000) ## Not run: require(colorRamps) require(rgl) c <- cut(bf$Y,breaks=64) cols <- matlab.like(64)[as.numeric(c)] plot3d(bf$X1,bf$X2,bf$Y,col=cols,radius=0.10,type="s", xlab="",ylab="",zlab="",box=F) axes3d(lwd=3,cex.axis=3) grid3d(c("x+","y-","z"),col="black",lwd=1) ## End(Not run)
bf <- Butterfly(1000) ## Not run: require(colorRamps) require(rgl) c <- cut(bf$Y,breaks=64) cols <- matlab.like(64)[as.numeric(c)] plot3d(bf$X1,bf$X2,bf$Y,col=cols,radius=0.10,type="s", xlab="",ylab="",zlab="",box=F) axes3d(lwd=3,cex.axis=3) grid3d(c("x+","y-","z"),col="black",lwd=1) ## End(Not run)
Computes the ln values of the multipoint Morisita index in 1, 2 or higher dimensional spaces.
logMINDEX(X, scaleQ=1:5, mMin=2, mMax=2)
logMINDEX(X, scaleQ=1:5, mMin=2, mMax=2)
X |
A |
scaleQ |
Either a single value or a vector. It contains the value(s) of |
mMin |
The minimum value of |
mMax |
The maximum value of |
is the edge length of the grid cells (or quadrats). Since the variables
(and consenquently the grid) are rescaled to the
interval,
is equal
to
for a grid consisting of only one cell.
is the number of grid cells (or quadrats) along each axis of the
Euclidean space in which the data points are embedded.
is equal to
where
is the number
of grid cells and
is the number of variables (or features).
is directly related to
(see References).
is the diagonal length of the grid cells.
A data.frame
containing the value of the m-Morisita index for each value of
and
. Notice also that the values of
are provided with regard to the
interval.
Jean Golay [email protected]
J. Golay and M. Kanevski (2015). A new estimator of intrinsic dimension based on the multipoint Morisita index, Pattern Recognition 48 (12):4070–4081.
sim_dat <- SwissRoll(1000) m <- 2 scaleQ <- 1:15 # It starts with a grid of 1^E cell (or quadrat). # It ends with a grid of 15^E cells (or quadrats). lnmMI <- logMINDEX(sim_dat, scaleQ, m, m) dev.new(width=5, height=4) plot(exp(lnmMI[,1]),exp(lnmMI[,2]),pch=19,col="black",xlab="",ylab="") title(xlab = expression(delta), cex.lab = 1.5,line = 2.5) title(ylab = expression(I['2,'*delta]), cex.lab = 1.5,line = 2.5) dev.new(width=5, height=4) plot(lnmMI[,1],lnmMI[,2],pch=19,col="black",xlab="",ylab="") title(xlab = expression(paste("log(",delta,")")), cex.lab = 1.5,line = 2.5) title(ylab = expression(paste("log(",I['2,'*delta],")")), cex.lab = 1.5,line = 2.5)
sim_dat <- SwissRoll(1000) m <- 2 scaleQ <- 1:15 # It starts with a grid of 1^E cell (or quadrat). # It ends with a grid of 15^E cells (or quadrats). lnmMI <- logMINDEX(sim_dat, scaleQ, m, m) dev.new(width=5, height=4) plot(exp(lnmMI[,1]),exp(lnmMI[,2]),pch=19,col="black",xlab="",ylab="") title(xlab = expression(delta), cex.lab = 1.5,line = 2.5) title(ylab = expression(I['2,'*delta]), cex.lab = 1.5,line = 2.5) dev.new(width=5, height=4) plot(lnmMI[,1],lnmMI[,2],pch=19,col="black",xlab="",ylab="") title(xlab = expression(paste("log(",delta,")")), cex.lab = 1.5,line = 2.5) title(ylab = expression(paste("log(",I['2,'*delta],")")), cex.lab = 1.5,line = 2.5)
Executes the MBFR algorithm for supervised feature selection.
MBFR(XY, scaleQ, m=2, C=NULL)
MBFR(XY, scaleQ, m=2, C=NULL)
XY |
A |
scaleQ |
A vector containing the values of |
m |
The value of the parameter m (by default: |
C |
The number of steps of the SFS procedure (by default: |
is the edge length of the grid cells (or quadrats). Since the data
(and consenquently the grid) are rescaled to the
interval,
is equal
to
for a grid consisting of only one cell.
is the number of grid cells (or quadrats) along each axis of the
Euclidean space in which the data points are embedded.
is equal to
where
is the number
of grid cells and
is the number of variables (or features).
is directly related to
(see References).
is the diagonal length of the grid cells.
The values of in
scaleQ
must be chosen according to the linear
part of the -
plot relating the
values of the
multipoint Morisita index to the
values of
(or,
equivalently, to the
values of
) (see
logMINDEX
).
A list of five elements:
a vector containing the identifier numbers of the original features in the order they are selected through the Sequential Forward Selection (SFS) search procedure.
the names of the corresponding features.
the corresponding values of .
the ID estimate of the output variable.
a matrix containing: (column 1) the ID estimates of the subsets retained by the SFS
procedure with the target variable; (column 2) the ID estimates of the subsets retained by the
SFS procedure without the output variable; (column 3) the values of
of the subsets
retained by the SFS procedure.
Jean Golay [email protected]
J. Golay, M. Leuenberger and M. Kanevski (2017). Feature selection for regression problems based on the Morisita estimator of intrinsic dimension, Pattern Recognition 70:126–138.
J. Golay, M. Leuenberger and M. Kanevski (2015). Morisita-based feature selection for regression problems.Proceedings of the 23rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Bruges (Belgium).
## Not run: bf <- Butterfly(10000) fly_select <- MBFR(bf, 5:25) var_order <- fly_select[[2]] var_perf <- fly_select[[3]] dev.new(width=5, height=4) plot(var_perf,type="b",pch=16,lwd=2,xaxt="n",xlab="",ylab="", ylim=c(0,1),col="red",panel.first={grid(lwd=1.5)}) axis(1,1:length(var_order),labels=var_order) mtext(1,text = "Added Features (from left to right)",line = 2.5,cex=1) mtext(2,text = "Estimated Dissimilarity",line = 2.5,cex=1) ## End(Not run)
## Not run: bf <- Butterfly(10000) fly_select <- MBFR(bf, 5:25) var_order <- fly_select[[2]] var_perf <- fly_select[[3]] dev.new(width=5, height=4) plot(var_perf,type="b",pch=16,lwd=2,xaxt="n",xlab="",ylab="", ylim=c(0,1),col="red",panel.first={grid(lwd=1.5)}) axis(1,1:length(var_order),labels=var_order) mtext(1,text = "Added Features (from left to right)",line = 2.5,cex=1) mtext(2,text = "Estimated Dissimilarity",line = 2.5,cex=1) ## End(Not run)
Executes the MBFR algorithm on a chosen number of workers (CPU parallel computing).
MBFR_parallel(XY, scaleQ, m=2, C=NULL, ncores=4)
MBFR_parallel(XY, scaleQ, m=2, C=NULL, ncores=4)
XY |
A |
scaleQ |
A vector containing the values of |
m |
The value of the parameter m (by default: |
C |
The number of steps of the SFS procedure (by default: |
ncores |
Number of workers (by default: |
is the edge length of the grid cells (or quadrats). Since the data
(and consenquently the grid) are rescaled to the
interval,
is equal
to
for a grid consisting of only one cell.
is the number of grid cells (or quadrats) along each axis of the
Euclidean space in which the data points are embedded.
is equal to
where
is the number
of grid cells and
is the number of variables (or features).
is directly related to
(see References).
is the diagonal length of the grid cells.
The values of in
scaleQ
must be chosen according to the linear
part of the -
plot relating the
values of the
multipoint Morisita index to the
values of
(or,
equivalently, to the
values of
) (see
logMINDEX
).
A list of five elements:
a vector containing the identifier numbers of the original features in the order they are selected through the Sequential Forward Selection (SFS) search procedure.
the names of the corresponding features.
the corresponding values of .
the ID estimate of the output variable.
a matrix containing: (column 1) the ID estimates of the subsets retained by the SFS
procedure with the target variable; (column 2) the ID estimates of the subsets retained by the
SFS procedure without the output variable; (column 3) the values of
of the subsets
retained by the SFS procedure.
Jean Golay [email protected]
J. Golay, M. Leuenberger and M. Kanevski (2017). Feature selection for regression problems based on the Morisita estimator of intrinsic dimension, Pattern Recognition 70:126–138.
J. Golay, M. Leuenberger and M. Kanevski (2015). Morisita-based feature selection for regression problems.Proceedings of the 23rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Bruges (Belgium).
## Not run: bf <- Butterfly(10000) fly_select <- MBFR_parallel(bf, 5:25, ncores=2) var_order <- fly_select[[2]] var_perf <- fly_select[[3]] dev.new(width=5, height=4) plot(var_perf,type="b",pch=16,lwd=2,xaxt="n",xlab="",ylab="", ylim=c(0,1),col="red",panel.first={grid(lwd=1.5)}) axis(1,1:length(var_order),labels=var_order) mtext(1,text = "Added Features (from left to right)",line = 2.5,cex=1) mtext(2,text = "Estimated Dissimilarity",line = 2.5,cex=1) bf_large <- Butterfly(10^5) system.time(MBFR(bf_large, 5:25)) system.time(MBFR_parallel(bf_large, 5:25)) ## End(Not run)
## Not run: bf <- Butterfly(10000) fly_select <- MBFR_parallel(bf, 5:25, ncores=2) var_order <- fly_select[[2]] var_perf <- fly_select[[3]] dev.new(width=5, height=4) plot(var_perf,type="b",pch=16,lwd=2,xaxt="n",xlab="",ylab="", ylim=c(0,1),col="red",panel.first={grid(lwd=1.5)}) axis(1,1:length(var_order),labels=var_order) mtext(1,text = "Added Features (from left to right)",line = 2.5,cex=1) mtext(2,text = "Estimated Dissimilarity",line = 2.5,cex=1) bf_large <- Butterfly(10^5) system.time(MBFR(bf_large, 5:25)) system.time(MBFR_parallel(bf_large, 5:25)) ## End(Not run)
Executes the MBRM algorithm for unsupervised feature selection.
MBRM(X, scaleQ, m=2, C=NULL, ID_tot=NULL)
MBRM(X, scaleQ, m=2, C=NULL, ID_tot=NULL)
X |
A |
scaleQ |
A vector containing the values of |
m |
The value of the parameter m (by default: |
C |
The number of steps of the SFS procedure (by default: |
ID_tot |
The value of the full data ID if it is known a priori (by default: the value of ID_tot is estimated using the Morisita estimator of ID witin the function). |
is the edge length of the grid cells (or quadrats). Since the the variables
(and consenquently the grid) are rescaled to the
interval,
is equal
to
for a grid consisting of only one cell.
is the number of grid cells (or quadrats) along each axis of the
Euclidean space in which the data points are embedded.
is equal to
where
is the number
of grid cells and
is the number of variables (or features).
is directly related to
(see References).
is the diagonal length of the grid cells.
The values of in
scaleQ
must be chosen according to the linear
part of the -
plot relating the
values of the
multipoint Morisita index to the
values of
(or,
equivalently, to the
values of
) (see
logMINDEX
).
A list of four elements:
a vector containing the identifier numbers of the original features in the order they are selected through the Sequential Forward Selection (SFS) search procedure.
the names of the corresponding features.
the corresponding ID estimates.
the ID estimate of the full data set.
Jean Golay [email protected]
J. Golay and M. Kanevski (2017). Unsupervised feature selection based on the Morisita estimator of intrinsic dimension, Knowledge-Based Systems 135:125-134.
## Not run: bf <- Butterfly(10000) bf_select <- MBRM(bf[,-9], 5:25) var_order <- bf_select[[2]] var_perf <- bf_select[[3]] dev.new(width=5, height=4) plot(var_perf,type="b",pch=16,lwd=2,xaxt="n",xlab="", ylab="", col="red",ylim=c(0,max(var_perf)),panel.first={grid(lwd=1.5)}) axis(1,1:length(var_order),labels=var_order) mtext(1,text="Added Features (from left to right)",line=2.5,cex=1) mtext(2,text="Estimated ID",line=2.5,cex=1) ## End(Not run)
## Not run: bf <- Butterfly(10000) bf_select <- MBRM(bf[,-9], 5:25) var_order <- bf_select[[2]] var_perf <- bf_select[[3]] dev.new(width=5, height=4) plot(var_perf,type="b",pch=16,lwd=2,xaxt="n",xlab="", ylab="", col="red",ylim=c(0,max(var_perf)),panel.first={grid(lwd=1.5)}) axis(1,1:length(var_order),labels=var_order) mtext(1,text="Added Features (from left to right)",line=2.5,cex=1) mtext(2,text="Estimated ID",line=2.5,cex=1) ## End(Not run)
Executes the MBRM algorithm for unsupervised feature selection (CPU parallel computing).
MBRM_parallel(X, scaleQ, m=2, C=NULL, ID_tot=NULL, ncores=4)
MBRM_parallel(X, scaleQ, m=2, C=NULL, ID_tot=NULL, ncores=4)
X |
A |
scaleQ |
A vector containing the values of |
m |
The value of the parameter m (by default: |
C |
The number of steps of the SFS procedure (by default: |
ID_tot |
The value of the full data ID if it is known a priori (by default: the value of ID_tot is estimated using the Morisita estimator of ID witin the function). |
ncores |
Number of workers (by default: |
is the edge length of the grid cells (or quadrats). Since the the variables
(and consenquently the grid) are rescaled to the
interval,
is equal
to
for a grid consisting of only one cell.
is the number of grid cells (or quadrats) along each axis of the
Euclidean space in which the data points are embedded.
is equal to
where
is the number
of grid cells and
is the number of variables (or features).
is directly related to
(see References).
is the diagonal length of the grid cells.
The values of in
scaleQ
must be chosen according to the linear
part of the -
plot relating the
values of the
multipoint Morisita index to the
values of
(or,
equivalently, to the
values of
) (see
logMINDEX
).
A list of four elements:
a vector containing the identifier numbers of the original features in the order they are selected through the Sequential Forward Selection (SFS) search procedure.
the names of the corresponding features.
the corresponding ID estimates.
the ID estimate of the full data set.
Jean Golay [email protected]
J. Golay and M. Kanevski (2017). Unsupervised feature selection based on the Morisita estimator of intrinsic dimension, Knowledge-Based Systems 135:125-134.
bf <- Butterfly(10000) bf_select <- MBRM_parallel(bf[,-9], 5:25, ncores=2) var_order <- bf_select[[2]] var_perf <- bf_select[[3]] ## Not run: dev.new(width=5, height=4) plot(var_perf,type="b",pch=16,lwd=2,xaxt="n",xlab="", ylab="", col="red",ylim=c(0,max(var_perf)),panel.first={grid(lwd=1.5)}) axis(1,1:length(var_order),labels=var_order) mtext(1,text="Added Features (from left to right)",line=2.5,cex=1) mtext(2,text="Estimated ID",line=2.5,cex=1) bf_large <- Butterfly(10^5) system.time(MBRM(bf_large[,-9], 5:25)) system.time(MBRM_parallel(bf_large[,-9], 5:25)) ## End(Not run)
bf <- Butterfly(10000) bf_select <- MBRM_parallel(bf[,-9], 5:25, ncores=2) var_order <- bf_select[[2]] var_perf <- bf_select[[3]] ## Not run: dev.new(width=5, height=4) plot(var_perf,type="b",pch=16,lwd=2,xaxt="n",xlab="", ylab="", col="red",ylim=c(0,max(var_perf)),panel.first={grid(lwd=1.5)}) axis(1,1:length(var_order),labels=var_order) mtext(1,text="Added Features (from left to right)",line=2.5,cex=1) mtext(2,text="Estimated ID",line=2.5,cex=1) bf_large <- Butterfly(10^5) system.time(MBRM(bf_large[,-9], 5:25)) system.time(MBRM_parallel(bf_large[,-9], 5:25)) ## End(Not run)
Computes the multipoint Morisita index for spatial patterns (i.e. 2-dimensional patterns).
MINDEX_SP(X, scaleQ=1:5, mMin=2, mMax=5, Wlim_x=NULL, Wlim_y=NULL)
MINDEX_SP(X, scaleQ=1:5, mMin=2, mMax=5, Wlim_x=NULL, Wlim_y=NULL)
X |
A |
scaleQ |
Either a single value or a vector. It contains the value(s) of |
mMin |
The minimum value of |
mMax |
The maximum value of |
Wlim_x |
A vector controlling the spatial extent of the |
Wlim_y |
A vector controlling the spatial extent of the |
is the number of grid cells (or quadrats) along each of the two axes.
is directly related to
(see References).
is the diagonal length of the grid cells.
A data.frame
containing the value of the m-Morisita index for each value of
and
.
Jean Golay [email protected]
J. Golay, M. Kanevski, C. D. Vega Orozco and M. Leuenberger (2014). The multipoint Morisita index for the analysis of spatial patterns, Physica A 406:191–202.
L. Telesca, J. Golay and M. Kanevski (2015). Morisita-based space-clustering analysis of Swiss seismicity, Physica A 419:40–47.
L. Telesca, M. Lovallo, J. Golay and M. Kanevski (2016). Comparing seismicity declustering techniques by means of the joint use of Allan Factor and Morisita index, Stochastic Environmental Research and Risk Assessment 30(1):77-90.
sim_dat <- SwissRoll(1000) m <- 2 scaleQ <- 1:15 # It starts with a grid of 1^2 cell (or quadrat). # It ends with a grid of 15^2 cells (or quadrats). mMI <- MINDEX_SP(sim_dat[,c(1,2)], scaleQ, m, 5) plot(mMI[,1],mMI[,2],pch=19,col="black",xlab="",ylab="") title(xlab=expression(delta),cex.lab=1.5,line=2.5) title(ylab=expression(I['2,'*delta]),cex.lab=1.5,line=2.5) ## Not run: require(colorRamps) colfunc <- colorRampPalette(c("blue","red")) color <- colfunc(4) dev.new(width=5,height=4) plot(mMI[5:15,1],mMI[5:15,2],pch=19,col=color[1],xlab="",ylab="", ylim=c(1,max(mMI[,5]))) title(xlab=expression(delta),cex.lab=1.5,line=2.5) title(ylab=expression(I['2,'*delta]),cex.lab=1.5,line=2.5) for(i in 3:5){ points(mMI[5:15,1],mMI[5:15,i],pch=19,col=color[i-1]) } legend.text<-c("m=2","m=3","m=4","m=5") legend.pch=c(19,19,19,19) legend.lwd=c(NA,NA,NA,NA) legend.col=c(color[1],color[2],color[3],color[4]) legend("topright",legend=legend.text,pch=legend.pch,lwd=legend.lwd, col=legend.col,ncol=1,text.col="black",cex=0.9,box.lwd=1,bg="white") xlim_l <- c(-5,5) # By default, the spatial extent of the grid is set so ylim_l <- c(-6,6) # that it is the same as the spatial extent of the data. xlim_s <- c(-0.6,0.2) # But it can be modified to cover either a larger (l) ylim_s <- c(-1,0.5) # or a smaller (s) study area (or validity domain). mMI_l <- MINDEX_SP(sim_dat[,c(1,2)], scaleQ, m, 5, xlim_l, ylim_l) mMI_s <- MINDEX_SP(sim_dat[,c(1,2)], scaleQ, m, 5, xlim_s, ylim_s) ## End(Not run)
sim_dat <- SwissRoll(1000) m <- 2 scaleQ <- 1:15 # It starts with a grid of 1^2 cell (or quadrat). # It ends with a grid of 15^2 cells (or quadrats). mMI <- MINDEX_SP(sim_dat[,c(1,2)], scaleQ, m, 5) plot(mMI[,1],mMI[,2],pch=19,col="black",xlab="",ylab="") title(xlab=expression(delta),cex.lab=1.5,line=2.5) title(ylab=expression(I['2,'*delta]),cex.lab=1.5,line=2.5) ## Not run: require(colorRamps) colfunc <- colorRampPalette(c("blue","red")) color <- colfunc(4) dev.new(width=5,height=4) plot(mMI[5:15,1],mMI[5:15,2],pch=19,col=color[1],xlab="",ylab="", ylim=c(1,max(mMI[,5]))) title(xlab=expression(delta),cex.lab=1.5,line=2.5) title(ylab=expression(I['2,'*delta]),cex.lab=1.5,line=2.5) for(i in 3:5){ points(mMI[5:15,1],mMI[5:15,i],pch=19,col=color[i-1]) } legend.text<-c("m=2","m=3","m=4","m=5") legend.pch=c(19,19,19,19) legend.lwd=c(NA,NA,NA,NA) legend.col=c(color[1],color[2],color[3],color[4]) legend("topright",legend=legend.text,pch=legend.pch,lwd=legend.lwd, col=legend.col,ncol=1,text.col="black",cex=0.9,box.lwd=1,bg="white") xlim_l <- c(-5,5) # By default, the spatial extent of the grid is set so ylim_l <- c(-6,6) # that it is the same as the spatial extent of the data. xlim_s <- c(-0.6,0.2) # But it can be modified to cover either a larger (l) ylim_s <- c(-1,0.5) # or a smaller (s) study area (or validity domain). mMI_l <- MINDEX_SP(sim_dat[,c(1,2)], scaleQ, m, 5, xlim_l, ylim_l) mMI_s <- MINDEX_SP(sim_dat[,c(1,2)], scaleQ, m, 5, xlim_s, ylim_s) ## End(Not run)
Estimates the intrinsic dimension of data using the Morisita estimator of intrinsic dimension.
MINDID(X, scaleQ=1:5, mMin=2, mMax=2)
MINDID(X, scaleQ=1:5, mMin=2, mMax=2)
X |
A |
scaleQ |
A vector (at least two values). It contains the values of |
mMin |
The minimum value of |
mMax |
The maximum value of |
is the edge length of the grid cells (or quadrats). Since the variables
(and consenquently the grid) are rescaled to the
interval,
is equal
to
for a grid consisting of only one cell.
is the number of grid cells (or quadrats) along each axis of the
Euclidean space in which the data points are embedded.
is equal to
where
is the number
of grid cells and
is the number of variables (or features).
is directly related to
(see References).
is the diagonal length of the grid cells.
A list of two elements:
a data.frame
containing the value of the m-Morisita index for each value
of
and
. The values of
are provided with regard to the
interval.
a data.frame
containing the values of and
for each value of
.
Jean Golay [email protected]
J. Golay and M. Kanevski (2015). A new estimator of intrinsic dimension based on the multipoint Morisita index, Pattern Recognition 48 (12):4070–4081.
J. Golay, M. Leuenberger and M. Kanevski (2017). Feature selection for regression problems based on the Morisita estimator of intrinsic dimension, Pattern Recognition 70:126–138.
J. Golay and M. Kanevski (2017). Unsupervised feature selection based on the Morisita estimator of intrinsic dimension, Knowledge-Based Systems 135:125-134.
J. Golay, M. Leuenberger and M. Kanevski (2015). Morisita-based feature selection for regression problems. Proceedings of the 23rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Bruges (Belgium).
sim_dat <- SwissRoll(1000) scaleQ <- 1:15 # It starts with a grid of 1^E cell (or quadrat). # It ends with a grid of 15^E cells (or quadrats). mMI_ID <- MINDID(sim_dat, scaleQ[5:15]) print(paste("The ID estimate is equal to",round(mMI_ID[[1]][1,3],2)))
sim_dat <- SwissRoll(1000) scaleQ <- 1:15 # It starts with a grid of 1^E cell (or quadrat). # It ends with a grid of 15^E cells (or quadrats). mMI_ID <- MINDID(sim_dat, scaleQ[5:15]) print(paste("The ID estimate is equal to",round(mMI_ID[[1]][1,3],2)))
Computes the functional m-Morisita index for a given set of threshold values.
MINDID_FMC(XY, scaleQ, m=2, thd)
MINDID_FMC(XY, scaleQ, m=2, thd)
XY |
A |
scaleQ |
A vector containing the values of |
m |
The value of the parameter m (by default: |
thd |
Either a single value or a vector. It contains the value(s) of the threshold(s). |
is the edge length of the grid cells (or quadrats). Since the input variables
(and consenquently the grid) are rescaled to the
interval,
is equal
to
for a grid consisting of only one cell.
is the number of grid cells (or quadrats) along each axis of the
Euclidean space in which the data points are embedded.
is equal to
where
is the number
of grid cells and
is the number of variables (or features).
is directly related to
(see References).
is the diagonal length of the grid cells.
A vector
containing the value(s) of the m-Morisita slope, , for each
threshold value.
Jean Golay [email protected]
J. Golay, M. Kanevski, C. D. Vega Orozco and M. Leuenberger (2014). The multipoint Morisita index for the analysis of spatial patterns, Physica A 406:191–202.
J. Golay and M. Kanevski (2015). A new estimator of intrinsic dimension based on the multipoint Morisita index, Pattern Recognition 48 (12):4070–4081.
L. Telesca, J. Golay and M. Kanevski (2015). Morisita-based space-clustering analysis of Swiss seismicity, Physica A 419:40–47.
## Not run: bf <- Butterfly(10000) bf_SP <- bf[,c(1,2,9)] m <- 2 scaleQ <- 5:25 thd <- quantile(bf_SP$Y,probs=c(0,0.1,0.2,0.3, 0.4,0.5,0.6, 0.7,0.8,0.9)) nbr_shuf <- 100 Sm_thd_shuf <- matrix(0,length(thd),nbr_shuf) for (i in 1:nbr_shuf){ bf_SP_shuf <- cbind(bf_SP[,1:2],sample(bf_SP$Y,length(bf_SP$Y))) Sm_thd_shuf[,i] <- MINDID_FMC(bf_SP_shuf, scaleQ, m, thd) } mean_shuf <- apply(Sm_thd_shuf,1,mean) dev.new(width=6, height=4) matplot(1:10,Sm_thd_shuf,type="l",lty=1,col=rgb(1,0,0,0.25), ylim=c(-0.05,0.05),ylab=bquote(S[.(m)]),xaxt="n", xlab="",cex.lab=1.2) axis(1,1:10,labels = FALSE) text(1:10,par("usr")[3]-0.01,srt=45,ad=1, labels=c("0_100", "10_100","20_100","30_100", "40_100","50_100","60_100", "70_100","80_100","90_100"),xpd=T,font=2,cex=1) mtext("Thresholds",side=1,line=3.5,cex=1.2) lines(1:10,mean_shuf,type="b",col="blue",pch=19) legend.text<-c("Shuffled","mean") legend.pch=c(NA,19) legend.lwd=c(2,2) legend.col=c("red","blue") legend("topleft",legend=legend.text,pch=legend.pch,lwd=legend.lwd, col=legend.col,ncol=1,text.col="black",cex=1,box.lwd=1,bg="white") ## End(Not run)
## Not run: bf <- Butterfly(10000) bf_SP <- bf[,c(1,2,9)] m <- 2 scaleQ <- 5:25 thd <- quantile(bf_SP$Y,probs=c(0,0.1,0.2,0.3, 0.4,0.5,0.6, 0.7,0.8,0.9)) nbr_shuf <- 100 Sm_thd_shuf <- matrix(0,length(thd),nbr_shuf) for (i in 1:nbr_shuf){ bf_SP_shuf <- cbind(bf_SP[,1:2],sample(bf_SP$Y,length(bf_SP$Y))) Sm_thd_shuf[,i] <- MINDID_FMC(bf_SP_shuf, scaleQ, m, thd) } mean_shuf <- apply(Sm_thd_shuf,1,mean) dev.new(width=6, height=4) matplot(1:10,Sm_thd_shuf,type="l",lty=1,col=rgb(1,0,0,0.25), ylim=c(-0.05,0.05),ylab=bquote(S[.(m)]),xaxt="n", xlab="",cex.lab=1.2) axis(1,1:10,labels = FALSE) text(1:10,par("usr")[3]-0.01,srt=45,ad=1, labels=c("0_100", "10_100","20_100","30_100", "40_100","50_100","60_100", "70_100","80_100","90_100"),xpd=T,font=2,cex=1) mtext("Thresholds",side=1,line=3.5,cex=1.2) lines(1:10,mean_shuf,type="b",col="blue",pch=19) legend.text<-c("Shuffled","mean") legend.pch=c(NA,19) legend.lwd=c(2,2) legend.col=c("red","blue") legend("topleft",legend=legend.text,pch=legend.pch,lwd=legend.lwd, col=legend.col,ncol=1,text.col="black",cex=1,box.lwd=1,bg="white") ## End(Not run)
Estimates Rényi's generalized dimensions (or Rényi's dimensions of order). It is
mainly for
that the result is used as an estimate of the intrinsic dimension of data.
RenDim(X, scaleQ=1:5, qMin=2, qMax=2)
RenDim(X, scaleQ=1:5, qMin=2, qMax=2)
X |
A |
scaleQ |
A vector (at least two values). It contains the values of |
qMin |
The minimum value of |
qMax |
The maximum value of |
is the edge length of the grid cells (or quadrats). Since the variables
(and consenquently the grid) are rescaled to the
interval,
is equal
to
for a grid consisting of only one cell.
is the number of grid cells (or quadrats) along each axis of the
Euclidean space in which the data points are embedded.
is equal to
where
is the number
of grid cells and
is the number of variables (or features).
is directly related to
(see References).
is the diagonal length of the grid cells.
A list of two elements:
a data.frame
containing the value of Rényi's information of order
(computed using the natural logarithm) for each value of
and
. The values of
are provided with regard to the
interval.
a data.frame
containing the value of for each value of
.
Jean Golay [email protected]
C. Traina Jr., A. J. M. Traina, L. Wu and C. Faloutsos (2000). Fast feature selection using fractal dimension. Proceedings of the 15th Brazilian Symposium on Databases (SBBD 2000), João Pessoa (Brazil).
E. P. M. De Sousa, C. Traina Jr., A. J. M. Traina, L. Wu and C. Faloutsos (2007). A fast and effective method to find correlations among attributes in databases, Data Mining and Knowledge Discovery 14(3):367-407.
J. Golay and M. Kanevski (2015). A new estimator of intrinsic dimension based on the multipoint Morisita index, Pattern Recognition 48 (12):4070–4081.
H. Hentschel and I. Procaccia (1983). The infinite number of generalized dimensions of fractals and strange attractors, Physica D 8(3):435-444.
sim_dat <- SwissRoll(1000) scaleQ <- 1:15 # It starts with a grid of 1^E cell (or quadrat). # It ends with a grid of 15^E cells (or quadrats). qRI_ID <- RenDim(sim_dat[,c(1,2)], scaleQ[5:15]) print(paste("The ID estimate is equal to",round(qRI_ID[[1]][1,2],2)))
sim_dat <- SwissRoll(1000) scaleQ <- 1:15 # It starts with a grid of 1^E cell (or quadrat). # It ends with a grid of 15^E cells (or quadrats). qRI_ID <- RenDim(sim_dat[,c(1,2)], scaleQ[5:15]) print(paste("The ID estimate is equal to",round(qRI_ID[[1]][1,2],2)))
Generates random points on the Swiss Roll manifold.
SwissRoll(N=10000)
SwissRoll(N=10000)
N |
The number of points to be generated (by default: |
A
data.frame
containing the
coordinates of the Swiss roll data points embedded in .
J. A. Lee and M. Verleysen (2007). Nonlinear Dimensionality Reduction, Springer, New York.
sim_dat <- SwissRoll(1000)
sim_dat <- SwissRoll(1000)