This vignette details the simulations tools provided with the selectboost package by providing five examples of use.
If you are a Linux/Unix or a Macos user, you can install a version of SelectBoost with support for doMC
from github with:
We want to creates \(NDatasets=200\) datasets with \(\textrm{length}(group)=10\) variables and \(N=10\) observations. In that example we want \(9\) groups:
The correlation structure of the explanatory variables of the dataset is provided by group
and the intra-group Pearson correlation value for each of the groups by cor_group
. A value must be provided even for single variable groups and the number of variables is length of the group
vector. Use the simulation_cor
function to create the correlation matrix (CM
).
library(SelectBoost)
group<-c(1:9,1) #10 variables
cor_group<-rep(0.95,9)
CM<-simulation_cor(group,cor_group)
CM
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 1.00 0 0 0 0 0 0 0 0 0.95
#> [2,] 0.00 1 0 0 0 0 0 0 0 0.00
#> [3,] 0.00 0 1 0 0 0 0 0 0 0.00
#> [4,] 0.00 0 0 1 0 0 0 0 0 0.00
#> [5,] 0.00 0 0 0 1 0 0 0 0 0.00
#> [6,] 0.00 0 0 0 0 1 0 0 0 0.00
#> [7,] 0.00 0 0 0 0 0 1 0 0 0.00
#> [8,] 0.00 0 0 0 0 0 0 1 0 0.00
#> [9,] 0.00 0 0 0 0 0 0 0 1 0.00
#> [10,] 0.95 0 0 0 0 0 0 0 0 1.00
Then generation of an explanatory dataset with \(N=10\) observations is made by the simulation_X
function.
set.seed(3141)
N<-10
X<-simulation_X(N,CM)
X
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.69029599 -1.7343187 0.38993973 0.7530345 0.73462394
#> [2,] -0.55429733 1.5236359 -0.44435298 -0.8293970 0.02137105
#> [3,] -0.65277340 -0.2365804 -0.33365748 1.6144115 -1.38882044
#> [4,] 0.06979050 -0.1798988 -0.01576611 -1.6377538 -0.85286458
#> [5,] -0.58450206 -1.7411024 -0.13801223 -0.4512545 -0.14449068
#> [6,] -1.26045088 0.2087882 -0.72491869 0.6856348 -0.43163916
#> [7,] 0.03836372 0.4497529 -0.14266131 -0.6019886 1.36552523
#> [8,] -0.52083678 -0.7626927 0.12068723 -1.6146444 0.79777307
#> [9,] 0.30631735 0.9848562 -0.73193720 -1.2236526 -0.59785470
#> [10,] -1.67366702 0.8709858 -0.80617294 -0.7559627 1.76655800
#> [,6] [,7] [,8] [,9] [,10]
#> [1,] 1.91645527 0.23481727 -0.2019611 -1.0254989 0.97524648
#> [2,] 1.62014297 -0.11230652 1.7600720 1.4330532 -0.46696929
#> [3,] 1.18979400 1.80315380 -0.7423216 -0.3679811 -0.83315697
#> [4,] 0.07352177 -0.30160066 0.6676371 2.0313025 -0.07749897
#> [5,] -0.77910581 0.04835639 0.8040776 -0.2039855 -0.75413152
#> [6,] -1.62517684 0.56503309 -0.1387350 -0.4091602 -1.09688456
#> [7,] 0.14761767 1.17868940 0.5279960 -0.5626160 -0.09706896
#> [8,] 2.41326136 -1.64916614 -0.6481176 1.7608488 -0.69320924
#> [9,] -0.13138414 -0.44649730 0.4507879 1.4486604 0.60032266
#> [10,] 0.58235105 -1.48612450 0.1245139 -0.9288625 -1.10028291
A response can now be added to the dataset by the simulation_Data
function. We have to specifiy the support of the response, i.e. the explanatory variables that will be used in the linear model created to compute the response. The support is given by the supp
vector whose entries are either \(0\) or \(1\). The length of the supp
vector must be equal to the number of explanatory variables and if the \(i\)entry is equal to \(1\), it means that the \(i\)variable will be used to derive the response value, whereas if the \(i\)entry is equal to \(0\), it means that the \(i\)variable will not be used to derive the response value (beta<-rep(0,length(supp))
). The values of the coefficients for the explanatory variables that are in the support of the response are random (either absolute value and sign) and given by beta[which(supp==1)]<-runif(sum(supp),minB,maxB)*(rbinom(sum(supp),1,.5)*2-1)
. Hence, the user can specify their minimal absolute value with the minB
option and their maximal absolute value with the maxB
option. The stn
option is a scaling factor for the noise added to the response vector ((t(beta)%*%var(X)%*%beta)/stn
, with X
the matrix of explanatory variables). The higher the stn
value, the smaller the noise: for instance for a given X
dataset, an stn
value \(\alpha\) times larger will result in a noise exactly \(\sqrt{\alpha}\) times smaller.
set.seed(3141)
supp<-c(1,1,1,0,0,0,0,0,0,0)
minB<-1
maxB<-2
stn<-50
firstdataset=simulation_DATA(X,supp,minB,maxB,stn)
firstdataset
#> $X
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.69029599 -1.7343187 0.38993973 0.7530345 0.73462394
#> [2,] -0.55429733 1.5236359 -0.44435298 -0.8293970 0.02137105
#> [3,] -0.65277340 -0.2365804 -0.33365748 1.6144115 -1.38882044
#> [4,] 0.06979050 -0.1798988 -0.01576611 -1.6377538 -0.85286458
#> [5,] -0.58450206 -1.7411024 -0.13801223 -0.4512545 -0.14449068
#> [6,] -1.26045088 0.2087882 -0.72491869 0.6856348 -0.43163916
#> [7,] 0.03836372 0.4497529 -0.14266131 -0.6019886 1.36552523
#> [8,] -0.52083678 -0.7626927 0.12068723 -1.6146444 0.79777307
#> [9,] 0.30631735 0.9848562 -0.73193720 -1.2236526 -0.59785470
#> [10,] -1.67366702 0.8709858 -0.80617294 -0.7559627 1.76655800
#> [,6] [,7] [,8] [,9] [,10]
#> [1,] 1.91645527 0.23481727 -0.2019611 -1.0254989 0.97524648
#> [2,] 1.62014297 -0.11230652 1.7600720 1.4330532 -0.46696929
#> [3,] 1.18979400 1.80315380 -0.7423216 -0.3679811 -0.83315697
#> [4,] 0.07352177 -0.30160066 0.6676371 2.0313025 -0.07749897
#> [5,] -0.77910581 0.04835639 0.8040776 -0.2039855 -0.75413152
#> [6,] -1.62517684 0.56503309 -0.1387350 -0.4091602 -1.09688456
#> [7,] 0.14761767 1.17868940 0.5279960 -0.5626160 -0.09706896
#> [8,] 2.41326136 -1.64916614 -0.6481176 1.7608488 -0.69320924
#> [9,] -0.13138414 -0.44649730 0.4507879 1.4486604 0.60032266
#> [10,] 0.58235105 -1.48612450 0.1245139 -0.9288625 -1.10028291
#>
#> $Y
#> [1] -4.2132936 3.5039588 0.3332549 -0.4924011 -2.5391834 1.8674007
#> [7] 0.6678607 -0.4589311 0.6353867 3.8091855
#>
#> $support
#> [1] 1 1 1 0 0 0 0 0 0 0
#>
#> $beta
#> [1] -1.754996 1.964992 1.041431 0.000000 0.000000 0.000000 0.000000
#> [8] 0.000000 0.000000 0.000000
#>
#> $stn
#> [1] 50
#>
#> $sigma
#> [,1]
#> [1,] 0.3493447
#>
#> attr(,"class")
#> [1] "simuls"
To generate multiple datasets, repeat steps 2 and 3, for instance use a for
loop. We create \(NDatasets=200\) datasets and assign them to the objects DATA_exemple1_nb_1
to DATA_exemple1_nb_200
.
set.seed(3141)
NDatasets=200
for(i in 1:NDatasets){
X<-simulation_X(N,CM)
assign(paste("DATA_exemple1_nb_",i,sep=""),simulation_DATA(X,supp,minB,maxB,stn))
}
We now check the correlation structure of the explanatory variable. First we compute the mean correlation matrix.
corr_sum=matrix(0,length(group),length(group))
for(i in 1:NDatasets){
corr_sum=corr_sum+cor(get(paste("DATA_exemple1_nb_",i,sep=""))$X)
}
corr_mean=corr_sum/NDatasets
Then we display and plot that the mean correlation matrix.
corr_mean
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1.0000000000 -0.0008611262 0.0193872629 0.0192496952 -0.012147407
#> [2,] -0.0008611262 1.0000000000 -0.0520800766 0.0144798781 0.006237499
#> [3,] 0.0193872629 -0.0520800766 1.0000000000 0.0008693002 -0.021373842
#> [4,] 0.0192496952 0.0144798781 0.0008693002 1.0000000000 0.007753693
#> [5,] -0.0121474071 0.0062374994 -0.0213738420 0.0077536931 1.000000000
#> [6,] -0.0089756967 -0.0404111300 0.0344817040 0.0081889675 0.018018674
#> [7,] -0.0082911544 0.0072612885 -0.0233188445 -0.0380192689 0.023833224
#> [8,] 0.0272233550 -0.0066654749 -0.0487035643 0.0172624295 0.043181249
#> [9,] -0.0145986545 0.0071146338 0.0364868095 -0.0020153080 -0.027733046
#> [10,] 0.9422544272 -0.0071281448 0.0264886880 0.0221950354 -0.003811061
#> [,6] [,7] [,8] [,9] [,10]
#> [1,] -0.008975697 -0.008291154 0.027223355 -0.014598655 0.942254427
#> [2,] -0.040411130 0.007261289 -0.006665475 0.007114634 -0.007128145
#> [3,] 0.034481704 -0.023318845 -0.048703564 0.036486809 0.026488688
#> [4,] 0.008188968 -0.038019269 0.017262430 -0.002015308 0.022195035
#> [5,] 0.018018674 0.023833224 0.043181249 -0.027733046 -0.003811061
#> [6,] 1.000000000 -0.015449494 -0.004054573 0.006159349 0.003444504
#> [7,] -0.015449494 1.000000000 -0.002105066 0.005052182 -0.018230902
#> [8,] -0.004054573 -0.002105066 1.000000000 -0.003169857 0.021688766
#> [9,] 0.006159349 0.005052182 -0.003169857 1.000000000 -0.013388952
#> [10,] 0.003444504 -0.018230902 0.021688766 -0.013388952 1.000000000
plot(abs(corr_mean))
coef_sum=rep(0,length(group))
names(coef_sum)<-paste("x",1:length(group),sep="")
error_counter=0
for(i in 1:NDatasets){
tempdf=data.frame(cbind(Y=get(paste("DATA_exemple1_nb_",i,sep=""))$Y,
get(paste("DATA_exemple1_nb_",i,sep=""))$X))
tempcoef=coef(lm(Y~.-1,data=tempdf))
if(is.null(tempcoef)){
cat("Error in lm fit, skip coefficients\n")
error_counter=error_counter+1
} else{
coef_sum=coef_sum+abs(tempcoef)
}
}
error_counter
#> [1] 0
coef_mean=coef_sum/NDatasets
All fits were sucessful. Then we display and plot that the mean coefficient vector values.
coef_mean
#> x1 x2 x3 x4 x5
#> 1.508327e+00 1.518967e+00 1.491694e+00 1.437883e-15 1.857978e-15
#> x6 x7 x8 x9 x10
#> 2.524863e-15 1.848380e-15 2.314167e-15 2.601342e-15 6.269818e-15
barplot(coef_mean)
abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue")
Reduce the noise in the response for the new responses by a factor \(\sqrt{5000/50}=10\). \(1/stn\cdot \beta_{support}^t\mathrm{Var}(X)\beta_{support}\) where \(\beta_{support}\) is the vector of coefficients wh
set.seed(3141)
stn <- 5000
for(i in 1:NDatasets){
X<-simulation_X(N,CM)
assign(paste("DATA_exemple1_bis_nb_",i,sep=""),simulation_DATA(X,supp,minB,maxB,stn))
}
Since it is the same explanatory dataset for response generation, we can compare the \(\sigma\) between those \(NDatasets=200\) datasets.
stn_ratios=rep(0,NDatasets)
for(i in 1:NDatasets){
stn_ratios[i]<-get(paste("DATA_exemple1_nb_",i,sep=""))$sigma/
get(paste("DATA_exemple1_bis_nb_",i,sep=""))$sigma
}
all(sapply(stn_ratios,all.equal,10))
#> [1] TRUE
All the ratios are equal to 10 as anticipated.
Since, the correlation structure is the same as before, we do not need to check it again. As befor, we infer the coefficients values of a linear model using the lm
function.
coef_sum_bis=rep(0,length(group))
names(coef_sum_bis)<-paste("x",1:length(group),sep="")
error_counter_bis=0
for(i in 1:NDatasets){
tempdf=data.frame(cbind(Y=get(paste("DATA_exemple1_bis_nb_",i,sep=""))$Y,
get(paste("DATA_exemple1_bis_nb_",i,sep=""))$X))
tempcoef=coef(lm(Y~.-1,data=tempdf))
if(is.null(tempcoef)){
cat("Error in lm fit, skip coefficients\n")
error_counter_bis=error_counte_bisr+1
} else{
coef_sum_bis=coef_sum_bis+abs(tempcoef)
}
}
error_counter_bis
#> [1] 0
coef_mean_bis=coef_sum_bis/NDatasets
All fits were sucessful. Then we display and plot that the mean coefficient vector values. As expected, the noise reduction enhances the retrieval of the true
mean coefficient absolute values by the models.
coef_mean_bis
#> x1 x2 x3 x4 x5
#> 1.508327e+00 1.518967e+00 1.491694e+00 1.437883e-15 1.857978e-15
#> x6 x7 x8 x9 x10
#> 2.524863e-15 1.848380e-15 2.314167e-15 2.601342e-15 6.269818e-15
barplot(coef_mean_bis)
abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue")
The simulation process looks sucessfull. What are the confidence indices for those variables?
We want to creates \(NDatasets=200\) datasets with \(\textrm{length}(group)=50\) variables and \(N=20\) observations. In that example we want \(1\) group:
We now check the correlation structure of the explanatory variable. First we compute the mean correlation matrix.
corr_sum=matrix(0,length(group),length(group))
for(i in 1:NDatasets){
corr_sum=corr_sum+cor(get(paste("DATA_exemple2_nb_",i,sep=""))$X)
}
corr_mean=corr_sum/NDatasets
Then we display and plot that the mean correlation matrix.
corr_mean[1:10,1:10]
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 1.0000000 0.4875195 0.4774265 0.4739869 0.4796578 0.4927521
#> [2,] 0.4875195 1.0000000 0.4750680 0.4940743 0.4842295 0.4988293
#> [3,] 0.4774265 0.4750680 1.0000000 0.4799584 0.4690139 0.4831884
#> [4,] 0.4739869 0.4940743 0.4799584 1.0000000 0.4990798 0.4995625
#> [5,] 0.4796578 0.4842295 0.4690139 0.4990798 1.0000000 0.4890548
#> [6,] 0.4927521 0.4988293 0.4831884 0.4995625 0.4890548 1.0000000
#> [7,] 0.4898313 0.4937678 0.4765315 0.4928739 0.4839137 0.4980120
#> [8,] 0.4786241 0.4904904 0.4871110 0.4799335 0.4807772 0.5015775
#> [9,] 0.4821186 0.4801270 0.4648431 0.4774361 0.4657147 0.4942747
#> [10,] 0.4875410 0.4690954 0.4669033 0.4832185 0.4781301 0.4873754
#> [,7] [,8] [,9] [,10]
#> [1,] 0.4898313 0.4786241 0.4821186 0.4875410
#> [2,] 0.4937678 0.4904904 0.4801270 0.4690954
#> [3,] 0.4765315 0.4871110 0.4648431 0.4669033
#> [4,] 0.4928739 0.4799335 0.4774361 0.4832185
#> [5,] 0.4839137 0.4807772 0.4657147 0.4781301
#> [6,] 0.4980120 0.5015775 0.4942747 0.4873754
#> [7,] 1.0000000 0.4766888 0.4786140 0.4825714
#> [8,] 0.4766888 1.0000000 0.4783224 0.4832658
#> [9,] 0.4786140 0.4783224 1.0000000 0.4769654
#> [10,] 0.4825714 0.4832658 0.4769654 1.0000000
plot(abs(corr_mean))
coef_sum=rep(0,length(group))
coef_lasso_sum=rep(0,length(group))
names(coef_sum)<-paste("x",1:length(group),sep="")
names(coef_lasso_sum)<-paste("x",1:length(group),sep="")
error_counter=0
for(i in 1:NDatasets){
tempdf=data.frame(cbind(Y=get(paste("DATA_exemple2_nb_",i,sep=""))$Y,
get(paste("DATA_exemple2_nb_",i,sep=""))$X))
tempcoef=coef(lm(Y~.-1,data=tempdf))
require(lars)
lasso.1 <- lars::lars(x=get(paste("DATA_exemple2_nb_",i,sep=""))$X,
y=get(paste("DATA_exemple2_nb_",i,sep=""))$Y, type="lasso",
trace=FALSE, normalize=FALSE, intercept = FALSE)
# cv.lars() uses crossvalidation to estimate optimal position in path
cv.lasso.1 <- lars::cv.lars(x=get(paste("DATA_exemple2_nb_",i,sep=""))$X,
y=get(paste("DATA_exemple2_nb_",i,sep=""))$Y,
plot.it=FALSE, type="lasso")
# Use the "+1SE rule" to find best model:
# Take the min CV and add its SE ("limit").
# Find smallest model that has its own CV within this limit (at "s.cv.1")
limit <- min(cv.lasso.1$cv) + cv.lasso.1$cv.error[which.min(cv.lasso.1$cv)]
s.cv.1 <- cv.lasso.1$index[min(which(cv.lasso.1$cv < limit))]
# Print out coefficients at optimal s.
coef_lasso_sum=coef_lasso_sum+abs(coef(lasso.1, s=s.cv.1, mode="fraction"))
if(is.null(tempcoef)){
cat("Error in lm fit, skip coefficients\n")
error_counter=error_counter+1
} else{
coef_sum=coef_sum+abs(tempcoef)
}
}
error_counter
#> [1] 0
coef_mean=coef_sum/NDatasets
coef_lasso_mean=coef_lasso_sum/NDatasets
With regular least squares and lasso estimators all fits were sucessful, yet only 20 variables coefficients could be estimated with regular least squares estimates for the linear model. Then we display and plot that the mean coefficient vector values for the least squares estimates.
coef_mean
#> x1 x2 x3 x4 x5
#> 1.500084e+00 1.492942e+00 1.496050e+00 1.528043e+00 1.497537e+00
#> x6 x7 x8 x9 x10
#> 2.577168e-15 6.820231e-15 3.386509e-15 4.297997e-15 7.216331e-15
#> x11 x12 x13 x14 x15
#> 3.528058e-15 1.001056e-14 6.068207e-15 9.640427e-15 5.351262e-15
#> x16 x17 x18 x19 x20
#> 9.276245e-15 5.295417e-15 8.227904e-15 5.860229e-15 4.724224e-15
#> x21 x22 x23 x24 x25
#> NA NA NA NA NA
#> x26 x27 x28 x29 x30
#> NA NA NA NA NA
#> x31 x32 x33 x34 x35
#> NA NA NA NA NA
#> x36 x37 x38 x39 x40
#> NA NA NA NA NA
#> x41 x42 x43 x44 x45
#> NA NA NA NA NA
#> x46 x47 x48 x49 x50
#> NA NA NA NA NA
barplot(coef_mean)
abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue")
coef_lasso_mean
#> x1 x2 x3 x4 x5 x6
#> 1.040812405 1.024444248 1.065905971 1.058866701 0.996624348 0.007811962
#> x7 x8 x9 x10 x11 x12
#> 0.015952857 0.010284528 0.025145014 0.019259508 0.014825470 0.023340488
#> x13 x14 x15 x16 x17 x18
#> 0.015909217 0.013040568 0.012039290 0.007597800 0.009091856 0.016135853
#> x19 x20 x21 x22 x23 x24
#> 0.006510554 0.017700262 0.024024125 0.026401587 0.016510921 0.006350005
#> x25 x26 x27 x28 x29 x30
#> 0.015389373 0.013108940 0.022619351 0.008526745 0.003851395 0.026688493
#> x31 x32 x33 x34 x35 x36
#> 0.013608283 0.010260901 0.019494725 0.008485431 0.008627058 0.019809449
#> x37 x38 x39 x40 x41 x42
#> 0.015047270 0.007562486 0.005826465 0.008270178 0.009596887 0.016556957
#> x43 x44 x45 x46 x47 x48
#> 0.014522021 0.016432803 0.003504785 0.008892051 0.012878671 0.011623352
#> x49 x50
#> 0.004653053 0.010384327
barplot(coef_lasso_mean,ylim=c(0,1.5))
abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue")
The simulation process looks sucessfull: the lasso estimates retrives mostly the correct variables, yet the other ones are also selected sometimes. What are the confidence indices for those variables?
We want to creates \(NDatasets=200\) datasets with \(\textrm{length}(supp)=100\) variables and \(N=24\) observations. In that example we use real data for the X
variables that we sample from all the \(1650\) probesets that are differentially expressed between the two conditions US and S. The main interest of that simulation is that the correlation structure of the X
dataset will be a real one.
First retrieve the datasets and get the differentially expressed probesets. Run the code to get additionnal plots.
require(CascadeData)
data(micro_S)
data(micro_US)
require(Cascade)
micro_US<-as.micro_array(micro_US,c(60,90,240,390),6)
micro_S<-as.micro_array(micro_S,c(60,90,240,390),6)
S<-geneSelection(list(micro_S,micro_US),list("condition",c(1,2),1),-1)
Sel<-micro_S@microarray[S@name,]
summary(S)
#> N1_US_T60 N1_US_T90 N1_US_T210
#> Min. :-3.24588 Min. :-2.740840 Min. :-4.056378
#> 1st Qu.:-0.13240 1st Qu.:-0.116239 1st Qu.:-0.298263
#> Median :-0.04406 Median : 0.007188 Median :-0.002423
#> Mean : 0.02723 Mean : 0.008828 Mean :-0.062663
#> 3rd Qu.: 0.07049 3rd Qu.: 0.143812 3rd Qu.: 0.218037
#> Max. : 3.00403 Max. : 2.581014 Max. : 3.068053
#> N1_US_T390 N2_US_T60 N2_US_T90 N2_US_T210
#> Min. :-4.6843 Min. :-3.1490 Min. :-3.26105 Min. :-3.69272
#> 1st Qu.:-0.4123 1st Qu.: 0.1957 1st Qu.:-0.20630 1st Qu.:-0.18419
#> Median :-0.1167 Median : 0.3917 Median :-0.10981 Median : 0.03299
#> Mean :-0.1144 Mean : 0.4016 Mean :-0.13583 Mean : 0.01174
#> 3rd Qu.: 0.1270 3rd Qu.: 0.5908 3rd Qu.:-0.02678 3rd Qu.: 0.22894
#> Max. : 3.1434 Max. : 3.7878 Max. : 3.54337 Max. : 2.46752
#> N2_US_T390 N3_US_T60 N3_US_T90
#> Min. :-4.61130 Min. :-3.66185 Min. :-3.44746
#> 1st Qu.:-0.21195 1st Qu.:-0.16978 1st Qu.:-0.07454
#> Median :-0.01994 Median :-0.03868 Median : 0.04864
#> Mean :-0.04728 Mean :-0.05892 Mean : 0.01881
#> 3rd Qu.: 0.16975 3rd Qu.: 0.07801 3rd Qu.: 0.15696
#> Max. : 2.36837 Max. : 3.17208 Max. : 2.22462
#> N3_US_T210 N3_US_T390 N4_US_T60
#> Min. :-4.2709 Min. :-4.73460 Min. :-1.80952
#> 1st Qu.:-0.6869 1st Qu.:-0.21676 1st Qu.:-0.08733
#> Median :-0.2507 Median : 0.04391 Median :-0.03024
#> Mean :-0.2593 Mean : 0.02288 Mean :-0.02583
#> 3rd Qu.: 0.1073 3rd Qu.: 0.28173 3rd Qu.: 0.02099
#> Max. : 3.2871 Max. : 3.67975 Max. : 2.20727
#> N4_US_T90 N4_US_T210 N4_US_T390
#> Min. :-3.555348 Min. :-2.95869 Min. :-3.21158
#> 1st Qu.: 0.001517 1st Qu.:-0.23332 1st Qu.:-0.17235
#> Median : 0.076508 Median :-0.09978 Median :-0.05177
#> Mean : 0.055968 Mean :-0.11058 Mean :-0.06072
#> 3rd Qu.: 0.147247 3rd Qu.: 0.02792 3rd Qu.: 0.05336
#> Max. : 1.719786 Max. : 2.91531 Max. : 2.15722
#> N5_US_T60 N5_US_T90 N5_US_T210
#> Min. :-2.04861 Min. :-2.14086 Min. :-3.61826
#> 1st Qu.: 0.01025 1st Qu.:-0.10201 1st Qu.:-0.22499
#> Median : 0.09644 Median :-0.01021 Median :-0.06266
#> Mean : 0.06829 Mean :-0.02281 Mean :-0.08268
#> 3rd Qu.: 0.17637 3rd Qu.: 0.06101 3rd Qu.: 0.08834
#> Max. : 2.23614 Max. : 2.62104 Max. : 2.81091
#> N5_US_T390 N6_US_T60 N6_US_T90
#> Min. :-3.42995 Min. :-2.814716 Min. :-2.46213
#> 1st Qu.:-0.11192 1st Qu.:-0.054682 1st Qu.:-0.10794
#> Median : 0.05497 Median : 0.008866 Median :-0.01223
#> Mean : 0.02813 Mean : 0.008599 Mean : 0.01019
#> 3rd Qu.: 0.20603 3rd Qu.: 0.077924 3rd Qu.: 0.09869
#> Max. : 3.38922 Max. : 2.789490 Max. : 2.58776
#> N6_US_T210 N6_US_T390
#> Min. :-2.89696 Min. :-4.15575
#> 1st Qu.:-0.17750 1st Qu.:-0.24111
#> Median : 0.02146 Median :-0.03612
#> Mean :-0.01274 Mean :-0.06744
#> 3rd Qu.: 0.19801 3rd Qu.: 0.14960
#> Max. : 2.08709 Max. : 2.97212
Generates the datasets sampling for each of them 100 probesets expressions among the 1650 that were selected and linking the response to the expressions of the first five probesets.
Here are the plots of an example of correlation structure, namely for DATA_exemple3_nb_200$X
. Run the code to get the graphics.
coef_sum=rep(0,length(supp))
coef_lasso_sum=rep(0,length(supp))
names(coef_sum)<-paste("x",1:length(supp),sep="")
names(coef_lasso_sum)<-paste("x",1:length(supp),sep="")
error_counter=0
for(i in 1:NDatasets){
tempdf=data.frame(cbind(Y=get(paste("DATA_exemple3_nb_",i,sep=""))$Y,
get(paste("DATA_exemple3_nb_",i,sep=""))$X))
tempcoef=coef(lm(Y~.-1,data=tempdf))
require(lars)
lasso.1 <- lars::lars(x=get(paste("DATA_exemple3_nb_",i,sep=""))$X,
y=get(paste("DATA_exemple3_nb_",i,sep=""))$Y, type="lasso",
trace=FALSE, normalize=FALSE, intercept = FALSE)
# cv.lars() uses crossvalidation to estimate optimal position in path
cv.lasso.1 <- lars::cv.lars(x=get(paste("DATA_exemple3_nb_",i,sep=""))$X,
y=get(paste("DATA_exemple3_nb_",i,sep=""))$Y,
plot.it=FALSE, normalize=FALSE, intercept = FALSE, type="lasso")
# Use the "+1SE rule" to find best model:
# Take the min CV and add its SE ("limit").
# Find smallest model that has its own CV within this limit (at "s.cv.1")
limit <- min(cv.lasso.1$cv) + cv.lasso.1$cv.error[which.min(cv.lasso.1$cv)]
s.cv.1 <- cv.lasso.1$index[min(which(cv.lasso.1$cv < limit))]
# Print out coefficients at optimal s.
coef_lasso_sum=coef_lasso_sum+abs(coef(lasso.1, s=s.cv.1, mode="fraction"))
if(is.null(tempcoef)){
cat("Error in lm fit, skip coefficients\n")
error_counter=error_counter+1
} else{
coef_sum=coef_sum+abs(tempcoef)
}
}
error_counter
#> [1] 0
coef_mean=coef_sum/NDatasets
coef_lasso_mean=coef_lasso_sum/NDatasets
With regular least squares and lasso estimators all fits were sucessful, yet only 20 variables coefficients could be estimated with regular least squares estimates for the linear model. Then we display and plot that the mean coefficient vector values for the least squares estimates.
coef_mean
#> x1 x2 x3 x4 x5
#> 1.457741e+00 1.488627e+00 1.503719e+00 1.506939e+00 1.481887e+00
#> x6 x7 x8 x9 x10
#> 4.783287e-14 3.253540e-14 3.033880e-14 3.386953e-14 5.129063e-14
#> x11 x12 x13 x14 x15
#> 2.171987e-14 3.593990e-14 6.635802e-14 3.740085e-14 4.467427e-14
#> x16 x17 x18 x19 x20
#> 2.678085e-14 3.441995e-14 2.796034e-14 2.978652e-14 2.492476e-14
#> x21 x22 x23 x24 x25
#> 2.808596e-14 2.206820e-14 3.824975e-14 3.346627e-14 NA
#> x26 x27 x28 x29 x30
#> NA NA NA NA NA
#> x31 x32 x33 x34 x35
#> NA NA NA NA NA
#> x36 x37 x38 x39 x40
#> NA NA NA NA NA
#> x41 x42 x43 x44 x45
#> NA NA NA NA NA
#> x46 x47 x48 x49 x50
#> NA NA NA NA NA
#> x51 x52 x53 x54 x55
#> NA NA NA NA NA
#> x56 x57 x58 x59 x60
#> NA NA NA NA NA
#> x61 x62 x63 x64 x65
#> NA NA NA NA NA
#> x66 x67 x68 x69 x70
#> NA NA NA NA NA
#> x71 x72 x73 x74 x75
#> NA NA NA NA NA
#> x76 x77 x78 x79 x80
#> NA NA NA NA NA
#> x81 x82 x83 x84 x85
#> NA NA NA NA NA
#> x86 x87 x88 x89 x90
#> NA NA NA NA NA
#> x91 x92 x93 x94 x95
#> NA NA NA NA NA
#> x96 x97 x98 x99 x100
#> NA NA NA NA NA
barplot(coef_mean)
abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue")
coef_lasso_mean
#> x1 x2 x3 x4 x5 x6
#> 0.637209930 0.664930735 0.704791558 0.699788446 0.708957352 0.026974567
#> x7 x8 x9 x10 x11 x12
#> 0.023487857 0.009919040 0.011058633 0.010031720 0.016771602 0.010042126
#> x13 x14 x15 x16 x17 x18
#> 0.021205099 0.017549985 0.012900827 0.018584630 0.023346476 0.010608447
#> x19 x20 x21 x22 x23 x24
#> 0.019021097 0.014329658 0.038015434 0.018848081 0.024050393 0.023265893
#> x25 x26 x27 x28 x29 x30
#> 0.021527789 0.025728325 0.018939560 0.016298912 0.017970262 0.012682511
#> x31 x32 x33 x34 x35 x36
#> 0.025418463 0.022593725 0.013102577 0.022873574 0.009968251 0.019946689
#> x37 x38 x39 x40 x41 x42
#> 0.017837080 0.014291701 0.038368829 0.012130424 0.014915569 0.019767104
#> x43 x44 x45 x46 x47 x48
#> 0.014077032 0.019515928 0.018308034 0.023001590 0.023608520 0.023254216
#> x49 x50 x51 x52 x53 x54
#> 0.027810565 0.020526328 0.021684967 0.017601886 0.015140221 0.014786294
#> x55 x56 x57 x58 x59 x60
#> 0.015782671 0.021593288 0.017002037 0.010170505 0.014032097 0.018771402
#> x61 x62 x63 x64 x65 x66
#> 0.021602370 0.015063636 0.017964267 0.008458907 0.025593764 0.015068537
#> x67 x68 x69 x70 x71 x72
#> 0.026420849 0.027444423 0.012679975 0.016202908 0.023457459 0.013357638
#> x73 x74 x75 x76 x77 x78
#> 0.019508482 0.019488113 0.020121936 0.038249554 0.020394668 0.022021328
#> x79 x80 x81 x82 x83 x84
#> 0.017333107 0.011867270 0.013594937 0.009380064 0.011832266 0.015219820
#> x85 x86 x87 x88 x89 x90
#> 0.020711292 0.015979236 0.014798893 0.008586843 0.015322780 0.012577586
#> x91 x92 x93 x94 x95 x96
#> 0.015768535 0.018415016 0.029014530 0.019916314 0.012457481 0.028478157
#> x97 x98 x99 x100
#> 0.009748370 0.009669022 0.032595465 0.012964933
barplot(coef_lasso_mean,ylim=c(0,1.5))
abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue")
The simulation process looks sucessfull: the lasso estimates retrives mostly the correct variables, yet the other ones are also selected sometimes. What are the confidence indices for those variables?
We want to creates \(NDatasets=101\) datasets with \(\textrm{length}(supp)=100\) variables and \(N=18\) observations. In that example we use real data for the variables that are the \(101\) probesets that are the more differentially expressed between the two conditions US and S. We create \(101\) datasets by leaving one of the variables out each time and using it as the response that shall be predicted. We also only use for the explanatory variables the observations that are the measurements for the 1st, 2nd and 3rd timepoints and for the responses the observations that are the measurements of the same variables for the 2nd, 3rd and 4th timepoints. The main interest of that simulation is that the correlation structure of the X
dataset will be a real one and that it is a typical setting for cascade network reverse-engineering in genomics or proteomics, see the Cascade
package for more details.
First retrieve the datasets and get the differentially expressed probesets. Run the code to get additionnal plots.
require(CascadeData)
data(micro_S)
data(micro_US)
require(Cascade)
micro_US<-as.micro_array(micro_US,c(60,90,240,390),6)
micro_S<-as.micro_array(micro_S,c(60,90,240,390),6)
S<-geneSelection(list(micro_S,micro_US),list("condition",c(1,2),1),101)
Sel<-micro_S@microarray[S@name,]
summary(S)
#> N1_US_T60 N1_US_T90 N1_US_T210
#> Min. :-0.89276 Min. :-0.773190 Min. :-2.079442
#> 1st Qu.:-0.11132 1st Qu.:-0.082888 1st Qu.:-0.283990
#> Median :-0.03030 Median :-0.002073 Median : 0.006005
#> Mean : 0.04941 Mean : 0.015356 Mean :-0.074991
#> 3rd Qu.: 0.08370 3rd Qu.: 0.127301 3rd Qu.: 0.204382
#> Max. : 2.53897 Max. : 0.938891 Max. : 1.155634
#> N1_US_T390 N2_US_T60 N2_US_T90
#> Min. :-1.30745 Min. :-1.6422 Min. :-3.26105
#> 1st Qu.:-0.44631 1st Qu.: 0.1499 1st Qu.:-0.21748
#> Median :-0.10464 Median : 0.3358 Median :-0.12623
#> Mean :-0.03396 Mean : 0.3331 Mean :-0.13730
#> 3rd Qu.: 0.23266 3rd Qu.: 0.5210 3rd Qu.:-0.04215
#> Max. : 2.96588 Max. : 2.7005 Max. : 1.56541
#> N2_US_T210 N2_US_T390 N3_US_T60
#> Min. :-2.55852 Min. :-2.71910 Min. :-2.37749
#> 1st Qu.:-0.16896 1st Qu.:-0.31993 1st Qu.:-0.18448
#> Median : 0.01873 Median :-0.02216 Median :-0.06051
#> Mean :-0.01445 Mean :-0.08521 Mean :-0.06439
#> 3rd Qu.: 0.16535 3rd Qu.: 0.18161 3rd Qu.: 0.08545
#> Max. : 1.16299 Max. : 0.95946 Max. : 1.00636
#> N3_US_T90 N3_US_T210 N3_US_T390
#> Min. :-0.76029 Min. :-1.9716 Min. :-3.225327
#> 1st Qu.:-0.03858 1st Qu.:-0.6713 1st Qu.:-0.258856
#> Median : 0.03977 Median :-0.1636 Median :-0.017611
#> Mean : 0.03449 Mean :-0.2324 Mean :-0.004301
#> 3rd Qu.: 0.14177 3rd Qu.: 0.2057 3rd Qu.: 0.287566
#> Max. : 0.52193 Max. : 1.3350 Max. : 3.679747
#> N4_US_T60 N4_US_T90 N4_US_T210
#> Min. :-0.376767 Min. :-0.52425 Min. :-1.38629
#> 1st Qu.:-0.095907 1st Qu.:-0.01274 1st Qu.:-0.23321
#> Median :-0.032177 Median : 0.07226 Median :-0.12920
#> Mean :-0.007576 Mean : 0.06410 Mean :-0.12071
#> 3rd Qu.: 0.024612 3rd Qu.: 0.15332 3rd Qu.: 0.02794
#> Max. : 0.734820 Max. : 0.52910 Max. : 0.57303
#> N4_US_T390 N5_US_T60 N5_US_T90
#> Min. :-0.65678 Min. :-0.92734 Min. :-2.025374
#> 1st Qu.:-0.19730 1st Qu.:-0.03818 1st Qu.:-0.080528
#> Median :-0.05158 Median : 0.05680 Median : 0.004812
#> Mean :-0.07142 Mean : 0.01675 Mean :-0.025334
#> 3rd Qu.: 0.07118 3rd Qu.: 0.14034 3rd Qu.: 0.055367
#> Max. : 0.74194 Max. : 0.35555 Max. : 1.193922
#> N5_US_T210 N5_US_T390 N6_US_T60
#> Min. :-0.87925 Min. :-1.791759 Min. :-1.30077
#> 1st Qu.:-0.23615 1st Qu.:-0.167992 1st Qu.:-0.03809
#> Median :-0.08791 Median : 0.001692 Median : 0.01202
#> Mean :-0.04003 Mean :-0.008130 Mean : 0.02745
#> 3rd Qu.: 0.11900 3rd Qu.: 0.165985 3rd Qu.: 0.08639
#> Max. : 2.80103 Max. : 2.280781 Max. : 1.29689
#> N6_US_T90 N6_US_T210 N6_US_T390
#> Min. :-0.606136 Min. :-1.96944 Min. :-0.835295
#> 1st Qu.:-0.096129 1st Qu.:-0.17553 1st Qu.:-0.282883
#> Median : 0.003016 Median :-0.00207 Median : 0.002766
#> Mean : 0.023406 Mean :-0.01794 Mean :-0.046176
#> 3rd Qu.: 0.072245 3rd Qu.: 0.20799 3rd Qu.: 0.186650
#> Max. : 0.971099 Max. : 0.96286 Max. : 1.098612
suppt<-rep(1:4,6)
supp<-c(1,1,1,1,1,rep(0,95)) #not used since we use one of the probeset expressions as response
minB<-1 #not used since we use one of the probeset expressions as response
maxB<-2 #not used since we use one of the probeset expressions as response
stn<-50 #not used since we use one of the probeset expressions as response
NDatasets<-101
set.seed(3141)
for(i in 1:NDatasets){
#the explanatory variables are the values for the 1st, 2nd and 3rd timepoints
X<-t(as.matrix(Sel[-i,suppt!=4]))
Xnorm<-t(t(X)/sqrt(colSums(X*X)))
DATA<-simulation_DATA(Xnorm,supp,minB,maxB,stn)
#the reponses are the values for the 2nd, 3rd and 4th timepoints
DATA$Y<-as.vector(t(Sel[i,suppt!=1]))
assign(paste("DATA_exemple4_nb_",i,sep=""),DATA)
rm(DATA)
}
Here are the plots of an example of correlation structure, namely for DATA_exemple3_nb_200$X
. Run the code to get the graphics.
coef_sum=rep(0,length(supp)+1)
coef_lasso_sum=rep(0,length(supp)+1)
names(coef_sum)<-paste("x",1:(length(supp)+1),sep="")
names(coef_lasso_sum)<-paste("x",1:(length(supp)+1),sep="")
error_counter=0
for(i in 1:NDatasets){
tempdf=data.frame(cbind(Y=get(paste("DATA_exemple4_nb_",i,sep=""))$Y,
get(paste("DATA_exemple4_nb_",i,sep=""))$X))
tempcoef=coef(lm(Y~.-1,data=tempdf))
require(lars)
lasso.1 <- lars::lars(x=get(paste("DATA_exemple4_nb_",i,sep=""))$X,
y=get(paste("DATA_exemple4_nb_",i,sep=""))$Y, type="lasso",
trace=FALSE, normalize=FALSE, intercept = FALSE)
# cv.lars() uses crossvalidation to estimate optimal position in path
cv.lasso.1 <- lars::cv.lars(x=get(paste("DATA_exemple4_nb_",i,sep=""))$X,
y=get(paste("DATA_exemple4_nb_",i,sep=""))$Y,
plot.it=FALSE, normalize=FALSE, intercept = FALSE, type="lasso")
# Use the "+1SE rule" to find best model:
# Take the min CV and add its SE ("limit").
# Find smallest model that has its own CV within this limit (at "s.cv.1")
limit <- min(cv.lasso.1$cv) + cv.lasso.1$cv.error[which.min(cv.lasso.1$cv)]
s.cv.1 <- cv.lasso.1$index[min(which(cv.lasso.1$cv < limit))]
# Print out coefficients at optimal s.
coef_lasso_sum[-i]=coef_lasso_sum[-i]+abs(coef(lasso.1, s=s.cv.1, mode="fraction"))
if(is.null(tempcoef)){
cat("Error in lm fit, skip coefficients\n")
error_counter=error_counter+1
} else{
coef_sum[-i]=coef_sum[-i]+abs(tempcoef)
}
}
error_counter
#> [1] 0
coef_mean=coef_sum/NDatasets
coef_lasso_mean=coef_lasso_sum/NDatasets
With regular least squares and lasso estimators all fits were sucessful, yet only 20 variables coefficients could be estimated with regular least squares estimates for the linear model. Then we display and plot that the mean coefficient vector values for the least squares estimates.
head(coef_mean, 40)
#> x1 x2 x3 x4 x5 x6
#> 14349.4269 8154.5511 14420.0396 5840.7360 15688.8442 12563.4724
#> x7 x8 x9 x10 x11 x12
#> 603.4098 8310.8567 13063.4258 9127.8669 6589.8195 6108.8444
#> x13 x14 x15 x16 x17 x18
#> 17045.1586 1160.3101 13911.5637 1626.6271 7104.9765 22361.0928
#> x19 x20 x21 x22 x23 x24
#> NA NA NA NA NA NA
#> x25 x26 x27 x28 x29 x30
#> NA NA NA NA NA NA
#> x31 x32 x33 x34 x35 x36
#> NA NA NA NA NA NA
#> x37 x38 x39 x40
#> NA NA NA NA
barplot(coef_mean)
abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue")
head(coef_lasso_mean, 40)
#> x1 x2 x3 x4 x5 x6
#> 29.0504131 5.2642895 20.5041967 22.5116563 0.0000000 5.2894815
#> x7 x8 x9 x10 x11 x12
#> 71.0918236 2.2726129 3.9801932 4.7184860 15.9162259 1.5542485
#> x13 x14 x15 x16 x17 x18
#> 0.1028558 49.3976044 0.0000000 59.8909719 0.5831222 5.8827624
#> x19 x20 x21 x22 x23 x24
#> 6.9380044 3.6694840 0.0000000 11.0675899 0.2983584 16.7356084
#> x25 x26 x27 x28 x29 x30
#> 8.1478577 0.6178652 49.3027867 0.6186457 35.3589748 1.2841470
#> x31 x32 x33 x34 x35 x36
#> 164.0056926 39.4302759 0.0000000 1.7432612 35.4478409 8.4377981
#> x37 x38 x39 x40
#> 138.3060762 12.1775167 5.5399093 0.0000000
barplot(coef_lasso_mean)
Some probesets seem explanatory for many other ones (=hubs). What are the confidence indices for those variables?
We want to creates \(NDatasets=200\) datasets with \(\textrm{length}(group)=500\) variables and \(N=25\) observations. In that example we want \(1\) group:
We now check the correlation structure of the explanatory variable. First we compute the mean correlation matrix.
corr_sum=matrix(0,length(group),length(group))
for(i in 1:NDatasets){
corr_sum=corr_sum+cor(get(paste("DATA_exemple5_nb_",i,sep=""))$X)
}
corr_mean=corr_sum/NDatasets
Then we display and plot that the mean correlation matrix.
corr_mean[1:10,1:10]
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 1.0000000 0.4684444 0.4944331 0.4628645 0.4961487 0.4956996
#> [2,] 0.4684444 1.0000000 0.5102506 0.5026134 0.4960774 0.4948349
#> [3,] 0.4944331 0.5102506 1.0000000 0.4853991 0.4907725 0.5089649
#> [4,] 0.4628645 0.5026134 0.4853991 1.0000000 0.4670240 0.4837113
#> [5,] 0.4961487 0.4960774 0.4907725 0.4670240 1.0000000 0.4932811
#> [6,] 0.4956996 0.4948349 0.5089649 0.4837113 0.4932811 1.0000000
#> [7,] 0.4654665 0.4813972 0.4925889 0.4948130 0.4723250 0.4759705
#> [8,] 0.4890355 0.5001703 0.4838451 0.4782784 0.4867136 0.4750897
#> [9,] 0.4815475 0.4979561 0.4917091 0.4906115 0.4726156 0.4944805
#> [10,] 0.4834349 0.4830943 0.4945958 0.4778387 0.4763692 0.5051952
#> [,7] [,8] [,9] [,10]
#> [1,] 0.4654665 0.4890355 0.4815475 0.4834349
#> [2,] 0.4813972 0.5001703 0.4979561 0.4830943
#> [3,] 0.4925889 0.4838451 0.4917091 0.4945958
#> [4,] 0.4948130 0.4782784 0.4906115 0.4778387
#> [5,] 0.4723250 0.4867136 0.4726156 0.4763692
#> [6,] 0.4759705 0.4750897 0.4944805 0.5051952
#> [7,] 1.0000000 0.4680615 0.4906298 0.4770014
#> [8,] 0.4680615 1.0000000 0.4959996 0.4743232
#> [9,] 0.4906298 0.4959996 1.0000000 0.4932802
#> [10,] 0.4770014 0.4743232 0.4932802 1.0000000
plot(abs(corr_mean))
coef_sum=rep(0,length(group))
coef_lasso_sum=rep(0,length(group))
names(coef_sum)<-paste("x",1:length(group),sep="")
names(coef_lasso_sum)<-paste("x",1:length(group),sep="")
error_counter=0
for(i in 1:NDatasets){
tempdf=data.frame(cbind(Y=get(paste("DATA_exemple5_nb_",i,sep=""))$Y,
get(paste("DATA_exemple5_nb_",i,sep=""))$X))
tempcoef=coef(lm(Y~.-1,data=tempdf))
require(lars)
lasso.1 <- lars::lars(x=get(paste("DATA_exemple5_nb_",i,sep=""))$X,
y=get(paste("DATA_exemple5_nb_",i,sep=""))$Y, type="lasso",
trace=FALSE, normalize=FALSE, intercept = FALSE)
# cv.lars() uses crossvalidation to estimate optimal position in path
cv.lasso.1 <- lars::cv.lars(x=get(paste("DATA_exemple5_nb_",i,sep=""))$X,
y=get(paste("DATA_exemple5_nb_",i,sep=""))$Y,
plot.it=FALSE, type="lasso")
# Use the "+1SE rule" to find best model:
# Take the min CV and add its SE ("limit").
# Find smallest model that has its own CV within this limit (at "s.cv.1")
limit <- min(cv.lasso.1$cv) + cv.lasso.1$cv.error[which.min(cv.lasso.1$cv)]
s.cv.1 <- cv.lasso.1$index[min(which(cv.lasso.1$cv < limit))]
# Print out coefficients at optimal s.
coef_lasso_sum=coef_lasso_sum+abs(coef(lasso.1, s=s.cv.1, mode="fraction"))
if(is.null(tempcoef)){
cat("Error in lm fit, skip coefficients\n")
error_counter=error_counter+1
} else{
coef_sum=coef_sum+abs(tempcoef)
}
}
error_counter
#> [1] 0
coef_mean=coef_sum/NDatasets
coef_lasso_mean=coef_lasso_sum/NDatasets
With regular least squares and lasso estimators all fits were sucessful, yet only 20 variables coefficients could be estimated with regular least squares estimates for the linear model. Then we display and plot that the mean coefficient vector values for the least squares estimates.
head(coef_mean, 40)
#> x1 x2 x3 x4 x5
#> 1.503770e+00 1.535116e+00 1.484148e+00 1.509255e+00 1.514028e+00
#> x6 x7 x8 x9 x10
#> 3.173485e-15 3.037102e-15 2.889551e-15 3.154865e-15 7.183429e-15
#> x11 x12 x13 x14 x15
#> 3.394267e-15 2.841336e-15 3.716311e-15 3.182421e-15 3.427576e-15
#> x16 x17 x18 x19 x20
#> 3.919400e-15 4.673144e-15 3.955783e-15 3.303872e-15 5.303571e-15
#> x21 x22 x23 x24 x25
#> 5.550495e-15 3.081008e-15 2.254520e-15 3.381582e-15 3.730642e-15
#> x26 x27 x28 x29 x30
#> NA NA NA NA NA
#> x31 x32 x33 x34 x35
#> NA NA NA NA NA
#> x36 x37 x38 x39 x40
#> NA NA NA NA NA
barplot(coef_mean)
abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue")
head(coef_lasso_mean, 40)
#> x1 x2 x3 x4 x5
#> 3.163481e-01 2.992011e-01 2.460792e-01 2.619038e-01 3.138343e-01
#> x6 x7 x8 x9 x10
#> 1.633413e-03 8.285933e-03 2.664676e-03 1.286248e-03 2.328200e-03
#> x11 x12 x13 x14 x15
#> 8.799246e-04 8.072028e-03 4.276199e-03 6.158855e-03 2.491027e-03
#> x16 x17 x18 x19 x20
#> 0.000000e+00 6.908800e-04 1.025494e-03 5.153866e-04 1.810606e-04
#> x21 x22 x23 x24 x25
#> 2.887157e-03 1.164730e-02 3.420346e-03 0.000000e+00 0.000000e+00
#> x26 x27 x28 x29 x30
#> 1.058844e-02 2.764924e-03 0.000000e+00 0.000000e+00 0.000000e+00
#> x31 x32 x33 x34 x35
#> 2.711486e-03 3.698186e-03 3.852792e-05 3.575717e-03 0.000000e+00
#> x36 x37 x38 x39 x40
#> 1.036557e-02 1.824428e-03 6.567690e-03 1.557148e-03 8.436719e-03
barplot(coef_lasso_mean,ylim=c(0,1.5))
abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue")
The simulation process looks sucessfull: the lasso estimates retrives mostly the correct variables, yet the other ones are also selected sometimes. What are the confidence indices for those variables?