Coefficient and confidence interval of lasso selection - lasso-regression

I conducted a feature selection using lasso method as well as a covariance test using covTest::covTest to retrieve the p.values. I borrow an example from covTest such that:
require(lars)
require(covTest)
set.seed(1234)
x=matrix(rnorm(100*10),ncol=10)
x=scale(x,TRUE,TRUE)/sqrt(99)
beta=c(4,rep(0,9))
y=x%*%beta+.4*rnorm(100)
a=lars(x,y)
covTest(a,x,y)
$results
Predictor_Number Drop_in_covariance P-value
1 105.7307 0.0000
6 0.9377 0.3953
10 0.2270 0.7974
3 0.0689 0.9334
7 0.1144 0.8921
2 0.0509 0.9504
9 0.0508 0.9505
8 0.0006 0.9994
4 0.1190 0.8880
5 0.0013 0.9987
$sigma
[1] 0.3705
$null.dist
[1] "F(2,90)
The covTest's results showed the p-values of the top hit features. My question is how to retrieve the coefficient of these features such as that of the predictor 1 as well as its Std.err and 95%CI. I'd to compare these estimates with the counterparts from glm.

Related

Efficient element-wise vector times matrix ,multiplication in MKL

I have a vector
[2 3 4]
That I need to multiply with a matrix
1 1 1
2 2 2
3 3 3
to get
2 3 4
4 6 8
6 9 12
Now, I can make the vector into a matrix and do an element-wise multiplication, but is there also an efficient way to do this in MKL / CBLAS?
Yes, there is a function in oneMKL called cblas_?gemv which computes the multiplication of matrix and vector.
You can refer to the below link for more details regarding the usage of the function.
https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/blas-and-sparse-blas-routines/blas-routines/blas-level-2-routines/cblas-gemv.html
If you have installed the oneMKL in your system, you can take a look at the examples which helps you to better understand the usage of the functions that are available in the library.

Chisquare test give wrong result. Should I reject proposed distribution?

I want to fit poission distribution on my data points and want to decide based on chisquare test that should I accept or reject this proposed distribution. I only used 10 observations. Here is my code
#Fitting function:
def Poisson_fit(x,a):
return (a*np.exp(-x))
#Code
hist, bins= np.histogram(x, bins=10, density=True)
print("hist: ",hist)
#hist: [5.62657158e-01, 5.14254073e-01, 2.03161280e-01, 5.84898068e-02,
1.35995217e-02,2.67094169e-03,4.39345778e-04,6.59603327e-05,1.01518320e-05,
1.06301906e-06]
XX = np.arange(len(hist))
print("XX: ",XX)
#XX: [0 1 2 3 4 5 6 7 8 9]
plt.scatter(XX, hist, marker='.',color='red')
popt, pcov = optimize.curve_fit(Poisson_fit, XX, hist)
plt.plot(x_data, Poisson_fit(x_data,*popt), linestyle='--',color='red',
label='Fit')
print("hist: ",hist)
plt.xlabel('s')
plt.ylabel('P(s)')
#Chisquare test:
f_obs =hist
#f_obs: [5.62657158e-01, 5.14254073e-01, 2.03161280e-01, 5.84898068e-02,
1.35995217e-02, 2.67094169e-03, 4.39345778e-04, 6.59603327e-05,
1.01518320e-05, 1.06301906e-06]
f_exp= Poisson_fit(XX,*popt)
f_exp: [6.76613820e-01, 2.48912314e-01, 9.15697229e-02, 3.36866185e-02,
1.23926144e-02, 4.55898806e-03, 1.67715798e-03, 6.16991940e-04,
2.26978650e-04, 8.35007789e-05]
chi,p_value=chisquare(f_obs,f_exp)
print("chi: ",chi)
print("p_value: ",p_value)
chi: 0.4588956658201067
p_value: 0.9999789643475111`
I am using 10 observations so degree of freedom would be 9. For this degree of freedom I can't find my p-value and chi value on Chi-square distribution table. Is there anything wrong in my code?Or my input values are too small that test fails? if P-value >0.05 distribution is accepted. Although p-value is large 0.999 but for this I can't find chisquare value 0.4588 on table. I think there is something wrong in my code. How to fix this error?
Is this returned chi value is the critical value of tails? How to check proposed hypothesis?

Plotting data from two sets with different shapes in the same plot

I am using data collected from two different instruments which have different resolution because of the sampling rate of each instrument. For a specific time, one of the sets have >10k entries while the other has ~2.5k. They however capture data over the same time interval, and I want to plot them on top of each other even though they have different resolution in data. The minimum and maximum x of both sets are the same however one of them have more entries.
Simplified it could look like this:
1st set from instrument with higher sampling rate:
time(s) value
0.0 10
0.2 11
0.4 12
0.6 13
0.8 14
... ..
100 50
2nd set from instrument with lower sampling rate:
time(s) value
0 100
1 120
2 125
3 128
4 130
. ...
100 430
They are measuring different things, but I would like to display them in the same plot. How can I accomplish this?
I found the mistake.. I was trying to plot both datasets using the time data from the first instrument. Of course they need to be plotted with their respective time data and I put the first time data in the second plot by mistake..

Splitting data frame in to test and train data sets

Use pandas to create two data frames: train_df and test_df, where
train_df has 80% of the data chosen uniformly at random without
replacement.
Here, what does "data chosen uniformly at random without replacement" mean?
Also, How can i do it?
Thanks
"chosen uniformly at random" means that each row has an equal probability of being selected into the 80%
"without replacement" means that each row is only considered once. Once it is assigned to a training or test set it is not
For example, consider the data below:
A B
0 5
1 6
2 7
3 8
4 9
If this dataset is being split into an 80% training set and 20% test set, then we will end up with a training set of 4 rows (80% of the data) and a test set of 1 row (20% of the data)
Without Replacement
Assume the first row is assigned to the training set. Now the training set is:
A B
0 5
When the next row is assigned to training or test, it will be selected from the remaining rows:
A B
1 6
2 7
3 8
4 9
With Replacement
Assume the first row is assigned to the training set. Now the training set is:
A B
0 5
But the next row will be assigned using the entire dataset (i.e. The first row has been placed back in the original dataset)
A B
0 5
1 6
2 7
3 8
4 9
How can you can do this:
You can use the train_test_split function from scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Or you could do this using pandas and Numpy:
df['random_number'] = np.random.randn(length_of_df)
train = df[df['random_number'] <= 0.8]
test = df[df['random_number'] > 0.8]

What's difference between Summation and Concatenation at neural network like CNN?

What's difference between summation and concatenation at neural network like CNNs?
For example Googlenet's Inception module used concatenation, Resnet's Residual learning used summation.
Please teach me.
Concatenation means to concatenate two blobs, so after the concat we have a bigger blob that contains the previous blobs in a continuous memory. For example:
blob1:
1
2
3
blob2:
4
5
6
blob_res:
1
2
3
4
5
6
Summation means element-wise summation, blob1 and blob2 must have the exact same shape, and the resultant blob has the same shape with the elements a1+b1, a2+b2, ai+bi, ... an+bn.
For the example above,
blob_res:
(1+4) 5
(2+5) 7
(3+6) 9