VIF Vs Mutual Info - data-science

I was searching for the best ways for feature selection in a regression problem & came across a post suggesting mutual info for regression, I tried the same on boston data set. The results were as follows:
#feature selection
f_selector = SelectKBest(score_func=mutual_info_regression, k='all')
#learning relationship from training data
f_selector.fit(X_train, y_train)
#transform train input data
X_train_fs = f_selector.transform(X_train)
#transform test input data
X_test_fs = f_selector.transform(X_test)
The scores were as follows:
Features Scores
12 LSTAT 0.651934
5 RM 0.591762
2 INDUS 0.532980
10 PTRATIO 0.490199
4 NOX 0.444421
9 TAX 0.362777
0 CRIM 0.335882
6 AGE 0.334989
7 DIS 0.308023
8 RAD 0.206662
1 ZN 0.197742
11 B 0.172348
3 CHAS 0.027097
I was just curious & mapped the VIF along with scores & I see that the features/Variables with high scores has a very high VIF.
Features Scores VIF_Factor
12 LSTAT 0.651934 11.102025
5 RM 0.591762 77.948283
2 INDUS 0.532980 14.485758
10 PTRATIO 0.490199 85.029547
4 NOX 0.444421 73.894947
9 TAX 0.362777 61.227274
0 CRIM 0.335882 2.100373
6 AGE 0.334989 21.386850
7 DIS 0.308023 14.699652
8 RAD 0.206662 15.167725
1 ZN 0.197742 2.844013
11 B 0.172348 20.104943
3 CHAS 0.027097 1.152952
Could you please help in understanding, on how the select the best features among the list.
Thanks in advance!

Related

Python altair - facet line plot with multiple variables

I have the following kind of DataFrame
Marque Annee Modele PVFP PM
0 A 1 Python 70783.066836 2.067821e+07
1 A 2 Python 75504.270716 1.957717e+07
2 A 3 Python 66383.237169 1.848982e+07
3 A 4 Python 61966.851675 1.755261e+07
4 A 5 Python 54516.367597 1.671907e+07
5 A 1 Sol 66400.686091 2.067821e+07
6 A 2 Sol 74953.770294 1.955218e+07
7 A 3 Sol 66500.916446 1.844078e+07
8 A 4 Sol 62016.941237 1.748098e+07
9 A 5 Sol 54356.008414 1.662684e+07
10 B 1 Python 43152.461787 1.340989e+07
11 B 2 Python 62397.794144 1.494418e+07
12 B 3 Python 1871.135251 2.178552e+06
I tried to build a facet graph but without really succeeding. I'am just able to concat vertically the 2 charts generated. I would be grateful if you have any idea to do it properly in one operation.
My current code :
chart = alt.Chart(euro).mark_line().encode(
x='Annee',
y='PVFP',
color='Modele'
).properties(
width=150,
height=150
).facet(
facet='Marque',
columns=3
)
chart2 = alt.Chart(euro).mark_line().encode(
x='Annee',
y='PM',
color='Modele'
).properties(
width=150,
height=150
).facet(
facet='Marque',
columns=3
)
chart & chart2
One good way to do this is to use a Fold Transform to fold your two columns into one, and then you can use row and column facets to facet by both variables at once. For example:
alt.Chart(euro).transform_fold(
['PVFP', 'PM'], as_=['key', 'value']
).mark_line().encode(
x='Annee:Q',
y='value:Q',
color='Modele:N'
).properties(
width=150,
height=150
).facet(
column='Marque:N',
row='key:N'
)

pandas create Cross-Validation based on specific columns

I have a dataframe of few hundreds rows , that can be grouped to ids as follows:
df = Val1 Val2 Val3 Id
2 2 8 b
1 2 3 a
5 7 8 z
5 1 4 a
0 9 0 c
3 1 3 b
2 7 5 z
7 2 8 c
6 5 5 d
...
5 1 8 a
4 9 0 z
1 8 2 z
I want to use GridSearchCV , but with a custom CV that will assure that all the rows from the same ID will always be on the same set.
So either all the rows if a are in the test set , or all of them are in the train set - and so for all the different IDs.
I want to have 5 folds - so 80% of the ids will go to the train and 20% to the test.
I understand that it can't guarentee that all folds will have the exact same amount of rows - since one ID might have more rows than the other.
What is the best way to do so?
As stated, you can provide cv with an iterator. You can use GroupShuffleSplit(). For example, once you use it to split your dataset, you can put the result within GridSearchCV() for the cv parameter.
As mentioned in the sklearn documentation, there's a parameter called "cv" where you can provide "An iterable yielding (train, test) splits as arrays of indices."
Do check out the documentation in future first.
As mentioned previously, GroupShuffleSplit() splits data based on group lables. However, the test sets aren't necessarily disjoint (i.e. doing multiple splits, an ID may appear in multiple test sets). If you want each ID to appear in exactly one test fold, you could use GroupKFold(). This is also available in Sklearn.model_selection, and directly extends KFold to take into account group lables.

Splitting data frame in to test and train data sets

Use pandas to create two data frames: train_df and test_df, where
train_df has 80% of the data chosen uniformly at random without
replacement.
Here, what does "data chosen uniformly at random without replacement" mean?
Also, How can i do it?
Thanks
"chosen uniformly at random" means that each row has an equal probability of being selected into the 80%
"without replacement" means that each row is only considered once. Once it is assigned to a training or test set it is not
For example, consider the data below:
A B
0 5
1 6
2 7
3 8
4 9
If this dataset is being split into an 80% training set and 20% test set, then we will end up with a training set of 4 rows (80% of the data) and a test set of 1 row (20% of the data)
Without Replacement
Assume the first row is assigned to the training set. Now the training set is:
A B
0 5
When the next row is assigned to training or test, it will be selected from the remaining rows:
A B
1 6
2 7
3 8
4 9
With Replacement
Assume the first row is assigned to the training set. Now the training set is:
A B
0 5
But the next row will be assigned using the entire dataset (i.e. The first row has been placed back in the original dataset)
A B
0 5
1 6
2 7
3 8
4 9
How can you can do this:
You can use the train_test_split function from scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Or you could do this using pandas and Numpy:
df['random_number'] = np.random.randn(length_of_df)
train = df[df['random_number'] <= 0.8]
test = df[df['random_number'] > 0.8]

Score bic may be used with discrete data only

I have a data frame with all columns in discrete format. I apply the following code to generate a BN using bnlearn package. However I get this error that says "score 'bic' may be used with discrete data only" while essentially my data are discrete! Here is a sample of my data:
A B C
3 2 0
0 0 5
5 1 7
0 0 2
4 6 1
And this is what I run:
> test=hc(dat, score="bic")
Error in check.score(score, x) :
score 'bic' may be used with discrete data only.
I don't get why my data is not seen as discrete?

Coefficient and confidence interval of lasso selection

I conducted a feature selection using lasso method as well as a covariance test using covTest::covTest to retrieve the p.values. I borrow an example from covTest such that:
require(lars)
require(covTest)
set.seed(1234)
x=matrix(rnorm(100*10),ncol=10)
x=scale(x,TRUE,TRUE)/sqrt(99)
beta=c(4,rep(0,9))
y=x%*%beta+.4*rnorm(100)
a=lars(x,y)
covTest(a,x,y)
$results
Predictor_Number Drop_in_covariance P-value
1 105.7307 0.0000
6 0.9377 0.3953
10 0.2270 0.7974
3 0.0689 0.9334
7 0.1144 0.8921
2 0.0509 0.9504
9 0.0508 0.9505
8 0.0006 0.9994
4 0.1190 0.8880
5 0.0013 0.9987
$sigma
[1] 0.3705
$null.dist
[1] "F(2,90)
The covTest's results showed the p-values of the top hit features. My question is how to retrieve the coefficient of these features such as that of the predictor 1 as well as its Std.err and 95%CI. I'd to compare these estimates with the counterparts from glm.