Best Subset regression on test sample (after k-fold) - testing

I am running a best subset regression analyses in R studio. I am using the following libraries:
library(foreign)
library(glmnet)
library(caTools)
library(leaps)
library(ISLR)
library(knitr)
library(ggvis)
I want to split my sample into three samples: training, cross-validation, and test (maybe 50%, 30%, 20%).
I have successfully ran best subset on our training and cross-validated those results with the following script:
k = 10
set.seed(1)
folds = sample(1:k,nrow(best_demo_train),replace=TRUE)
table(folds)
cv.errors=matrix(NA,k,5, dimnames=list(NULL, paste(1:5)))
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
mat[, names(coefi)] %*% coefi
}
for(j in 1:k){
best.fit = regsubsets(selfaware ~., data=best_demo_train[folds != j,])
for (i in 1:5){
pred = predict(best.fit, best_demo_train[folds == j, ], id = i)
cv.errors[j, i] = mean((best_demo_train$selfaware[folds == j] - pred)^2)
}
}
mean.cv.errors=apply(cv.errors,2,mean)
mean.cv.errors
which.min(mean.cv.errors)
par(mfrow=c(1,1))
plot(mean.cv.errors,type='b')
points(which.min(mean.cv.errors),mean.cv.errors[which.min(mean.cv.errors)],
col="red",cex=2,pch=20)
reg.best=regsubsets (selfaware~.,data=best_demo_train)
coef(reg.best ,3)
reg.summary=summary(reg.best ,3)
reg.summary$adjr2
So, once I have the "best" variables, I would like to "test" that model on 20% of the data. Can someone help me out with this? I do not know what the script would be to test this model and have been unsuccessful searching online.
Thank you, I appreciate it.
Sarah

Related

How to use the 'sphereize data' option with PCA in TensorFlow

I have used PCA with the 'Sphereize data' option on the following page successfully: https://projector.tensorflow.org/
I wonder how to run the same computation locally using the TensorFlow API. I found the PCA documentation in the API documentation, but I am not sure if sphereizing the data is available somewhere in the API too?
The "sphereize data" option normalizes the data by shifting each point by the centroid and making unit norm.
Here is the code used in Tensorboard (in typescript):
normalize() {
// Compute the centroid of all data points.
let centroid = vector.centroid(this.points, (a) => a.vector);
if (centroid == null) {
throw Error('centroid should not be null');
}
// Shift all points by the centroid and make them unit norm.
for (let id = 0; id < this.points.length; ++id) {
let dataPoint = this.points[id];
dataPoint.vector = vector.sub(dataPoint.vector, centroid);
if (vector.norm2(dataPoint.vector) > 0) {
// If we take the unit norm of a vector of all 0s, we get a vector of
// all NaNs. We prevent that with a guard.
vector.unit(dataPoint.vector);
}
}
}
You can reproduce that normalization using the following python function:
def sphereize_data(x):
"""
x is a 2D Tensor of shape :(num_vectors, dim_vectors)
"""
centroids = tf.reduce_mean(x, axis=0, keepdims=True)
return tf.math.div_no_nan((x - centroids), tf.norm(x - centroids, axis=0, keepdims=True))

LSTM from scratch in tensorflow 2

I'm trying to make LSTM in tensorflow 2.1 from scratch, without using the one already supplied with keras (tf.keras.layers.LSTM), just to learn and code something. To do so, I've defined a class "Model" that when called (like with model(input)) it computes the matrix multiplications of the LSTM. I'm pasting here part of my code, the other parts are on github (link)
class Model(object):
[...]
def __call__(self, inputs):
assert inputs.shape == (vocab_size, T_steps)
outputs = []
for time_step in range(T_steps):
x = inputs[:,time_step]
x = tf.expand_dims(x,axis=1)
z = tf.concat([self.h_prev,x],axis=0)
f = tf.matmul(self.W_f, z) + self.b_f
f = tf.sigmoid(f)
i = tf.matmul(self.W_i, z) + self.b_i
i = tf.sigmoid(i)
o = tf.matmul(self.W_o, z) + self.b_o
o = tf.sigmoid(o)
C_bar = tf.matmul(self.W_C, z) + self.b_C
C_bar = tf.tanh(C_bar)
C = (f * self.C_prev) + (i * C_bar)
h = o * tf.tanh(C)
v = tf.matmul(self.W_v, h) + self.b_v
v = tf.sigmoid(v)
y = tf.math.softmax(v, axis=0)
self.h_prev = h
self.C_prev = C
outputs.append(y)
outputs = tf.squeeze(tf.stack(outputs,axis=1))
return outputs
But this neural netoworks has three problems:
1) it is way slow during training. In comparison a model that uses tf.keras.layers.LSTM() is trained more than 10 times faster. Why is this? Maybe because I didn't use a minibatch training, but a stochastic one?
2) the NN seems to not learn anything at all. After just some (very few!) training examples, the loss seems to settle down and it won't decrease anymore, but rather it oscillates around the reached value. After training, I tested the NN making it generate some text, but it just outputs non-sense gibberish. Why isn't learning anything?
3) the loss function outputs very high values. I've coded a categorical cross-entropy loss function but, with 100 characters long sequence, the value of the function is over 370 per training example. Shouldn't it be way lower than this?
I've wrote the loss function like this:
def compute_loss(predictions, desired_outputs):
l = 0
for i in range(T_steps):
l -= tf.math.log(predictions[desired_outputs[i], i])
return l
I know they're open questions, but unfortunately I can't make it works. So any answer, even a short answer that help me to make myself solve the problem, is fine :)

How to select multiple columns from DataFrame efficiently?

I'm writing something like Q-Learning on a DataFrame which consists of transitions. The data has the following columns:
date, begin_time, begin_grid, end_time, end_grid, reward, [feature_columns]
I'm using the below code to do mini-batch training:
for i in xrange(num_iter):
idx = np.random.choice(N, batch_size)
now_data = train_data.loc[idx]
predicted_value = []
for index, row in now_data.iterrows():
action_state = row[["date", "end_time", "end_grid"]]
#Can the next line be quicker?
end_state_frame = train_data[(train_data["date"] == action_state["date"]) & (train_data["begin_grid"] == action_state["end_grid"]) & (train_data["begin_time"] == action_state["end_time"])]
if len(end_state_frame) == 0:
predicted_value.append(0.0)
else:
end_pred_values = model.predict(end_state_frame[feature_columns]).flatten()
predicted_value.append(np.mean(end_pred_values))
In this code, I'm looking up on "date", "begin_time" and "begin_grid" repeatly. This code now is too slow to actually train the model. I'm wondering if I can do something to speed up the process (perhaps via setting index or grouping)?
Thank you!

How to get an importance of a class using random forest?

I am using randomForest package in my dataset to do a classification, but with the importance command I only get the importance of variables. So, if I want the variable importance by specific categories of variables? Like a specific location in a region variable, how much that region impact in the total. I thought in transformer every class in a dummy, but i don't know if this is really a good idea.
I think you mean "variable importance by specific categories of variables". That has not been implemented, but I guess it would be possible, meaningful and perhaps useful. Of course it would not be meaningful for variables with only two categories.
I would implement it something like:
Train model -> compute out-of-bag prediction performance (OOB-cv1) -> permute specific category by specific variable (reassign this category randomly to other categories, weighted by other category prevalence) -> re-compute out-of-bag- prediction performance (OOB-cv2) -> subtract OOB-cv1 from OOB-cv2
And then I wrote the a function implementing categorical specific variable importance.
library(randomForest)
#Create some classification problem, with mixed categorical and numeric vars
#Cat A of var 1, cat B of var 2 and Cat C of var 3 influence class the most.
X.cat = replicate(3,sample(c("A","B","C"),600,rep=T))
X.val = replicate(2,rnorm(600))
y.cat = 3*(X.cat[,1]=="A") + 3*(X.cat[,2]=="B") + 3*(X.cat[,3]=="C")
y.cat.err = y.cat+rnorm(600)
y.lim = quantile(y.cat.err,c(1/3,2/3))
y.class = apply(replicate(2,y.cat.err),1,function(x) sum(x>y.lim)+1)
y.class = factor(y.class,labels=c("ann","bob","chris"))
X.full = data.frame(X.cat,X.val)
X.full[1:3] = lapply(X.full[1:3],as.factor)
#train forest
rf=randomForest(X.full,y.class,keep.inbag=T,replace=T)
#make function to compute crovalidated classification error
oobErr = function(rf,X) {
preds = predict(rf,X,type="vote",predict.all = T)$individual
preds[rf$inbag!=0]=NA
oob.pred = apply(preds,1,function(x) {
tabx=sort(table(x),dec=T)
majority.vote = names(tabx)[1]
})
return(mean(as.character(rf$y)!=oob.pred))
}
#make function to iterate all categories of categorical variables
#and compute change of OOB class error due to permutation of category
catVar = function(rf,X,nPerm=2) {
ref = oobErr(rf,X)
catVars = which(rf$forest$ncat>1)
lapply(catVars, function(iVar) {
catImp = replicate(nPerm,{
sapply(levels(X[[iVar]]), function(thisCat) {
thisCat.ind = which(thisCat==X[[iVar]])
X[thisCat.ind,iVar] = head(sample(X[[iVar]]),length(thisCat.ind))
varImp = oobErr(rf,X)-ref
})
})
if(nPerm==1) catImp else apply(catImp,1,mean)
})
}
#try it out
out = catVar(rf,X.full,nPerm=4)
print(out) #seems like it works as it should
$X1
A B C
0.14000 0.07125 0.06875
$X2
A B C
0.07458333 0.16083333 0.07666667
$X3
A B C
0.05333333 0.08083333 0.15375000

QGIS - Retrieve start and end features from a line

I'm creating lines by selecting two features from various layers. When I create a line a form pops up. In this form I want to display data from the start and end features of the line.
What I'm currently doing is retrieving the vertices as point :
geom = feature.geometry ()
line = geom.asPolyline ()
pointFather = ligne[0]
pointChild = ligne[-1]
then I get the coordinates of each point :
xf = pointFather.x()
yf = pointFather.y()
and then I look into each possible layer to find the features with the same coordinates, just to retrieve the features I just clicked on !
for layer in layerList:
provider = layer.dataProvider()
iter = provider.getFeatures()
for feature in iter:
geom = feature.geometry().asPoint()
if geom.x() == xf and geom.y() == yf:
It must be something easier to do to directly retrieve the start and end features, isn't it ?
EDIT 1 :
here is my try after PCamargo first answer :
def retrieve_feature_from_xy(geom,point,layerList):
for layer in layerList:
index = QgsSpatialIndex()
iter = layer.getFeatures()
for feat in iter:
index.insertFeature(feat)
ids = index.intersects(geom.boundingBox())
request = QgsFeatureRequest()
request.setFilterFids(ids)
iter = layer.getFeatures(request)
for feat in iter:
geom2 = feat.geometry().asPoint()
if geom2.x() == point.x() and geom2.y() == point.y():
return feat
EDIT 2 :
Here is my try after PCamargo second comment :
def retrieve_feature_from_xy2(geom,point,layerList):
allfeatures = {}
indexes=[]
ids=[]
for layer in layerList:
index = QgsSpatialIndex()
iter = layer.getFeatures()
for feat in iter:
index.insertFeature(feat)
allfeatures[feat.id()]=feat
indexes.append(index)
for index in indexes:
intersect_ids = index.intersects(geom.boundingBox())
ids.append(intersect_ids)
for id in ids:
for i in id:
feat=allfeatures[i]
geom2=feat.geometry().asPoint()
if geom2.x() == point.x() and geom2.y() == point.y():
return feat
EDIT 3
Here is my try after PCamargo third comment :
def retrieve_feature_from_xy3(geom,point,layerList):
allfeatures = {}
indexes=[]
ids=[]
indexDict = {}
intersectsIdsDict = {}
for layer in layerList:
index = QgsSpatialIndex()
iter = layer.getFeatures()
for feat in iter:
index.insertFeature(feat)
allfeatures[layer,feat.id()]=feat
indexes.append(index)
indexDict[layer]=index
for layer, index in indexDict.items():
intersectsIds = index.intersects(geom.boundingBox())
intersectsIdsDict[layer]=intersectsIds
for layer, index in intersectsIdsDict.items():
for id in index:
feat = allfeatures[layer,id]
geom2=feat.geometry().asPoint()
if geom2.x() == point.x() and geom2.y() == point.y():
return feat
Chris,
You can definitely improve the look for similar coordinates (Third part of the code).
Instead of looping through all features in each layer, create a spatial index (https://docs.qgis.org/2.2/en/docs/pyqgis_developer_cookbook/vector.html#using-spatial-index) for each link and use nearestNeighbor.
It would be something like this:
#You only need to create these indices once
indexes=[]
for layer in layerlist:
index = QgsSpatialIndex()
for feat in layer:
index.insertFeature(feat)
indexes.append(index)
Now that we have the indexes, we can use faster geographic search.
geom = feature.geometry ()
for index in indexes:
intersect_ids = index.intersects(geom.boundingBox())
intersect_ids is a smaller list of features that are candidates to be equivalent, so you can compare only these features with the feature you selected.
You need to organize this a bit more, but that is the idea