Spline binary logistic regression in R - spline

I am beginner in using R. I want to run a binary logistic regression and I guess some variables have nonlinearities. So I want to use splines method to understand affect of each range in a continuous variable. I am confused about it.
How can I do it in R ?
Knots (cut points) will determine automatically or I should determine manually ?
How can I know overall p-value of variable ?
model<- lrm(formula = PA ~ rcs(HEIGHT) + SLOPE + ASPECT, data=data, x=TRUE, y=TRUE)
I do not know "rcs" is helpful or not.. I found it by searching here. I will grateful if you guide me about it.

Related

How to perform dynamic optimization for a nonlinear discrete optimization problem with nonlinear constraints, using non-linear solvers like SNOPT?

I am new to the field of optimization and I need help in the following optimization problem. I have tried to solve it using normal coding to make sure that I got he correct results. However, the results I got are different and I am not sure my way of analysis is correct or not. This is a short description of the problem:
The objective function shown in the picture is used to find the optimal temperature of the insulating system that minimizes the total cost over a given horizon.
[This image provides the mathematical description of the objective function and the constraints] (https://i.stack.imgur.com/yidrO.png)
The data of the problems are as follow:
1-
Problem data:
A=1.07×10^8
h=1
T_ref=87.5
N=20
p1=0.001;
p2=0.0037;
This is the curve I want to obtain
2- Optimization variable:
u_t
3- Model type:
The model is a nonlinear cost function with non-linear constraints and it is solved using non-linear solver SNOPT.
4-The meaning of the symbols in the objective and constrained functions
The optimization is performed over a prediction horizon of N years.
T_ref is The reference temperature.
Represent the degree of polymerization in the kth year.
X_DP Represents the temperature of the insulating system in the kth year.
h is the time step (1 year) of the discrete-time model.
R is the ratio of the load loss at the rated load to the no-load loss.
E is the activation energy.
A is the pre-exponential constant.
beta is a linear coefficient representing the cost due to the decrement of the temperature.
I have developed the source code in MATLAB, this code is used to check if my analysis is correct or not.
I have tried to initialize the Ut value in its increasing or decreasing states so that I can have the curves similar to the original one. [This is the curve I obtained] (https://i.stack.imgur.com/KVv2q.png)
I have tried to simulate the problem using conventional coding without optimization and I got the figure shown above.
close all; clear all;
h=1;
N=20;
a=250;
R=8.314;
A=1.07*10^8;
E=111000;
Tref=87.5;
p1=0.0019;
p2=0.0037;
p3=0.0037;
Utt=[80,80.7894736842105,81.5789473684211,82.3684210526316,83.1578947368421,... % The value of Utt given here represent the temperature increament over a predictive horizon.
83.9473684210526,84.7368421052632,85.5263157894737,86.3157894736842,...
87.1052631578947,87.8947368421053,88.6842105263158,89.4736842105263,...
90.2631578947369,91.0526315789474,91.8421052631579,92.6315789473684,...
93.4210526315790,94.2105263157895,95];
Utt1 = [95,94.2105263157895,93.4210526315790,92.6315789473684,91.8421052631579,... % The value of Utt1 given here represent the temperature decreament over a predictive horizon.
91.0526315789474,90.2631578947369,89.4736842105263,88.6842105263158,...
87.8947368421053,87.1052631578947,86.3157894736842,85.5263157894737,...
84.7368421052632,83.9473684210526,83.1578947368421,82.3684210526316,...
81.5789473684211,80.7894736842105,80];
Ut1=zeros(1,N);
Ut2=zeros(1,N);
Xdp =zeros(N,N);
Xdp(1,1)=1000;
Xdp1 =zeros(N,N);
Xdp1(1,1)=1000;
for L=1:N-1
for k=1:N-1
%vt(k+L)=Ut(k-L+1);
Xdq(k+1,L) =(1/Xdp(k,L))+A*exp((-1*E)/(R*(Utt(k)+273)))*24*365*h;
Xdp(k+1,L)=1/(Xdq(k+1,L));
Xdp(k,L+1)=1/(Xdq(k+1,L));
Xdq1(k+1,L) =(1/Xdp1(k,L))+A*exp((-1*E)/(R*(Utt1(k)+273)))*24*365*h;
Xdp1(k+1,L)=1/(Xdq1(k+1,L));
Xdp1(k,L+1)=1/(Xdq1(k+1,L));
end
end
% MATLAB code
for j =1:N-1
Ut1(j)= -p1*(Utt(j)-Tref);
Ut2(j)= -p2*(Utt1(j)-Tref);
end
sum00=sum(Ut1);
sum01=sum(Ut2);
X1=1./Xdp(:,1);
Xf=1./Xdp(:,20);
Total= table(X1,Xf);
Tdiff =a*(Total.Xf-Total.X1);
X22=1./Xdp1(:,1);
X2f=1./Xdp1(:,20);
Total22= table(X22,X2f);
Tdiff22 =a*(Total22.X2f-Total22.X22);
obj=(sum00+(Tdiff));
ob1 = min(obj);
obj2=sum01+Tdiff22;
ob2 = min(obj2);
plot(Utt,obj,'-o');
hold on
plot(Utt1,obj)

Generating a Plot of CV vs. Degrees of Freedom

I have a dataset (n=298), and I am currently working on a general additive model for it. There are three predictor variables and one response variable. I used this code to generate the GAM and perform leave one out cross validation:
ctrl <- trainControl(method = "LOOCV")
model <- train(response~ predictor1+ predictor2 + predictor3, data= data[2:5], method = "gam", trControl = ctrl)
While I think this worked in generating the model and performing cross validation, I'd like to graph the CV value over the degrees of freedom, similar to what is shown in the book image below. I'm not really sure how to go about this with my model as I am pretty new to using R.
[Graph Example
I tried to use plot(model), but it just outputs the graph below, which isn't very helpful and certainly isn't what I'm looking for. Any advice on how to approach this would be greatly appreciated. Thanks.
plot(model) Graph

Linear regression graph interpretation

I have a histogram showing frequency of some data.
I have two type of files: Pdbs and Uniprots. Each Uniprot file is associated with a certain number of Pdbs. So this histogram shows how many Uniprot files are associated with 0 Pdb files, 1 Pdb file, 2 Pdb files ... 80 Pdb files.
Y-axis is in a log scale.
I did a regression on the same dataset and this is the result.
Here is the code I'm using for the regression graph:
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
x = np.array(x).reshape((-1, 1))
y = np.array(y)
regressor.fit(x, y)
# Predicting the Test set results
y = regressor.predict(x)
# Visualizing the Training set results
plt.scatter(x, y, color = 'red')
plt.plot(x, regressor.predict(x), color = 'blue')
plt.title('Uniprot vs Pdb')
plt.xlabel('Pdbs')
plt.ylabel('Uniprot')
plt.savefig('regression_test.png')
plt.show()
Can you help me interpret the regression graph?
I can understand that as the number of Pdbs increases, there will be less Uniprots associated with them.
But why is it going negative on the y-axis? Is this normal?
The correct way to interpret this linear regression is "this linear regression is 90% meaningless." In fact, some of that 90% is worse than meaningless, it's downright misleading, as you have pointed out with the negative y values. OTOH, there is about 10% of it that we can interpret to good effect, but you have to know what you're looking for.
The Why: Amongst other often less apparent things, one of the assumptions of a linear regression model is that the data are more-or-less linear. If your data aren't linear with some very regular "noise" added in, then all bets are off. Your data aren't linear. They're not even close. So all bets are off.
Since all bets are off, it is helpful to examine the sort of things that we might have otherwise wanted to do with a linear regression model. The hardest thing is extrapolation, which is predicting y outside of the original x range. Your model's abilities at extrapolation are pretty well illustrated by its behavior at the endpoints. This is where you noticed "hey, my graph is all negative!". This is, in a very simplistic sense, because you took a linear model, fit it to data that did not satisfy the "linear" assumption, and then tried to make it do the hardest thing for a model to do. The second hardest thing for a model to do is interpolation which is making predictions inside the original x range. This linear regression isn't very good at that either. Further down the list is, if we simply look at the slope of the linear regression line, we can get a general idea of whether our data are increasing or decreasing. Note that even this bet is off if your data aren't linear. However, it generally works out in a not-entirely-useless sort of way for large classes of even non-linear real-world data. So, this one thing, your linear regression model gets kind of right. Your data are decreasing, and the linear model is also decreasing. That's the 10% I spoke of previously.
What to do: Try to fit a better model. You say that you log-transformed your original data, but it doesn't look like that helped much. In general, the whole point of "transforming" data is to make it look linear. The log transform is helpful for exponential data. If your starting data didn't look exponential-like, then the log transform probably isn't going to help. Since you are trying to do density estimation, you almost certainly want to fit a probability distribution to this stuff, for which you don't even need to do a transform to make the data linear. Here is another Stack Overflow answer with details about how to fit a beta distribution to data. However, there are many options.
Can you help me interpret the regression graph?
Linear Regression tries to built a line between x-variables and a target y-variable which assimates the 'real' value in the most closed possible way (graph you find also here: https://en.wikipedia.org/wiki/Linear_regression):
the line here is the blue line, and the original points are the black lines. The goal is to minimize the error (black dots to blue line) for all black dots.
The regression line is the blue line. That means you can describe a uniprot with a linear equatation y = m*x +b , which has a constant value m=0.1 (example) and b=0.2 (example) and x=Pdbs.
I can understand that as the number of Pdbs increases, there will be less Uniprots associated with them. But why is it going negative on the y-axis?
This is normal, you could plot this line until -10000000 Pdbs or whateever, it is just a equation. Not a real line.
But there is one mistake in your plot, you need to plot the original black dots also or not?
y = regressor.predict(x)
plt.scatter(x, y, color = 'red')
This is wrong, you should add the original values to it, to get the plot from my graphic, something like:
y = df['Uniprot']
plt.scatter(x, y, color = 'red')
should help to understand it.

Why is tf.transpose so important in a RNN?

I've been reading the docs to learn TensorFlow and have been struggling on when to use the following functions and their purpose.
tf.split()
tf.reshape()
tf.transpose()
My guess so far is that:
tf.split() is used because inputs must be a sequence.
tf.reshape() is used to make the shapes compatible (Incorrect shapes tends to be a common problem / mistake for me). I used numpy for this before. I'll probably stick to tf.reshape() now. I am not sure if there is a difference between the two.
tf.transpose() swaps the rows and columns from my understanding. If I don't use tf.transpose() my loss doesn't go down. If the parameter values are incorrect the loss doesn't go down. So the purpose of me using tf.transpose() is so that my loss goes down and my predictions become more accurate.
This bothers me tremendously because I'm using tf.transpose() because I have to and have no understanding why it's such an important factor. I'm assuming if it's not used correctly the inputs and labels can be in the wrong position. Making it impossible for the model to learn. If this is true how can I go about using tf.transpose() so that I am not so reliant on figuring out the parameter values via trial and error?
Question
Why do I need tf.transpose()?
What is the purpose of tf.transpose()?
Answer
Why do I need tf.transpose()? I can't imagine why you would need it unless you coded your solution from the beginning to require it. For example, suppose I have 120 student records with 50 stats per student and I want to use that to try and make a linear association with their chance of taking 3 classes. I'd state it like so
c = r x m
r = records, a matrix with a shape if [120x50]
m = the induction matrix. it has a shape of [50x3]
c = the chance of all students taking one of three courses, a matrix with a shape of [120x3]
Now if instead of making m [50x3], we goofed and made m [3x50], then we'd have to transpose it before multiplication.
What is the purpose of tf.transpose()?
Sometimes you just need to swap rows and columns, like above. Wikipedia has a fantastic page on it. The transpose function has some excellent properties for matrix math function, like associativeness and associativeness with the inverse function.
Summary
I don't think I've ever used tf.transpose in any CNN I've written.

How to get scikit learn to find simple non-linear relationship

I have some data in a pandas dataframe (although pandas is not the point of this question). As an experiment I made column ZR as column Z divided by column R. As a first step using scikit learn I wanted to see if I could predict ZR from the other columns (which should be possible as I just made it from R and Z). My steps have been.
columns=['R','T', 'V', 'X', 'Z']
for c in columns:
results[c] = preprocessing.scale(results[c])
results['ZR'] = preprocessing.scale(results['ZR'])
labels = results["ZR"].values
features = results[columns].values
#print labels
#print features
regr = linear_model.LinearRegression()
regr.fit(features, labels)
print(regr.coef_)
print np.mean((regr.predict(features)-labels)**2)
This gives
[ 0.36472515 -0.79579885 -0.16316067 0.67995378 0.59256197]
0.458552051342
The preprocessing seems wrong as it destroys the Z/R relationship I think. What's the right way to preprocess in this situation?
Is there some way to get near 100% accuracy? Linear regression is the wrong tool as the relationship is not-linear.
The five features are highly correlated in my data. Is non-negative least squares implemented in scikit learn ? ( I can see it mentioned in the mailing list but not the docs.) My aim would be to get as many coefficients set to zero as possible.
You should easily be able to get a decent fit using random forest regression, without any preprocessing, since it is a nonlinear method:
model = RandomForestRegressor(n_estimators=10, max_features=2)
model.fit(features, labels)
You can play with the parameters to get better performance.
The solutions is not as easy and can be very influenced by your data.
If your variables R and Z are bounded (for ex 0<R<1 -3<Z<2) then you should be able to get a good estimation of the output variable using neural network.
Using neural network you should be able to estimate your output even without preprocessing the data and using all the variables as input.
(Of course here you will have to solve a minimization problem).
Sklearn do not implement neural network so you should use pybrain or fann.
If you want to preprocess the data in order to make the minimization problem easier you can try to extract the right features from the predictor matrix.
I do not think there are a lot of tools for non linear features selection. I would try to estimate the important variables from you dataset using in this order :
1-lasso
2- sparse PCA
3- decision tree (you can actually use them for features selection ) but I would avoid this as much as possible
If this is a toy problem I would sugges you to move towards something of more standard.
You can find a lot of examples on google.