I am learning about spline variables for logit/Cox model and I generally notice that the confidence bands are narrow when odds/hazard ratio nears 1. What is the possible reason for this?
Attaching a sample chart for reference.
Edit: The code used to generate the above graph is based on a conditional logit model.
proc phreg data=have;
effect v1Spline=spline(v1/naturalcubic basis=tpf(noint) knotmethod=percentilelist(5 25 50 75 95);
model time*outcome(0)=v1Spline;
strata id;
run;
Related
Thank you for reading. I'm not good at English.
I am wondering how to predict and get future time series data after model training. I would like to get the values after N steps.
I wonder if the time series data has been properly learned and predicted.
How i do this right get the following (next) value?
I want to get the next value using like model.predict or etc
I have x_test and x_test[-1] == t, so the meaning of the next value is t+1, t+2, .... t+n,
In this example I want to get predictions of the next t+1, t+2 ... t+n
First
I tried using stock index data
inputs = total_data[len(total_data) - forecast - look_back:]
inputs = scaler.transform(inputs)
X_test = []
for i in range(look_back, inputs.shape[0]):
X_test.append(inputs[i - look_back:i])
X_test = np.array(X_test)
predicted = model.predict(X_test)
but the result is like below
The results from X_test[-20:] and the following 20 predictions looks like same.
I'm wondering if it's the correct train and predicted value.
I'm wondering if it was a right training and predict.
full source
The method I tried first did not work correctly.
Seconds
I realized something is wrong, I tried using another official data
So, I used the time series in the Tensorflow tutorial to practice predicting the model.
a = y_val[-look_back:]
for i in range(N-step prediction): # predict a new value n times.
tmp = model.predict(a.reshape(-1, look_back, num_feature)) # predicted value
a = a[1:] # remove first
a = np.append(a, tmp) # insert predicted value
The results were predicted in a linear regression shape very differently from the real data.
Output a linear regression that is independent of the real data:
full source (After the 25th line is my code.)
I'm really very curious what is a standard method of predicting next values of a stock market.
Thank you for reading the long question. I seek advice about your priceless opinion.
Q : "How can I find a standard method of predicting next values of a stock market...?"
First - salutes to C64 practitioner!
Next, let me say, there is no standard method - there cannot be ( one ).
Principally - let me draw from your field of a shared experience - one can easily predict the near future flow of laminar fluids ( a technically "working" market instrument - is a model A, for which one can derive a better or worse predictive tool )
That will never work, however, for turbulent states of the fluids ( just read the complexity of the attempts to formulate the many-dimensional high-order PDE for a turbulence ( and it still just approximates the turbulence ) ) -- and this is the fundamentally "working" market ( after some expected fundamental factor was released ( read NFP or CPI ) or some flash-news was announced in the news - ( read a Swiss release of currency-bonding of CHF to some USD parity or Cyprus one time state tax on all speculative deposits ... the financial Big Bangs follow ... )
So, please, do not expect one, the less any simple, model for reasonably precise predictions, working for both the laminar and turbulent fluidics - the real world is for sure way more complex than this :o)
I am playing around with DeepExplainer to get shap values for deep learning models. By following some tutorials I can get some results, i.e. what variables are pushing the model prediction from the base value, which is the average model output in training set.
I have around 5,000 observations along with 70 features. The performance of DeepExplainer is quite satisfactory. And my code is:
model0 = load_model(model_p+'health0.h5')
background = healthScaler.transform(train[healthFeatures])
e = shap.DeepExplainer(model0, background)
shap_values = e.shap_values(healthScaler.transform(test[healthFeatures]))
test2 = test[healthFeatures].copy()
test2[healthFeatures] = healthScaler.transform(test[healthFeatures])
shap.force_plot(e.expected_value[0], shap_values[0][947,:], test2.iloc[947,:])
And the plot is the following:
Here the base value is 0.012 (can also be seen through e.expected_value[0]) and very close to the output value which is 0.01.
At this point I have some questions:
1) The output value is not identical to the prediction gotten through model0.predict(test[healthFeatures])[947] = -0.103 How should I assess output value?
2) As can be seen, I am using whole training set as the background to approximate conditional expectations of SHAP values. What is the difference between using random samples from training set and entire set? Is it only related to performance issue?
Many thanks in advance!
Probably too late but stil a most common question that will benefit other begginers. To answer (1), the expected and out values will be different. the expected is, as the name suggest, is the avereage over the scores predicted by your model, e.g., if it was probability then it is the average of the probabilties that your model spits. For (2), as long as the backroung values are less then 5k, it wont change much, but if > 5k then your calculations will take days to finish.
See this (lines 21-25) for more comprehensive answers.
I am a bit confused by the numpy function random.randn() which returns random values from the standard normal distribution in an array in the size of your choosing.
My question is that I have no idea when this would ever be useful in applied practices.
For reference about me I am a complete programming noob but studied math (mostly stats related courses) as an undergraduate.
The Python function randn is incredibly useful for adding in a random noise element into a dataset that you create for initial testing of a machine learning model. Say for example that you want to create a million point dataset that is roughly linear for testing a regression algorithm. You create a million data points using
x_data = np.linspace(0.0,10.0,1000000)
You generate a million random noise values using randn
noise = np.random.randn(len(x_data))
To create your linear data set you follow the formula
y = mx + b + noise_levels with the following code (setting b = 5, m = 0.5 in this example)
y_data = (0.5 * x_data ) + 5 + noise
Finally the dataset is created with
my_data = pd.concat([pd.DataFrame(data=x_data,columns=['X Data']),pd.DataFrame(data=y_data,columns=['Y'])],axis=1)
This could be used in 3D programming to generate non-overlapping random values. This would be useful for optimization of graphical effects.
Another possible use for statistical applications would be applying a formula in order to test against spacial factors affecting a given constant. Such as if you were measuring a span of time with some formula doing something but then needing to know what the effectiveness would be given various spans of time. This would return a statistic measuring for example that your formula is more effective in the shorter intervals or longer intervals, etc.
np.random.randn(d0, d1, ..., dn) Return a sample (or samples) from the “standard normal” distribution(mu=0, stdev=1).
For random samples from , use:
sigma * np.random.randn(...) + mu
This is because if Z is a standard normal deviate, then will have a normal distribution with expected value and standard deviation .
https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.randn.html
https://en.wikipedia.org/wiki/Normal_distribution
i got a game with only 10x2 pixels as input and it learns after one hour training doing it by itself. Now i want to use one float value output of the model instead of three classifier outputs. The three classifier outputs where stop,1-step right, 1-step-left. Now i want to produce one output value which tells me e.g. -4 => 4 steps-left, +2 => 2 steps-right and so on.
But after training for 1-2 hours, it only produces numbers around 0.001, but it should produce numbers between -10.0->+10.0 ?
Do i need todo it in a completly other way, or can i use an classifier model to output real value without changing much code ?
thanks for help
game code link
Training a classifier is much simpler than coming up with a good loss function that will give you scalaer values that make sense. Much (!) simpler.
Make it a classifier with 21 classes (0=10 left, 1=9 left, 2=8 left,...,10=stay, 11=1 right, ..., 20=10 right)
I've trained a simple logistic regression model in SSAS, using Gender and NIC as discrete input nodes (NIC is 0 for non-smoker, 1 for smoker) with Score (0-100) as a continuous output node.
I want to predict the score based on a new participant's values for Gender and NIC. Of course, I can run a singleton query in DMX; for example, the following produces a value of 49.51....
SELECT Predict(Score)
FROM [MyModel]
NATURAL PREDICTION JOIN
(SELECT 'M' AS Gender, '1' AS NIC) as t
But instead of using DMX, I want to create a formula from the model in order to calculate scores while "disconnected" from SSAS.
Investigating the model, I have the following information in the NODE_DISTRIBUTION of the output node:
ATTRIBUTE_NAME ATTRIBUTE_VALUE SUPPORT PROBABILITY VARIANCE VALUETYPE
Gender:F 0.459923854 0 0 0 7 (Coefficient)
Gender:M 0.273306289 0 0 0 7 (Coefficient)
Nic:0 -0.282281195 0 0 0 7 (Coefficient)
Nic:1 -0.802106901 0 0 0 7 (Coefficient)
0.013983007 0 0 0.647513829 7 (Coefficient)
Score 75.03691517 0 0 0 3 (Continuous
Plugging these coefficients into a logistic regression formula -- that I am being disallowed from uploading as a new user :) -- for the smoking male example above,
f(...) = 1 / (1 + exp(0 - (0.0139830071136734 -- Constant(?)
+ 0 * 0.459923853918008 -- Gender:F = 0
+ 1 * 0.273306289390897 -- Gender:M = 1
+ 1 * -0.802106900621717 -- Nic:1 = 1
+ 0 * -0.282281195489355))) -- Nic:0 = 0
results in a value of 0.374.... But how do I "map" this value back to the score distribution of 0-100? In other words, how do I extend the equation above to produce the same value that the DMX singleton query does? I'm assuming it will require the stdev and mean of my Score distribution, but I'm stuck on exactly how to use those values. I'm also unsure whether I'm using the ATTRIBUTE_VALUE in the fifth row correctly as the constant.
Any help you can provide will be appreciated!
I'm no expert, but it sounds to me you don't want to use logistic regression at all. You want to train a linear regression. You currently have a logistic regression model, these are typically used for binary classification, not continuous values, i.e., 0-100.
How to do linear regression in SAS
Wikipedia: linear regression
more details: the question really depends, like most datamining/machine learing problems, on your data. If your data is bimodal, more than 90% of the training set is very close to either 1 or 100, then a logistic regression MIGHT be used. The equation used in logistic regression is specifically designed to render YES/NO answers. It is technically a continuous function, therefore results such as .34 are possible, but they are statistically very unlikely (in typical usage you would round down to 0).
However, if your data is normally distributed (most of nature is) the better method is linear regression. Only problem is it CAN predict outside of your range 0-100, if given a particularly bad data point. In this case you would be best off rounding (clipping the result to 0-100) or ignore the data point as an outlier.
In the case of gender, a quick hack would be to map male to 0 and female to 1, then treat gender as an input for the model.
SSAS linear regression
You do not want to be using logistic regression if you are trying to model a score restricted to an interval [0,100]. Logistic regression is used to model either binary data or proportions based on a binomial distribution. Assuming a logit link function what you are actually modelling with logistic regression is a function of probability (log of odds) and as such the entire process is geared to give you values in the interval [0,1]. To try to use this to map to a score does not seem to be the right type of analysis at all.
In addition I cannot see how regular linear regression will help you either as your fitted model will be capable of generating values way outside of your target interval [0,100] and if you are having to perform ad hoc truncation of values to this range then can you really be sure that your data has any effective meaning??
I would like to be able to point you to the type of analysis that you require but I have not encountered this type of analysis. My advice to you would be to abandon the logistic regression approach and consider joining the ALLSTAT mailing list used by professional statisticians and mathematicians and asking for advice there. Or something similar.