how to make predictions using my date time (Random forest model) - pandas

I have made this Random Forest model to predict if a stock market day will have an up day or down day.
and my goal is to get a 1 for up days and 0 for down days by passing in a date-time like this
2020-05-12 00:00:00-04:00
I was thinking that it would work with this line of code but obviously I'm not understanding something since it does not work
rf_random.predict(2020-05-12 00:00:00-04:00)
Here is my dataframe
time close high low open volume c_in_p down_days up_days RSI
2016-06-27 00:00:00-04:00 57.61 58.76 57.05 58.76 31954614 -1.97 1.97 0.00 19.832891
2016-06-28 00:00:00-04:00 59.50 59.55 58.26 59.19 24884353 1.89 0.00 1.89 35.990316
2016-06-29 00:00:00-04:00 61.20 61.21 60.00 60.33 18107419 1.70 0.00 1.70 47.063366
here is the code for my model
# New Random Forest Classifier to house optimal parameters
rf = RandomForestClassifier()
# Specfiy the details of our Randomized Search
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 5, verbose=5, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)
Fitting 5 folds for each of 100 candidates, totalling 500 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 14 tasks | elapsed: 25.4s
[Parallel(n_jobs=-1)]: Done 68 tasks | elapsed: 1.5min
[Parallel(n_jobs=-1)]: Done 158 tasks | elapsed: 3.7min
[Parallel(n_jobs=-1)]: Done 284 tasks | elapsed: 8.2min
[Parallel(n_jobs=-1)]: Done 446 tasks | elapsed: 12.6min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed: 14.3min finished
RandomizedSearchCV(cv=5, error_score=nan,
estimator=RandomForestClassifier(bootstrap=True,
ccp_alpha=0.0,
class_weight=None,
criterion='gini',
max_depth=None,
max_features='auto',
max_leaf_nodes=None,
max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100,
n_jobs...
param_distributions={'bootstrap': [True, False],
'max_depth': [10, 20, 30, 40, 50, 60,
70, 80, 90, 100, None],
'max_features': ['auto', 'sqrt', None,
'log2'],
'min_samples_leaf': [1, 2, 7, 12, 14,
16, 20],
'min_samples_split': [2, 5, 10, 20, 30,
40],
'n_estimators': [200, 400, 600, 800,
1000, 1200, 1400, 1600,
1800]},
pre_dispatch='2*n_jobs', random_state=42, refit=True,
return_train_score=False, scoring=None, verbose=5)
'''
ACCURACY
'''
# Once the predictions have been made, then grab the accuracy score.
print('Correct Prediction (%): ', accuracy_score(y_test, rf_random.predict(X_test), normalize = True) * 100.0)
'''
CLASSIFICATION REPORT
'''
# Define the traget names
target_names = ['Down Day', 'Up Day']
# Build a classifcation report
report = classification_report(y_true = y_test, y_pred = y_pred, target_names = target_names, output_dict = True)
# Add it to a data frame, transpose it for readability.
report_df = pd.DataFrame(report).transpose()
display(report_df)
print('\n')
'''
FEATURE IMPORTANCE
'''
# Calculate feature importance and store in pandas series
feature_imp = pd.Series(rand_frst_clf.feature_importances_, index=X_Cols.columns).sort_values(ascending=False)
display(feature_imp)
Correct Prediction (%): 66.80327868852459
precision recall f1-score support
Down Day 0.623932 0.629310 0.626609 116.000000
Up Day 0.661417 0.656250 0.658824 128.000000
accuracy 0.643443 0.643443 0.643443 0.643443
macro avg 0.642674 0.642780 0.642716 244.000000
weighted avg 0.643596 0.643443 0.643509 244.000000
MACD 0.213449
k_percent 0.183975
r_percent 0.181395
Price_Rate_Of_Change 0.156800
RSI 0.150577
On Balance Volume 0.113804
dtype: float64
rf_random.best_estimator_
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=20, max_features=None,
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=12, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=800,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
Here I would of ask for my prediction but it doesn't work
rf_random.predict(2020-05-12 00:00:00-04:00)
File "<ipython-input-51-788cba99b288>", line 1
rf_random.predict(2020-05-12 00:00:00-04:00)
^
SyntaxError: invalid token

So why wouldn't rf_random.predict("2020-05-12 00:00:00-04:00") work? the model doesn't actually learn from the date. Your model needs values for close, high, low, open, volume, c_in_p, down_days, up_days, and RSI to make a prediction. It has to be in the same format as X_train.
I know it's always easy to be negative, but just a couple of big picture issues for the record:
You're using the future to predict the past with random CV partitioning
A useful model must be evaluated on how it performs in the future
The dataset is an equally spaced series, so you could use time series approaches to use calendar based events to improve predictions. This solves the issue where you would need to know tomorrow's close/high/low/volume to predict whether tomorrow goes up or down.
You're predicting something that has a lot of difficult to explain volatility, therefore it is a poor use case for any machine learning model to learn from. If I made a model that was truly even 60% accurate at predicting whether the market goes up or down tomorrow, I'd be a billionaire!

Related

Recurrent neural network LSTM problem solved using 1 epoch

Looking at a solved problem on which the goal is to make predictions of stock price I've found that only 1 epoch is used to solve it. The data is composed of little less than 1500 points each corresponding to a daily closing price. So we have a dataset of dates (days) and prices.
Using LSTM approach the X_train training set is generated as:
Original dataset:
Date Price
1-1-2010 100
2-1-2010 80
3-1-2010 50
4-1-2010 40
5-1-2010 70
...
30-10-2012 130
...
X_train:
[[100, 80, 50, 40, 70, ...],
[80, 50, 40, 70, 90, ...],
[50, 40, 70, 90, 95, ...],
...
[..., 78, 85, 72, 60, 105],
[..., 85, 72, 60, 105, 130]]
The training set is 60 in length and shifted by one day everytime until a fraction of the total dataframe is reached (training set). Please don't consider things like normalization, etc. This is just an example.
The thing is that in the training part of the problem the epochs are set to 1, this is the first time I see this approach of considering just one pass through the data to train the model. I've searched about it but to no avail.
Does anyone knows how this technique is called (if it has a name) so I can search more about it?

LSTM model for time series forecasting does not train proprely for some data

CONTEXT
I have a dataframe of monthly historical prices of market indices like so (all data comes from Bloomberg):
MSCI World S&P 500 ... HFRX Event Driven Gold Spot
1969-12-31 100 92.06 ... NaN NaN
1970-01-30 94.25 85.02 ... NaN NaN
... ... ... ... ... ...
2021-07-31 3141.35 4395.26 ... 20459.292 143.77
2021-08-31 3006.6 4522.68 ... 20614.276 134.06
I want to predict the value of each index for the next month with an LSTM NN (each index has its specially trained NN).
So a new LSTM model is initialized and trained on each of these time series (which all have from 300 to 1200 samples). This (Pytorch) LSTM model is the following:
class LSTMRegressor(nn.Module):
def __init__(self, input_size, hidden_size,sequence_size,num_layers,dropout):
super(LSTMRegressor,self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.sequence_size = sequence_size
self.num_layers=num_layers
self.droput = dropout
self.lstm = nn.LSTM(
input_size=self.input_size,
hidden_size=self.hidden_size,
num_layers=num_layers,
batch_first=True,
dropout=dropout)
self.linear = nn.Linear(in_features=hidden_size, out_features=1)
def forward(self, x):
lstm_out, self.hidden = self.lstm(x)
y_pred = self.linear(lstm_out[:,-1,:])
return y_pred
Loss function and optimizer:
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(),lr=learning_rate)
My parameters are the following:
input_size = 1
hidden_size=150
num_layers=2
dropout=0
batch_size = 16
learning_rate = 0.001
RESULTS
For most of the indexes, the training seems to work well as there is only about a mean error of 0.5% in the testing set (see an exemple in first graph below). However, for some of the indexes, the training does not work (about 100% error) (see an exemple in second graph below).
The graphs show training/validation loss and mape (mean average percentage error). The vertical red line is simply the best epoch calculated by an "early stopping algorithm".
Model that trained successfully (test == validation):
Model that trained unsuccessfully (test == validation):
QUESTIONS
Why do all LSTM models seem not to overfit (I've tested with ten of thousands of epochs)?
Why do some LSTM models do not train proprely? (they are not those with the least data)
Why do the models that do not train proprely have such smooth curves?
Thank you very much for your help!

GEKKO dynamic optimization negative degrees of freedom

I'm trying to use GEKKO for minimization of combined power load from charging vehicle batteries in discrete time.
Each vehicle has an energy demand ('dem' in vehicles_info dict) which should be met within its available time frame (from 'start' to 'end' in the vehicles_info dict)
There is also a constraint for the maximum power supply (Crate) to the battery based on SoC-level in each time step. Thus SoC and Crate is continously calculated as intermediates for each vehicle battery in every time step.
A solution is found with the vehicles in the vehicles_list below, but the degrees of freedom is -1255. I guess this could become an issue for convergence with bigger systems (=more vehicles and longer time periods)? I can't really tell how to fix this.
Full code:
import numpy as np
#################
# Vehicles info #
#################
# start = starting timestep for charging of vehicle
# end = ending timestep for charging of vehicle
# batt = vehicle battery size
# dem = vehicle energy demand
# start_soc = vehicle battery starting state-of-charge
vehicles_info = {1: {'start': 5, 'end':50, 'batt': 700.0, 'dem': 290.0, 'start_soc': 0.2,},
2: {'start': 20, 'end':80, 'batt': 650.0, 'dem': 255.0, 'start_soc':0.2},
3: {'start': 40, 'end':90, 'batt': 600.0, 'dem': 278.0, 'start_soc':0.27},
4: {'start': 50, 'end':350, 'batt': 600.0, 'dem': 450.0, 'start_soc':0.15},
5: {'start': 90, 'end':390, 'batt': 600.0, 'dem': 450.0, 'start_soc':0.15}}
##############################
# Charging curve (max Crate) #
##############################
## Charging curve parameters
C_high=2.0
C_med=1.0
C_low=0.5
SoC_med=0.5
SoC_high=0.8
n1 = 100 # slope exponential functions
# Exopnential function: Crate = C_high - C_med/(1 + m.exp(-n1*(SoC-SoC_med))) - C_low/(1 + m.exp(-n1*(SoC-SoC_high)))
###################
# Time parameters #
###################
time_stepsize_min = 1 # minute
time_stepsize_h = time_stepsize_min/60 # hour
start_timestep = 0
end_timestep = 400
m = GEKKO()
# overall time frame
m.time = np.linspace(start_timestep,end_timestep,end_timestep+1)
# variables for optimization (charging power)
P = m.Array(m.Var,len(vehicles_info))
# add initial guess and lower bound for the variables
for i in range(len(P)):
P[i].value = 0
P[i].lower = 0
# "block" time intervals outside each vehicle's time frame
for i in range(len(P)):
for j1 in range(1,vehicles_info[i+1]['start']):
m.fix(P[i],val=0,pos=j1)
for j2 in range(vehicles_info[i+1]['end'],end_timestep+1):
m.fix(P[i],val=0,pos=j2)
# Intermediates
SoC = [m.Intermediate(m.integral(P[i]*time_stepsize_h)/vehicles_info[i+1]['batt']+vehicles_info[i+1]['start_soc']) for i in range(len(P))]
Crate = [m.Intermediate(C_high - C_med/(1 + m.exp(-n1*(SoC[i]-SoC_med))) - C_low/(1 + m.exp(-n1*(SoC[i]-SoC_high)))) for i in range(len(P))]
# fix energy demand at ending time for each vehicle
E_fin = [m.integral(P[i]*time_stepsize_h) for i in range(len(P))]
for i in range(len(P)):
m.fix(E_fin[i],vehicles_info[i+1]['dem'],pos=vehicles_info[i+1]['end'])
## Equations
m.Equations(P[i]<=Crate[i]*vehicles_info[i+1]['batt'] for i in range(len(P)))
m.Minimize(np.sum(P,axis=0)**2)
m.options.IMODE = 6
m.solve(disp=True)
And some result plots:
from matplotlib import pyplot as plt
fig, ax = plt.subplots(3,1,figsize=(10,15))
# plot power, soc and crate curves
for i in range(len(P)):
ax[0].plot(m.time,P[i])
ax[1].plot(m.time,SoC[i])
ax[2].plot(m.time,Crate[i])
ax[0].set_title('Power curves')
ax[1].set_title('SoC curves')
ax[2].set_title('Crate curve')
The degrees of freedom are calculated at the beginning of the problem before the solver calculates a solution and knows which constraints are active. As long as this constraint:
m.Equations(P[i]<=Crate[i]*vehicles_info[i+1]['batt'] for i in range(len(P)))
is not at the boundary (P[i]==Crate[i]*vehicles_...) for every i and time point then the actual degrees of freedom may be positive. If the problem becomes infeasible due to too few degrees of freedom then an alternative form is to use a slack variable that minimizes the infeasibility.

random sampling from a data frame in pyspark

In my data set I have 73 billion rows. I want to apply a classification algorithm on it. I need a sample from the original data so that I can test my model.
I want to do a train-test split.
Dataframe looks like -
id age gender salary bonus area churn
1 38 m 37654 765 bb 1
2 48 f 3654 365 bb 0
3 33 f 55443 87 uu 0
4 27 m 26354 875 jh 0
5 58 m 87643 354 vb 1
How to take random sampling using pyspark so that my dependent(churn) variable ration should not change.
Any suggestion?
You will find examples in the linked documentation.
Spark supports Stratified Sampling.
# an RDD of any key value pairs
data = sc.parallelize([(1, 'a'), (1, 'b'), (2, 'c'), (2, 'd'), (2, 'e'), (3, 'f')])
# specify the exact fraction desired from each key as a dictionary
fractions = {1: 0.1, 2: 0.6, 3: 0.3}
approxSample = data.sampleByKey(False, fractions)
You can also use the TrainValidationSplit
For example:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
# Prepare training and test data.
data = spark.read.format("libsvm")\
.load("data/mllib/sample_linear_regression_data.txt")
train, test = data.randomSplit([0.9, 0.1], seed=12345)
lr = LinearRegression(maxIter=10)
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# TrainValidationSplit will try all combinations of values and determine best model using
# the evaluator.
paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.1, 0.01]) \
.addGrid(lr.fitIntercept, [False, True])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
.build()
# In this case the estimator is simply the linear regression.
# A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
tvs = TrainValidationSplit(estimator=lr,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(),
# 80% of the data will be used for training, 20% for validation.
trainRatio=0.8)
# Run TrainValidationSplit, and choose the best set of parameters.
model = tvs.fit(train)
# Make predictions on test data. model is the model with combination of parameters
# that performed best.
model.transform(test)\
.select("features", "label", "prediction")\
.show()
To see sample from original data , we can use sample in spark:
df.sample(fraction).show()
Fraction should be between [0.0, 1.0]
example:
# run this command repeatedly, it will show different samples of your original data.
df.sample(0.2).show(10)

Tensorflow - loss starts high and does not decrease

i started writing Neuronal Networks with tensorflow and there is one Problem i seem to face in each of my example Projects.
My loss allways starts at something like 50 or higher and does not decrease or if it does, it does so slowly that after all my epochs i do not even get near an acceptable loss-rate.
Things it already tried (and did not affect the result very much)
tested on overfitting, but in the following example
you can see that i have 15000 training and 15000 testing-datasets and
something like 900 neurons
tested different optimizers and optimizer-values
tried increasing the traingdata by using the testdata as
trainingdata aswell
tried increasing and decreasing the batchsize
I created the network on knowledge of https://youtu.be/vq2nnJ4g6N0
But let us have a look on one of my testprojects:
I have a list of names and wanted to assume the gender so my raw data looks like this:
names=["Maria","Paul","Emilia",...]
genders=["f","m","f",...]
For feeding it into the network i transform the names into an array of charCodes (expecting a maxlength of 30) and the gender into a bit array
names=[[77.,97. ,114.,105.,97. ,0. ,0.,...]
[80.,97. ,117.,108.,0. ,0. ,0.,...]
[69.,109.,105.,108.,105.,97.,0.,...]]
genders=[[1.,0.]
[0.,1.]
[1.,0.]]
I built the network with 3 hidden layers [30,20],[20,10],[10,10] and [10,2] for the output layer. All hidden layers have a ReLU as activation function. The output layer has a softmax.
# Input Layer
x = tf.placeholder(tf.float32, shape=[None, 30])
y_ = tf.placeholder(tf.float32, shape=[None, 2])
# Hidden Layers
# H1
W1 = tf.Variable(tf.truncated_normal([30, 20], stddev=0.1))
b1 = tf.Variable(tf.zeros([20]))
y1 = tf.nn.relu(tf.matmul(x, W1) + b1)
# H2
W2 = tf.Variable(tf.truncated_normal([20, 10], stddev=0.1))
b2 = tf.Variable(tf.zeros([10]))
y2 = tf.nn.relu(tf.matmul(y1, W2) + b2)
# H3
W3 = tf.Variable(tf.truncated_normal([10, 10], stddev=0.1))
b3 = tf.Variable(tf.zeros([10]))
y3 = tf.nn.relu(tf.matmul(y2, W3) + b3)
# Output Layer
W = tf.Variable(tf.truncated_normal([10, 2], stddev=0.1))
b = tf.Variable(tf.zeros([2]))
y = tf.nn.softmax(tf.matmul(y3, W) + b)
Now the calculation for the loss, accuracy and the training operation:
# Loss
cross_entropy = -tf.reduce_sum(y_*tf.log(y))
# Accuracy
is_correct = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))
# Training
train_operation = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
I train the network in batches of 100
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(150):
bs = 100
index = i*bs
inputBatch = inputData[index:index+bs]
outputBatch = outputData[index:index+bs]
sess.run(train_operation, feed_dict={x: inputBatch, y_: outputBatch})
accuracyTrain, lossTrain = sess.run([accuracy, cross_entropy], feed_dict={x: inputBatch, y_: outputBatch})
if i%(bs/10) == 0:
print("step %d loss %.2f accuracy %.2f" % (i, lossTrain, accuracyTrain))
And i get the following result:
step 0 loss 68.96 accuracy 0.55
step 10 loss 69.32 accuracy 0.50
step 20 loss 69.31 accuracy 0.50
step 30 loss 69.31 accuracy 0.50
step 40 loss 69.29 accuracy 0.51
step 50 loss 69.90 accuracy 0.53
step 60 loss 68.92 accuracy 0.55
step 70 loss 68.99 accuracy 0.55
step 80 loss 69.49 accuracy 0.49
step 90 loss 69.25 accuracy 0.52
step 100 loss 69.39 accuracy 0.49
step 110 loss 69.32 accuracy 0.47
step 120 loss 67.17 accuracy 0.61
step 130 loss 69.34 accuracy 0.50
step 140 loss 69.33 accuracy 0.47
What am i doing wrong?
Why does it start at ~69 in my Project and not lower?
Thank you very much guys!
There's nothing wrong with 0.69 nats of entropy per samples, as a starting point for a binary classification.
If you convert to base 2, 0.69/log(2), you'll see that it's almost exactly 1 bit per sample which is exactly what you would expect if you're unsure about a binary classification.
I usually use the mean loss instead of the sum so things are less sensitive to batch size.
You should also not calculate the entropy directly yourself, because that method breaks easily. you probably want tf.nn.sigmoid_cross_entropy_with_logits.
I also like starting with the Adam Optimizer instead of pure gradient descent.
Here are two reasons you might be having some trouble with this problem:
1) Character codes are ordered, but the order doesn't mean anything. Your inputs would be easier for the network to take as input if they were input as one-hot vectors. So your input would be a 26x30 = 780 element vector. Without that the network has to waste a bunch of capacity learning the boundaries between letters.
2) You've only got fully connected layers. This makes it impossible for it to learn a fact independent of it's absolute position in the name. 6 of the top 10 girls names in 2015 ended in 'a', while 0 of the top 10 boys names did. As currently written your network needs to re-learn "Usually it's a girl's name if it ends in 'a'" independently for each name length. Using some convolution layers would allow it to learn facts once across all name lengths.