Looking at a solved problem on which the goal is to make predictions of stock price I've found that only 1 epoch is used to solve it. The data is composed of little less than 1500 points each corresponding to a daily closing price. So we have a dataset of dates (days) and prices.
Using LSTM approach the X_train training set is generated as:
Original dataset:
Date Price
1-1-2010 100
2-1-2010 80
3-1-2010 50
4-1-2010 40
5-1-2010 70
...
30-10-2012 130
...
X_train:
[[100, 80, 50, 40, 70, ...],
[80, 50, 40, 70, 90, ...],
[50, 40, 70, 90, 95, ...],
...
[..., 78, 85, 72, 60, 105],
[..., 85, 72, 60, 105, 130]]
The training set is 60 in length and shifted by one day everytime until a fraction of the total dataframe is reached (training set). Please don't consider things like normalization, etc. This is just an example.
The thing is that in the training part of the problem the epochs are set to 1, this is the first time I see this approach of considering just one pass through the data to train the model. I've searched about it but to no avail.
Does anyone knows how this technique is called (if it has a name) so I can search more about it?
Related
I'm working on my first Tensorflow model and when I was training the dataset, my accuracy dropped to 25% from around 60% when using sci-kit. A friend told me it might have to do with some of the data, for example, "781C376B-E380-C052-448B-B4AB6F3D". How do I deal with symbols (dashes here), numbers, and letters in my data when running my models?
Currently I am looking into text vectorization so it could read my data easier.
You can you tf.strings.unicode_decode() which converts an encoded string scalar to a vector of code points. It provides unique number for each character in the string.
For example:
# A batch of Unicode strings, each represented as a UTF8-encoded string.
batch_utf8 = [s.encode('UTF-8') for s in
[u'781C376B-E380-C052-448B-B4AB6F3D']]
batch_chars_ragged = tf.strings.unicode_decode(batch_utf8,
input_encoding='UTF-8')
for sentence_chars in batch_chars_ragged.to_list():
print(sentence_chars)
output:[55, 56, 49, 67, 51, 55, 54, 66, 45, 69, 51, 56, 48, 45, 67, 48, 53, 50, 45, 52, 52, 56, 66, 45, 66, 52, 65, 66, 54, 70, 51, 68]
For more details please refer to this document. Thank You.
I have made this Random Forest model to predict if a stock market day will have an up day or down day.
and my goal is to get a 1 for up days and 0 for down days by passing in a date-time like this
2020-05-12 00:00:00-04:00
I was thinking that it would work with this line of code but obviously I'm not understanding something since it does not work
rf_random.predict(2020-05-12 00:00:00-04:00)
Here is my dataframe
time close high low open volume c_in_p down_days up_days RSI
2016-06-27 00:00:00-04:00 57.61 58.76 57.05 58.76 31954614 -1.97 1.97 0.00 19.832891
2016-06-28 00:00:00-04:00 59.50 59.55 58.26 59.19 24884353 1.89 0.00 1.89 35.990316
2016-06-29 00:00:00-04:00 61.20 61.21 60.00 60.33 18107419 1.70 0.00 1.70 47.063366
here is the code for my model
# New Random Forest Classifier to house optimal parameters
rf = RandomForestClassifier()
# Specfiy the details of our Randomized Search
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 5, verbose=5, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)
Fitting 5 folds for each of 100 candidates, totalling 500 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 14 tasks | elapsed: 25.4s
[Parallel(n_jobs=-1)]: Done 68 tasks | elapsed: 1.5min
[Parallel(n_jobs=-1)]: Done 158 tasks | elapsed: 3.7min
[Parallel(n_jobs=-1)]: Done 284 tasks | elapsed: 8.2min
[Parallel(n_jobs=-1)]: Done 446 tasks | elapsed: 12.6min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed: 14.3min finished
RandomizedSearchCV(cv=5, error_score=nan,
estimator=RandomForestClassifier(bootstrap=True,
ccp_alpha=0.0,
class_weight=None,
criterion='gini',
max_depth=None,
max_features='auto',
max_leaf_nodes=None,
max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100,
n_jobs...
param_distributions={'bootstrap': [True, False],
'max_depth': [10, 20, 30, 40, 50, 60,
70, 80, 90, 100, None],
'max_features': ['auto', 'sqrt', None,
'log2'],
'min_samples_leaf': [1, 2, 7, 12, 14,
16, 20],
'min_samples_split': [2, 5, 10, 20, 30,
40],
'n_estimators': [200, 400, 600, 800,
1000, 1200, 1400, 1600,
1800]},
pre_dispatch='2*n_jobs', random_state=42, refit=True,
return_train_score=False, scoring=None, verbose=5)
'''
ACCURACY
'''
# Once the predictions have been made, then grab the accuracy score.
print('Correct Prediction (%): ', accuracy_score(y_test, rf_random.predict(X_test), normalize = True) * 100.0)
'''
CLASSIFICATION REPORT
'''
# Define the traget names
target_names = ['Down Day', 'Up Day']
# Build a classifcation report
report = classification_report(y_true = y_test, y_pred = y_pred, target_names = target_names, output_dict = True)
# Add it to a data frame, transpose it for readability.
report_df = pd.DataFrame(report).transpose()
display(report_df)
print('\n')
'''
FEATURE IMPORTANCE
'''
# Calculate feature importance and store in pandas series
feature_imp = pd.Series(rand_frst_clf.feature_importances_, index=X_Cols.columns).sort_values(ascending=False)
display(feature_imp)
Correct Prediction (%): 66.80327868852459
precision recall f1-score support
Down Day 0.623932 0.629310 0.626609 116.000000
Up Day 0.661417 0.656250 0.658824 128.000000
accuracy 0.643443 0.643443 0.643443 0.643443
macro avg 0.642674 0.642780 0.642716 244.000000
weighted avg 0.643596 0.643443 0.643509 244.000000
MACD 0.213449
k_percent 0.183975
r_percent 0.181395
Price_Rate_Of_Change 0.156800
RSI 0.150577
On Balance Volume 0.113804
dtype: float64
rf_random.best_estimator_
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=20, max_features=None,
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=12, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=800,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
Here I would of ask for my prediction but it doesn't work
rf_random.predict(2020-05-12 00:00:00-04:00)
File "<ipython-input-51-788cba99b288>", line 1
rf_random.predict(2020-05-12 00:00:00-04:00)
^
SyntaxError: invalid token
So why wouldn't rf_random.predict("2020-05-12 00:00:00-04:00") work? the model doesn't actually learn from the date. Your model needs values for close, high, low, open, volume, c_in_p, down_days, up_days, and RSI to make a prediction. It has to be in the same format as X_train.
I know it's always easy to be negative, but just a couple of big picture issues for the record:
You're using the future to predict the past with random CV partitioning
A useful model must be evaluated on how it performs in the future
The dataset is an equally spaced series, so you could use time series approaches to use calendar based events to improve predictions. This solves the issue where you would need to know tomorrow's close/high/low/volume to predict whether tomorrow goes up or down.
You're predicting something that has a lot of difficult to explain volatility, therefore it is a poor use case for any machine learning model to learn from. If I made a model that was truly even 60% accurate at predicting whether the market goes up or down tomorrow, I'd be a billionaire!
I am trying to embedding the positional information 'index' to some vector and use in Keras, for instance
inputs = Input(shape=(23,))
Which usually 23 represents as the number of features. I want to embed the position of the features to be one dimentional vector, from position 0 to position 22.
But I don't know how to get the position index of the features (I want something like 'enumerate' function for keras layer), I added
pos = K.constant([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]])
embedding_pos = Embedding(23, 1)(pos)
And I actually this embedding_pos to multiply with inputs and do the rest of the algorithms go on.
But I get this message
AttributeError: 'NoneType' object has no attribute '_inbound_nodes'
If I get rid of that embedding layer and multiply layer, the algorithm works fine. How am I supposed to get the embedding vectors using the position index of the features of inputs?
======
Adding more information, I moved around the layers to see the model.summary(), it seems like it has the embedding_pos = [None, 1] shape, which is missing batch size.
I don't think it is good to use 'Constant'. I'd like to know if there is some kind of 'enumerate' function for keras layer
=====
By the request, example inputs is like this
batch_size x number_of_features = 1 x 10
[[1.0, 4719.0, 0.0001, 472818.44, 958, 6402818., 1.828, 24.321, 55.0, 127.44]]
and so on...
I want to get the index of the features
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
to use this value as the input for Embedding.
But if I make it with constant, it doesn't know the batch size.
I have data that I just need to perform a transpose on, seems simple but i can't make heads or tails of the transpose function.
data looks like this
name, requirement_1, requirement_2, requirement_3
label, 1.1, 1.2, 1.3
threshold, 10, 20, 30
objective, 100, 200, 300
floor, 0, .5, .5
I need:
name, label, threshold, objective, floor
requirement_1, 1.1, 10, 100, 0
requirement_2, 1.2, 20, 200, 0.5
requirement_3, 1.3, 30, 300, 0.5
in power query this is simply clicking the transpose button.
Thanks
This is a bit more complicated in OpenRefine, since you have to perform two operations: transpose cells across columns into rows, then columnize by key value.
I'm surprised how few are the posts relating to this problem. Anyway...
here it is:
I have csv data files containing X values in the first column, and several Y values columns thereafter. But for a given X value not all Y series have a corresponding value. Here is an example:
0, 16, 96, 99
10, 88, 45, 85
20, 85, 61, 10
30, 30, --, 45
40, 82, 28, 82
50, 23, 9, 61
60, 40, 77, 0
70, 26, 21, --
80, --, 58, 99
90, 1, 14, 30
when this csv data is loaded with numpy.genfromtxt, the '--' strings are taken as nan which is good. But when plotting, the plots are interrupted with blanks where there is a nan. Is there an option when a nan appears to make pyplot.plot() ignoring both the nan and the corresponding X value?
Not sure if matplotlib has such functionality built in, but you could home-brew it doing the following:
idx = ~numpy.isnan(Y)
pyplot.plot(X[idx], Y[idx])
Look at this post
As proposed in my answer there, I'd recommend using np.isfinite instead of np.isnan. There might be other reasons for your plot to have discontinuities, e.f., inf