Predicting values in time series for future periods with windowed dataset - tensorflow

I have the data till 3650 time steps, but I want to make future predictions i.e. data after 3650 time steps. I am new to machine learning, and apparently can't figure it out. How can I do it?
For reference,
Colab Notebook

The general approach of how to adapt tabular (or cross-sectional) regression algorithms to forecasting problems is described here. In short: you train your model on windows of lagged observations. To generate forecasts, you have different options, with the recursive strategy most commonly used, here you use the last available window to forecast the first value, then update the last window with the first forecasted value to forecast the next value and so on.
If you're interested, we're developing a toolbox that extends scikit-learn for exactly these use cases. So with sktime, you could simply write:
import numpy as np
from sktime.datasets import load_airline
from sktime.forecasting.compose import RecursiveTabularRegressionForecaster
from sklearn.ensemble import RandomForestRegressor
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.performance_metrics.forecasting import mean_absolute_percentage_error
y = load_airline() # load 1-dimensional time series
y_train, y_test = temporal_train_test_split(y)
fh = np.arange(1, len(y_test) + 1) # forecasting horizon
regressor = RandomForestRegressor(random_state=3)
forecaster = RecursiveTabularRegressionForecaster(regressor, window_length=10)
forecaster.fit(y_train)
y_pred = forecaster.predict(fh)
print(mean_absolute_percentage_error(y_test, y_pred, symmetric=True))
>>> 0.1440354514063762

Related

Facebook NeuralProphet - Loading model from pickle for prediction

I have a weekly Job which reads data from a csv file and create model based on NeuralProphet and dump the pickle file for the later use.
from neuralprophet import NeuralProphet
from matplotlib import pyplot as plt
import pandas as pd
import pickle
data_location = /input_data/
df = pd.read_csv(data_location + 'input.csv')
np = NeuralProphet()
model = np.fit(df, freq="5min")
with open('model/neuralprophet_model.pkl', "wb") as f:
# dump information to that file
pickle.dump(model, f)
The above python code runs on a weekly basis and it dumps the model file in a file.
Now, i have a different python file which loads the pickle file and does the prediction for the future date.
Lets say, I have last 2 years data in a csv file and created model from that. Now, I would like to predict the future based on the above model.
from neuralprophet import NeuralProphet
import pandas as pd
import pickle
with open('model/neuralprophet_model.pkl', "rb") as f:
model = pickle.load(file)
# To get a next 1 hour prediction by 5mins interval
future = model.make_future_dataframe(periods=12, freq='5min')
forecast = model.predict(future)
Is this correct? Here, I dont pass the data to make_future_dataframe. But, all the internet example passes the data as well. Since, the data was used to train the model, I am just using the model here. Why do we need to pass data also here as we use predict(For some unknown future date) based on the model?
The NeuralProphet model (pickle file) is just a trained neural network... the most simple analogy would be a training linear regression model (from sci-kit learn etc)... y = Ax + b where you have trained A and b vectors. These vectors alone cannot produce y without x. Your model in this example is just the A and b vectors. Now, neuralprophet uses auto-regressive feed forward neural networks, so there are more vector terms and they are not all linear.
That's why NeuralProhpet requires historic data in model.fit... the historic data is x. x can be from the same dataset that you used for training A and b, or x can be from a different but statistically similar dataset (You can use d-bar testing to determine and confidence intervals to determine similarity here).
This is how we use models across most supervised learning applications... train on one sample dataset and apply to predict outcomes on similar datasets.

Activation function with exponential distributed data

I am in beginning of neural networks, I have a bunch of targets in a regression model to predict, what I have noticed is the model works perfectly with targets were already normally distributed, but it does not work well with exponentially distributed targets, I understand this is the activation function rule, but I have been trying many functions (relu, linear, selu,elu, etc) and still didn't get a great result.
Please check the images below
Normally distributed
Exponentially distributed
That sort of makes sense, but riddle me this. Are you taking the right approach? You don't need to assume Normal distributions to do regression. It is a common misunderstanding that OLS somehow assumes normally distributed data. It does not. It is far more general. So, OLS regression makes no assumptions about the data, it makes assumptions about the errors, as estimated by residuals. Also, transforming data to make in fit a model is, in my opinion, the wrong approach. You want your model to fit your problem, not the other way round. There are a few ways to deal with skewed data sets.
1. Normalize Data
2. Standardize Data
Let's see an example.
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
##iris.keys()
df= pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
# Normalize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
import seaborn as sns
# load the iris dataset
iris = load_iris()
print(iris.data.shape)
# separate the data from the target attributes
X = df[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']]
y = df['species']
sns.displot(X)
# normalize the data attributes
normalized_X = preprocessing.normalize(X)
sns.displot(normalized_X)
# Standardize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
# load the Iris dataset
iris = load_iris()
print(iris.data.shape)
# separate the data and target attributes
X = X
y = y
sns.displot(X)
# standardize the data attributes
standardized_X = preprocessing.scale(X)
sns.displot(standardized_X)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[iris.feature_names], iris.target, test_size=0.5, stratify=iris.target, random_state=123456)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=123456)
rf.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
predicted = rf.predict(X_test)
accuracy = accuracy_score(y_test, predicted)
print(f'Out-of-bag score estimate: {rf.oob_score_:.3}')
print(f'Mean accuracy score: {accuracy:.3}')
Result:
Out-of-bag score estimate: 0.96
Mean accuracy score: 0.933
See the link below for specific info on these concepts.
https://machinelearningmastery.com/rescaling-data-for-machine-learning-in-python-with-scikit-learn/
Again, though, maybe this is not the right approach. For one thing, you can try to apply a different model to your specific data set. Support Vector Machine algos just cares about the boundaries of the separating hyperplane and do not assume the exact shape of the distributions. One of my favorite models in the the Decision Tree family, specifically, the Random Forest model.
Also, see this link,
https://www.blopig.com/blog/2017/07/using-random-forests-in-python-with-scikit-learn/

Classify intent of random utterance of chat bot from training data and give different graphical visualization using random forest?

I am creating a nlp model to detect the intent from the provided utterance from a excel file which I am using for training having 2 columns like shown below:
Utterence Intent
hi can I have an Apple Watch service
how much I will be paying monthly service
you still around YOU_THERE
are you still there YOU_THERE
you there YOU_THERE
Speak to me if you are there. YOU_THERE
you around YOU_THERE
There are like around 3000 utterances in the training files and many intents.
I trained my model using scikit learn module and my code looks like this.
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
import re
def preprocessing(userQuery):
letters_only = re.sub("[^a-zA-Z\\d]", " ", userQuery)
words = letters_only.lower().split()
return( " ".join(words ))
#read utterance data from a xlsx file
train = pd.read_excel('training.xlsx')
query_features = train['Utterence']
#create tfidf
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 1))
new_query = [preprocessing(query) for query in query_features]
features = tfidf_vectorizer.fit_transform(new_query).toarray()
#create random forest classification model
model = RandomForestClassifier()
model.fit(features, train['Intent'])
#intent prediction on user query
userQuery = "I want apple watch"
userQueryList=[]
userQueryList.append(preprocessing(userQuery))
utfidf = tfidf_vectorizer.transform(userQueryList)
print(" prediction: ", model.predict(utfidf))
The one of problem for me here is for example: when i run for utterance I want apple watch it gives predicted intent as you_there instead of service as shown below(confirmation on training snapshot above):
C:\U\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
prediction: ['YOU_THERE']
Please help me how should i train my model and what changes should I make to fix such issues and how i can check accuracy? Also I want to see graphical visualization and ROC curve how it can achieved using random forest. I am not very verse in NLP any help would be appreciated.
You are using word bags approach which does not perform well on sequence data.
For your problem, sequential is material to classification.
I would suggest to you that use LSTM (performs better on sequence data)
Let's address your first issue:
how should i train my model and what changes should I make to fix such issues
Below I'm using word2vec approach which rather than just converting the utterances to vectors using TFIDF approach (losing the semantic information contained in that particular sentence), it maintains the semantic info.
To understand more about word2vec, refer this blog :
[1]https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
Below is the code for predicted the intent using word2vec approach (Note - It's same as your code, just instead of using TFIDFVectorizer, I'm using word2vec to obtain the vectors. Also the code is divided into different functions to get a good overview of logics that will be evident by there names).
import pandas as pd
import numpy as np
from gensim.models import Word2Vec
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
def preprocess_lower(token):
#utility for preprocessing
return token.lower()
def load_data(file_name):
# load a csv file in memory
return pd.read_csv(file_name)
def process_training_data(training_data):
# process the training data and split it between independent and dependent variables
training_sentences = [list(map(preprocess_lower,sentence.split(" "))) for sentence in list(training_data.Utterence.values)]
target_class = training_data.Intent.values
label_encoded_Y = preprocessing.LabelEncoder().fit_transform(list(target_class))
return target_class, training_sentences, label_encoded_Y
def process_user_query(training_data):
# process the training data and split it between independent and dependent variables
training_sentences = [list(map(preprocess_lower,sentence.split(" "))) for sentence in training_data]
return training_sentences
def train_word2vec_model(train_sentences_list):
# training word2vec on sentences list (inputted by user)
model = Word2Vec(train_sentences_list, size=100, window=4, min_count=1, workers=4)
return model
def convert_training_data_vectors(model, train_sentences_list):
#get the sentences average vector
training_sectences_vector = list()
for sentence in train_sentences_list:
sentence_vetor = [list(model.wv[token]) for token in sentence if token in model.wv.vocab ]
training_sectences_vector.append(list(np.mean(sentence_vetor, axis=0)))
return training_sectences_vector
def training_rf_prediction_model(training_data_vectors, label_encoded_Y):
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# training model on user inputted data
rf_model = RandomForestClassifier()
# here use the split function and divide the data into training and testing
x_train,x_test,y_train,y_test=train_test_split(training_data_vectors,label_encoded_Y,
train_size=0.8,test_size=0.2)
rf_model.fit(x_train, y_train)
y_pred = rf_model.predict(x_test)
print(accuracy_score(y_test, y_pred))
return rf_model
def training_svm_prediction_model(training_data_vectors, label_encoded_Y):
svm_model = SVC(gamma='auto')
svm_model.fit(training_data_vectors, label_encoded_Y)
return svm_model
def process_data_flow(file_name):
training_data = load_data(file_name)
target_class, training_sentences, label_encoded_Y = process_training_data(training_data)
word2vec_model = train_word2vec_model(train_sentences_list=training_sentences)
training_data_vectors = convert_training_data_vectors(word2vec_model, train_sentences_list=training_sentences)
prediction_model = training_rf_prediction_model(training_data_vectors, label_encoded_Y)
#intent prediction on user query
userQuery = ["i want apple watch"]
user_query_vectors = convert_training_data_vectors(word2vec_model, process_user_query(userQuery))
predicted_class = prediction_model.predict(user_query_vectors)[0]
predicted_intent = target_class[list(label_encoded_Y).index(predicted_class)]
return predicted_intent
print("Predicted class: ", process_data_flow("sample_intent_data.csv"))
sample data file is in csv format, you just need to format and paste the data in below format :
#sample_input_data.csv
Utterence,Intent
hi can I have an Apple Watch,service
how much I will be paying monthly,service
you still around,YOU_THERE
are you still there,YOU_THERE
you there,YOU_THERE
Speak to me if you are there,YOU_THERE
you around,YOU_THERE
Also note, your training data should contain good amount of training utterances for each intents for the approach to work.
For accuracy, you can use below approach:
Divide the data into training and testing (mention the split ratio) :
x_train,x_test,y_train,y_test=train_test_split(training_vectors,label_encoded_Y,
train_size=0.8,
test_size=0.2)
And after training the model, use predict function on x_test to get the predictions. Now just match the prediction for testing data from the model and actual from the data set and you will be able to easily determine the accuracy.
Edit: Added the accuracy score calculation while predicting.

How to read the top of a Queue multiple times before dequeueing in tensorflow

In the following example, every time I run sess.run([image, label]), a different sample from the queue is returned, thus a different np_image is returned.
Is there a way that I can let the slim.queues.QueueRunners know that I want to use (run) the same sample multiples before a dequeue operation takes place?
The reason I ask is that I have a large op that doesn't fit in my VRAM. I have to break the large op into several small ops and feed a different feed_dict every time a small ops is runned. However, when I run the small op, image changes which break the code. Putting all the small ops in a list and run the list at the same time doesn't work for me because the VRAM size is the limitation.
Thanks!
import tensorflow as tf
import numpy as np
slim = tf.contrib.slim
from datasets import dataset_utils
from tensorflow.python.ops import control_flow_ops
from datasets import dataset_factory
from deployment import model_deploy
from nets import nets_factory
from preprocessing import preprocessing_factory
with tf.Graph().as_default():
dataset = dataset_factory.get_dataset('cifar10', 'train','/home/user/dataset/cifar10')
provider = slim.dataset_data_provider.DatasetDataProvider(
dataset,
num_readers=1,
common_queue_capacity=256,
common_queue_min=128)
[image, label] = provider.get(['image', 'label'])
image_preprocessing_fn = preprocessing_factory.get_preprocessing(
'cifarnet',
is_training=True)
images, labels = tf.train.batch([image, label],
batch_size=32,
num_threads=1,
capacity=64)
with tf.Session() as sess:
with slim.queues.QueueRunners(sess):
for i in range(3):
#in every iteration, the tensor 'image' will be different
#the np_image value will be different as well
np_image, np_label = sess.run([image, label])
Peek operation for queues is currently not supported, for discussion see
https://github.com/tensorflow/tensorflow/issues/7880
A work-around is to restructure your code to take values from tf.Variable objects rather than from tf.dequeue. IE something like this
x = tf.Variable(queue.dequeue())
y = x+2
sess.run(x.initializer)
sess.run(y)
sess.run(y) # same value
sess.run(x.initializer)
sess.run(y) # new value

How do I plot a non-linear model using matplotlib?

I'm a bit lost as to how to proceed to achieve this. Normally with a linear model, when I perform linear regressions, I simply take my training data (x) and and my output data (y) and plot them using matplotlib. Now I have 3 features with and my output/observation (y). Can anyone guide me as to how to graph this kind of model using matplotlib? My goal is to fit a polynomial model and graph a polynomial using matplotlib.
%matplotlib inline
import sframe as frame
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
# Initalize SFrame
sales = frame.SFrame('kc_house_data.gl/')
# Separate data into test and training data
train_data,test_data = sales.random_split(.8,seed=0)
# Organize data into training and testing data
train_x = train_data[['sqft_living', 'bedrooms', 'bathrooms']].to_dataframe().values
train_y = train_data[['price']].to_dataframe().values
test_x = test_data[['sqft_living', 'bedrooms', 'bathrooms']].to_dataframe().values
test_y = test_data[['price']].to_dataframe().values
# Create a model using sklearn with multiple features
regr = linear_model.LinearRegression(fit_intercept=True, n_jobs=2)
# test predictions
regr.predict(train_x)
# Prepare to plot the data
Note:
The train_x variable contains my 3 features, and my train_y contains the output data. I use SFrame to contain the data. SFrame has the ability to convert itself into a dataframe (used in Pandas). Using the conversion I am able to grab the values.
Rather than plotting a non-linear model with multiple discrete features at once, I have found that simply observing each and every feature against my observation/output was better and easier for my research.