Facebook NeuralProphet - Loading model from pickle for prediction - pandas

I have a weekly Job which reads data from a csv file and create model based on NeuralProphet and dump the pickle file for the later use.
from neuralprophet import NeuralProphet
from matplotlib import pyplot as plt
import pandas as pd
import pickle
data_location = /input_data/
df = pd.read_csv(data_location + 'input.csv')
np = NeuralProphet()
model = np.fit(df, freq="5min")
with open('model/neuralprophet_model.pkl', "wb") as f:
# dump information to that file
pickle.dump(model, f)
The above python code runs on a weekly basis and it dumps the model file in a file.
Now, i have a different python file which loads the pickle file and does the prediction for the future date.
Lets say, I have last 2 years data in a csv file and created model from that. Now, I would like to predict the future based on the above model.
from neuralprophet import NeuralProphet
import pandas as pd
import pickle
with open('model/neuralprophet_model.pkl', "rb") as f:
model = pickle.load(file)
# To get a next 1 hour prediction by 5mins interval
future = model.make_future_dataframe(periods=12, freq='5min')
forecast = model.predict(future)
Is this correct? Here, I dont pass the data to make_future_dataframe. But, all the internet example passes the data as well. Since, the data was used to train the model, I am just using the model here. Why do we need to pass data also here as we use predict(For some unknown future date) based on the model?

The NeuralProphet model (pickle file) is just a trained neural network... the most simple analogy would be a training linear regression model (from sci-kit learn etc)... y = Ax + b where you have trained A and b vectors. These vectors alone cannot produce y without x. Your model in this example is just the A and b vectors. Now, neuralprophet uses auto-regressive feed forward neural networks, so there are more vector terms and they are not all linear.
That's why NeuralProhpet requires historic data in model.fit... the historic data is x. x can be from the same dataset that you used for training A and b, or x can be from a different but statistically similar dataset (You can use d-bar testing to determine and confidence intervals to determine similarity here).
This is how we use models across most supervised learning applications... train on one sample dataset and apply to predict outcomes on similar datasets.

Related

Activation function with exponential distributed data

I am in beginning of neural networks, I have a bunch of targets in a regression model to predict, what I have noticed is the model works perfectly with targets were already normally distributed, but it does not work well with exponentially distributed targets, I understand this is the activation function rule, but I have been trying many functions (relu, linear, selu,elu, etc) and still didn't get a great result.
Please check the images below
Normally distributed
Exponentially distributed
That sort of makes sense, but riddle me this. Are you taking the right approach? You don't need to assume Normal distributions to do regression. It is a common misunderstanding that OLS somehow assumes normally distributed data. It does not. It is far more general. So, OLS regression makes no assumptions about the data, it makes assumptions about the errors, as estimated by residuals. Also, transforming data to make in fit a model is, in my opinion, the wrong approach. You want your model to fit your problem, not the other way round. There are a few ways to deal with skewed data sets.
1. Normalize Data
2. Standardize Data
Let's see an example.
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
##iris.keys()
df= pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
# Normalize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
import seaborn as sns
# load the iris dataset
iris = load_iris()
print(iris.data.shape)
# separate the data from the target attributes
X = df[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']]
y = df['species']
sns.displot(X)
# normalize the data attributes
normalized_X = preprocessing.normalize(X)
sns.displot(normalized_X)
# Standardize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
# load the Iris dataset
iris = load_iris()
print(iris.data.shape)
# separate the data and target attributes
X = X
y = y
sns.displot(X)
# standardize the data attributes
standardized_X = preprocessing.scale(X)
sns.displot(standardized_X)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[iris.feature_names], iris.target, test_size=0.5, stratify=iris.target, random_state=123456)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=123456)
rf.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
predicted = rf.predict(X_test)
accuracy = accuracy_score(y_test, predicted)
print(f'Out-of-bag score estimate: {rf.oob_score_:.3}')
print(f'Mean accuracy score: {accuracy:.3}')
Result:
Out-of-bag score estimate: 0.96
Mean accuracy score: 0.933
See the link below for specific info on these concepts.
https://machinelearningmastery.com/rescaling-data-for-machine-learning-in-python-with-scikit-learn/
Again, though, maybe this is not the right approach. For one thing, you can try to apply a different model to your specific data set. Support Vector Machine algos just cares about the boundaries of the separating hyperplane and do not assume the exact shape of the distributions. One of my favorite models in the the Decision Tree family, specifically, the Random Forest model.
Also, see this link,
https://www.blopig.com/blog/2017/07/using-random-forests-in-python-with-scikit-learn/

Predicting values in time series for future periods with windowed dataset

I have the data till 3650 time steps, but I want to make future predictions i.e. data after 3650 time steps. I am new to machine learning, and apparently can't figure it out. How can I do it?
For reference,
Colab Notebook
The general approach of how to adapt tabular (or cross-sectional) regression algorithms to forecasting problems is described here. In short: you train your model on windows of lagged observations. To generate forecasts, you have different options, with the recursive strategy most commonly used, here you use the last available window to forecast the first value, then update the last window with the first forecasted value to forecast the next value and so on.
If you're interested, we're developing a toolbox that extends scikit-learn for exactly these use cases. So with sktime, you could simply write:
import numpy as np
from sktime.datasets import load_airline
from sktime.forecasting.compose import RecursiveTabularRegressionForecaster
from sklearn.ensemble import RandomForestRegressor
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.performance_metrics.forecasting import mean_absolute_percentage_error
y = load_airline() # load 1-dimensional time series
y_train, y_test = temporal_train_test_split(y)
fh = np.arange(1, len(y_test) + 1) # forecasting horizon
regressor = RandomForestRegressor(random_state=3)
forecaster = RecursiveTabularRegressionForecaster(regressor, window_length=10)
forecaster.fit(y_train)
y_pred = forecaster.predict(fh)
print(mean_absolute_percentage_error(y_test, y_pred, symmetric=True))
>>> 0.1440354514063762

Classify intent of random utterance of chat bot from training data and give different graphical visualization using random forest?

I am creating a nlp model to detect the intent from the provided utterance from a excel file which I am using for training having 2 columns like shown below:
Utterence Intent
hi can I have an Apple Watch service
how much I will be paying monthly service
you still around YOU_THERE
are you still there YOU_THERE
you there YOU_THERE
Speak to me if you are there. YOU_THERE
you around YOU_THERE
There are like around 3000 utterances in the training files and many intents.
I trained my model using scikit learn module and my code looks like this.
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
import re
def preprocessing(userQuery):
letters_only = re.sub("[^a-zA-Z\\d]", " ", userQuery)
words = letters_only.lower().split()
return( " ".join(words ))
#read utterance data from a xlsx file
train = pd.read_excel('training.xlsx')
query_features = train['Utterence']
#create tfidf
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 1))
new_query = [preprocessing(query) for query in query_features]
features = tfidf_vectorizer.fit_transform(new_query).toarray()
#create random forest classification model
model = RandomForestClassifier()
model.fit(features, train['Intent'])
#intent prediction on user query
userQuery = "I want apple watch"
userQueryList=[]
userQueryList.append(preprocessing(userQuery))
utfidf = tfidf_vectorizer.transform(userQueryList)
print(" prediction: ", model.predict(utfidf))
The one of problem for me here is for example: when i run for utterance I want apple watch it gives predicted intent as you_there instead of service as shown below(confirmation on training snapshot above):
C:\U\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
prediction: ['YOU_THERE']
Please help me how should i train my model and what changes should I make to fix such issues and how i can check accuracy? Also I want to see graphical visualization and ROC curve how it can achieved using random forest. I am not very verse in NLP any help would be appreciated.
You are using word bags approach which does not perform well on sequence data.
For your problem, sequential is material to classification.
I would suggest to you that use LSTM (performs better on sequence data)
Let's address your first issue:
how should i train my model and what changes should I make to fix such issues
Below I'm using word2vec approach which rather than just converting the utterances to vectors using TFIDF approach (losing the semantic information contained in that particular sentence), it maintains the semantic info.
To understand more about word2vec, refer this blog :
[1]https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
Below is the code for predicted the intent using word2vec approach (Note - It's same as your code, just instead of using TFIDFVectorizer, I'm using word2vec to obtain the vectors. Also the code is divided into different functions to get a good overview of logics that will be evident by there names).
import pandas as pd
import numpy as np
from gensim.models import Word2Vec
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
def preprocess_lower(token):
#utility for preprocessing
return token.lower()
def load_data(file_name):
# load a csv file in memory
return pd.read_csv(file_name)
def process_training_data(training_data):
# process the training data and split it between independent and dependent variables
training_sentences = [list(map(preprocess_lower,sentence.split(" "))) for sentence in list(training_data.Utterence.values)]
target_class = training_data.Intent.values
label_encoded_Y = preprocessing.LabelEncoder().fit_transform(list(target_class))
return target_class, training_sentences, label_encoded_Y
def process_user_query(training_data):
# process the training data and split it between independent and dependent variables
training_sentences = [list(map(preprocess_lower,sentence.split(" "))) for sentence in training_data]
return training_sentences
def train_word2vec_model(train_sentences_list):
# training word2vec on sentences list (inputted by user)
model = Word2Vec(train_sentences_list, size=100, window=4, min_count=1, workers=4)
return model
def convert_training_data_vectors(model, train_sentences_list):
#get the sentences average vector
training_sectences_vector = list()
for sentence in train_sentences_list:
sentence_vetor = [list(model.wv[token]) for token in sentence if token in model.wv.vocab ]
training_sectences_vector.append(list(np.mean(sentence_vetor, axis=0)))
return training_sectences_vector
def training_rf_prediction_model(training_data_vectors, label_encoded_Y):
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# training model on user inputted data
rf_model = RandomForestClassifier()
# here use the split function and divide the data into training and testing
x_train,x_test,y_train,y_test=train_test_split(training_data_vectors,label_encoded_Y,
train_size=0.8,test_size=0.2)
rf_model.fit(x_train, y_train)
y_pred = rf_model.predict(x_test)
print(accuracy_score(y_test, y_pred))
return rf_model
def training_svm_prediction_model(training_data_vectors, label_encoded_Y):
svm_model = SVC(gamma='auto')
svm_model.fit(training_data_vectors, label_encoded_Y)
return svm_model
def process_data_flow(file_name):
training_data = load_data(file_name)
target_class, training_sentences, label_encoded_Y = process_training_data(training_data)
word2vec_model = train_word2vec_model(train_sentences_list=training_sentences)
training_data_vectors = convert_training_data_vectors(word2vec_model, train_sentences_list=training_sentences)
prediction_model = training_rf_prediction_model(training_data_vectors, label_encoded_Y)
#intent prediction on user query
userQuery = ["i want apple watch"]
user_query_vectors = convert_training_data_vectors(word2vec_model, process_user_query(userQuery))
predicted_class = prediction_model.predict(user_query_vectors)[0]
predicted_intent = target_class[list(label_encoded_Y).index(predicted_class)]
return predicted_intent
print("Predicted class: ", process_data_flow("sample_intent_data.csv"))
sample data file is in csv format, you just need to format and paste the data in below format :
#sample_input_data.csv
Utterence,Intent
hi can I have an Apple Watch,service
how much I will be paying monthly,service
you still around,YOU_THERE
are you still there,YOU_THERE
you there,YOU_THERE
Speak to me if you are there,YOU_THERE
you around,YOU_THERE
Also note, your training data should contain good amount of training utterances for each intents for the approach to work.
For accuracy, you can use below approach:
Divide the data into training and testing (mention the split ratio) :
x_train,x_test,y_train,y_test=train_test_split(training_vectors,label_encoded_Y,
train_size=0.8,
test_size=0.2)
And after training the model, use predict function on x_test to get the predictions. Now just match the prediction for testing data from the model and actual from the data set and you will be able to easily determine the accuracy.
Edit: Added the accuracy score calculation while predicting.

Understanding data augmentation in the object detection API

I am using the object detection API to train with a different dataset and I would like to know if it is possible to have sample images of what is reaching the network during the training.
I ask this because I am trying to find a good combination of data augmentation options (here the options), but the result adding them has been worse. Seeing what reaches the network in training would be very helpful.
Another question is if it is possible to get the API to help with balancing the classes, in case that the dataset passed have them unbalanced.
Thank you!
Yes, it is possible. Shortly speaking, you need to get an instance of tf.data.Dataset. Then, you can iterate over it and get the network input data as NumPy arrays. Saving it to image files using PIL or OpenCV is trivial then.
Assuming you use TF2 the pseudo-code is like this:
ds = ... get dataset object somehow
sample_num = 0
for features, _ in ds:
images = features[fields.InputDataFields.image] # is a [batch_size, H, W, C] float32 tensor with preprocessed images
batch_size = images.shape[0]
for i in range(batch_size):
image = np.array(images[i] * 255).astype(np.uint8) # assuming input data is only scaled to [0..1]
cv2.imwrite(output_path, image)
sample_num += 1
if sample_num >= MAX_SAMPLES:
break
The trick here is to get the Dataset instance. Google object detection API is very sophisticated, but I guess you should start with calling train_input function here: https://github.com/tensorflow/models/blob/3c8b6f1e17e230b68519fd8d58c4dd9e9570d789/research/object_detection/inputs.py#L763
It requires pipeline config sub-parts describing training, train_input and the model.
You can find some code snippets on how to work with pipeline here: Dynamically Editing Pipeline Config for Tensorflow Object Detection
import argparse
import tensorflow as tf
from google.protobuf import text_format
from object_detection.protos import pipeline_pb2
def parse_arguments():
parser = argparse.ArgumentParser(description='')
parser.add_argument('pipeline')
parser.add_argument('output')
return parser.parse_args()
def main():
args = parse_arguments()
pipeline_config = pipeline_pb2.TrainEvalPipelineConfig()
with tf.gfile.GFile(args.pipeline, "r") as f:
proto_str = f.read()
text_format.Merge(proto_str, pipeline_config)

How do I plot a non-linear model using matplotlib?

I'm a bit lost as to how to proceed to achieve this. Normally with a linear model, when I perform linear regressions, I simply take my training data (x) and and my output data (y) and plot them using matplotlib. Now I have 3 features with and my output/observation (y). Can anyone guide me as to how to graph this kind of model using matplotlib? My goal is to fit a polynomial model and graph a polynomial using matplotlib.
%matplotlib inline
import sframe as frame
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
# Initalize SFrame
sales = frame.SFrame('kc_house_data.gl/')
# Separate data into test and training data
train_data,test_data = sales.random_split(.8,seed=0)
# Organize data into training and testing data
train_x = train_data[['sqft_living', 'bedrooms', 'bathrooms']].to_dataframe().values
train_y = train_data[['price']].to_dataframe().values
test_x = test_data[['sqft_living', 'bedrooms', 'bathrooms']].to_dataframe().values
test_y = test_data[['price']].to_dataframe().values
# Create a model using sklearn with multiple features
regr = linear_model.LinearRegression(fit_intercept=True, n_jobs=2)
# test predictions
regr.predict(train_x)
# Prepare to plot the data
Note:
The train_x variable contains my 3 features, and my train_y contains the output data. I use SFrame to contain the data. SFrame has the ability to convert itself into a dataframe (used in Pandas). Using the conversion I am able to grab the values.
Rather than plotting a non-linear model with multiple discrete features at once, I have found that simply observing each and every feature against my observation/output was better and easier for my research.