How do I plot a non-linear model using matplotlib? - matplotlib

I'm a bit lost as to how to proceed to achieve this. Normally with a linear model, when I perform linear regressions, I simply take my training data (x) and and my output data (y) and plot them using matplotlib. Now I have 3 features with and my output/observation (y). Can anyone guide me as to how to graph this kind of model using matplotlib? My goal is to fit a polynomial model and graph a polynomial using matplotlib.
%matplotlib inline
import sframe as frame
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
# Initalize SFrame
sales = frame.SFrame('kc_house_data.gl/')
# Separate data into test and training data
train_data,test_data = sales.random_split(.8,seed=0)
# Organize data into training and testing data
train_x = train_data[['sqft_living', 'bedrooms', 'bathrooms']].to_dataframe().values
train_y = train_data[['price']].to_dataframe().values
test_x = test_data[['sqft_living', 'bedrooms', 'bathrooms']].to_dataframe().values
test_y = test_data[['price']].to_dataframe().values
# Create a model using sklearn with multiple features
regr = linear_model.LinearRegression(fit_intercept=True, n_jobs=2)
# test predictions
regr.predict(train_x)
# Prepare to plot the data
Note:
The train_x variable contains my 3 features, and my train_y contains the output data. I use SFrame to contain the data. SFrame has the ability to convert itself into a dataframe (used in Pandas). Using the conversion I am able to grab the values.

Rather than plotting a non-linear model with multiple discrete features at once, I have found that simply observing each and every feature against my observation/output was better and easier for my research.

Related

How to change xtick of Yellowbrick's Learning Curve visualizer?

I'm trying to change the xtick of Yellowbrick's learning curve figure from number of samples to normalized number(%) of samples. I googled a lot but couldn't find the way.
You need to change the xticks so that they are normalized to the number of training instances thus you need to specify in the percentformatter in my example the number of training instances(55000). I provide the before and after images.
from yellowbrick.model_selection import LearningCurve
from sklearn.naive_bayes import MultinomialNB
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from yellowbrick.datasets import load_game
import matplotlib.pyplot as plt
# Create subplot
fig,ax = plt.subplots()
# Create the learning curve visualizer
sizes = np.linspace(0.3, 1.0, 10)
# Load a classification dataset
X, y = load_game()
# Encode the categorical data
X = OneHotEncoder().fit_transform(X)
y = LabelEncoder().fit_transform(y)
# Instantiate the classification model and visualizer
model = MultinomialNB()
visualizer = LearningCurve(
model, scoring='f1_weighted', ax=ax, train_sizes=sizes)
xticks = mtick.PercentFormatter(55000)
ax.xaxis.set_major_formatter(xticks)
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.show()

Linear regression with one feature from Pandas dataframe

I have tried the code below
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
# Assign the dataframe to this variable.
# TODO: Load the data
bmi_life_data = pd.read_csv("bmi_and_life_expectancy.csv")
X= bmi_life_data['BMI'].values.reshape(-1,1)
y = bmi_life_data['Life expectancy'].values.reshape(-1,1)
# Make and fit the linear regression model
#TODO: Fit the model and Assign it to bmi_life_model
bmi_life_model = LinearRegression()
bmi_life_model.fit(X,y)
# Mak a prediction using the model
# TODO: Predict life expectancy for a BMI value of 21.07931
laos_life_exp = bmi_life_model.predict(21.07931)
but it gives me the error
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Even after reshaping it. I have tried to not reshape it but it still gives me the same error.
The error was in the prediction line
laos_life_exp = bmi_life_model.predict(21.07931)
should be
laos_life_exp = bmi_life_model.predict([[21.07931]])
to be of appropriate dimension
Thanks to #onyambu

Facebook NeuralProphet - Loading model from pickle for prediction

I have a weekly Job which reads data from a csv file and create model based on NeuralProphet and dump the pickle file for the later use.
from neuralprophet import NeuralProphet
from matplotlib import pyplot as plt
import pandas as pd
import pickle
data_location = /input_data/
df = pd.read_csv(data_location + 'input.csv')
np = NeuralProphet()
model = np.fit(df, freq="5min")
with open('model/neuralprophet_model.pkl', "wb") as f:
# dump information to that file
pickle.dump(model, f)
The above python code runs on a weekly basis and it dumps the model file in a file.
Now, i have a different python file which loads the pickle file and does the prediction for the future date.
Lets say, I have last 2 years data in a csv file and created model from that. Now, I would like to predict the future based on the above model.
from neuralprophet import NeuralProphet
import pandas as pd
import pickle
with open('model/neuralprophet_model.pkl', "rb") as f:
model = pickle.load(file)
# To get a next 1 hour prediction by 5mins interval
future = model.make_future_dataframe(periods=12, freq='5min')
forecast = model.predict(future)
Is this correct? Here, I dont pass the data to make_future_dataframe. But, all the internet example passes the data as well. Since, the data was used to train the model, I am just using the model here. Why do we need to pass data also here as we use predict(For some unknown future date) based on the model?
The NeuralProphet model (pickle file) is just a trained neural network... the most simple analogy would be a training linear regression model (from sci-kit learn etc)... y = Ax + b where you have trained A and b vectors. These vectors alone cannot produce y without x. Your model in this example is just the A and b vectors. Now, neuralprophet uses auto-regressive feed forward neural networks, so there are more vector terms and they are not all linear.
That's why NeuralProhpet requires historic data in model.fit... the historic data is x. x can be from the same dataset that you used for training A and b, or x can be from a different but statistically similar dataset (You can use d-bar testing to determine and confidence intervals to determine similarity here).
This is how we use models across most supervised learning applications... train on one sample dataset and apply to predict outcomes on similar datasets.

Activation function with exponential distributed data

I am in beginning of neural networks, I have a bunch of targets in a regression model to predict, what I have noticed is the model works perfectly with targets were already normally distributed, but it does not work well with exponentially distributed targets, I understand this is the activation function rule, but I have been trying many functions (relu, linear, selu,elu, etc) and still didn't get a great result.
Please check the images below
Normally distributed
Exponentially distributed
That sort of makes sense, but riddle me this. Are you taking the right approach? You don't need to assume Normal distributions to do regression. It is a common misunderstanding that OLS somehow assumes normally distributed data. It does not. It is far more general. So, OLS regression makes no assumptions about the data, it makes assumptions about the errors, as estimated by residuals. Also, transforming data to make in fit a model is, in my opinion, the wrong approach. You want your model to fit your problem, not the other way round. There are a few ways to deal with skewed data sets.
1. Normalize Data
2. Standardize Data
Let's see an example.
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
##iris.keys()
df= pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
# Normalize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
import seaborn as sns
# load the iris dataset
iris = load_iris()
print(iris.data.shape)
# separate the data from the target attributes
X = df[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']]
y = df['species']
sns.displot(X)
# normalize the data attributes
normalized_X = preprocessing.normalize(X)
sns.displot(normalized_X)
# Standardize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
# load the Iris dataset
iris = load_iris()
print(iris.data.shape)
# separate the data and target attributes
X = X
y = y
sns.displot(X)
# standardize the data attributes
standardized_X = preprocessing.scale(X)
sns.displot(standardized_X)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[iris.feature_names], iris.target, test_size=0.5, stratify=iris.target, random_state=123456)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=123456)
rf.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
predicted = rf.predict(X_test)
accuracy = accuracy_score(y_test, predicted)
print(f'Out-of-bag score estimate: {rf.oob_score_:.3}')
print(f'Mean accuracy score: {accuracy:.3}')
Result:
Out-of-bag score estimate: 0.96
Mean accuracy score: 0.933
See the link below for specific info on these concepts.
https://machinelearningmastery.com/rescaling-data-for-machine-learning-in-python-with-scikit-learn/
Again, though, maybe this is not the right approach. For one thing, you can try to apply a different model to your specific data set. Support Vector Machine algos just cares about the boundaries of the separating hyperplane and do not assume the exact shape of the distributions. One of my favorite models in the the Decision Tree family, specifically, the Random Forest model.
Also, see this link,
https://www.blopig.com/blog/2017/07/using-random-forests-in-python-with-scikit-learn/

Classify intent of random utterance of chat bot from training data and give different graphical visualization using random forest?

I am creating a nlp model to detect the intent from the provided utterance from a excel file which I am using for training having 2 columns like shown below:
Utterence Intent
hi can I have an Apple Watch service
how much I will be paying monthly service
you still around YOU_THERE
are you still there YOU_THERE
you there YOU_THERE
Speak to me if you are there. YOU_THERE
you around YOU_THERE
There are like around 3000 utterances in the training files and many intents.
I trained my model using scikit learn module and my code looks like this.
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
import re
def preprocessing(userQuery):
letters_only = re.sub("[^a-zA-Z\\d]", " ", userQuery)
words = letters_only.lower().split()
return( " ".join(words ))
#read utterance data from a xlsx file
train = pd.read_excel('training.xlsx')
query_features = train['Utterence']
#create tfidf
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 1))
new_query = [preprocessing(query) for query in query_features]
features = tfidf_vectorizer.fit_transform(new_query).toarray()
#create random forest classification model
model = RandomForestClassifier()
model.fit(features, train['Intent'])
#intent prediction on user query
userQuery = "I want apple watch"
userQueryList=[]
userQueryList.append(preprocessing(userQuery))
utfidf = tfidf_vectorizer.transform(userQueryList)
print(" prediction: ", model.predict(utfidf))
The one of problem for me here is for example: when i run for utterance I want apple watch it gives predicted intent as you_there instead of service as shown below(confirmation on training snapshot above):
C:\U\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
prediction: ['YOU_THERE']
Please help me how should i train my model and what changes should I make to fix such issues and how i can check accuracy? Also I want to see graphical visualization and ROC curve how it can achieved using random forest. I am not very verse in NLP any help would be appreciated.
You are using word bags approach which does not perform well on sequence data.
For your problem, sequential is material to classification.
I would suggest to you that use LSTM (performs better on sequence data)
Let's address your first issue:
how should i train my model and what changes should I make to fix such issues
Below I'm using word2vec approach which rather than just converting the utterances to vectors using TFIDF approach (losing the semantic information contained in that particular sentence), it maintains the semantic info.
To understand more about word2vec, refer this blog :
[1]https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
Below is the code for predicted the intent using word2vec approach (Note - It's same as your code, just instead of using TFIDFVectorizer, I'm using word2vec to obtain the vectors. Also the code is divided into different functions to get a good overview of logics that will be evident by there names).
import pandas as pd
import numpy as np
from gensim.models import Word2Vec
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
def preprocess_lower(token):
#utility for preprocessing
return token.lower()
def load_data(file_name):
# load a csv file in memory
return pd.read_csv(file_name)
def process_training_data(training_data):
# process the training data and split it between independent and dependent variables
training_sentences = [list(map(preprocess_lower,sentence.split(" "))) for sentence in list(training_data.Utterence.values)]
target_class = training_data.Intent.values
label_encoded_Y = preprocessing.LabelEncoder().fit_transform(list(target_class))
return target_class, training_sentences, label_encoded_Y
def process_user_query(training_data):
# process the training data and split it between independent and dependent variables
training_sentences = [list(map(preprocess_lower,sentence.split(" "))) for sentence in training_data]
return training_sentences
def train_word2vec_model(train_sentences_list):
# training word2vec on sentences list (inputted by user)
model = Word2Vec(train_sentences_list, size=100, window=4, min_count=1, workers=4)
return model
def convert_training_data_vectors(model, train_sentences_list):
#get the sentences average vector
training_sectences_vector = list()
for sentence in train_sentences_list:
sentence_vetor = [list(model.wv[token]) for token in sentence if token in model.wv.vocab ]
training_sectences_vector.append(list(np.mean(sentence_vetor, axis=0)))
return training_sectences_vector
def training_rf_prediction_model(training_data_vectors, label_encoded_Y):
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# training model on user inputted data
rf_model = RandomForestClassifier()
# here use the split function and divide the data into training and testing
x_train,x_test,y_train,y_test=train_test_split(training_data_vectors,label_encoded_Y,
train_size=0.8,test_size=0.2)
rf_model.fit(x_train, y_train)
y_pred = rf_model.predict(x_test)
print(accuracy_score(y_test, y_pred))
return rf_model
def training_svm_prediction_model(training_data_vectors, label_encoded_Y):
svm_model = SVC(gamma='auto')
svm_model.fit(training_data_vectors, label_encoded_Y)
return svm_model
def process_data_flow(file_name):
training_data = load_data(file_name)
target_class, training_sentences, label_encoded_Y = process_training_data(training_data)
word2vec_model = train_word2vec_model(train_sentences_list=training_sentences)
training_data_vectors = convert_training_data_vectors(word2vec_model, train_sentences_list=training_sentences)
prediction_model = training_rf_prediction_model(training_data_vectors, label_encoded_Y)
#intent prediction on user query
userQuery = ["i want apple watch"]
user_query_vectors = convert_training_data_vectors(word2vec_model, process_user_query(userQuery))
predicted_class = prediction_model.predict(user_query_vectors)[0]
predicted_intent = target_class[list(label_encoded_Y).index(predicted_class)]
return predicted_intent
print("Predicted class: ", process_data_flow("sample_intent_data.csv"))
sample data file is in csv format, you just need to format and paste the data in below format :
#sample_input_data.csv
Utterence,Intent
hi can I have an Apple Watch,service
how much I will be paying monthly,service
you still around,YOU_THERE
are you still there,YOU_THERE
you there,YOU_THERE
Speak to me if you are there,YOU_THERE
you around,YOU_THERE
Also note, your training data should contain good amount of training utterances for each intents for the approach to work.
For accuracy, you can use below approach:
Divide the data into training and testing (mention the split ratio) :
x_train,x_test,y_train,y_test=train_test_split(training_vectors,label_encoded_Y,
train_size=0.8,
test_size=0.2)
And after training the model, use predict function on x_test to get the predictions. Now just match the prediction for testing data from the model and actual from the data set and you will be able to easily determine the accuracy.
Edit: Added the accuracy score calculation while predicting.