I am in beginning of neural networks, I have a bunch of targets in a regression model to predict, what I have noticed is the model works perfectly with targets were already normally distributed, but it does not work well with exponentially distributed targets, I understand this is the activation function rule, but I have been trying many functions (relu, linear, selu,elu, etc) and still didn't get a great result.
Please check the images below
Normally distributed
Exponentially distributed
That sort of makes sense, but riddle me this. Are you taking the right approach? You don't need to assume Normal distributions to do regression. It is a common misunderstanding that OLS somehow assumes normally distributed data. It does not. It is far more general. So, OLS regression makes no assumptions about the data, it makes assumptions about the errors, as estimated by residuals. Also, transforming data to make in fit a model is, in my opinion, the wrong approach. You want your model to fit your problem, not the other way round. There are a few ways to deal with skewed data sets.
1. Normalize Data
2. Standardize Data
Let's see an example.
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
##iris.keys()
df= pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
# Normalize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
import seaborn as sns
# load the iris dataset
iris = load_iris()
print(iris.data.shape)
# separate the data from the target attributes
X = df[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']]
y = df['species']
sns.displot(X)
# normalize the data attributes
normalized_X = preprocessing.normalize(X)
sns.displot(normalized_X)
# Standardize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
# load the Iris dataset
iris = load_iris()
print(iris.data.shape)
# separate the data and target attributes
X = X
y = y
sns.displot(X)
# standardize the data attributes
standardized_X = preprocessing.scale(X)
sns.displot(standardized_X)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[iris.feature_names], iris.target, test_size=0.5, stratify=iris.target, random_state=123456)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=123456)
rf.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
predicted = rf.predict(X_test)
accuracy = accuracy_score(y_test, predicted)
print(f'Out-of-bag score estimate: {rf.oob_score_:.3}')
print(f'Mean accuracy score: {accuracy:.3}')
Result:
Out-of-bag score estimate: 0.96
Mean accuracy score: 0.933
See the link below for specific info on these concepts.
https://machinelearningmastery.com/rescaling-data-for-machine-learning-in-python-with-scikit-learn/
Again, though, maybe this is not the right approach. For one thing, you can try to apply a different model to your specific data set. Support Vector Machine algos just cares about the boundaries of the separating hyperplane and do not assume the exact shape of the distributions. One of my favorite models in the the Decision Tree family, specifically, the Random Forest model.
Also, see this link,
https://www.blopig.com/blog/2017/07/using-random-forests-in-python-with-scikit-learn/
Related
I'm trying to change the xtick of Yellowbrick's learning curve figure from number of samples to normalized number(%) of samples. I googled a lot but couldn't find the way.
You need to change the xticks so that they are normalized to the number of training instances thus you need to specify in the percentformatter in my example the number of training instances(55000). I provide the before and after images.
from yellowbrick.model_selection import LearningCurve
from sklearn.naive_bayes import MultinomialNB
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from yellowbrick.datasets import load_game
import matplotlib.pyplot as plt
# Create subplot
fig,ax = plt.subplots()
# Create the learning curve visualizer
sizes = np.linspace(0.3, 1.0, 10)
# Load a classification dataset
X, y = load_game()
# Encode the categorical data
X = OneHotEncoder().fit_transform(X)
y = LabelEncoder().fit_transform(y)
# Instantiate the classification model and visualizer
model = MultinomialNB()
visualizer = LearningCurve(
model, scoring='f1_weighted', ax=ax, train_sizes=sizes)
xticks = mtick.PercentFormatter(55000)
ax.xaxis.set_major_formatter(xticks)
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.show()
I have tried the code below
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
# Assign the dataframe to this variable.
# TODO: Load the data
bmi_life_data = pd.read_csv("bmi_and_life_expectancy.csv")
X= bmi_life_data['BMI'].values.reshape(-1,1)
y = bmi_life_data['Life expectancy'].values.reshape(-1,1)
# Make and fit the linear regression model
#TODO: Fit the model and Assign it to bmi_life_model
bmi_life_model = LinearRegression()
bmi_life_model.fit(X,y)
# Mak a prediction using the model
# TODO: Predict life expectancy for a BMI value of 21.07931
laos_life_exp = bmi_life_model.predict(21.07931)
but it gives me the error
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Even after reshaping it. I have tried to not reshape it but it still gives me the same error.
The error was in the prediction line
laos_life_exp = bmi_life_model.predict(21.07931)
should be
laos_life_exp = bmi_life_model.predict([[21.07931]])
to be of appropriate dimension
Thanks to #onyambu
I have a weekly Job which reads data from a csv file and create model based on NeuralProphet and dump the pickle file for the later use.
from neuralprophet import NeuralProphet
from matplotlib import pyplot as plt
import pandas as pd
import pickle
data_location = /input_data/
df = pd.read_csv(data_location + 'input.csv')
np = NeuralProphet()
model = np.fit(df, freq="5min")
with open('model/neuralprophet_model.pkl', "wb") as f:
# dump information to that file
pickle.dump(model, f)
The above python code runs on a weekly basis and it dumps the model file in a file.
Now, i have a different python file which loads the pickle file and does the prediction for the future date.
Lets say, I have last 2 years data in a csv file and created model from that. Now, I would like to predict the future based on the above model.
from neuralprophet import NeuralProphet
import pandas as pd
import pickle
with open('model/neuralprophet_model.pkl', "rb") as f:
model = pickle.load(file)
# To get a next 1 hour prediction by 5mins interval
future = model.make_future_dataframe(periods=12, freq='5min')
forecast = model.predict(future)
Is this correct? Here, I dont pass the data to make_future_dataframe. But, all the internet example passes the data as well. Since, the data was used to train the model, I am just using the model here. Why do we need to pass data also here as we use predict(For some unknown future date) based on the model?
The NeuralProphet model (pickle file) is just a trained neural network... the most simple analogy would be a training linear regression model (from sci-kit learn etc)... y = Ax + b where you have trained A and b vectors. These vectors alone cannot produce y without x. Your model in this example is just the A and b vectors. Now, neuralprophet uses auto-regressive feed forward neural networks, so there are more vector terms and they are not all linear.
That's why NeuralProhpet requires historic data in model.fit... the historic data is x. x can be from the same dataset that you used for training A and b, or x can be from a different but statistically similar dataset (You can use d-bar testing to determine and confidence intervals to determine similarity here).
This is how we use models across most supervised learning applications... train on one sample dataset and apply to predict outcomes on similar datasets.
I want to build a Handwritten Digit Recognition on MNIST dataset using sklearn and I wanted to shuffle my train set for both features(x) and label(y). But it shows a KeyError. Let me know what is the correct way to do it.
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
x,y=mnist['data'],mnist['target']
x.shape
y.shape
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
digit = np.array(x.iloc[45])
digit_img = digit.reshape(28,28)
plt.imshow(digit_img,cmap=matplotlib.cm.binary , interpolation="nearest")
plt.axis("off")
y.iloc[45]
x_train, x_test = x[:60000],x[60000:]
y_train, y_test=y[:60000],y[60000:]
import numpy as np
shuffled = np.random.permutation(60000)
x_train=x_train[shuffled] -->
y_train = y_train[shuffled] --> these two lines are throwing error
Please check if type(x_train) is numpy.ndarray or DataFrame.
Since Scikit-Learn 0.24, fetch_openml() returns a Pandas DataFrame by default.
If it is dataframe, in that case you can not use x_train[shuffled], which is meant for arrays.
Instead use x_train.iloc[shuffled]
I'm a bit lost as to how to proceed to achieve this. Normally with a linear model, when I perform linear regressions, I simply take my training data (x) and and my output data (y) and plot them using matplotlib. Now I have 3 features with and my output/observation (y). Can anyone guide me as to how to graph this kind of model using matplotlib? My goal is to fit a polynomial model and graph a polynomial using matplotlib.
%matplotlib inline
import sframe as frame
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
# Initalize SFrame
sales = frame.SFrame('kc_house_data.gl/')
# Separate data into test and training data
train_data,test_data = sales.random_split(.8,seed=0)
# Organize data into training and testing data
train_x = train_data[['sqft_living', 'bedrooms', 'bathrooms']].to_dataframe().values
train_y = train_data[['price']].to_dataframe().values
test_x = test_data[['sqft_living', 'bedrooms', 'bathrooms']].to_dataframe().values
test_y = test_data[['price']].to_dataframe().values
# Create a model using sklearn with multiple features
regr = linear_model.LinearRegression(fit_intercept=True, n_jobs=2)
# test predictions
regr.predict(train_x)
# Prepare to plot the data
Note:
The train_x variable contains my 3 features, and my train_y contains the output data. I use SFrame to contain the data. SFrame has the ability to convert itself into a dataframe (used in Pandas). Using the conversion I am able to grab the values.
Rather than plotting a non-linear model with multiple discrete features at once, I have found that simply observing each and every feature against my observation/output was better and easier for my research.