Linear regression with one feature from Pandas dataframe - pandas

I have tried the code below
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
# Assign the dataframe to this variable.
# TODO: Load the data
bmi_life_data = pd.read_csv("bmi_and_life_expectancy.csv")
X= bmi_life_data['BMI'].values.reshape(-1,1)
y = bmi_life_data['Life expectancy'].values.reshape(-1,1)
# Make and fit the linear regression model
#TODO: Fit the model and Assign it to bmi_life_model
bmi_life_model = LinearRegression()
bmi_life_model.fit(X,y)
# Mak a prediction using the model
# TODO: Predict life expectancy for a BMI value of 21.07931
laos_life_exp = bmi_life_model.predict(21.07931)
but it gives me the error
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Even after reshaping it. I have tried to not reshape it but it still gives me the same error.

The error was in the prediction line
laos_life_exp = bmi_life_model.predict(21.07931)
should be
laos_life_exp = bmi_life_model.predict([[21.07931]])
to be of appropriate dimension
Thanks to #onyambu

Related

How to get samples per class for TensorFlow Dataset

I am using dataset from TensorFlow datasets.
Is there an easy way to access number of samples for each class in dataset? I was searching through keras api, and I did not found any ready to use function.
Ultimately I would like to plot a bar plot with number of samples at Y axis, and int indicating class id at X axis. The goal is to show how evenly is data distributed across classes.
With np.fromiter you can create a 1-D array from an iterable object.
import tensorflow_datasets as tfds
import numpy as np
import seaborn as sns
dataset = tfds.load('cifar10', split='train', as_supervised=True)
labels, counts = np.unique(np.fromiter(dataset.map(lambda x, y: y), np.int32),
return_counts=True)
plt.ylabel('Counts')
plt.xlabel('Labels')
sns.barplot(x = labels, y = counts)
Update: You can also count the labels like below:
labels = []
for x, y in dataset:
# Not one hot encoded
labels.append(y.numpy())
# If one hot encoded, then apply argmax
# labels.append(np.argmax(y, axis = -1))
labels = np.concatenate(labels, axis = 0) # Assuming dataset was batched.
Then you can plot them using the labels array.

Passing a dict of tensors to a Keras model

I am trying to preprocess the infamous Titanic data (from Kaggle) by following this tutorial.
Everything was okay until I get to run the titanic_processing Model on the data (titanic_features) and I get this error:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float).
In the tutorial it is mentioned that one should transform the data into a dict of tensors, but:
I don't see how the code (see HERE1 tag in my code below) makes a dict of tensors (there is no tf.convert_to_tensor for example)
I don't understand why one should retransform all the data as the previous code was suppose to do just that (when one create preprocessed_inputs etc.)
Here is my code, but you can also execute it on Google Colab here.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
url = "https://raw.githubusercontent.com/aymeric75/IA/master/train.csv"
titanic = pd.read_csv(url)
titanic_features = titanic.copy()
titanic_labels = titanic_features.pop('Survived')
inputs = {}
for name, column in titanic_features.items():
dtype = column.dtype
if dtype == object:
dtype = tf.string
else:
dtype = tf.float32
inputs[name] = tf.keras.Input(shape=(1,), name=name, dtype=dtype)
numeric_inputs = {name:input for name,input in inputs.items()
if input.dtype==tf.float32}
x = layers.Concatenate()(list(numeric_inputs.values()))
norm = preprocessing.Normalization()
norm.adapt(np.array(titanic[numeric_inputs.keys()]))
all_numeric_inputs = norm(x)
preprocessed_inputs = [all_numeric_inputs]
for name, input in inputs.items():
if input.dtype == tf.float32:
continue
lookup = preprocessing.StringLookup(vocabulary=np.unique(titanic_features[name].dropna()))
one_hot = preprocessing.CategoryEncoding(max_tokens=lookup.vocab_size())
x = lookup(input)
x = one_hot(x)
preprocessed_inputs.append(x)
preprocessed_inputs_cat = layers.Concatenate()(preprocessed_inputs)
titanic_preprocessing = tf.keras.Model(inputs, preprocessed_inputs_cat)
titanic_features_dict = {}
# This model just contains the input preprocessing. You can run it to see what it does to your data.
# Keras models don't automatically convert Pandas DataFrames because
# it's not clear if it should be converted to one tensor or to a dictionary of tensors. So convert it to a dictionary of tensors:
# HERE1
titanic_features_dict = {name: np.array(value)
for name, value in titanic_features.items()}
features_dict = {name:values[:1] for name, values in titanic_features_dict.items()}
titanic_preprocessing(features_dict)
Thanks a lot for you support!
Aymeric
[UPDATE] if you can answer question 2 ("I don't understand why one should retransform all the data as the previous code was suppose to do just that (when one create preprocessed_inputs etc.") then I will validate your answer, because I think I need to reformat the input indeed (but I don't see what it the point of doing all the code before...)
In your case, the problem is caused by the fact that your feature "Cabin" contains some nan (Not a Number) values. Tensorflow is fine with nan in floating point and integer data types, but not for strings.
You can replace all those nan values with an empty strings in your pandas dataframe :
titanic_features["Cabin"] = titanic_features["Cabin"].fillna("")
The previous code simply declares a preprocessing function as a keras model. You don't actually preprocess any data until your call to the titanic_preprocessing model.

Activation function with exponential distributed data

I am in beginning of neural networks, I have a bunch of targets in a regression model to predict, what I have noticed is the model works perfectly with targets were already normally distributed, but it does not work well with exponentially distributed targets, I understand this is the activation function rule, but I have been trying many functions (relu, linear, selu,elu, etc) and still didn't get a great result.
Please check the images below
Normally distributed
Exponentially distributed
That sort of makes sense, but riddle me this. Are you taking the right approach? You don't need to assume Normal distributions to do regression. It is a common misunderstanding that OLS somehow assumes normally distributed data. It does not. It is far more general. So, OLS regression makes no assumptions about the data, it makes assumptions about the errors, as estimated by residuals. Also, transforming data to make in fit a model is, in my opinion, the wrong approach. You want your model to fit your problem, not the other way round. There are a few ways to deal with skewed data sets.
1. Normalize Data
2. Standardize Data
Let's see an example.
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
##iris.keys()
df= pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
# Normalize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
import seaborn as sns
# load the iris dataset
iris = load_iris()
print(iris.data.shape)
# separate the data from the target attributes
X = df[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']]
y = df['species']
sns.displot(X)
# normalize the data attributes
normalized_X = preprocessing.normalize(X)
sns.displot(normalized_X)
# Standardize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
# load the Iris dataset
iris = load_iris()
print(iris.data.shape)
# separate the data and target attributes
X = X
y = y
sns.displot(X)
# standardize the data attributes
standardized_X = preprocessing.scale(X)
sns.displot(standardized_X)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[iris.feature_names], iris.target, test_size=0.5, stratify=iris.target, random_state=123456)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=123456)
rf.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
predicted = rf.predict(X_test)
accuracy = accuracy_score(y_test, predicted)
print(f'Out-of-bag score estimate: {rf.oob_score_:.3}')
print(f'Mean accuracy score: {accuracy:.3}')
Result:
Out-of-bag score estimate: 0.96
Mean accuracy score: 0.933
See the link below for specific info on these concepts.
https://machinelearningmastery.com/rescaling-data-for-machine-learning-in-python-with-scikit-learn/
Again, though, maybe this is not the right approach. For one thing, you can try to apply a different model to your specific data set. Support Vector Machine algos just cares about the boundaries of the separating hyperplane and do not assume the exact shape of the distributions. One of my favorite models in the the Decision Tree family, specifically, the Random Forest model.
Also, see this link,
https://www.blopig.com/blog/2017/07/using-random-forests-in-python-with-scikit-learn/

My TensorFlow Graph is abnormally large using Edward

I have code here that I've modified from this website. Basically what I have written is this:
#import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
#from tensorflow.examples.tutorials.mnist import input_data
from edward.models import Categorical, Normal
import edward as ed
#ed.set_seed(39)
import pandas as pd
import csv
# Use the TensorFlow method to download and/or load the data.
with open ("data_final.csv", "r") as csvfile:
reader1 = csv.reader(csvfile)
data1 = np.array(list(reader1)).astype(np.float)
#mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
N = data1.shape[0] -1 # number of images in a minibatch.
D = 4 # number of features.
K = 4 # number of classes.
# Create a placeholder to hold the data (in minibatches) in a TensorFlow graph.
x = tf.placeholder(tf.float32, [N, D])
# Normal(0,1) priors for the variables. Note that the syntax assumes TensorFlow 1.1.
w = Normal(loc=tf.zeros([D, K]), scale=tf.ones([D, K]))
b = Normal(loc=tf.zeros(K), scale=tf.ones(K))
# Categorical likelihood for classication.
y =tf.matmul(x,w)+b
# Contruct the q(w) and q(b). in this case we assume Normal distributions.
qw = Normal(loc=tf.Variable(tf.random_normal([D, K])),
scale=tf.nn.softplus(tf.Variable(tf.random_normal([D, K]))))
qb = Normal(loc=tf.Variable(tf.random_normal([K])),
scale=tf.nn.softplus(tf.Variable(tf.random_normal([K]))))
# We use a placeholder for the labels in anticipation of the traning data.
y_ph = tf.placeholder(tf.float32, [N, K])
# Define the VI inference technique, ie. minimise the KL divergence between q and p.
inference = ed.KLqp({w: qw, b: qb}, data={y:y_ph})
# Initialse the infernce variables
inference.initialize(n_iter=5000, n_print=100, scale={y: 1})
# We will use an interactive session.
sess = tf.InteractiveSession()
# Initialise all the vairables in the session.
tf.global_variables_initializer().run()
I use the data linked here, to run the code. I get an error after less than a second of running the code (so I have a hard time believing this actually happened) that said:
ValueError: GraphDef cannot be larger than 2GB.
I think there were other topics with the same error as mine, but those people had instantiated like 1 million parameters of something. I have on the order to 20 parameters, so unsure why I'm getting this error.
In my case, there were still variables (and likely a graphs) that were not garbage collected from a previous Edward runs. Garbage collecting/resetting the console fixed the problem.

How do I plot a non-linear model using matplotlib?

I'm a bit lost as to how to proceed to achieve this. Normally with a linear model, when I perform linear regressions, I simply take my training data (x) and and my output data (y) and plot them using matplotlib. Now I have 3 features with and my output/observation (y). Can anyone guide me as to how to graph this kind of model using matplotlib? My goal is to fit a polynomial model and graph a polynomial using matplotlib.
%matplotlib inline
import sframe as frame
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
# Initalize SFrame
sales = frame.SFrame('kc_house_data.gl/')
# Separate data into test and training data
train_data,test_data = sales.random_split(.8,seed=0)
# Organize data into training and testing data
train_x = train_data[['sqft_living', 'bedrooms', 'bathrooms']].to_dataframe().values
train_y = train_data[['price']].to_dataframe().values
test_x = test_data[['sqft_living', 'bedrooms', 'bathrooms']].to_dataframe().values
test_y = test_data[['price']].to_dataframe().values
# Create a model using sklearn with multiple features
regr = linear_model.LinearRegression(fit_intercept=True, n_jobs=2)
# test predictions
regr.predict(train_x)
# Prepare to plot the data
Note:
The train_x variable contains my 3 features, and my train_y contains the output data. I use SFrame to contain the data. SFrame has the ability to convert itself into a dataframe (used in Pandas). Using the conversion I am able to grab the values.
Rather than plotting a non-linear model with multiple discrete features at once, I have found that simply observing each and every feature against my observation/output was better and easier for my research.