Linear Regression - matplotlib

My Problem Statement is :
The following data set shows the result of recently conducted study on the correlation of the number of hours spent driving with the risk of developing acute back pain. Find the equation of the best fit line for this data.
Data set is as below :
x y
10 95
9 80
2 10
15 50
10 45
16 98
11 38
16 93
Machine spec : Linux Ubuntu 18.10 64bit
I am having some error:
python LR.py
Accuracy :
43.70948145101002
[6.01607946]
Enter the no of hours10
y :
0.095271*10.000000+5.063367
Risk Score : 6.016079463451905
Traceback (most recent call last):
File "LR.py", line 30, in <module>
plt.plot(X,y,'o')
File "/home/sumeet/anaconda3/lib/python3.6/site-
packages/matplotlib/pyplot.py", line 3358, in plot
ret = ax.plot(*args, **kwargs)
File "/home/sumeet/anaconda3/lib/python3.6/site-
packages/matplotlib/__init__.py", line 1855, in inner
return func(ax, *args, **kwargs)
File "/home/sumeet/anaconda3/lib/python3.6/site-
packages/matplotlib/axes/_axes.py", line 1527, in plot
for line in self._get_lines(*args, **kwargs):
File "/home/sumeet/anaconda3/lib/python3.6/site-
packages/matplotlib/axes/_base.py", line 406, in _grab_next_args
for seg in self._plot_args(this, kwargs):
File "/home/sumeet/anaconda3/lib/python3.6/site-
packages/matplotlib/axes/_base.py", line 383, in _plot_args
x, y = self._xy_from_xy(x, y)
File "/home/sumeet/anaconda3/lib/python3.6/site-
packages/matplotlib/axes/_base.py", line 242, in _xy_from_xy
"have shapes {} and {}".format(x.shape, y.shape))
ValueError: x and y must have same first dimension, but have
shapes (8, 1) and (1,)
The code is as below:
import matplotlib.pyplot as plt
import pandas as pd
# Read Dataset
dataset=pd.read_csv("hours.csv")
X=dataset.iloc[:,:-1].values
y=dataset.iloc[:,1].values
# Import the Linear Regression and Create object of it
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(X,y)
Accuracy=regressor.score(X, y)*100
print("Accuracy :")
print(Accuracy)
# Predict the value using Regressor Object
y_pred=regressor.predict([[10]])
print(y_pred)
# Take user input
hours=int(input('Enter the no of hours'))
#calculate the value of y
eq=regressor.coef_*hours+regressor.intercept_
y='%f*%f+%f' %(regressor.coef_,hours,regressor.intercept_)
print("y :")
print(y)
print("Risk Score : ", eq[0])
plt.plot(X,y,'o')
plt.plot(X,regressor.predict(X));
plt.show()

In the beginning of your code, you define the y which you probably want to plot:
y=dataset.iloc[:,1].values
but further down, you re-define (and thus overwriting) it as
y='%f*%f+%f' %(regressor.coef_,hours,regressor.intercept_)
which causes the error, as this last y is a string and not an array with 8 elements like X (and like your initial y).
Change it with something else, e.g. Y, at the relevant lines in the end:
Y='%f*%f+%f' %(regressor.coef_,hours,regressor.intercept_)
print("Y :")
print(Y)
so as to keep your y as initially defined, and you should be fine.

Related

Python Sklearn "ValueError: Classification metrics can't handle a mix of multiclass-multioutput and binary targets" error

I have already visited this answer but didn't understand.
I don't get this error when I use test_train_split function for using the same dateset for testing and training.
But when I try to use different csv files for testing and training I get this error.
link to titanic kaggle competition
Can Someone please explain why I am I getting this error?
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(df,survived_df)
predictions=logreg.predict(test)
from sklearn.metrics import accuracy_score
accuracy=accuracy_score(test_survived,predictions) #error here Value Error ""ValueError: Classification metrics can't handle a mix of multiclass-multioutput and binary targets""
print(accuracy)
Full Error
ValueError Traceback (most recent call last)
<ipython-input-243-89c8ae1a928d> in <module>
----> 1 logreg.score(test,test_survived)
2
~/mldl/kaggle_practice/titanic_pilot/venv/lib64/python3.8/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
497 """
498 from .metrics import accuracy_score
--> 499 return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
500
501 def _more_tags(self):
~/mldl/kaggle_practice/titanic_pilot/venv/lib64/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
~/mldl/kaggle_practice/titanic_pilot/venv/lib64/python3.8/site-packages/sklearn/metrics/_classification.py in accuracy_score(y_true, y_pred, normalize, sample_weight)
185
186 # Compute accuracy for each possible representation
--> 187 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
188 check_consistent_length(y_true, y_pred, sample_weight)
189 if y_type.startswith('multilabel'):
~/mldl/kaggle_practice/titanic_pilot/venv/lib64/python3.8/site-packages/sklearn/metrics/_classification.py in _check_targets(y_true, y_pred)
88
89 if len(y_type) > 1:
---> 90 raise ValueError("Classification metrics can't handle a mix of {0} "
91 "and {1} targets".format(type_true, type_pred))
92
ValueError: Classification metrics can't handle a mix of multiclass-multioutput and binary targets
Full Code
df=pd.read_csv('data/train.csv')
test=pd.read_csv('data/test.csv')
test_survived=pd.read_csv('data/gender_submission.csv')
plt.figure(5)
df=df.drop(columns=['Name','SibSp','Ticket','Cabin','Parch','Embarked'])
test=test.drop(columns=['Name','SibSp','Ticket','Cabin','Parch','Embarked'])
sns.heatmap(df.isnull(),),
plt.figure(2)
sns.boxplot(data=df,y='Age')
# from boxplot 75th%ile seems to b 38 n 25th percentile seems to be 20.....
#so multiplying by 1.5 at both ends so Age(10,57) seems good and any value outside this ...lets consider as outliers..
#also using this age for calaculating mean for replacing na values of age.
df=df.loc[df['Age'].between(9,58),]
# test=test.loc[test['Age'].between(9,58),]
# test=test.loc[test['Age'].between(9,58),]
df=df.reset_index(drop=True,)
class_3_age=df.loc[df['Pclass']==3].Age.mean()
class_2_age=df.loc[df['Pclass']==2].Age.mean()
class_1_age=df.loc[df['Pclass']==1].Age.mean()
def remove_null_age(data):
agee=data[0]
pclasss=data[1]
if pd.isnull(agee):
if pclasss==1:
return class_1_age
elif pclasss==2:
return class_2_age
else:
return class_3_age
return agee
df['Age']=df[["Age","Pclass"]].apply(remove_null_age,axis=1)
test['Age']=test[["Age","Pclass"]].apply(remove_null_age,axis=1)
sex=pd.get_dummies(df['Sex'],drop_first=True)
test_sex=pd.get_dummies(test['Sex'],drop_first=True)
sex=sex.reset_index(drop=True)
test_sex=test_sex.reset_index(drop=True)
df=df.drop(columns=['Sex'])
test=test.drop(columns=['Sex'])
df=pd.concat([df,sex],axis=1)
test=test.reset_index(drop=True)
df=df.reset_index(drop=True)
test=pd.concat([test,test_sex],axis=1)
survived_df=df["Survived"]
df=df.drop(columns='Survived')
test["Age"]=test['Age'].round(1)
test.at[152,'Fare']=30
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(df,survived_df)
predictions=logreg.predict(test)
from sklearn.metrics import accuracy_score
accuracy=accuracy_score(test_survived,predictions)
print(accuracy)
You probably want to get the accuracy for the predictions together with the column Survived of the test_survived dataframe:
from sklearn.metrics import accuracy_score
accuracy=accuracy_score(test_survived['Survived'],predictions)
print(accuracy)
Your error occured, because the accuracy_score() only takes two 1-dimensional arrays, one as the ground truth labels and the other as the predicted labels. But you provided a 2-dimensional "array" (the dataframe) and the 1-dimensional predictions, hence it assumed that your first input is a multiclass-output.
The documentation is also very resourceful for this.

TypeError: scatter() got multiple values for argument 'c'

I am trying to do hierarchy clustering on my MFCC array 'signal_mfcc' which is an ndarray with dimensions of (198, 12). 198 audio frames/observation and 12 coefficients/dimensions?
I am using a random threshold of '250' with 'distance' for the criterion as shown below:
thresh = 250
print(signal_mfcc.shape)
clusters = hcluster.fclusterdata(signal_mfcc, thresh, criterion="distance")
With the specified threshold, the output variable 'cluster' is a sequence [1 1 1 ... 1] with the length of 198 or (198,) which I assume points all the data to a single cluster.
Then, I am using pyplot to plot scatter() with the following code:
# plotting
print(*(signal_mfcc.T).shape)
plt.scatter(*np.transpose(signal_mfcc), c=clusters)
plt.axis("equal")
title = "threshold: %f, number of clusters: %d" % (thresh) len(set(clusters)))
plt.title(title)
plt.show()
The output is:
plt.scatter(*np.transpose(signal_mfcc), c=clusters)
TypeError: scatter() got multiple values for argument 'c'
The scatter plot would not show. Any clues to what may went wrong?
Thanks in advance!
From this SO Thread, you can see why you have this error.
Fom the Scatter documentation, c is the 2nd optional argument, and the 4th argument total. This error means that your unpacking on np.transpose(signal_mfcc) returns more than 4 items. And as you define c later on, it is defined twice and it cannot choose which one is correct.
Example :
def temp(n, c=0):
pass
temp(*[1, 2], c=1)
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# TypeError: temp() got multiple values for argument 'c'

Why this errror appears during fit while creating decision Tree Classifier

Hi I am trying Decision Tree Classifier by following this video Hello World - Machine Learning Recipes #1 Google Developers.
Here is my Code.
#Import the Pandas library
import pandas as pd
#Load the train and test datasets to create two DataFrames
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv" train = pd.read_csv(train_url)
#Print the head of the train and test dataframes
train.head()
test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv" test = pd.read_csv(test_url)
#Print the head of the train and test dataframes
test.head()
#from sklearn import tree
from sklearn import tree
#find the best feature to predict Survival rate
#define X_features and Y_labels
col_names=['Pclass','Age','SibSp','Parch']
X_features= train[col_names]
#assign survial to label
Y_labels= train.Survived
#create a decision tree classifier
clf=tree.DecisionTreeClassifier()
#fit (find patterns in Data)
clf=clf.fit(X_features, Y_labels)
clf.predict(test[col_names])
Getting Error
ValueError Traceback (most recent call last) in () 13#Y_train_sparse=Y_labels.to_sparse() 14 # fit (find patterns in Data) ---> 15 clf=clf.fit(X_features, Y_labels) 16 #clf.predict(test[col_names])
C:\Users\nitinahu\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\tree\tree.py
in fit(self, X, y, sample_weight, check_input, X_idx_sorted) 152
random_state = check_random_state(self.random_state) 153 if
check_input: --> 154 X = check_array(X, dtype=DTYPE,
accept_sparse="csc") 155 if issparse(X): 156 X.sort_indices()
C:\Users\nitinahu\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py
in check_array(array, accept_sparse, dtype, order, copy,
force_all_finite, ensure_2d, allow_nd, ensure_min_samples,
ensure_min_features, warn_on_dtype, estimator) 396 % (array.ndim,
estimator_name)) 397 if force_all_finite: --> 398
_assert_all_finite(array) 399 400 shape_repr = _shape_repr(array.shape)
C:\Users\nitinahu\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py
in _assert_all_finite(X) 52 and not np.isfinite(X).all()): 53 raise
ValueError("Input contains NaN, infinity" ---> 54 " or a value too
large for %r." % X.dtype) 55 56
ValueError: Input contains NaN, infinity or a value too large for
dtype('float32').
Just check all the values u r getting in the responses.
One or two is giving out of bound values and that is causing an overflow to occur.

Cant fit scikit-neuralnetwork classifier because of tuple index out of range

I am trying to get this classifier working. It is a extension for scikit learn with dependencies to Theano.
My goal was to fit a neural network with a list of years and teach it to know if it is a leap year or not (later I would increase the range). But I run in an error if I want to test this example.
My code looks like this:
leapyear.py
import numpy as np
import calendar
from sknn.mlp import Classifier, Layer
from sklearn.cross_validation import train_test_split
# create years in range
years = np.arange(1970, 2001)
pre_is_leap = []
# test if year is a leapyear
for x in years:
pre_is_leap.append(calendar.isleap(x))
# convert true, false list to 0,1 list
is_leap = np.array(pre_is_leap, dtype=bool).astype(int)
# split
years_train, years_test, is_leap_train, is_leap_test = train_test_split(years, is_leap, test_size=0.33, random_state=42)
# test output
print(len(years_train))
print(len(is_leap_train))
print(years_train)
print(is_leap_train)
#neural network
nn = Classifier(
layers=[
Layer("Maxout", units=100, pieces=2),
Layer("Softmax")],
learning_rate=0.001,
n_iter=25)
# fit
nn.fit(years_train, is_leap_train)
#nn.fit(np.array(years_train), np.array(is_leap_train))
requirements.txt
numpy==1.9.2
PyYAML==3.11
scikit-learn==0.16.1
scikit-neuralnetwork==0.3
scipy==0.16.0
Theano==0.7.0
my output with error:
20
20
[1986 1975 1983 1981 1992 1971 1972 1995 1973 1991 1996 1988 2000 1990 1977
1980 1984 1998 1989 1976]
[0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 1 1 0 0 1]
/home/devnull/master/scikit/env/lib/python3.4/site-packages/sklearn/utils/validation.py:498: UserWarning: MinMaxScaler assumes floating point values as input, got int64
"got %s" % (estimator, X.dtype))
/home/devnull/master/scikit/env/lib/python3.4/site-packages/sklearn/preprocessing/data.py:256: DeprecationWarning: Implicitly casting between incompatible kinds. In a future numpy release, this will raise an error. Use casting="unsafe" if this is intentional.
X *= self.scale_
/home/devnull/master/scikit/env/lib/python3.4/site-packages/sklearn/preprocessing/data.py:257: DeprecationWarning: Implicitly casting between incompatible kinds. In a future numpy release, this will raise an error. Use casting="unsafe" if this is intentional.
X += self.min_
Traceback (most recent call last):
File "/home/devnull/master/scikit/leapyear.py", line 47, in <module>
pipeline.fit(years_train, is_leap_train)
File "/home/devnull/master/scikit/env/lib/python3.4/site-packages/sklearn/pipeline.py", line 141, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "/home/devnull/master/scikit/env/lib/python3.4/site-packages/sknn/mlp.py", line 283, in fit
return super(Classifier, self)._fit(X, yp)
File "/home/devnull/master/scikit/env/lib/python3.4/site-packages/sknn/mlp.py", line 127, in _fit
X, y = self._initialize(X, y)
File "/home/devnull/master/scikit/env/lib/python3.4/site-packages/sknn/mlp.py", line 37, in _initialize
self._create_specs(X, y)
File "/home/devnull/master/scikit/env/lib/python3.4/site-packages/sknn/mlp.py", line 67, in _create_specs
self.unit_counts = [numpy.product(X.shape[1:]) if self.is_convolution else X.shape[1]]
IndexError: tuple index out of range
I looked into the sources of mlp.py, but I dont know how to fix it. What has to be changed that I can fit my network?
Update not question related:
I just wanted to add, that I need to convert the year to a binary representation, after this the neural network will work.
The problem is that the classifier requires the data to be presented as a 2 dimensional numpy array, with the first axis being the samples and the second axis being the features.
In your case you have only one "feature" (the year) so you need to turn the years data into a Nx1 2D numpy array. This can be achieved by adding the following line just before the data split statement:
years = np.array([[year] for year in years])

read three columns from a text file using matplotlib

My input file (text.txt) includes three columns. First one is belongs to x-axis, second column represents y-axis and third column represents y-axis again. When i run my code, i get "x.append(float(line.split()[0])) IndexError: list index out of range". How can I fix that error?
my code:
#!/usr/bin/python
import numpy as np
import matplotlib.pyplot as plt
with open("text.txt", "r") as data_file:
lines=data_file.readlines()
x=[]
y1=[]
y2=[]
counter=0
for line in lines:
if((line[0]!='#') and (line[0]!='#')):
x.append(float(line.split()[0]))
y1.append(float(line.split()[0]))
y2.append(float(line.split()[1]))
counter+=1
plt.plot(x, y1, y2)
plt.savefig("text.png", dpi=300)
my text.txt:input
# Carbon
# Gallium
#
# title
# xaxis
1.00 2.12 14.51
2.00 4.54 18.14
3.00 6.12 45.11
4.00 9.02 89.15
5.00 6.48 49.99
6.00 8.01 92.33
7.00 7.56 95.14
8.00 5.89 96.01
You are getting the error
IndexError: list index out of range
because your data file contains empty lines. They may be at the end of the file.
You could fix it by including
for line in lines:
if not line.strip(): continue
but instead of the code you posted, I would use NumPy's genfromtxt to parse the file this way:
import numpy as np
import matplotlib.pyplot as plt
with open("text.txt", "r") as f:
lines = (line for line in f if not any(line.startswith(c) for c in '##'))
x, y1, y2 = np.genfromtxt(lines, dtype=None, unpack=True)
plt.plot(x, y1, x, y2)
plt.savefig("text2.png", dpi=300)
If you want to fix your original code with minimal changes, it might look something like this:
with open("text.txt", "r") as f:
x = []
y1 = []
y2 = []
for line in f:
if not line.strip() or line.startswith('#') or line.startswith('#'):
continue
row = line.split()
x.append(float(row[0]))
y1.append(float(row[1]))
y2.append(float(row[2]))
plt.plot(x, y1, x, y2)
plt.savefig("text.png", dpi=300)
Tip: readlines()
reads the entire file and returns a list of strings. This can require a lot of
memory if the file is large. Therefore, never use lines=data_file.readlines() unless you really need the entire file converted into a list of strings. Otherwise, it requires less memory if you can process each line one-at-a-time using
for line in f: