'tuple' object has no attribute 'reshape' - numpy

I used a dataset "ex1data1.txt", but when I am running it to convert, it is showing the following error:
AttributeError Traceback (most recent call last)
<ipython-input-52-7c523f7ba9e1> in <module>()
1 # Converting loaded dataset into numpy array
2
----> 3 X = np.concatenate((np.ones(len(population)).reshape(len(population), 1), population.reshape(len(population),1)), axis=1)
4
5
AttributeError: 'tuple' object has no attribute 'reshape'
The code is given below:
import csv
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pandas as pd
import numpy as np
# Loading Dataset
with open('ex1data1.txt') as csvfile:
population, profit = zip(*[(float(row['Population']), float(row['Profit'])) for row in csv.DictReader(csvfile)])
# Creating DataFrame
df = pd.DataFrame()
df['Population'] = population
df['Profit'] = profit
# Plotting using Seaborn
sns.lmplot(x="Population", y="Profit", data=df, fit_reg=False, scatter_kws={'s':45})
# Converting loaded dataset into numpy array
X = np.concatenate((np.ones(len(population)).reshape(len(population), 1), population.reshape(len(population),1)), axis=1)
y = np.array(profit).reshape(len(profit), 1)
# Creating theta matrix , theta = [[0], [0]]
theta = np.zeros((2, 1))
# Learning rate
alpha = 0.1
# Iterations to be taken
iterations = 1500
# Updated theta and calculated cost
theta, cost = gradientDescent(X, y, theta, alpha, iterations)
I don't know how to solve this reshape problem. Can anyone tell how can I solve this problem?

from your definition, population is a tuple. I'd suggest two options, the first is converting it to an array, i.e.
population = np.asarray(population)
Alternatively, you can use the DataFrame column .values attribute, which is essentially a numpy array:
X = np.concatenate((np.ones(len(population)).reshape(len(population), 1), df['Population'].values.reshape(len(population),1)), axis=1)

Related

MatplotLib/Pandas Using Time as X Axis

I am working on a project where I would like to read sensor data from a CSV file and do a live graph.
I am using Matplotlib for the graphing and Pandas for the data handling.
For the CSV I am using:
Column 0= Pass/Fail Boolean
Column 1= Time in %h:%m:%s format.
Column 3= Error Code (int64)
When I run the script I get a "ValueError: values must be a 1D array".
I believe its coming from the time data, but when i check the dtype it is a datetime64 as expected. My program is below:
from itertools import count
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
plt.style.use('fivethirtyeight')
x_vals = []
y_vals = []
index = count()
def animate(i):
# Use Pandas to Read CSV and create a Dataframe. Use KWARGS to choose columns, and then
specify name and type of data.
data = pd.read_csv('C:/Python/20220124.csv',usecols=[0, 1, 3], names=["Pass", "Time",
"Error Code"], header= None, parse_dates=[1], dtype={"Pass": 'boolean', "Error Code":
'Int64'})
pd.to_datetime(data['Time'])
y1 = data['Pass']
x1 = data['Time']
y2 = data['Error Code']
# Pyplot Clear Axes
plt.cla()
#Pyplot Plot data in line graphs
plt.plot(x1, y1, label='Pass/Fail', lw=3, c='c', marker='o', markersize=4, mfc='k')
plt.plot(x1, y2, label='Error Code', lw=2, ls='--', c='k')
plt.legend(loc='upper left')
plt.tight_layout()
#Pyplot Get Current Axes
ax = plt.gca()
ani = FuncAnimation(plt.gcf(), animate, interval=1000)
plt.tight_layout()
plt.show()
You can do this using only pandas. Pandas also contain a pre-built function for visualization :
df = data[['Error Code', 'Time']] # Create a pandas series contain the data that will be ploted
df.set_index('Time', inplace = True) # set the time as an index (it will serve as x-axis)
df.plot() # plot the graph

Simple logistic regression with Statsmodels: Adding an intercept and visualizing the logistic regression equation

Using Statsmodels, I am trying to generate a simple logistic regression model to predict whether a person smokes or not (Smoke) based on their height (Hgt).
I have a feeling that an intercept needs to be included into the logistic regression model but I am not sure how to implement one using the add_constant() function. Also, I am unsure why the error below is generated.
This is the dataset, Pulse.CSV: https://drive.google.com/file/d/1FdUK9p4Dub4NXsc-zHrYI-AGEEBkX98V/view?usp=sharing
The full code and output are in this PDF file: https://drive.google.com/file/d/1kHlrAjiU7QvFXF2a7tlTSFPgfpq9bOXJ/view?usp=sharing
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
raw_data = pd.read_csv('Pulse.csv')
raw_data
x1 = raw_data['Hgt']
y = raw_data['Smoke']
reg_log = sm.Logit(y,x1,missing='Drop')
results_log = reg_log.fit()
def f(x,b0,b1):
return np.array(np.exp(b0+x*b1) / (1 + np.exp(b0+x*b1)))
f_sorted = np.sort(f(x1,results_log.params[0],results_log.params[1]))
x_sorted = np.sort(np.array(x1))
plt.scatter(x1,y,color='C0')
plt.xlabel('Hgt', fontsize = 20)
plt.ylabel('Smoked', fontsize = 20)
plt.plot(x_sorted,f_sorted,color='C8')
plt.show()
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
4729 try:
-> 4730 return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
4731 except KeyError as e1:
((( Truncated for brevity )))
IndexError: index out of bounds
Intercept is not added by default in Statsmodels regression, but if you need you can include it manually.
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
raw_data = pd.read_csv('Pulse.csv')
raw_data
x1 = raw_data['Hgt']
y = raw_data['Smoke']
x1 = sm.add_constant(x1)
reg_log = sm.Logit(y,x1,missing='Drop')
results_log = reg_log.fit()
results_log.summary()
def f(x,b0,b1):
return np.array(np.exp(b0+x*b1) / (1 + np.exp(b0+x*b1)))
f_sorted = np.sort(f(x1,results_log.params[0],results_log.params[1]))
x_sorted = np.sort(np.array(x1))
plt.scatter(x1['Hgt'],y,color='C0')
plt.xlabel('Hgt', fontsize = 20)
plt.ylabel('Smoked', fontsize = 20)
plt.plot(x_sorted,f_sorted,color='C8')
plt.show()
This will also resolve the error as there was no intercept in your initial code.Source

Couldn't load pyspark data frame to decision tree algorithm. It says can't work with pyspark data frame

I was working on IBM's data platform. I was able to load data into the pyspark data frame and made a spark SQL table. After splitting the data set, then feeding it into the Classification algorithm. It rises errors like spark SQL data can't load. required ndarrays.
from sklearn.ensemble import RandomForestRegressor`
from sklearn.model_selection import train_test_split`
from sklearn import preprocessing`
import numpy as np`
X_train,y_train,X_test,y_test = train_test_split(x,y,test_size = 0.1,random_state = 42)
RM = RandomForestRegressor()
RM.fit(X_train.reshape(1,-1),y_train)`
Error:
TypeError: Expected sequence or array-like, got {<}class 'pyspark.sql.dataframe.DataFrame'>
after this error, I did something like this:
x = spark.sql('select Id,YearBuilt,MoSold,YrSold,Fireplaces FROM Train').toPandas()
y = spark.sql('Select SalePrice FROM Train where SalePrice is not null').toPandas()
Error:
AttributeError Traceback (most recent call last)
in ()
5 X_train,y_train,X_test,y_test = train_test_split(x,y,test_size = 0.1,random_state = 42)
6 RM = RandomForestRegressor()
----> 7 RM.fit(X_train.reshape(1,-1),y_train)
/opt/ibm/conda/miniconda3.6/lib/python3.6/site-packages/pandas/core/generic.py in getattr(self, name)
5065 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5066 return self[name]
-> 5067 return object.getattribute(self, name)
5068
5069 def setattr(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'reshape'
As the sklearn documentation says:
"""
X : array-like or sparse matrix, shape = [n_samples, n_features]
"""
regr = RandomForestRegressor()
regr.fit(X, y)
So firstly you're trying to give as the X argument a pandas.DataFrame instead of an array.
Secondly the reshape() method is not an attribute of the DataFrame object but numpy array.
import numpy as np
x = np.array([[2,3,4], [5,6,7]])
np.reshape(x, (3, -1))
Hope this helps.

Getting a score of zero using cross val score

I am trying to use cross_val_score on my dataset, but I keep getting zeros as the score:
This is my code:
df = pd.read_csv("Flaveria.csv")
df = pd.get_dummies(df, columns=["N level", "species"], drop_first=True)
# Extracting the target value from the dataset
X = df.iloc[:, df.columns != "Plant Weight(g)"]
y = np.array(df.iloc[:, 0], dtype="S6")
logreg = LogisticRegression()
loo = LeaveOneOut()
scores = cross_val_score(logreg, X, y, cv=loo)
print(scores)
The features are categorical values, while the target value is a float value. I am not exactly sure why I am ONLY getting zeros.
The data looks like this before creating dummy variables
N level,species,Plant Weight(g)
L,brownii,0.3008
L,brownii,0.3288
M,brownii,0.3304
M,brownii,0.388
M,brownii,0.406
H,brownii,0.3955
H,brownii,0.3797
H,brownii,0.2962
Updated code where I am still getting zeros:
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
import numpy as np
import pandas as pd
# Creating dummies for the non numerical features in the dataset
df = pd.read_csv("Flaveria.csv")
df = pd.get_dummies(df, columns=["N level", "species"], drop_first=True)
# Extracting the target value from the dataset
X = df.iloc[:, df.columns != "Plant Weight(g)"]
y = df.iloc[:, 0]
forest = RandomForestRegressor()
loo = LeaveOneOut()
scores = cross_val_score(forest, X, y, cv=loo)
print(scores)
The general cross_val_score will split the data into train and test with the given iterator, then fit the model with the train data and score on the test fold. And for regressions, r2_score is the default in scikit.
You have specified LeaveOneOut() as your cv iterator. So each fold will contain a single test case. In this case, R_squared will always be 0.
Looking at the formula for R2 in wikipedia:
R2 = 1 - (SS_res/SS_tot)
And
SS_tot = sqr(sum(y - y_mean))
Here for a single case, y_mean will be equal to y value and hence denominator is 0. So the whole R2 is undefined (Nan). In this case, scikit-learn will set the value to 0, instead of nan.
Changing the LeaveOneOut() to any other CV iterator like KFold, will give you some non-zero results as you have already observed.

Visualizing class labels in self-organizing map plot or iris dataset

I am trying to produce a visualization of the SOM mapping for the Iris dataset ( https://archive.ics.uci.edu/ml/datasets/Iris).
My code so far:
from sklearn.datasets import load_iris
from mvpa2.suite import *
import pandas as pd
import numpy as np
df = pd.read_csv(filepath_or_buffer='data/iris.data', header=None, sep=',')
df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True) # drops the empty line at file-end
# split the data table into feature data x and class labels y
x = df.ix[:,0:4].values # the first 4 columns are the features
y = df.ix[:,4].values # the last column is the class label
t = np.zeros(len(y), dtype=int)
t[y == 'Iris-setosa'] = 0
t[y == 'Iris-versicolor'] = 1
t[y == 'Iris-virginica'] = 2
som = SimpleSOMMapper((240, 320), 100, learning_rate=0.05)
som.train(x)
pl.imshow(som.K, origin='lower')
mapped = som(x)
for i, m in enumerate(mapped):
pl.text(m[1], m[0], t[i], ha='center', va='center',
bbox=dict(facecolor='white', alpha=0.5, lw=0))
pl.show()
which produces this mapping:
Is there any way to customize the palette so it looks nicer like this one? (taken from https://github.com/JustGlowing/minisom)?
Basically I am trying to use a nicer palette (perhaps with fewer colors) and mark the class labels in a nicer way.
Thank you.
I will answer my own question: it turns out that I forgot to slice my data:
pl.imshow(som.K[:,:,0], origin='lower')
Everything looks fine now: