Based on some rules, how to expand data in Pandas? - pandas

Please forgive my English. I hope I can say clearly.
Assume we have this data:
>>> data = {'Span':[3,3.5], 'Low':[6.2,5.16], 'Medium':[4.93,4.1], 'High':[3.68,3.07], 'VeryHigh':[2.94,2.45], 'ExtraHigh':[2.48,2.06], '0.9':[4.9,3.61], '1.5':[3.23,2.38], '2':[2.51,1.85]}
>>> df = pd.DataFrame(data)
>>> df
Span Low Medium High VeryHigh ExtraHigh 0.9 1.5 2
0 3.0 6.20 4.93 3.68 2.94 2.48 4.90 3.23 2.51
1 3.5 5.16 4.10 3.07 2.45 2.06 3.61 2.38 1.85
I want to get this data:
Span Wind Snow MaxSpacing
0 3.0 Low 0.0 6.20
1 3.0 Medium 0.0 4.93
2 3.0 High 0.0 3.68
3 3.0 VeryHigh 0.0 2.94
4 3.0 ExtraHigh 0.0 2.48
5 3.0 0 0.9 4.90
6 3.0 0 1.5 3.23
7 3.0 0 2.0 2.51
8 3.5 Low 0.0 5.16
9 3.5 Medium 0.0 4.10
10 3.5 High 0.0 3.07
11 3.5 VeryHigh 0.0 2.45
12 3.5 ExtraHigh 0.0 2.06
13 3.5 0 0.9 3.61
14 3.5 0 1.5 2.38
15 3.5 0 2.0 1.85
The principles apply to df:
Span expands by the combination of Wind and Snow to get the MaxSpacing
Wind and Snow is mutually exclusive. When Wind is one of 'Low', 'Medium', 'High', 'VeryHigh', 'ExtraHigh', Snow is zero; when Snow is one of 0.9, 1.5, 2, Wind is zero.
Please help. Thank you.

Use DataFrame.melt for unpivot and then sorting by indices, create Snow column by to_numeric and Series.fillna in DataFrame.insert and last set 0 for Wind column:
df = (df.melt('Span', ignore_index=False, var_name='Wind', value_name='MaxSpacing')
.sort_index(ignore_index=True))
s = pd.to_numeric(df['Wind'], errors='coerce')
df.insert(2, 'Snow', s.fillna(0))
df.loc[s.notna(), 'Wind'] = 0
print (df)
Span Wind Snow MaxSpacing
0 3.0 Low 0.0 6.20
1 3.0 Medium 0.0 4.93
2 3.0 High 0.0 3.68
3 3.0 VeryHigh 0.0 2.94
4 3.0 ExtraHigh 0.0 2.48
5 3.0 0 0.9 4.90
6 3.0 0 1.5 3.23
7 3.0 0 2.0 2.51
8 3.5 Low 0.0 5.16
9 3.5 Medium 0.0 4.10
10 3.5 High 0.0 3.07
11 3.5 VeryHigh 0.0 2.45
12 3.5 ExtraHigh 0.0 2.06
13 3.5 0 0.9 3.61
14 3.5 0 1.5 2.38
15 3.5 0 2.0 1.85
Alternative solution with DataFrame.set_index and DataFrame.stack:
df = df.set_index('Span').rename_axis('Wind', axis=1).stack().reset_index(name='MaxSpacing')
s = pd.to_numeric(df['Wind'], errors='coerce')
df.insert(2, 'Snow', s.fillna(0))
df.loc[s.notna(), 'Wind'] = 0
print (df)
Span Wind Snow MaxSpacing
0 3.0 Low 0.0 6.20
1 3.0 Medium 0.0 4.93
2 3.0 High 0.0 3.68
3 3.0 VeryHigh 0.0 2.94
4 3.0 ExtraHigh 0.0 2.48
5 3.0 0 0.9 4.90
6 3.0 0 1.5 3.23
7 3.0 0 2.0 2.51
8 3.5 Low 0.0 5.16
9 3.5 Medium 0.0 4.10
10 3.5 High 0.0 3.07
11 3.5 VeryHigh 0.0 2.45
12 3.5 ExtraHigh 0.0 2.06
13 3.5 0 0.9 3.61
14 3.5 0 1.5 2.38
15 3.5 0 2.0 1.85

Related

Make bins coarser on pandas dataframe, and sum in counting columns

I have a dataframe with a variable (E), where the value in the dataframe is the left edge of the bin, and a set of occupancies for each bin (n) (and the uncertainty squared (v)). At the moment, these are binned from 200 to 2000 in steps of 100 (usually), then binned 2000 to +inf. However these bins are very fine for the plotting I need to perform, and I need to rebin these into 200, 300, 400, 600, 1000, +inf.
Key Point: Because I am reading several sets of data like this from a source, not all my dataframes have entries e.g. for bin 600-700, i.e. some rows will be missing from one dataframe, while another may have entries for them. I need to rebin and sum n and v based on the new bins, while accounting for the fact that my dataframes aren't "regular".
Here's an example dataframe:
E n v
0 200.0 26.0 1.3
1 300.0 56.0 2.2
2 400.0 62.0 2.5
3 500.0 55.0 2.2
4 600.0 24.0 1.7
5 800.0 12.0 1.3
6 900.0 8.0 0.9
7 1000.0 4.0 0.6
8 1100.0 1.0 0.2
And here is my desired output:
E n v
0 200.0 26.0 1.3
1 300.0 56.0 2.2
2 400.0 117.0 4.7
3 600.0 44.0 3.9
4 1000.0 5.0 0.8
Any help or guidance is much appreciated.
You can cut with agg
s=df.groupby(pd.cut(df.E,[200,300,400,600,1000,np.inf],right=False)).agg({'E':'first','n':'sum','v':'sum'})
s.E=s.index.map(lambda x :x.left)
s.reset_index(drop=True,inplace=True)
s
E n v
0 200.0 26.0 1.3
1 300.0 56.0 2.2
2 400.0 117.0 4.7
3 600.0 44.0 3.9
4 1000.0 5.0 0.8

Creating variables and calculating the difference between these variables and selected variable - Pandas

I've got this data frame:
ID Date X 123_P 456_P 789_P choice
A 07/16/2019 . 1.5 1.8 1.6 123
A 07/17/2019 . 2.0 2.1 4.5 789
A 07/18/2019 . 3.0 3.2 NaN 0
A 07/19/2019 . 2.1 2.2 4.5 456
B 07/16/2019 . 1.5 1.8 1.6 789
B 07/17/2019 . 2.0 2.1 4.5 0
B 07/18/2019 . 3.0 3.2 NaN 123
I want to create new variables: 123_PD, 456_PD, 789_PD (I have much more variables than this example, so it shouldn't be done manually).
The new variables will indicate the differences between 123_P, 456_P, 789_P variables and the same variables from the previous row, considering the previous choice.
I mean, if the choice from the previous row was "123", so the differences between these variables will refer to value in "123_P" from the previous row.
Notes:
Value of 0 means there is no choice, so the differences will refer to the last choice for this ID.
It should be done for each ID separately.
Expected result:
ID Date X 123_P 456_P 789_P choice 123_PD 456_PD 789_PD
A 07/16/2019 . 1.5 1.8 1.6 123 0 0 0
A 07/17/2019 . 2.0 2.1 4.5 789 0.5 0.6 3.0
A 07/18/2019 . 3.0 3.2 NaN 0 -1.5 -1.3 NaN
A 07/19/2019 . 2.1 2.2 4.5 456 -2.4 -2.3 0
B 07/16/2019 . 1.5 1.8 1.6 789 0 0 0
B 07/17/2019 . 2.0 2.1 4.5 0 0.4 0.5 2.9
B 07/18/2019 . 3.0 3.2 NaN 123 1.4 1.6 NaN
First create helper DataFrame with new column 0_P for filled missing values and change choice values for match columns names:
df1 = (df.join(pd.DataFrame({'0_P':np.nan}, index=df.index))
.assign(choice = df['choice'].astype(str) + '_P'))
print (df1)
ID Date X 123_P 456_P 789_P choice 0_P
0 A 07/16/2019 . 1.5 1.8 1.6 123_P NaN
1 A 07/17/2019 . 2.0 2.1 4.5 789_P NaN
2 A 07/18/2019 . 3.0 3.2 NaN 0_P NaN
3 A 07/19/2019 . 2.1 2.2 4.5 456_P NaN
4 B 07/16/2019 . 1.5 1.8 1.6 789_P NaN
5 B 07/17/2019 . 2.0 2.1 4.5 0_P NaN
6 B 07/18/2019 . 3.0 3.2 NaN 123_P NaN
Then use DataFrame.lookup for values to array, convert to Series, Series.shift and forward filling missing values per groups in lambda function:
s = (pd.Series(df1.lookup(df1.index, df1['choice']), index=df.index)
.groupby(df['ID'])
.apply(lambda x: x.shift().ffill()))
print (s)
0 NaN
1 1.5
2 4.5
3 4.5
4 NaN
5 1.6
6 1.6
dtype: float64
Then select necessary columns, subtract by DataFrame.sub, DataFrame.add_suffix and last set rows to 0 by duplicated ID column:
df2 = df.iloc[:, -4:-1].sub(s, axis=0).add_suffix('D')
df2.loc[~df1['ID'].duplicated(), :] = 0
print (df2)
123_PD 456_PD 789_PD
0 0.0 0.0 0.0
1 0.5 0.6 3.0
2 -1.5 -1.3 NaN
3 -2.4 -2.3 0.0
4 0.0 0.0 0.0
5 0.4 0.5 2.9
6 1.4 1.6 NaN
df = df.join(df2)
print (df)
ID Date X 123_P 456_P 789_P choice 123_PD 456_PD 789_PD
0 A 07/16/2019 . 1.5 1.8 1.6 123 0.0 0.0 0.0
1 A 07/17/2019 . 2.0 2.1 4.5 789 0.5 0.6 3.0
2 A 07/18/2019 . 3.0 3.2 NaN 0 -1.5 -1.3 NaN
3 A 07/19/2019 . 2.1 2.2 4.5 456 -2.4 -2.3 0.0
4 B 07/16/2019 . 1.5 1.8 1.6 789 0.0 0.0 0.0
5 B 07/17/2019 . 2.0 2.1 4.5 0 0.4 0.5 2.9
6 B 07/18/2019 . 3.0 3.2 NaN 123 1.4 1.6 NaN
This should do the needful:
df[['123_PD', '456_PD', '789_PD']] = df[['123_P', '456_P', '789_P']] - df[['123_P', '456_P', '789_P']].shift(1)
df['123_PD'].iloc[0] = 0
df['456_PD'].iloc[0] = 0
df['789_PD'].iloc[0] = 0

Python: group by with sum special columns and keep the initial rows too

I have a df:
ID Car Jan17 Jun18 Dec18 Apr19
0 Nissan 0.0 1.7 3.7 0.0
1 Porsche 10.0 0.0 2.8 3.5
2 Golf 0.0 1.7 3.0 2.0
3 Tiguan 1.0 0.0 3.0 5.2
4 Touareg 0.0 0.0 3.0 4.2
5 Mercedes 0.0 0.0 0.0 7.2
6 Passat 0.0 3.0 0.0 0.0
I would like to change the values for row #6: Passat value in Car column by add the values from row#2 & row#3 & row#4 (Golf, Tiguan, Touareg) in the Car column) and also keep the values of row#2 & row#3 & row#4 as initial.
Because Passat includes Golf, Touareg, Tiguan and due to it I need to add the values of Golf, Touareg, Tiguanrows to Passat row.
I tried to do it the following code:
car_list = ['Golf', 'Tiguan', 'Touareg']
for car in car_list:
df['Car'][df['Car']==car]='Passat'
and after I used groupby by Car and sum() function:
df1 = df.groupby(['Car'])['Jan17', 'Jun18', 'Dec18', 'Apr19'].sum().reset_index()
In result, df1 doesn't have initial (Golf, Tiguan, Touareg) rows. So, this way is wrong.
Expected result is df1:
ID Car Jan17 Jun18 Dec18 Apr19
0 Nissan 0.0 1.7 3.7 0.0
1 Porsche 10.0 0.0 2.8 3.5
2 Golf 0.0 1.7 3.0 2.0
3 Tiguan 1.0 0.0 3.0 5.2
4 Touareg 0.0 0.0 3.0 4.2
5 Mercedes 0.0 0.0 0.0 7.2
6 Passat 1.0 4.7 9.0 11.4
I'd appreciate for any idea. Thanks)
First we use .isin to get the correct Cars, then we use .filter to get the correct value columns, finally we sum the values and put them in our variable sums.
Then we select the Passat row and add the values to that row:
sums = df[df['Car'].isin(car_list)].filter(regex='\w{3}\d{2}').sum()
df.loc[df['Car'].eq('Passat'), 'Jan17':] += sums
Output
ID Car Jan17 Jun18 Dec18 Apr19
0 0 Nissan 0.0 1.7 3.7 0.0
1 1 Porsche 10.0 0.0 2.8 3.5
2 2 Golf 0.0 1.7 3.0 2.0
3 3 Tiguan 1.0 0.0 3.0 5.2
4 4 Touareg 0.0 0.0 3.0 4.2
5 5 Mercedes 0.0 0.0 0.0 7.2
6 6 Passat 1.0 4.7 9.0 11.4
Solution is in view of function:
car_list = ['Golf', 'Tiguan', 'Touareg', 'Passat']
def updateCarInfoBySum(df, car_list, name, id):
req = df[df['Car'].isin(car_list)]
req.set_index(['Car', 'ID], inplace=True)
req.loc[('new_value', '000'), :] = req.sum(axis=0)
req.reset_index(inplace=True)
req = req[req.Car != name]
req['Car'][req['Car'] == 'new_value'] = name
req['ID'][req['ID'] == '000'] = id
req.set_index(['Car', 'ID], inplace=True)
df_final = df.copy()
df_final.set_index(['Car', 'ID], inplace=True)
df_final.update(req)
return df_final

Pandas calculate difference on two DataFrames with column and multi-indices

I have 2 DataFrames, df1 is:
Jan17 Jun18 Dec18 Apr19
ID Name
0 Nick 10.0 1.7 3.7 0.0
1 Jack 10.0 0.0 2.8 3.5
2 Fox 10.0 1.7 0.0 0.0
3 Rex 1.0 0.0 3.0 4.2
the second DataFrame - df2 is:
Jan17 Jun18 Dec18 Apr19
ID Name
0 Nick 5.0 1.7 2.0 0.0
1 Jack 6.0 0.0 0.8 3.5
2 Fox 8.0 5.0 0.0 0.0
3 Rex 1.0 0.0 1.0 4.2
4 Snack 3.1 9.0 2.8 4.4
5 Yosee 4.3 0.0 0.0 4.3
6 Petty 0.5 1.3 2.8 3.5
7 Lind 3.6 7.5 2.8 4.3
8 Korr 0.6 1.5 1.8 2.3
Result is df3:
ID Name Jan17 Jun18 Dec18 Apr19
0 Nick 5.0 0 1.7 0
1 Jack 4.0 0 2.0 0
2 Fox 2.0 -3.3 0 0
3 Rex 0 0 2.0 0
How to calculate differences between columns in df1 and df2 based on multi-indices: [ID, Name] of df1 and save result to the df3?
I'd appreciate for any idea. Thanks!
Just subtract, subtraction is aligned on the index. You can reindex df2 before subtracting to avoid NaNs:
# df1 - df2.reindex(df1.index)
df1.sub(df2.reindex(df1.index))
Jan17 Jun18 Dec18 Apr19
ID Name
0 Nick 5.0 0.0 1.7 0.0
1 Jack 4.0 0.0 2.0 0.0
2 Fox 2.0 -3.3 0.0 0.0
3 Rex 0.0 0.0 2.0 0.0
Note that the reason I went for reindex over loc was to avoid KeyErrors if there are missing index values.
In the above instance, the first solution will produce NaNs, so you can specify fill_values to reindex to ensure df1's value is returned (rather than NaN):
df2.reindex(df1.index, fill_value=0)
Jan17 Jun18 Dec18 Apr19
ID Name
0 Nick 5.0 1.7 2.0 0.0
1 Jack 6.0 0.0 0.8 3.5
2 Fox 8.0 5.0 0.0 0.0
3 Rex 1.0 0.0 1.0 4.2
You can simply do
df1-df2.loc[df1.index]
Output:
Jan17 Jun18 Dec18 Apr19
ID Name
0 Nick 5.0 0.0 1.7 0.0
1 Jack 4.0 0.0 2.0 0.0
2 Fox 2.0 -3.3 0.0 0.0
3 Rex 0.0 0.0 2.0 0.0
Try something new
sum(df1.align(0-df2,join='left'))
Out[282]:
Jan17 Jun18 Dec18 Apr19
ID Name
0 Nick 5.0 0.0 1.7 0.0
1 Jack 4.0 0.0 2.0 0.0
2 Fox 2.0 -3.3 0.0 0.0
3 Rex 0.0 0.0 2.0 0.0

Logistic regression with pandas and sklearn: Input contains NaN, infinity or a value too large for dtype('float64')

I want to run the following model (logistic regression) for the pandas data frame I read.
However, when the predict method comes, it says: "Input contains NaN, infinity or a value too large for dtype('float64')"
My code is: (Note that there must exist 10 numerical and 4 categorial variables)
import pandas as pd
import io
import requests
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
heart = pd.DataFrame(s_rows_cols, columns = header_row, index=range(271))
pd.to_numeric(heart['age'])
pd.to_numeric(heart['restBP'])
pd.to_numeric(heart['chol'])
pd.to_numeric(heart['sugar'])
pd.to_numeric(heart['maxhr'])
pd.to_numeric(heart['angina'])
pd.to_numeric(heart['dep'])
pd.to_numeric(heart['fluor'])
heart['chestpain'] = heart['chestpain'].astype('category')
heart['ecg'] = heart['ecg'].astype('category')
heart['thal'] = heart['thal'].astype('category')
heart['exercise'] = heart['exercise'].astype('category')
x = pd.to_numeric(heart['diagnosis'])
heart['diagnosis'] = (x > 1).astype(int)
heart_train, heart_test, goal_train, goal_test = train_test_split(heart.loc[:,'age':'thal'], heart.loc[:,'diagnosis'], test_size=0.3, random_state=0)
clf = LogisticRegression()
clf.fit(heart_train, goal_train)
heart_test_results = clf.predict(heart_test) #From here it is broken
print(clf.get_params(clf))
print(clf.score(heart_train,goal_train))
The data frame info is as follows (print(heart.info()):
RangeIndex: 271 entries, 0 to 270
Data columns (total 14 columns):
age 270 non-null object
sex 270 non-null object
chestpain 270 non-null category
restBP 270 non-null object
chol 270 non-null object
sugar 270 non-null object
ecg 270 non-null category
maxhr 270 non-null object
angina 270 non-null object
dep 270 non-null object
exercise 270 non-null category
fluor 270 non-null object
thal 270 non-null category
diagnosis 271 non-null int32
dtypes: category(4), int32(1), object(9)
memory usage: 21.4+ KB
None
Do anyone know what I am missing here?
Thanks in advance!!
I gues the reason for this error is how you parse this data:
In [116]: %paste
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
heart = pd.DataFrame(s_rows_cols, columns = header_row, index=range(271))
pd.to_numeric(heart['age'])
pd.to_numeric(heart['restBP'])
pd.to_numeric(heart['chol'])
pd.to_numeric(heart['sugar'])
pd.to_numeric(heart['maxhr'])
pd.to_numeric(heart['angina'])
pd.to_numeric(heart['dep'])
pd.to_numeric(heart['fluor'])
heart['chestpain'] = heart['chestpain'].astype('category')
heart['ecg'] = heart['ecg'].astype('category')
heart['thal'] = heart['thal'].astype('category')
heart['exercise'] = heart['exercise'].astype('category')
## -- End pasted text --
In [117]: heart
Out[117]:
age sex chestpain restBP chol sugar ecg maxhr angina dep exercise fluor thal diagnosis
0 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4 2.0 3.0 3.0 2
1 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6 2.0 0.0 7.0 1
2 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3 1.0 0.0 7.0 2
3 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2 2.0 1.0 7.0 1
4 74.0 0.0 2.0 120.0 269.0 0.0 2.0 121.0 1.0 0.2 1.0 1.0 3.0 1
5 65.0 1.0 4.0 120.0 177.0 0.0 0.0 140.0 0.0 0.4 1.0 0.0 7.0 1
6 56.0 1.0 3.0 130.0 256.0 1.0 2.0 142.0 1.0 0.6 2.0 1.0 6.0 2
7 59.0 1.0 4.0 110.0 239.0 0.0 2.0 142.0 1.0 1.2 2.0 1.0 7.0 2
8 60.0 1.0 4.0 140.0 293.0 0.0 2.0 170.0 0.0 1.2 2.0 2.0 7.0 2
9 63.0 0.0 4.0 150.0 407.0 0.0 2.0 154.0 0.0 4.0 2.0 3.0 7.0 2
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 60.0 1.0 4.0 130.0 206.0 0.0 2.0 132.0 1.0 2.4 2.0 2.0 7.0 2
262 58.0 1.0 2.0 120.0 284.0 0.0 2.0 160.0 0.0 1.8 2.0 0.0 3.0 2
263 49.0 1.0 2.0 130.0 266.0 0.0 0.0 171.0 0.0 0.6 1.0 0.0 3.0 1
264 48.0 1.0 2.0 110.0 229.0 0.0 0.0 168.0 0.0 1.0 3.0 0.0 7.0 2
265 52.0 1.0 3.0 172.0 199.0 1.0 0.0 162.0 0.0 0.5 1.0 0.0 7.0 1
266 44.0 1.0 2.0 120.0 263.0 0.0 0.0 173.0 0.0 0.0 1.0 0.0 7.0 1
267 56.0 0.0 2.0 140.0 294.0 0.0 2.0 153.0 0.0 1.3 2.0 0.0 3.0 1
268 57.0 1.0 4.0 140.0 192.0 0.0 0.0 148.0 0.0 0.4 2.0 0.0 6.0 1
269 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 2
270 None None NaN None None None NaN None None None NaN None NaN None
[271 rows x 14 columns]
NOTE: pay attention at the very last row with NaN's
try to do it this simplified way instead:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
In [118]: df = pd.read_csv(url, sep='\s+', header=None, names=header_row)
In [119]: df
Out[119]:
age sex chestpain restBP chol sugar ecg maxhr angina dep exercise fluor thal diagnosis
0 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4 2.0 3.0 3.0 2
1 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6 2.0 0.0 7.0 1
2 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3 1.0 0.0 7.0 2
3 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2 2.0 1.0 7.0 1
4 74.0 0.0 2.0 120.0 269.0 0.0 2.0 121.0 1.0 0.2 1.0 1.0 3.0 1
5 65.0 1.0 4.0 120.0 177.0 0.0 0.0 140.0 0.0 0.4 1.0 0.0 7.0 1
6 56.0 1.0 3.0 130.0 256.0 1.0 2.0 142.0 1.0 0.6 2.0 1.0 6.0 2
7 59.0 1.0 4.0 110.0 239.0 0.0 2.0 142.0 1.0 1.2 2.0 1.0 7.0 2
8 60.0 1.0 4.0 140.0 293.0 0.0 2.0 170.0 0.0 1.2 2.0 2.0 7.0 2
9 63.0 0.0 4.0 150.0 407.0 0.0 2.0 154.0 0.0 4.0 2.0 3.0 7.0 2
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
260 58.0 0.0 3.0 120.0 340.0 0.0 0.0 172.0 0.0 0.0 1.0 0.0 3.0 1
261 60.0 1.0 4.0 130.0 206.0 0.0 2.0 132.0 1.0 2.4 2.0 2.0 7.0 2
262 58.0 1.0 2.0 120.0 284.0 0.0 2.0 160.0 0.0 1.8 2.0 0.0 3.0 2
263 49.0 1.0 2.0 130.0 266.0 0.0 0.0 171.0 0.0 0.6 1.0 0.0 3.0 1
264 48.0 1.0 2.0 110.0 229.0 0.0 0.0 168.0 0.0 1.0 3.0 0.0 7.0 2
265 52.0 1.0 3.0 172.0 199.0 1.0 0.0 162.0 0.0 0.5 1.0 0.0 7.0 1
266 44.0 1.0 2.0 120.0 263.0 0.0 0.0 173.0 0.0 0.0 1.0 0.0 7.0 1
267 56.0 0.0 2.0 140.0 294.0 0.0 2.0 153.0 0.0 1.3 2.0 0.0 3.0 1
268 57.0 1.0 4.0 140.0 192.0 0.0 0.0 148.0 0.0 0.4 2.0 0.0 6.0 1
269 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 2
[270 rows x 14 columns]
also pay attention at automatically parsed (guessed) dtypes - pd.read_csv() will do all necesarry convertions for you:
In [120]: df.dtypes
Out[120]:
age float64
sex float64
chestpain float64
restBP float64
chol float64
sugar float64
ecg float64
maxhr float64
angina float64
dep float64
exercise float64
fluor float64
thal float64
diagnosis int64
dtype: object
I would suspect it was the train_test_split thing.
I would suggest turning your X, and y into numpy arrays to avoid this problem. This usually solves for this.
X = heart.loc[:,'age':'thal'].as_matrix()
y = heart.loc[:,'diagnosis'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size, random_state)
and then fit for
clf.fit(X_train, y_train)