Given a file with the extention of .data, I have read it with pd.read_fwf("./input.data", sep=",", header = None):
Out:
0
0 63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3...
1 67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5...
2 67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6...
3 37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5...
4 41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4...
... ...
292 57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2...
293 45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2...
294 68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4...
295 57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2...
296 57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0...
How can I add the following column names to it? Thanks.
col_names = ["age", "sex", "cp", "restbp", "chol", "fbs", "restecg",
"thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"]
Update:
pd.read_fwf("./input.data", names = col_names)
Out:
age sex cp restbp chol fbs restecg thalach exang oldpeak slope ca thal num
0 63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
292 57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
293 45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
294 68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
295 57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
296 57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0... NaN NaN NaN NaN NaN NaN
If check read_fwf:
Read a table of fixed-width formatted lines into DataFrame.
So if there is separator , use read_csv:
col_names = ["age", "sex", "cp", "restbp", "chol", "fbs", "restecg",
"thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"]
df = pd.read_csv("input.data", names=col_names)
print (df)
age sex cp restbp chol fbs restecg thalach exang oldpeak \
0 63.0 1.0 1.0 145.0 233.0 1.0 2.0 150.0 0.0 2.3
1 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5
2 67.0 1.0 4.0 120.0 229.0 0.0 2.0 129.0 1.0 2.6
3 37.0 1.0 3.0 130.0 250.0 0.0 0.0 187.0 0.0 3.5
4 41.0 0.0 2.0 130.0 204.0 0.0 2.0 172.0 0.0 1.4
.. ... ... ... ... ... ... ... ... ... ...
292 57.0 0.0 4.0 140.0 241.0 0.0 0.0 123.0 1.0 0.2
293 45.0 1.0 1.0 110.0 264.0 0.0 0.0 132.0 0.0 1.2
294 68.0 1.0 4.0 144.0 193.0 1.0 0.0 141.0 0.0 3.4
295 57.0 1.0 4.0 130.0 131.0 0.0 0.0 115.0 1.0 1.2
296 57.0 0.0 2.0 130.0 236.0 0.0 2.0 174.0 0.0 0.0
slope ca thal num
0 3.0 0.0 6.0 0
1 2.0 3.0 3.0 1
2 2.0 2.0 7.0 1
3 3.0 0.0 3.0 0
4 1.0 0.0 3.0 0
.. ... ... ... ...
292 2.0 0.0 7.0 1
293 2.0 0.0 7.0 1
294 2.0 2.0 7.0 1
295 2.0 1.0 7.0 1
296 2.0 1.0 3.0 1
[297 rows x 14 columns]
Just do a read_csv without header and pass col_names:
df = pd.read_csv('input.data', header=None, names=col_names);
Output (head):
age sex cp restbp chol fbs restecg thalach exang oldpeak slope ca thal num
-- ----- ----- ---- -------- ------ ----- --------- --------- ------- --------- ------- ---- ------ -----
0 63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
1 67 1 4 160 286 0 2 108 1 1.5 2 3 3 1
2 67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
3 37 1 3 130 250 0 0 187 0 3.5 3 0 3 0
4 41 0 2 130 204 0 2 172 0 1.4 1 0 3 0
I have the following code I'm working with in JuypterNotebook:
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/dothemathonthatone/maps/master/fruchtbarkeit.csv'
fruchtdf = pd.read_csv(url)
fruchtdf = fruchtdf.set_axis(['year', 'regional_schlüssel', 'kreis_frei', 'gender', 'nationality', 'u15', '15', '16', '17', '18', '19', '20',
'21', '22', '23', '24', '25', '26', '27', '28', '29', '30','31', '32', '33', '34', '35', '47', '48', '49', 'Ü5',
'38', '39', '36', '37', '40', '41', '42', '43', '44', '45', '46', 'uknwn'], axis=1, inplace=False)
fruchtdf['15']= fruchtdf['u15']+fruchtdf['15']
fruchtdf.drop(['u15'], axis=1, inplace=True)
year regional_schlüssel kreis_frei gender nationality 15 16 17 18 19 ... 36 37 40 41 42 43 44 45 46 uknwn
0 2000 5111000 Düsseldorf, krfr. Stadt man Deutsche --1 7 9 20 24 ... 13 1 - 1 1 - - - - -
1 2000 5111000 Düsseldorf, krfr. Stadt man Ausländerin --- 3 3 7 17 ... 4 3 1 - - - - - 1 -
2 2000 5111000 Düsseldorf, krfr. Stadt woman Deutsche --1 4 7 14 20 ... 9 4 3 2 1 - - - - -
3 2000 5111000 Düsseldorf, krfr. Stadt woman Ausländerin --- 1 5 10 17 ... 2 4 1 1 1 - - - - -
4 2000 5111000 Düsseldorf, krfr. Stadt man Deutsche --1 9 14 30 45 ... 3 1 - - - - - - - -
I am trying to aggregate columns 15 to unknown grouping by nationality, year, and regioinal_schlüssel
year regional_schlüssel nationality gender 15 16 17 ... unknown
2000 5111000 Deutsche man 1 4 4 7
2000 5111000 Deutsche woman 1 4 4 3
2000 5111000 Auslande man 1 4 4 7
2000 5111000 Auslande woman 1 4 4 3
desired output:
year regional_schlüssel nationality gender 15 16 17 ... unknown
2000 5111000 Deutsche man 2 8 8 10
2000 5111000 Auslande man 2 8 8 10
Then I would like to make 2 new sets of columns for each nationality: De15, De16, ..., unknown, and Aus15, Aus16, ..., Ausunknown
year regional_schlüssel nationality gender De15 De16 De17 ... Deunknown Aus15 Aus16 Aus17 Ausunknown
2000 5111000 Deutsche man 2 8 8 10 2 8 8 10
Is this possible?
Convert all columns from 5th column to numeric, if no numeric is created missing value:
fruchtdf.iloc[:, 5:] = fruchtdf.iloc[:, 5:].apply(pd.to_numeric, errors='coerce')
Then aggregate sum:
sumcol = ['year', 'regional_schlüssel','nationality', 'gender']
df = fruchtdf.groupby(sumcol).sum()
For first DataFrame convert MultiIndex to columns by reset_index():
df1 = df.reset_index()
print (df1)
year regional_schlüssel nationality gender 15 16 17 18 \
0 2000 5111000 Ausländerin man 0.0 4.0 10.0 26.0
1 2000 5111000 Ausländerin woman 0.0 4.0 10.0 30.0
2 2000 5111000 Deutsche man -2.0 16.0 23.0 50.0
3 2000 5111000 Deutsche woman -2.0 9.0 22.0 39.0
4 2000 5113000 Ausländerin man 0.0 1.0 7.0 11.0
... ... ... ... ... ... ... ...
29155 2017 5978036 Deutsche woman 0.0 0.0 0.0 1.0
29156 2017 5978040 Ausländerin man 0.0 0.0 0.0 0.0
29157 2017 5978040 Ausländerin woman 0.0 0.0 0.0 0.0
29158 2017 5978040 Deutsche man -1.0 1.0 1.0 1.0
29159 2017 5978040 Deutsche woman 0.0 0.0 1.0 2.0
19 20 ... 36 37 40 41 42 43 44 45 46 uknwn
0 44.0 68.0 ... 5.0 3.0 2.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0
1 45.0 66.0 ... 5.0 5.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
2 69.0 75.0 ... 16.0 2.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
3 54.0 82.0 ... 15.0 5.0 5.0 2.0 1.0 0.0 0.0 0.0 0.0 0.0
4 20.0 22.0 ... 2.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
29155 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
29156 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
29157 0.0 2.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
29158 2.0 3.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
29159 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[29160 rows x 41 columns]
And for second reshape by Series.unstack and convert first letter of nationality to columns names by f-strings:
df2 = df.unstack(2)
df2.columns = df2.columns.map(lambda x: f'{x[1][:3]}{x[0]}')
df2 = df2.reset_index()
print (df2)
year regional_schlüssel gender Aus15 Deu15 Aus16 Deu16 Aus17 \
0 2000 5111000 man 0.0 -2.0 4.0 16.0 10.0
1 2000 5111000 woman 0.0 -2.0 4.0 9.0 10.0
2 2000 5113000 man 0.0 -1.0 1.0 8.0 7.0
3 2000 5113000 woman -1.0 0.0 3.0 6.0 6.0
4 2000 5114000 man 0.0 0.0 0.0 2.0 0.0
... ... ... ... ... ... ... ...
14575 2017 5978032 woman 0.0 0.0 0.0 0.0 0.0
14576 2017 5978036 man 0.0 -2.0 0.0 0.0 1.0
14577 2017 5978036 woman 0.0 0.0 0.0 0.0 0.0
14578 2017 5978040 man 0.0 -1.0 0.0 1.0 0.0
14579 2017 5978040 woman 0.0 0.0 0.0 0.0 0.0
Deu17 Aus18 ... Aus43 Deu43 Aus44 Deu44 Aus45 Deu45 Aus46 \
0 23.0 26.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 1.0
1 22.0 30.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 16.0 11.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 17.0 8.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 4.0 5.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ...
14575 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14576 1.0 2.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14577 0.0 2.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14578 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0
14579 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Deu46 Ausuknwn Deuuknwn
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0
... ... ...
14575 0.0 0.0 0.0
14576 0.0 0.0 0.0
14577 0.0 0.0 0.0
14578 0.0 0.0 0.0
14579 0.0 0.0 0.0
[14580 rows x 77 columns]
I have a Pandas Series of solar radiation values with the index being timestamps with a one minute resolution. E.g.:
index solar_radiation
2019-01-01 08:01 0
2019-01-01 08:02 10
2019-01-01 08:03 15
...
2019-01-10 23:59 0
I would like to convert this to a table (DataFrame) where each hour is averaged into one column, e.g.:
index 00 01 02 03 04 05 06 ... 23
2019-01-01 0 0 0 0 0 3 10 ... 0
2019-01-02 0 0 0 0 0 4 12 ... 0
....
2019-01-10 0 0 0 0 0 6 24... 0
I have tried to look into Groupby, but there I am only able to group hours into one combined bin and not one for each day... any hints or suggestions as to how I can achive this with groupby or should I just brute force it and iterate over each hour?
If I understand you correctly, you want to use resample hourly. Then we can make a MultiIndex with date and hour, then we unstack the hour index to columns:
df = df.resample('H').mean()
df.set_index([df.index.date, df.index.time], inplace=True)
df = df.unstack(level=[1])
Which gives us the following output:
print(df)
solar_radiation \
00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00
2019-01-01 NaN NaN NaN NaN NaN NaN
2019-01-02 NaN NaN NaN NaN NaN NaN
2019-01-03 NaN NaN NaN NaN NaN NaN
2019-01-04 NaN NaN NaN NaN NaN NaN
2019-01-05 NaN NaN NaN NaN NaN NaN
2019-01-06 NaN NaN NaN NaN NaN NaN
2019-01-07 NaN NaN NaN NaN NaN NaN
2019-01-08 NaN NaN NaN NaN NaN NaN
2019-01-09 NaN NaN NaN NaN NaN NaN
2019-01-10 NaN NaN NaN NaN NaN NaN
... \
06:00:00 07:00:00 08:00:00 09:00:00 ... 14:00:00 15:00:00
2019-01-01 NaN NaN 8.333333 NaN ... NaN NaN
2019-01-02 NaN NaN NaN NaN ... NaN NaN
2019-01-03 NaN NaN NaN NaN ... NaN NaN
2019-01-04 NaN NaN NaN NaN ... NaN NaN
2019-01-05 NaN NaN NaN NaN ... NaN NaN
2019-01-06 NaN NaN NaN NaN ... NaN NaN
2019-01-07 NaN NaN NaN NaN ... NaN NaN
2019-01-08 NaN NaN NaN NaN ... NaN NaN
2019-01-09 NaN NaN NaN NaN ... NaN NaN
2019-01-10 NaN NaN NaN NaN ... NaN NaN
\
16:00:00 17:00:00 18:00:00 19:00:00 20:00:00 21:00:00 22:00:00
2019-01-01 NaN NaN NaN NaN NaN NaN NaN
2019-01-02 NaN NaN NaN NaN NaN NaN NaN
2019-01-03 NaN NaN NaN NaN NaN NaN NaN
2019-01-04 NaN NaN NaN NaN NaN NaN NaN
2019-01-05 NaN NaN NaN NaN NaN NaN NaN
2019-01-06 NaN NaN NaN NaN NaN NaN NaN
2019-01-07 NaN NaN NaN NaN NaN NaN NaN
2019-01-08 NaN NaN NaN NaN NaN NaN NaN
2019-01-09 NaN NaN NaN NaN NaN NaN NaN
2019-01-10 NaN NaN NaN NaN NaN NaN NaN
23:00:00
2019-01-01 NaN
2019-01-02 NaN
2019-01-03 NaN
2019-01-04 NaN
2019-01-05 NaN
2019-01-06 NaN
2019-01-07 NaN
2019-01-08 NaN
2019-01-09 NaN
2019-01-10 0.0
[10 rows x 24 columns]
Note I got a lot of NaN since you provided only couple of rows data.
Solutions for one column DataFrame:
Aggregate mean by DatetimeIndex with DatetimeIndex.floor for remove times and DatetimeIndex.hour, reshape by Series.unstack and add missing values by DataFrame.reindex:
#if necessary
#df.index = pd.to_datetime(df.index)
rng = pd.date_range(df.index.min().floor('D'), df.index.max().floor('D'))
df1 = (df.groupby([df.index.floor('D'), df.index.hour])['solar_radiation']
.mean()
.unstack(fill_value=0)
.reindex(columns=range(0, 24), fill_value=0, index=rng))
Another solution with Grouper by hour, replace missing values to 0 and reshape by Series.unstack:
#if necessary
#df.index = pd.to_datetime(df.index)
df1 = df.groupby(pd.Grouper(freq='H'))[['solar_radiation']].mean().fillna(0)
df1 = df1.set_index([df1.index.date, df1.index.hour])['solar_radiation'].unstack(fill_value=0)
print (df1)
0 1 2 3 4 5 6 7 8 9 ... 14 \
2019-01-01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.333333 0.0 ... 0.0
2019-01-02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0
2019-01-03 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0
2019-01-04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0
2019-01-05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0
2019-01-06 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0
2019-01-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0
2019-01-08 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0
2019-01-09 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0
2019-01-10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0
15 16 17 18 19 20 21 22 23
2019-01-01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-03 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-06 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-08 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-09 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2019-01-10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[10 rows x 24 columns]
Solutions for Series with DatetimeIndex:
rng = pd.date_range(df.index.min().floor('D'), df.index.max().floor('D'))
df1 = (df.groupby([df.index.floor('D'), df.index.hour])
.mean()
.unstack(fill_value=0)
.reindex(columns=range(0, 24), fill_value=0, index=rng))
df1 = df.groupby(pd.Grouper(freq='H')).mean().to_frame('new').fillna(0)
df1 = df1.set_index([df1.index.date, df1.index.hour])['new'].unstack(fill_value=0)
I want to run the following model (logistic regression) for the pandas data frame I read.
However, when the predict method comes, it says: "Input contains NaN, infinity or a value too large for dtype('float64')"
My code is: (Note that there must exist 10 numerical and 4 categorial variables)
import pandas as pd
import io
import requests
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
heart = pd.DataFrame(s_rows_cols, columns = header_row, index=range(271))
pd.to_numeric(heart['age'])
pd.to_numeric(heart['restBP'])
pd.to_numeric(heart['chol'])
pd.to_numeric(heart['sugar'])
pd.to_numeric(heart['maxhr'])
pd.to_numeric(heart['angina'])
pd.to_numeric(heart['dep'])
pd.to_numeric(heart['fluor'])
heart['chestpain'] = heart['chestpain'].astype('category')
heart['ecg'] = heart['ecg'].astype('category')
heart['thal'] = heart['thal'].astype('category')
heart['exercise'] = heart['exercise'].astype('category')
x = pd.to_numeric(heart['diagnosis'])
heart['diagnosis'] = (x > 1).astype(int)
heart_train, heart_test, goal_train, goal_test = train_test_split(heart.loc[:,'age':'thal'], heart.loc[:,'diagnosis'], test_size=0.3, random_state=0)
clf = LogisticRegression()
clf.fit(heart_train, goal_train)
heart_test_results = clf.predict(heart_test) #From here it is broken
print(clf.get_params(clf))
print(clf.score(heart_train,goal_train))
The data frame info is as follows (print(heart.info()):
RangeIndex: 271 entries, 0 to 270
Data columns (total 14 columns):
age 270 non-null object
sex 270 non-null object
chestpain 270 non-null category
restBP 270 non-null object
chol 270 non-null object
sugar 270 non-null object
ecg 270 non-null category
maxhr 270 non-null object
angina 270 non-null object
dep 270 non-null object
exercise 270 non-null category
fluor 270 non-null object
thal 270 non-null category
diagnosis 271 non-null int32
dtypes: category(4), int32(1), object(9)
memory usage: 21.4+ KB
None
Do anyone know what I am missing here?
Thanks in advance!!
I gues the reason for this error is how you parse this data:
In [116]: %paste
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
heart = pd.DataFrame(s_rows_cols, columns = header_row, index=range(271))
pd.to_numeric(heart['age'])
pd.to_numeric(heart['restBP'])
pd.to_numeric(heart['chol'])
pd.to_numeric(heart['sugar'])
pd.to_numeric(heart['maxhr'])
pd.to_numeric(heart['angina'])
pd.to_numeric(heart['dep'])
pd.to_numeric(heart['fluor'])
heart['chestpain'] = heart['chestpain'].astype('category')
heart['ecg'] = heart['ecg'].astype('category')
heart['thal'] = heart['thal'].astype('category')
heart['exercise'] = heart['exercise'].astype('category')
## -- End pasted text --
In [117]: heart
Out[117]:
age sex chestpain restBP chol sugar ecg maxhr angina dep exercise fluor thal diagnosis
0 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4 2.0 3.0 3.0 2
1 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6 2.0 0.0 7.0 1
2 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3 1.0 0.0 7.0 2
3 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2 2.0 1.0 7.0 1
4 74.0 0.0 2.0 120.0 269.0 0.0 2.0 121.0 1.0 0.2 1.0 1.0 3.0 1
5 65.0 1.0 4.0 120.0 177.0 0.0 0.0 140.0 0.0 0.4 1.0 0.0 7.0 1
6 56.0 1.0 3.0 130.0 256.0 1.0 2.0 142.0 1.0 0.6 2.0 1.0 6.0 2
7 59.0 1.0 4.0 110.0 239.0 0.0 2.0 142.0 1.0 1.2 2.0 1.0 7.0 2
8 60.0 1.0 4.0 140.0 293.0 0.0 2.0 170.0 0.0 1.2 2.0 2.0 7.0 2
9 63.0 0.0 4.0 150.0 407.0 0.0 2.0 154.0 0.0 4.0 2.0 3.0 7.0 2
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 60.0 1.0 4.0 130.0 206.0 0.0 2.0 132.0 1.0 2.4 2.0 2.0 7.0 2
262 58.0 1.0 2.0 120.0 284.0 0.0 2.0 160.0 0.0 1.8 2.0 0.0 3.0 2
263 49.0 1.0 2.0 130.0 266.0 0.0 0.0 171.0 0.0 0.6 1.0 0.0 3.0 1
264 48.0 1.0 2.0 110.0 229.0 0.0 0.0 168.0 0.0 1.0 3.0 0.0 7.0 2
265 52.0 1.0 3.0 172.0 199.0 1.0 0.0 162.0 0.0 0.5 1.0 0.0 7.0 1
266 44.0 1.0 2.0 120.0 263.0 0.0 0.0 173.0 0.0 0.0 1.0 0.0 7.0 1
267 56.0 0.0 2.0 140.0 294.0 0.0 2.0 153.0 0.0 1.3 2.0 0.0 3.0 1
268 57.0 1.0 4.0 140.0 192.0 0.0 0.0 148.0 0.0 0.4 2.0 0.0 6.0 1
269 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 2
270 None None NaN None None None NaN None None None NaN None NaN None
[271 rows x 14 columns]
NOTE: pay attention at the very last row with NaN's
try to do it this simplified way instead:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
In [118]: df = pd.read_csv(url, sep='\s+', header=None, names=header_row)
In [119]: df
Out[119]:
age sex chestpain restBP chol sugar ecg maxhr angina dep exercise fluor thal diagnosis
0 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4 2.0 3.0 3.0 2
1 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6 2.0 0.0 7.0 1
2 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3 1.0 0.0 7.0 2
3 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2 2.0 1.0 7.0 1
4 74.0 0.0 2.0 120.0 269.0 0.0 2.0 121.0 1.0 0.2 1.0 1.0 3.0 1
5 65.0 1.0 4.0 120.0 177.0 0.0 0.0 140.0 0.0 0.4 1.0 0.0 7.0 1
6 56.0 1.0 3.0 130.0 256.0 1.0 2.0 142.0 1.0 0.6 2.0 1.0 6.0 2
7 59.0 1.0 4.0 110.0 239.0 0.0 2.0 142.0 1.0 1.2 2.0 1.0 7.0 2
8 60.0 1.0 4.0 140.0 293.0 0.0 2.0 170.0 0.0 1.2 2.0 2.0 7.0 2
9 63.0 0.0 4.0 150.0 407.0 0.0 2.0 154.0 0.0 4.0 2.0 3.0 7.0 2
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
260 58.0 0.0 3.0 120.0 340.0 0.0 0.0 172.0 0.0 0.0 1.0 0.0 3.0 1
261 60.0 1.0 4.0 130.0 206.0 0.0 2.0 132.0 1.0 2.4 2.0 2.0 7.0 2
262 58.0 1.0 2.0 120.0 284.0 0.0 2.0 160.0 0.0 1.8 2.0 0.0 3.0 2
263 49.0 1.0 2.0 130.0 266.0 0.0 0.0 171.0 0.0 0.6 1.0 0.0 3.0 1
264 48.0 1.0 2.0 110.0 229.0 0.0 0.0 168.0 0.0 1.0 3.0 0.0 7.0 2
265 52.0 1.0 3.0 172.0 199.0 1.0 0.0 162.0 0.0 0.5 1.0 0.0 7.0 1
266 44.0 1.0 2.0 120.0 263.0 0.0 0.0 173.0 0.0 0.0 1.0 0.0 7.0 1
267 56.0 0.0 2.0 140.0 294.0 0.0 2.0 153.0 0.0 1.3 2.0 0.0 3.0 1
268 57.0 1.0 4.0 140.0 192.0 0.0 0.0 148.0 0.0 0.4 2.0 0.0 6.0 1
269 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 2
[270 rows x 14 columns]
also pay attention at automatically parsed (guessed) dtypes - pd.read_csv() will do all necesarry convertions for you:
In [120]: df.dtypes
Out[120]:
age float64
sex float64
chestpain float64
restBP float64
chol float64
sugar float64
ecg float64
maxhr float64
angina float64
dep float64
exercise float64
fluor float64
thal float64
diagnosis int64
dtype: object
I would suspect it was the train_test_split thing.
I would suggest turning your X, and y into numpy arrays to avoid this problem. This usually solves for this.
X = heart.loc[:,'age':'thal'].as_matrix()
y = heart.loc[:,'diagnosis'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size, random_state)
and then fit for
clf.fit(X_train, y_train)