Pandas - create new column with the sum of last N values of another column with groupby - pandas

I have this df:
round_id team opponent home_dummy GC GP P
0 1.0 Flamengo Atlético-MG 1.0 1.0 0.0 0
1 4.0 Flamengo Grêmio 1.0 1.0 1.0 1
2 5.0 Flamengo Botafogo 1.0 1.0 1.0 1
3 6.0 Flamengo Santos 0.0 0.0 1.0 3
4 7.0 Flamengo Bahia 0.0 3.0 5.0 3
5 8.0 Flamengo Fortaleza 1.0 1.0 2.0 3
6 9.0 Flamengo Fluminense 0.0 1.0 2.0 3
7 10.0 Flamengo Ceará 0.0 2.0 0.0 0
8 3.0 Flamengo Coritiba 0.0 0.0 1.0 3
9 11.0 Flamengo Goiás 1.0 1.0 2.0 3
10 13.0 Flamengo Athlético-PR 1.0 1.0 3.0 3
11 14.0 Flamengo Sport 1.0 0.0 3.0 3
12 15.0 Flamengo Vasco 0.0 1.0 2.0 3
13 16.0 Flamengo Bragantino 1.0 1.0 1.0 1
14 17.0 Flamengo Corinthians 0.0 1.0 5.0 3
15 18.0 Flamengo Internacional 0.0 2.0 2.0 1
16 19.0 Flamengo São Paulo 1.0 4.0 1.0 0
17 12.0 Flamengo Palmeiras 0.0 1.0 1.0 1
18 2.0 Flamengo Atlético-GO 0.0 3.0 0.0 0
19 20.0 Flamengo Atlético-MG 0.0 4.0 0.0 0
and I've aded a new column to it, which creates a new column with the sum of last N values of another, like so:
df['Last_5'] = df.P.rolling(5,min_periods=1).sum().shift().fillna(0)
which gives me:
round_id team opponent home_dummy GC GP P last_5
0 1.0 Flamengo Atlético-MG 1.0 1.0 0.0 0 0
1 4.0 Flamengo Grêmio 1.0 1.0 1.0 1 0
2 5.0 Flamengo Botafogo 1.0 1.0 1.0 1 1
3 6.0 Flamengo Santos 0.0 0.0 1.0 3 2
4 7.0 Flamengo Bahia 0.0 3.0 5.0 3 5
5 8.0 Flamengo Fortaleza 1.0 1.0 2.0 3 8
6 9.0 Flamengo Fluminense 0.0 1.0 2.0 3 11
7 10.0 Flamengo Ceará 0.0 2.0 0.0 0 13
8 3.0 Flamengo Coritiba 0.0 0.0 1.0 3 12
9 11.0 Flamengo Goiás 1.0 1.0 2.0 3 12
10 13.0 Flamengo Athlético-PR 1.0 1.0 3.0 3 12
11 14.0 Flamengo Sport 1.0 0.0 3.0 3 12
12 15.0 Flamengo Vasco 0.0 1.0 2.0 3 12
13 16.0 Flamengo Bragantino 1.0 1.0 1.0 1 15
14 17.0 Flamengo Corinthians 0.0 1.0 5.0 3 13
15 18.0 Flamengo Internacional 0.0 2.0 2.0 1 11
16 19.0 Flamengo São Paulo 1.0 4.0 1.0 0 8
17 12.0 Flamengo Palmeiras 0.0 1.0 1.0 1 8
18 2.0 Flamengo Atlético-GO 0.0 3.0 0.0 0 6
19 20.0 Flamengo Atlético-MG 0.0 4.0 0.0 0 5
But lets say I have many teams in the same dataframe:
round_id team opponent home_dummy GC GP /
0 1.0 Flamengo Atlético-MG 1.0 1.0 0.0
1 4.0 Flamengo Grêmio 1.0 1.0 1.0
2 5.0 Flamengo Botafogo 1.0 1.0 1.0
3 6.0 Flamengo Santos 0.0 0.0 1.0
4 7.0 Flamengo Bahia 0.0 3.0 5.0
.. ... ... ... ... ... ...
395 15.0 Atlético-GO Bragantino 1.0 1.0 2.0
396 16.0 Atlético-GO Santos 0.0 0.0 1.0
397 17.0 Atlético-GO Athlético-PR 1.0 1.0 1.0
398 9.0 Atlético-GO Vasco 0.0 1.0 2.0
399 20.0 Atlético-GO Corinthians 1.0 1.0 1.0
How do I apply the same calculation and achieve the same result per Team, without overlapping last N rows between teams?

Adding the groupby
df['Last_5'] = df.groupby('team').P.apply(lambda x : x.rolling(5,min_periods=1).sum().shift().fillna(0))

Related

Pandas show NaN value after fillna with mode [duplicate]

This question already has answers here:
How to Pandas fillna() with mode of column?
(7 answers)
Closed 4 months ago.
I read data from csv and fillna with mode like this code.
df = pd.read_csv(r'C:\\Users\PC\Downloads\File.csv')
df.fillna(df.mode(), inplace=True)
It still show NaN value like this.
0 0.0 0.0 4.7 0.0 138.0 0.15 0.15
1 0.0 1.0 3.5 0.0 132.0 0.38 0.18
2 0.0 0.0 4.0 0.0 132.0 0.30 0.11
3 0.0 1.0 3.9 0.0 146.0 0.75 0.37
4 0.0 1.0 3.5 0.0 132.0 0.45 0.22
5 0.0 NaN NaN NaN NaN 0.45 0.22
6 0.0 NaN NaN NaN NaN 0.30 0.11
7 0.0 0.0 4.5 0.0 136.0 NaN NaN
8 0.0 NaN NaN NaN NaN 0.30 0.37
9 0.0 NaN NaN NaN NaN 0.38 0.11
If I fillna with mean it have no problem. How to fillna with mode?
Because DataFrame.mode should return multiple values if smae number of maximum counts, select first row:
print (df.mode())
1 2 3 4 5 6 7
0 0.0 0.0 3.5 0.0 132.0 0.3 0.11
1 NaN 1.0 NaN NaN NaN NaN NaN
df.fillna(df.mode().iloc[0], inplace=True)
print (df)
1 2 3 4 5 6 7
0
0 0.0 0.0 4.7 0.0 138.0 0.15 0.15
1 0.0 1.0 3.5 0.0 132.0 0.38 0.18
2 0.0 0.0 4.0 0.0 132.0 0.30 0.11
3 0.0 1.0 3.9 0.0 146.0 0.75 0.37
4 0.0 1.0 3.5 0.0 132.0 0.45 0.22
5 0.0 0.0 3.5 0.0 132.0 0.45 0.22
6 0.0 0.0 3.5 0.0 132.0 0.30 0.11
7 0.0 0.0 4.5 0.0 136.0 0.30 0.11
8 0.0 0.0 3.5 0.0 132.0 0.30 0.37
9 0.0 0.0 3.5 0.0 132.0 0.38 0.11

Pandas - create new column with the sum of last N values of another column

I have this df:
round_id team opponent home_dummy GC GP P
0 1.0 Flamengo Atlético-MG 1.0 1.0 0.0 0
1 4.0 Flamengo Grêmio 1.0 1.0 1.0 1
2 5.0 Flamengo Botafogo 1.0 1.0 1.0 1
3 6.0 Flamengo Santos 0.0 0.0 1.0 3
4 7.0 Flamengo Bahia 0.0 3.0 5.0 3
5 8.0 Flamengo Fortaleza 1.0 1.0 2.0 3
6 9.0 Flamengo Fluminense 0.0 1.0 2.0 3
7 10.0 Flamengo Ceará 0.0 2.0 0.0 0
8 3.0 Flamengo Coritiba 0.0 0.0 1.0 3
9 11.0 Flamengo Goiás 1.0 1.0 2.0 3
10 13.0 Flamengo Athlético-PR 1.0 1.0 3.0 3
11 14.0 Flamengo Sport 1.0 0.0 3.0 3
12 15.0 Flamengo Vasco 0.0 1.0 2.0 3
13 16.0 Flamengo Bragantino 1.0 1.0 1.0 1
14 17.0 Flamengo Corinthians 0.0 1.0 5.0 3
15 18.0 Flamengo Internacional 0.0 2.0 2.0 1
16 19.0 Flamengo São Paulo 1.0 4.0 1.0 0
17 12.0 Flamengo Palmeiras 0.0 1.0 1.0 1
18 2.0 Flamengo Atlético-GO 0.0 3.0 0.0 0
19 20.0 Flamengo Atlético-MG 0.0 4.0 0.0 0
Now I'd like to add a column 'last_5', which consists of the sum of the last 5 'P' values, ending up with:
rodada_id clube opponent home_dummy GC GP P last_5
0 1.0 Flamengo Atlético-MG 1.0 1.0 0.0 0 0
1 4.0 Flamengo Grêmio 1.0 1.0 1.0 1 0
2 5.0 Flamengo Botafogo 1.0 1.0 1.0 1 1
3 6.0 Flamengo Santos 0.0 0.0 1.0 3 2
4 7.0 Flamengo Bahia 0.0 3.0 5.0 3 5
5 8.0 Flamengo Fortaleza 1.0 1.0 2.0 3 8
6 9.0 Flamengo Fluminense 0.0 1.0 2.0 3 11
7 10.0 Flamengo Ceará 0.0 2.0 0.0 0 13
8 3.0 Flamengo Coritiba 0.0 0.0 1.0 3 12
9 11.0 Flamengo Goiás 1.0 1.0 2.0 3 12
10 13.0 Flamengo Athlético-PR 1.0 1.0 3.0 3 12
11 14.0 Flamengo Sport 1.0 0.0 3.0 3 12
12 15.0 Flamengo Vasco 0.0 1.0 2.0 3 12
13 16.0 Flamengo Bragantino 1.0 1.0 1.0 1 15
14 17.0 Flamengo Corinthians 0.0 1.0 5.0 3 13
15 18.0 Flamengo Internacional 0.0 2.0 2.0 1 11
16 19.0 Flamengo São Paulo 1.0 4.0 1.0 0 8
17 12.0 Flamengo Palmeiras 0.0 1.0 1.0 1 8
18 2.0 Flamengo Atlético-GO 0.0 3.0 0.0 0 6
19 20.0 Flamengo Atlético-MG 0.0 4.0 0.0 0 5
Please note that up to index 4 (n=5), the sum will have to be of the last 1, 2, 3, 4 rows.
I have tried:
N = 5
df = df.groupby(df.P // N).sum()
But this does not work.
Let us try
df['Last_5'] = df.P.rolling(5,min_periods=1).sum().shift().fillna(0)
Out[9]:
0 0.0
1 0.0
2 1.0
3 2.0
4 5.0
5 8.0
6 11.0
7 13.0
8 12.0
9 12.0
10 12.0
11 12.0
12 12.0
13 15.0
14 13.0
15 13.0
16 11.0
17 8.0
18 6.0
19 5.0

Groupby aggregate and create new columns from row cells

I have the following code I'm working with in JuypterNotebook:
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/dothemathonthatone/maps/master/fruchtbarkeit.csv'
fruchtdf = pd.read_csv(url)
fruchtdf = fruchtdf.set_axis(['year', 'regional_schlüssel', 'kreis_frei', 'gender', 'nationality', 'u15', '15', '16', '17', '18', '19', '20',
'21', '22', '23', '24', '25', '26', '27', '28', '29', '30','31', '32', '33', '34', '35', '47', '48', '49', 'Ü5',
'38', '39', '36', '37', '40', '41', '42', '43', '44', '45', '46', 'uknwn'], axis=1, inplace=False)
fruchtdf['15']= fruchtdf['u15']+fruchtdf['15']
fruchtdf.drop(['u15'], axis=1, inplace=True)
year regional_schlüssel kreis_frei gender nationality 15 16 17 18 19 ... 36 37 40 41 42 43 44 45 46 uknwn
0 2000 5111000 Düsseldorf, krfr. Stadt man Deutsche --1 7 9 20 24 ... 13 1 - 1 1 - - - - -
1 2000 5111000 Düsseldorf, krfr. Stadt man Ausländerin --- 3 3 7 17 ... 4 3 1 - - - - - 1 -
2 2000 5111000 Düsseldorf, krfr. Stadt woman Deutsche --1 4 7 14 20 ... 9 4 3 2 1 - - - - -
3 2000 5111000 Düsseldorf, krfr. Stadt woman Ausländerin --- 1 5 10 17 ... 2 4 1 1 1 - - - - -
4 2000 5111000 Düsseldorf, krfr. Stadt man Deutsche --1 9 14 30 45 ... 3 1 - - - - - - - -
I am trying to aggregate columns 15 to unknown grouping by nationality, year, and regioinal_schlüssel
year regional_schlüssel nationality gender 15 16 17 ... unknown
2000 5111000 Deutsche man 1 4 4 7
2000 5111000 Deutsche woman 1 4 4 3
2000 5111000 Auslande man 1 4 4 7
2000 5111000 Auslande woman 1 4 4 3
desired output:
year regional_schlüssel nationality gender 15 16 17 ... unknown
2000 5111000 Deutsche man 2 8 8 10
2000 5111000 Auslande man 2 8 8 10
Then I would like to make 2 new sets of columns for each nationality: De15, De16, ..., unknown, and Aus15, Aus16, ..., Ausunknown
year regional_schlüssel nationality gender De15 De16 De17 ... Deunknown Aus15 Aus16 Aus17 Ausunknown
2000 5111000 Deutsche man 2 8 8 10 2 8 8 10
Is this possible?
Convert all columns from 5th column to numeric, if no numeric is created missing value:
fruchtdf.iloc[:, 5:] = fruchtdf.iloc[:, 5:].apply(pd.to_numeric, errors='coerce')
Then aggregate sum:
sumcol = ['year', 'regional_schlüssel','nationality', 'gender']
df = fruchtdf.groupby(sumcol).sum()
For first DataFrame convert MultiIndex to columns by reset_index():
df1 = df.reset_index()
print (df1)
year regional_schlüssel nationality gender 15 16 17 18 \
0 2000 5111000 Ausländerin man 0.0 4.0 10.0 26.0
1 2000 5111000 Ausländerin woman 0.0 4.0 10.0 30.0
2 2000 5111000 Deutsche man -2.0 16.0 23.0 50.0
3 2000 5111000 Deutsche woman -2.0 9.0 22.0 39.0
4 2000 5113000 Ausländerin man 0.0 1.0 7.0 11.0
... ... ... ... ... ... ... ...
29155 2017 5978036 Deutsche woman 0.0 0.0 0.0 1.0
29156 2017 5978040 Ausländerin man 0.0 0.0 0.0 0.0
29157 2017 5978040 Ausländerin woman 0.0 0.0 0.0 0.0
29158 2017 5978040 Deutsche man -1.0 1.0 1.0 1.0
29159 2017 5978040 Deutsche woman 0.0 0.0 1.0 2.0
19 20 ... 36 37 40 41 42 43 44 45 46 uknwn
0 44.0 68.0 ... 5.0 3.0 2.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0
1 45.0 66.0 ... 5.0 5.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
2 69.0 75.0 ... 16.0 2.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
3 54.0 82.0 ... 15.0 5.0 5.0 2.0 1.0 0.0 0.0 0.0 0.0 0.0
4 20.0 22.0 ... 2.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
29155 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
29156 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
29157 0.0 2.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
29158 2.0 3.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
29159 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[29160 rows x 41 columns]
And for second reshape by Series.unstack and convert first letter of nationality to columns names by f-strings:
df2 = df.unstack(2)
df2.columns = df2.columns.map(lambda x: f'{x[1][:3]}{x[0]}')
df2 = df2.reset_index()
print (df2)
year regional_schlüssel gender Aus15 Deu15 Aus16 Deu16 Aus17 \
0 2000 5111000 man 0.0 -2.0 4.0 16.0 10.0
1 2000 5111000 woman 0.0 -2.0 4.0 9.0 10.0
2 2000 5113000 man 0.0 -1.0 1.0 8.0 7.0
3 2000 5113000 woman -1.0 0.0 3.0 6.0 6.0
4 2000 5114000 man 0.0 0.0 0.0 2.0 0.0
... ... ... ... ... ... ... ...
14575 2017 5978032 woman 0.0 0.0 0.0 0.0 0.0
14576 2017 5978036 man 0.0 -2.0 0.0 0.0 1.0
14577 2017 5978036 woman 0.0 0.0 0.0 0.0 0.0
14578 2017 5978040 man 0.0 -1.0 0.0 1.0 0.0
14579 2017 5978040 woman 0.0 0.0 0.0 0.0 0.0
Deu17 Aus18 ... Aus43 Deu43 Aus44 Deu44 Aus45 Deu45 Aus46 \
0 23.0 26.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 1.0
1 22.0 30.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 16.0 11.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 17.0 8.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 4.0 5.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ...
14575 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14576 1.0 2.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14577 0.0 2.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14578 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0
14579 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Deu46 Ausuknwn Deuuknwn
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0
... ... ...
14575 0.0 0.0 0.0
14576 0.0 0.0 0.0
14577 0.0 0.0 0.0
14578 0.0 0.0 0.0
14579 0.0 0.0 0.0
[14580 rows x 77 columns]

How do I 'merge' information on a user during different periods in a dataset?

So I'm working with a dataset as an assignment / personal project right now. Basically, I have about 15k entries on about 5k unique IDs and I need to make a simple YES/NO prediction on each ID. Each row is some info on an ID during a certain period(1,2 or 3) and has 43 attributes.
My question is, what's the best approach in this situation? Should I just merge the 3 periods for each ID into 1 and have 129 attributes in a row? Is there a better approach? Thanks in advance.
Here's an exmaple of my dataset
PERIOD ID V_1 V_2 V_3 V_4 V_5 V_6 V_7 V_8 V_9 V_10 V_11 V_12 V_13 V_14 V_15 V_16 V_17 V_18 V_19 V_20 V_21 V_22 V_23 V_24 V_25 V_26 V_27 V_28 V_29 V_30 V_31 V_32 V_33 V_34 V_35 V_36 V_37 V_38 V_39 V_40 V_41 V_42 V_43
0 1 1 27.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 NaN 27.0 2.0 63.48 230.43 226.18 3.92 0.0 0.0 0.33 0.0 0.0 0.0 0.0 92.77 82.12 10.65 0.0 0.0 117.0 112.0 2.0 NaN 35.0 30.0 NaN 0.0 0.0 45.53 1.0550 0.0 0.0 45.53 0.0 0.0
1 2 1 19.0 0.0 NaN 1.0 1.0 0.0 1.0 0.0 NaN 19.0 2.0 NaN 134.75 132.03 2.03 0.0 0.0 0.69 1.0 0.0 0.0 0.0 162.48 162.48 0.00 0.0 NaN 54.0 48.0 2.0 0.0 44.0 44.0 0.0 0.0 0.0 48.00 NaN NaN 0.0 48.00 0.0 0.0
2 3 1 22.0 0.0 0.0 NaN 1.0 0.0 0.0 0.0 0.0 22.0 1.0 21.98 159.08 158.08 1.00 0.0 0.0 0.00 0.0 NaN 0.0 0.0 180.90 180.90 0.00 0.0 0.0 39.0 38.0 1.0 0.0 33.0 33.0 0.0 0.0 NaN 46.59 0.0000 0.0 0.0 46.59 0.0 0.0
3 1 2 NaN NaN 0.0 1.0 1.0 NaN 0.0 NaN 0.0 NaN 4.0 2.20 175.97 164.92 11.00 0.0 0.0 0.05 NaN 0.0 0.0 0.0 281.68 259.63 22.05 NaN 0.0 109.0 103.0 4.0 0.0 152.0 143.0 9.0 0.0 0.0 157.50 3.3075 0.0 0.0 157.50 0.0 0.0
4 2 2 28.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 28.0 8.0 73.93 367.20 339.73 27.47 0.0 0.0 NaN 0.0 0.0 0.0 0.0 504.13 479.53 24.60 0.0 0.0 233.0 222.0 11.0 0.0 288.0 279.0 NaN 0.0 0.0 157.50 3.6400 0.0 0.0 157.50 0.0 0.0
Here's an example of an output
ID OUTPUT
1 1.0
2 0.0
3 0.0
4 0.0
5 1.0
6 1.0
...

Logistic regression with pandas and sklearn: Input contains NaN, infinity or a value too large for dtype('float64')

I want to run the following model (logistic regression) for the pandas data frame I read.
However, when the predict method comes, it says: "Input contains NaN, infinity or a value too large for dtype('float64')"
My code is: (Note that there must exist 10 numerical and 4 categorial variables)
import pandas as pd
import io
import requests
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
heart = pd.DataFrame(s_rows_cols, columns = header_row, index=range(271))
pd.to_numeric(heart['age'])
pd.to_numeric(heart['restBP'])
pd.to_numeric(heart['chol'])
pd.to_numeric(heart['sugar'])
pd.to_numeric(heart['maxhr'])
pd.to_numeric(heart['angina'])
pd.to_numeric(heart['dep'])
pd.to_numeric(heart['fluor'])
heart['chestpain'] = heart['chestpain'].astype('category')
heart['ecg'] = heart['ecg'].astype('category')
heart['thal'] = heart['thal'].astype('category')
heart['exercise'] = heart['exercise'].astype('category')
x = pd.to_numeric(heart['diagnosis'])
heart['diagnosis'] = (x > 1).astype(int)
heart_train, heart_test, goal_train, goal_test = train_test_split(heart.loc[:,'age':'thal'], heart.loc[:,'diagnosis'], test_size=0.3, random_state=0)
clf = LogisticRegression()
clf.fit(heart_train, goal_train)
heart_test_results = clf.predict(heart_test) #From here it is broken
print(clf.get_params(clf))
print(clf.score(heart_train,goal_train))
The data frame info is as follows (print(heart.info()):
RangeIndex: 271 entries, 0 to 270
Data columns (total 14 columns):
age 270 non-null object
sex 270 non-null object
chestpain 270 non-null category
restBP 270 non-null object
chol 270 non-null object
sugar 270 non-null object
ecg 270 non-null category
maxhr 270 non-null object
angina 270 non-null object
dep 270 non-null object
exercise 270 non-null category
fluor 270 non-null object
thal 270 non-null category
diagnosis 271 non-null int32
dtypes: category(4), int32(1), object(9)
memory usage: 21.4+ KB
None
Do anyone know what I am missing here?
Thanks in advance!!
I gues the reason for this error is how you parse this data:
In [116]: %paste
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
heart = pd.DataFrame(s_rows_cols, columns = header_row, index=range(271))
pd.to_numeric(heart['age'])
pd.to_numeric(heart['restBP'])
pd.to_numeric(heart['chol'])
pd.to_numeric(heart['sugar'])
pd.to_numeric(heart['maxhr'])
pd.to_numeric(heart['angina'])
pd.to_numeric(heart['dep'])
pd.to_numeric(heart['fluor'])
heart['chestpain'] = heart['chestpain'].astype('category')
heart['ecg'] = heart['ecg'].astype('category')
heart['thal'] = heart['thal'].astype('category')
heart['exercise'] = heart['exercise'].astype('category')
## -- End pasted text --
In [117]: heart
Out[117]:
age sex chestpain restBP chol sugar ecg maxhr angina dep exercise fluor thal diagnosis
0 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4 2.0 3.0 3.0 2
1 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6 2.0 0.0 7.0 1
2 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3 1.0 0.0 7.0 2
3 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2 2.0 1.0 7.0 1
4 74.0 0.0 2.0 120.0 269.0 0.0 2.0 121.0 1.0 0.2 1.0 1.0 3.0 1
5 65.0 1.0 4.0 120.0 177.0 0.0 0.0 140.0 0.0 0.4 1.0 0.0 7.0 1
6 56.0 1.0 3.0 130.0 256.0 1.0 2.0 142.0 1.0 0.6 2.0 1.0 6.0 2
7 59.0 1.0 4.0 110.0 239.0 0.0 2.0 142.0 1.0 1.2 2.0 1.0 7.0 2
8 60.0 1.0 4.0 140.0 293.0 0.0 2.0 170.0 0.0 1.2 2.0 2.0 7.0 2
9 63.0 0.0 4.0 150.0 407.0 0.0 2.0 154.0 0.0 4.0 2.0 3.0 7.0 2
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 60.0 1.0 4.0 130.0 206.0 0.0 2.0 132.0 1.0 2.4 2.0 2.0 7.0 2
262 58.0 1.0 2.0 120.0 284.0 0.0 2.0 160.0 0.0 1.8 2.0 0.0 3.0 2
263 49.0 1.0 2.0 130.0 266.0 0.0 0.0 171.0 0.0 0.6 1.0 0.0 3.0 1
264 48.0 1.0 2.0 110.0 229.0 0.0 0.0 168.0 0.0 1.0 3.0 0.0 7.0 2
265 52.0 1.0 3.0 172.0 199.0 1.0 0.0 162.0 0.0 0.5 1.0 0.0 7.0 1
266 44.0 1.0 2.0 120.0 263.0 0.0 0.0 173.0 0.0 0.0 1.0 0.0 7.0 1
267 56.0 0.0 2.0 140.0 294.0 0.0 2.0 153.0 0.0 1.3 2.0 0.0 3.0 1
268 57.0 1.0 4.0 140.0 192.0 0.0 0.0 148.0 0.0 0.4 2.0 0.0 6.0 1
269 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 2
270 None None NaN None None None NaN None None None NaN None NaN None
[271 rows x 14 columns]
NOTE: pay attention at the very last row with NaN's
try to do it this simplified way instead:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
In [118]: df = pd.read_csv(url, sep='\s+', header=None, names=header_row)
In [119]: df
Out[119]:
age sex chestpain restBP chol sugar ecg maxhr angina dep exercise fluor thal diagnosis
0 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4 2.0 3.0 3.0 2
1 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6 2.0 0.0 7.0 1
2 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3 1.0 0.0 7.0 2
3 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2 2.0 1.0 7.0 1
4 74.0 0.0 2.0 120.0 269.0 0.0 2.0 121.0 1.0 0.2 1.0 1.0 3.0 1
5 65.0 1.0 4.0 120.0 177.0 0.0 0.0 140.0 0.0 0.4 1.0 0.0 7.0 1
6 56.0 1.0 3.0 130.0 256.0 1.0 2.0 142.0 1.0 0.6 2.0 1.0 6.0 2
7 59.0 1.0 4.0 110.0 239.0 0.0 2.0 142.0 1.0 1.2 2.0 1.0 7.0 2
8 60.0 1.0 4.0 140.0 293.0 0.0 2.0 170.0 0.0 1.2 2.0 2.0 7.0 2
9 63.0 0.0 4.0 150.0 407.0 0.0 2.0 154.0 0.0 4.0 2.0 3.0 7.0 2
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
260 58.0 0.0 3.0 120.0 340.0 0.0 0.0 172.0 0.0 0.0 1.0 0.0 3.0 1
261 60.0 1.0 4.0 130.0 206.0 0.0 2.0 132.0 1.0 2.4 2.0 2.0 7.0 2
262 58.0 1.0 2.0 120.0 284.0 0.0 2.0 160.0 0.0 1.8 2.0 0.0 3.0 2
263 49.0 1.0 2.0 130.0 266.0 0.0 0.0 171.0 0.0 0.6 1.0 0.0 3.0 1
264 48.0 1.0 2.0 110.0 229.0 0.0 0.0 168.0 0.0 1.0 3.0 0.0 7.0 2
265 52.0 1.0 3.0 172.0 199.0 1.0 0.0 162.0 0.0 0.5 1.0 0.0 7.0 1
266 44.0 1.0 2.0 120.0 263.0 0.0 0.0 173.0 0.0 0.0 1.0 0.0 7.0 1
267 56.0 0.0 2.0 140.0 294.0 0.0 2.0 153.0 0.0 1.3 2.0 0.0 3.0 1
268 57.0 1.0 4.0 140.0 192.0 0.0 0.0 148.0 0.0 0.4 2.0 0.0 6.0 1
269 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 2
[270 rows x 14 columns]
also pay attention at automatically parsed (guessed) dtypes - pd.read_csv() will do all necesarry convertions for you:
In [120]: df.dtypes
Out[120]:
age float64
sex float64
chestpain float64
restBP float64
chol float64
sugar float64
ecg float64
maxhr float64
angina float64
dep float64
exercise float64
fluor float64
thal float64
diagnosis int64
dtype: object
I would suspect it was the train_test_split thing.
I would suggest turning your X, and y into numpy arrays to avoid this problem. This usually solves for this.
X = heart.loc[:,'age':'thal'].as_matrix()
y = heart.loc[:,'diagnosis'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size, random_state)
and then fit for
clf.fit(X_train, y_train)