pandas DataFrame column manipulation using previous row value - pandas

I have below pandas DataFrame
color
direction
Total
-1.0
1.0
NaN
1.0
1.0
0
1.0
1.0
0
1.0
1.0
0
-1.0
1.0
NaN
1.0
-1.0
NaN
1.0
1.0
0
1.0
1.0
0
I am trying to update the total column based on below logic.
if df['color'] == 1.0 and df['direction'] == 1.0 then Total should be Total of previous row + 1. if Total of previous row is NaN, then 0+1
Note: I was trying to read the previous row total using df['Total'].shift() + 1 but it didnt work.
Expected DataFrame.
color
direction
Total
-1.0
1.0
NaN
1.0
1.0
1
1.0
1.0
2
1.0
1.0
3
-1.0
1.0
NaN
1.0
-1.0
NaN
1.0
1.0
1
1.0
1.0
2

You can create the sub-groupby value with cumsum , the new just groupby with color and direction and do cumcount
df.loc[df.Total.notnull(),'Total'] = df.groupby([df['Total'].isna().cumsum(),df['color'],df['direction']]).cumcount()+1
df
Out[618]:
color direction Total
0 -1.0 1.0 NaN
1 1.0 1.0 1.0
2 1.0 1.0 2.0
3 1.0 1.0 3.0
4 -1.0 1.0 NaN
5 1.0 -1.0 NaN
6 1.0 1.0 1.0
7 1.0 1.0 2.0

Related

How to get proportions of different observation types per total and per year in pandas

I am not entirely new to data science, but rather novice with pandas.
My data looks like this:
Date Obser_Type
0 2001-01-05 A
1 2002-02-06 A
2 2002-02-06 B
3 2004-03-07 C
4 2005-04-08 B
5 2006-05-09 A
6 2007-06-10 C
7 2007-07-11 B
I would like to get the following output with the proportions for the different kinds of observations as of total (i.e. accumulated from the beginning up to and including the specified year) and within each year:
Year A_%_total B_%_total C_%_total A_%_Year B_%_Year C_%_Year
0 2001 100 0 0 100 0 0
1 2002 67 33 0 50 50 0
2 2004 50 25 25 0 0 100
3 2005 40 40 20 0 100 0
4 2006 50 33 17 100 0 0
5 2007 37,5 37,5 25 0 50 50
I tried various approaches involving groupby, multiindexing, count etc but to no avail. I got either errors or something unsatisfying.
After extensively digging Stack Overflow and the rest of the internet for days, I am stumped.
The medieval way would be a bucket of loops and ifs, but what is the proper way to do this?
I have used appropriate values for the numbers. I don't know the aggregation logic of each of them, but I decided to create a composition ratio for 'Obser_Type' and a composition ratio for 'year'.
Add a new column for year data
2.Aggregate and create DF
3.Creating the Composition Ratio
4.Aggregate and create DF
5.Creating the Composition Ratio
6.Combining the two DF's
import pandas as pd
import numpy as np
import io
data = '''
Date Obser_Type Value
0 2001-01-05 A 34
1 2002-02-06 A 39
2 2002-02-06 B 67
3 2004-03-07 C 20
4 2005-04-08 B 29
5 2006-05-09 A 10
6 2007-06-10 C 59
7 2007-07-11 B 43
'''
df = pd.read_csv(io.StringIO(data), sep=' ')
df['Date'] = pd.to_datetime(df['Date'])
df['yyyy'] = df['Date'].dt.year
df1 = df.groupby(['yyyy','Obser_Type'])['Value'].agg(sum).unstack().fillna(0)
df1 = df1.apply(lambda x: x/sum(x), axis=0).rename(columns={'A':'A_%_total','B':'B_%_total','C':'C_%_total'})
df2 = df.groupby(['Obser_Type','yyyy'])['Value'].agg(sum).unstack().fillna(0)
df2 = df2.apply(lambda x: x/sum(x), axis=0)
df2 = df2.unstack().unstack().rename(columns={'A':'A_%_Year','B':'B_%_Year','C':'C_%_Year'})
pd.merge(df1, df2, on='yyyy')
Obser_Type A_%_total B_%_total C_%_total A_%_Year B_%_Year C_%_Year
yyyy
2001 0.409639 0.000000 0.000000 1.000000 0.000000 0.000000
2002 0.469880 0.482014 0.000000 0.367925 0.632075 0.000000
2004 0.000000 0.000000 0.253165 0.000000 0.000000 1.000000
2005 0.000000 0.208633 0.000000 0.000000 1.000000 0.000000
2006 0.120482 0.000000 0.000000 1.000000 0.000000 0.000000
2007 0.000000 0.309353 0.746835 0.000000 0.421569 0.578431
Thank you very much for your answer. However, i probably should have made it more clear that the actual dataframe is much bigger and has much more types of observations than A B C, so listing them manually would be inconvenient. My scope here is just the statistics for the different types of observations, not their associated numerical values.
I was able to build something and would like to share:
# convert dates to datetimes
#
df[‚Date'] = pd.to_datetime(df[‚Date'])
# get years from the dates
#
df[‚Year'] = df.Date.dt.year
# get total number of observations per type of observation and year in tabular form
#
grouped = df.groupby(['Year', 'Obser_Type']).count().unstack(1)
Date
Obser_Type A B C
Year
2001 1.0 NaN NaN
2002 1.0 1.0 NaN
2004 NaN NaN 1.0
2005 NaN 1.0 NaN
2006 1.0 NaN NaN
2007 NaN 1.0 1.0
# sum total number of observations per type over all years
#
grouped.loc['Total_Obs_per_Type',:] = grouped.sum(axis=0)
Date
Obser_Type A B C
Year
2001 1.0 NaN NaN
2002 1.0 1.0 NaN
2004 NaN NaN 1.0
2005 NaN 1.0 NaN
2006 1.0 NaN NaN
2007 NaN 1.0 1.0
Total_Obs_per_Type 3.0 3.0 2.0
# at this point the columns have a multiindex
#
grouped.columns
MultiIndex([('Date', 'A'),
('Date', 'B'),
('Date', 'C')],
names=[None, 'Obser_Type'])
# i only needed the second layer which looks like this
#
grouped.columns.get_level_values(1)
Index(['A', 'B', 'C'], dtype='object', name='Obser_Type')
# so i flattened the index
#
grouped.columns = grouped.columns.get_level_values(1)
# now i can easily address the columns
#
grouped.columns
Index(['A', 'B', 'C'], dtype='object', name='Obser_Type')
# create list of columns with observation types
# this refers to columns "A B C"
#
types_list = grouped.columns.values.tolist()
# create list to later access the columns with the cumulative sum of observations per type
# this refers to columns "A_cum B_cum C_cum"
#
types_cum_list = []
# calculate cumulative sum for the different kinds of observations
#
for columnName in types_list:
# create new columns with modified name and calculate for each type of observation the cumulative sum of observations
#
grouped[columnName+'_cum'] = grouped[columnName].cumsum()
# put the new column names in the list of columns with cumulative sum of observations per type
#
types_cum_list.append(columnName+'_cum')
# this gives
Obser_Type A B C A_cum B_cum C_cum
Year
2001 1.0 NaN NaN 1.0 NaN NaN
2002 1.0 1.0 NaN 2.0 1.0 NaN
2004 NaN NaN 1.0 NaN NaN 1.0
2005 NaN 1.0 NaN NaN 2.0 NaN
2006 1.0 NaN NaN 3.0 NaN NaN
2007 NaN 1.0 1.0 NaN 3.0 2.0
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0
# create new column with total number of observations for all types of observation within a single year
#
grouped['All_Obs_Y'] = grouped.loc[:,types_list].sum(axis=1)
# this gives
Obser_Type A B C A_cum B_cum C_cum All_Obs_Y
Year
2001 1.0 NaN NaN 1.0 NaN NaN 1.0
2002 1.0 1.0 NaN 2.0 1.0 NaN 2.0
2004 NaN NaN 1.0 NaN NaN 1.0 1.0
2005 NaN 1.0 NaN NaN 2.0 NaN 1.0
2006 1.0 NaN NaN 3.0 NaN NaN 1.0
2007 NaN 1.0 1.0 NaN 3.0 2.0 2.0
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0 8.0
# create new columns with cumulative sum of all kinds observations up to each year
#
grouped['All_Obs_Cum'] = grouped['All_Obs_Y'].cumsum()
# this gives
# sorry i could not work out the formatting and i am not allowed yet to include screenshots
Obser_Type A B C A_cum B_cum C_cum All_Obs_Y All_Obs_Cum
Year
2001 1.0 NaN NaN 1.0 NaN NaN 1.0 1.0
2002 1.0 1.0 NaN 2.0 1.0 NaN 2.0 3.0
2004 NaN NaN 1.0 NaN NaN 1.0 1.0 4.0
2005 NaN 1.0 NaN NaN 2.0 NaN 1.0 5.0
2006 1.0 NaN NaN 3.0 NaN NaN 1.0 6.0
2007 NaN 1.0 1.0 NaN 3.0 2.0 2.0 8.0
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0 8.0 16.0
# create list of columns with the percentages each type of observation has within the observations of each year
# this refers to columns "A_%_Y B_%_Y C_Y_%"
#
types_percent_Y_list = []
# calculate the percentages each type of observation has within each year
#
for columnName in types_list:
# calculate percentages
#
grouped[columnName+'_%_Y'] = grouped[columnName] / grouped['All_Obs_Y']
# put the new columns names in list of columns with percentages each type of observation has within a year for later access
#
types_percent_Y_list.append(columnName+'_%_Y')
# this gives
Obser_Type A B C A_cum B_cum C_cum All_Obs_Y All_Obs_Cum A_%_Y B_%_Y C_%_Y
Year
2001 1.0 NaN NaN 1.0 NaN NaN 1.0 1.0 1.000 NaN NaN
2002 1.0 1.0 NaN 2.0 1.0 NaN 2.0 3.0 0.500 0.500 NaN
2004 NaN NaN 1.0 NaN NaN 1.0 1.0 4.0 NaN NaN 1.00
2005 NaN 1.0 NaN NaN 2.0 NaN 1.0 5.0 NaN 1.000 NaN
2006 1.0 NaN NaN 3.0 NaN NaN 1.0 6.0 1.000 NaN NaN
2007 NaN 1.0 1.0 NaN 3.0 2.0 2.0 8.0 NaN 0.500 0.50
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0 8.0 16.0 0.375 0.375 0.25
# replace the NaNs in the types_cum columns, otherwise the calculation of the cumulative percentages in the next step would not work
#
# types_cum_list :
# if there is no observation for e.g. type B in the first year (2001) we put a count of 0 for that year,
# that is, in the first row.
# If there is no observation for type B in a later year (e.g. 2004) the cumulative count of Bs
# from the beginning up to that year does not change in that year, so we replace the NaN there with
# the last non-NaN value preceding it
#
# replace NaNs in first row by 0
#
for columnName in types_cum_list:
grouped.update(grouped.iloc[:1][columnName].fillna(value=0))
# replace NaNs in later rows with preceding non-NaN value
#
for columnName in types_cum_list:
grouped[columnName].fillna(method='ffill' , inplace=True)
# this gives
Obser_Type A B C A_cum B_cum C_cum All_Obs_Y All_Obs_Cum A_%_Y B_%_Y C_%_Y
Year
2001 1.0 NaN NaN 1.0 0.0 0.0 1.0 1.0 1.000 NaN NaN
2002 1.0 1.0 NaN 2.0 1.0 0.0 2.0 3.0 0.500 0.500 NaN
2004 NaN NaN 1.0 2.0 1.0 1.0 1.0 4.0 NaN NaN 1.00
2005 NaN 1.0 NaN 2.0 2.0 1.0 1.0 5.0 NaN 1.000 NaN
2006 1.0 NaN NaN 3.0 2.0 1.0 1.0 6.0 1.000 NaN NaN
2007 NaN 1.0 1.0 3.0 3.0 2.0 2.0 8.0 NaN 0.500 0.50
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0 8.0 16.0 0.375 0.375 0.25
# create list of the columns with the cumulative percentages of the different observation types from the beginning up to that year
# this refers to columns "A_cum_% B_cum_% C_cum_%"
#
types_cum_percent_list = []
# calculate cumulative proportions of different types of observations from beginning up to each year
#
for columnName in types_cum_list:
# if we had not taken care of the NaNs in the types_cum columns this would produce incorrect numbers
#
grouped[columnName+'_%'] = grouped[columnName] / grouped['All_Obs_Cum']
# put the new columns in their respective list so we can access them conveniently later
#
types_cum_percent_list.append(columnName+'_%')
# this gives
Obser_Type A B C A_cum B_cum C_cum All_Obs_Y All_Obs_Cum A_%_Y B_%_Y C_%_Y A_cum_% B_cum_% C_cum_%
Year
2001 1.0 NaN NaN 1.0 0.0 0.0 1.0 1.0 1.000 NaN NaN 1.000000 0.000000 0.000000
2002 1.0 1.0 NaN 2.0 1.0 0.0 2.0 3.0 0.500 0.500 NaN 0.666667 0.333333 0.000000
2004 NaN NaN 1.0 2.0 1.0 1.0 1.0 4.0 NaN NaN 1.00 0.500000 0.250000 0.250000
2005 NaN 1.0 NaN 2.0 2.0 1.0 1.0 5.0 NaN 1.000 NaN 0.400000 0.400000 0.200000
2006 1.0 NaN NaN 3.0 2.0 1.0 1.0 6.0 1.000 NaN NaN 0.500000 0.333333 0.166667
2007 NaN 1.0 1.0 3.0 3.0 2.0 2.0 8.0 NaN 0.500 0.50 0.375000 0.375000 0.250000
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0 8.0 16.0 0.375 0.375 0.25 0.375000 0.375000 0.250000
# to conclude i replace the remaining NaNs to make plotting easier
# replace NaNs in columns in types_list
#
# if there is no observation for a type of observation in a year we put a count of 0 for that year
#
for columnName in types_list:
grouped[columnName].fillna(value=0, inplace=True)
# replace NaNs in columns in types_percent_Y_list
#
# if there is no observation for a type of observation in a year we put a percentage of 0 for that year
#
for columnName in types_percent_Y_list:
grouped[columnName].fillna(value=0, inplace=True)
Obser_Type A B C A_cum B_cum C_cum All_Obs_Y All_Obs_Cum A_%_Y B_%_Y C_%_Y A_cum_% B_cum_% C_cum_%
Year
2001 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.000 0.000 0.00 1.000000 0.000000 0.000000
2002 1.0 1.0 0.0 2.0 1.0 0.0 2.0 3.0 0.500 0.500 0.00 0.666667 0.333333 0.000000
2004 0.0 0.0 1.0 2.0 1.0 1.0 1.0 4.0 0.000 0.000 1.00 0.500000 0.250000 0.250000
2005 0.0 1.0 0.0 2.0 2.0 1.0 1.0 5.0 0.000 1.000 0.00 0.400000 0.400000 0.200000
2006 1.0 0.0 0.0 3.0 2.0 1.0 1.0 6.0 1.000 0.000 0.00 0.500000 0.333333 0.166667
2007 0.0 1.0 1.0 3.0 3.0 2.0 2.0 8.0 0.000 0.500 0.50 0.375000 0.375000 0.250000
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0 8.0 16.0 0.375 0.375 0.25 0.375000 0.375000 0.250000
This has the functionylity and flexibility i was looking for. But as i am still learning pandas suggestions for improvement are appreciated.

pandas - how to select rows based on a conjunction of a non indexed column?

Consider the following DataFrame -
In [47]: dati
Out[47]:
x y
frame face lmark
1 NaN NaN NaN NaN
300 0.0 1.0 745.0 367.0
2.0 753.0 411.0
3.0 759.0 455.0
2201 0.0 1.0 634.0 395.0
2.0 629.0 439.0
3.0 630.0 486.0
How can we select the rows where dati['x'] > 629.5 for all rows sharing the same value in the 'frame' column. For this example, I would expect to result to be
x y
frame face lmark
300 0.0 1.0 745.0 367.0
2.0 753.0 411.0
3.0 759.0 455.0
because column 'x' of 'frame' 2201, 'lmark' 2.0 is not greater than 629.5
Use GroupBy.transform with GroupBy.all for test if all Trues per groups and filter in boolean indexing:
df = dat[(dat['x'] > 629.5).groupby(level=0).transform('all')]
print (df)
x y
frame face lmark
300 0.0 1.0 745.0 367.0
2.0 753.0 411.0
3.0 759.0 455.0

How do I 'merge' information on a user during different periods in a dataset?

So I'm working with a dataset as an assignment / personal project right now. Basically, I have about 15k entries on about 5k unique IDs and I need to make a simple YES/NO prediction on each ID. Each row is some info on an ID during a certain period(1,2 or 3) and has 43 attributes.
My question is, what's the best approach in this situation? Should I just merge the 3 periods for each ID into 1 and have 129 attributes in a row? Is there a better approach? Thanks in advance.
Here's an exmaple of my dataset
PERIOD ID V_1 V_2 V_3 V_4 V_5 V_6 V_7 V_8 V_9 V_10 V_11 V_12 V_13 V_14 V_15 V_16 V_17 V_18 V_19 V_20 V_21 V_22 V_23 V_24 V_25 V_26 V_27 V_28 V_29 V_30 V_31 V_32 V_33 V_34 V_35 V_36 V_37 V_38 V_39 V_40 V_41 V_42 V_43
0 1 1 27.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 NaN 27.0 2.0 63.48 230.43 226.18 3.92 0.0 0.0 0.33 0.0 0.0 0.0 0.0 92.77 82.12 10.65 0.0 0.0 117.0 112.0 2.0 NaN 35.0 30.0 NaN 0.0 0.0 45.53 1.0550 0.0 0.0 45.53 0.0 0.0
1 2 1 19.0 0.0 NaN 1.0 1.0 0.0 1.0 0.0 NaN 19.0 2.0 NaN 134.75 132.03 2.03 0.0 0.0 0.69 1.0 0.0 0.0 0.0 162.48 162.48 0.00 0.0 NaN 54.0 48.0 2.0 0.0 44.0 44.0 0.0 0.0 0.0 48.00 NaN NaN 0.0 48.00 0.0 0.0
2 3 1 22.0 0.0 0.0 NaN 1.0 0.0 0.0 0.0 0.0 22.0 1.0 21.98 159.08 158.08 1.00 0.0 0.0 0.00 0.0 NaN 0.0 0.0 180.90 180.90 0.00 0.0 0.0 39.0 38.0 1.0 0.0 33.0 33.0 0.0 0.0 NaN 46.59 0.0000 0.0 0.0 46.59 0.0 0.0
3 1 2 NaN NaN 0.0 1.0 1.0 NaN 0.0 NaN 0.0 NaN 4.0 2.20 175.97 164.92 11.00 0.0 0.0 0.05 NaN 0.0 0.0 0.0 281.68 259.63 22.05 NaN 0.0 109.0 103.0 4.0 0.0 152.0 143.0 9.0 0.0 0.0 157.50 3.3075 0.0 0.0 157.50 0.0 0.0
4 2 2 28.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 28.0 8.0 73.93 367.20 339.73 27.47 0.0 0.0 NaN 0.0 0.0 0.0 0.0 504.13 479.53 24.60 0.0 0.0 233.0 222.0 11.0 0.0 288.0 279.0 NaN 0.0 0.0 157.50 3.6400 0.0 0.0 157.50 0.0 0.0
Here's an example of an output
ID OUTPUT
1 1.0
2 0.0
3 0.0
4 0.0
5 1.0
6 1.0
...

Sum of NaNs to equal NaN (not zero)

I can add a TOTAL column to this DF using df['TOTAL'] = df.sum(axis=1), and it adds the row elements like this:
col1 col2 TOTAL
0 1.0 5.0 6.0
1 2.0 6.0 8.0
2 0.0 NaN 0.0
3 NaN NaN 0.0
However, I would like the total of the bottom row to be NaN, not zero, like this:
col1 col2 TOTAL
0 1.0 5.0 6.0
1 2.0 6.0 8.0
2 0.0 NaN 0.0
3 NaN NaN Nan
Is there a way I can achieve this in a performant way?
Add parameter min_count=1 to DataFrame.sum:
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
df['TOTAL'] = df.sum(axis=1, min_count=1)
print (df)
col1 col2 TOTAL
0 1.0 5.0 6.0
1 2.0 6.0 8.0
2 0.0 NaN 0.0
3 NaN NaN NaN

Logistic regression with pandas and sklearn: Input contains NaN, infinity or a value too large for dtype('float64')

I want to run the following model (logistic regression) for the pandas data frame I read.
However, when the predict method comes, it says: "Input contains NaN, infinity or a value too large for dtype('float64')"
My code is: (Note that there must exist 10 numerical and 4 categorial variables)
import pandas as pd
import io
import requests
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
heart = pd.DataFrame(s_rows_cols, columns = header_row, index=range(271))
pd.to_numeric(heart['age'])
pd.to_numeric(heart['restBP'])
pd.to_numeric(heart['chol'])
pd.to_numeric(heart['sugar'])
pd.to_numeric(heart['maxhr'])
pd.to_numeric(heart['angina'])
pd.to_numeric(heart['dep'])
pd.to_numeric(heart['fluor'])
heart['chestpain'] = heart['chestpain'].astype('category')
heart['ecg'] = heart['ecg'].astype('category')
heart['thal'] = heart['thal'].astype('category')
heart['exercise'] = heart['exercise'].astype('category')
x = pd.to_numeric(heart['diagnosis'])
heart['diagnosis'] = (x > 1).astype(int)
heart_train, heart_test, goal_train, goal_test = train_test_split(heart.loc[:,'age':'thal'], heart.loc[:,'diagnosis'], test_size=0.3, random_state=0)
clf = LogisticRegression()
clf.fit(heart_train, goal_train)
heart_test_results = clf.predict(heart_test) #From here it is broken
print(clf.get_params(clf))
print(clf.score(heart_train,goal_train))
The data frame info is as follows (print(heart.info()):
RangeIndex: 271 entries, 0 to 270
Data columns (total 14 columns):
age 270 non-null object
sex 270 non-null object
chestpain 270 non-null category
restBP 270 non-null object
chol 270 non-null object
sugar 270 non-null object
ecg 270 non-null category
maxhr 270 non-null object
angina 270 non-null object
dep 270 non-null object
exercise 270 non-null category
fluor 270 non-null object
thal 270 non-null category
diagnosis 271 non-null int32
dtypes: category(4), int32(1), object(9)
memory usage: 21.4+ KB
None
Do anyone know what I am missing here?
Thanks in advance!!
I gues the reason for this error is how you parse this data:
In [116]: %paste
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
heart = pd.DataFrame(s_rows_cols, columns = header_row, index=range(271))
pd.to_numeric(heart['age'])
pd.to_numeric(heart['restBP'])
pd.to_numeric(heart['chol'])
pd.to_numeric(heart['sugar'])
pd.to_numeric(heart['maxhr'])
pd.to_numeric(heart['angina'])
pd.to_numeric(heart['dep'])
pd.to_numeric(heart['fluor'])
heart['chestpain'] = heart['chestpain'].astype('category')
heart['ecg'] = heart['ecg'].astype('category')
heart['thal'] = heart['thal'].astype('category')
heart['exercise'] = heart['exercise'].astype('category')
## -- End pasted text --
In [117]: heart
Out[117]:
age sex chestpain restBP chol sugar ecg maxhr angina dep exercise fluor thal diagnosis
0 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4 2.0 3.0 3.0 2
1 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6 2.0 0.0 7.0 1
2 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3 1.0 0.0 7.0 2
3 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2 2.0 1.0 7.0 1
4 74.0 0.0 2.0 120.0 269.0 0.0 2.0 121.0 1.0 0.2 1.0 1.0 3.0 1
5 65.0 1.0 4.0 120.0 177.0 0.0 0.0 140.0 0.0 0.4 1.0 0.0 7.0 1
6 56.0 1.0 3.0 130.0 256.0 1.0 2.0 142.0 1.0 0.6 2.0 1.0 6.0 2
7 59.0 1.0 4.0 110.0 239.0 0.0 2.0 142.0 1.0 1.2 2.0 1.0 7.0 2
8 60.0 1.0 4.0 140.0 293.0 0.0 2.0 170.0 0.0 1.2 2.0 2.0 7.0 2
9 63.0 0.0 4.0 150.0 407.0 0.0 2.0 154.0 0.0 4.0 2.0 3.0 7.0 2
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 60.0 1.0 4.0 130.0 206.0 0.0 2.0 132.0 1.0 2.4 2.0 2.0 7.0 2
262 58.0 1.0 2.0 120.0 284.0 0.0 2.0 160.0 0.0 1.8 2.0 0.0 3.0 2
263 49.0 1.0 2.0 130.0 266.0 0.0 0.0 171.0 0.0 0.6 1.0 0.0 3.0 1
264 48.0 1.0 2.0 110.0 229.0 0.0 0.0 168.0 0.0 1.0 3.0 0.0 7.0 2
265 52.0 1.0 3.0 172.0 199.0 1.0 0.0 162.0 0.0 0.5 1.0 0.0 7.0 1
266 44.0 1.0 2.0 120.0 263.0 0.0 0.0 173.0 0.0 0.0 1.0 0.0 7.0 1
267 56.0 0.0 2.0 140.0 294.0 0.0 2.0 153.0 0.0 1.3 2.0 0.0 3.0 1
268 57.0 1.0 4.0 140.0 192.0 0.0 0.0 148.0 0.0 0.4 2.0 0.0 6.0 1
269 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 2
270 None None NaN None None None NaN None None None NaN None NaN None
[271 rows x 14 columns]
NOTE: pay attention at the very last row with NaN's
try to do it this simplified way instead:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
In [118]: df = pd.read_csv(url, sep='\s+', header=None, names=header_row)
In [119]: df
Out[119]:
age sex chestpain restBP chol sugar ecg maxhr angina dep exercise fluor thal diagnosis
0 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4 2.0 3.0 3.0 2
1 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6 2.0 0.0 7.0 1
2 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3 1.0 0.0 7.0 2
3 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2 2.0 1.0 7.0 1
4 74.0 0.0 2.0 120.0 269.0 0.0 2.0 121.0 1.0 0.2 1.0 1.0 3.0 1
5 65.0 1.0 4.0 120.0 177.0 0.0 0.0 140.0 0.0 0.4 1.0 0.0 7.0 1
6 56.0 1.0 3.0 130.0 256.0 1.0 2.0 142.0 1.0 0.6 2.0 1.0 6.0 2
7 59.0 1.0 4.0 110.0 239.0 0.0 2.0 142.0 1.0 1.2 2.0 1.0 7.0 2
8 60.0 1.0 4.0 140.0 293.0 0.0 2.0 170.0 0.0 1.2 2.0 2.0 7.0 2
9 63.0 0.0 4.0 150.0 407.0 0.0 2.0 154.0 0.0 4.0 2.0 3.0 7.0 2
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
260 58.0 0.0 3.0 120.0 340.0 0.0 0.0 172.0 0.0 0.0 1.0 0.0 3.0 1
261 60.0 1.0 4.0 130.0 206.0 0.0 2.0 132.0 1.0 2.4 2.0 2.0 7.0 2
262 58.0 1.0 2.0 120.0 284.0 0.0 2.0 160.0 0.0 1.8 2.0 0.0 3.0 2
263 49.0 1.0 2.0 130.0 266.0 0.0 0.0 171.0 0.0 0.6 1.0 0.0 3.0 1
264 48.0 1.0 2.0 110.0 229.0 0.0 0.0 168.0 0.0 1.0 3.0 0.0 7.0 2
265 52.0 1.0 3.0 172.0 199.0 1.0 0.0 162.0 0.0 0.5 1.0 0.0 7.0 1
266 44.0 1.0 2.0 120.0 263.0 0.0 0.0 173.0 0.0 0.0 1.0 0.0 7.0 1
267 56.0 0.0 2.0 140.0 294.0 0.0 2.0 153.0 0.0 1.3 2.0 0.0 3.0 1
268 57.0 1.0 4.0 140.0 192.0 0.0 0.0 148.0 0.0 0.4 2.0 0.0 6.0 1
269 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 2
[270 rows x 14 columns]
also pay attention at automatically parsed (guessed) dtypes - pd.read_csv() will do all necesarry convertions for you:
In [120]: df.dtypes
Out[120]:
age float64
sex float64
chestpain float64
restBP float64
chol float64
sugar float64
ecg float64
maxhr float64
angina float64
dep float64
exercise float64
fluor float64
thal float64
diagnosis int64
dtype: object
I would suspect it was the train_test_split thing.
I would suggest turning your X, and y into numpy arrays to avoid this problem. This usually solves for this.
X = heart.loc[:,'age':'thal'].as_matrix()
y = heart.loc[:,'diagnosis'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size, random_state)
and then fit for
clf.fit(X_train, y_train)