This question already has answers here:
How to Pandas fillna() with mode of column?
(7 answers)
Closed 4 months ago.
I read data from csv and fillna with mode like this code.
df = pd.read_csv(r'C:\\Users\PC\Downloads\File.csv')
df.fillna(df.mode(), inplace=True)
It still show NaN value like this.
0 0.0 0.0 4.7 0.0 138.0 0.15 0.15
1 0.0 1.0 3.5 0.0 132.0 0.38 0.18
2 0.0 0.0 4.0 0.0 132.0 0.30 0.11
3 0.0 1.0 3.9 0.0 146.0 0.75 0.37
4 0.0 1.0 3.5 0.0 132.0 0.45 0.22
5 0.0 NaN NaN NaN NaN 0.45 0.22
6 0.0 NaN NaN NaN NaN 0.30 0.11
7 0.0 0.0 4.5 0.0 136.0 NaN NaN
8 0.0 NaN NaN NaN NaN 0.30 0.37
9 0.0 NaN NaN NaN NaN 0.38 0.11
If I fillna with mean it have no problem. How to fillna with mode?
Because DataFrame.mode should return multiple values if smae number of maximum counts, select first row:
print (df.mode())
1 2 3 4 5 6 7
0 0.0 0.0 3.5 0.0 132.0 0.3 0.11
1 NaN 1.0 NaN NaN NaN NaN NaN
df.fillna(df.mode().iloc[0], inplace=True)
print (df)
1 2 3 4 5 6 7
0
0 0.0 0.0 4.7 0.0 138.0 0.15 0.15
1 0.0 1.0 3.5 0.0 132.0 0.38 0.18
2 0.0 0.0 4.0 0.0 132.0 0.30 0.11
3 0.0 1.0 3.9 0.0 146.0 0.75 0.37
4 0.0 1.0 3.5 0.0 132.0 0.45 0.22
5 0.0 0.0 3.5 0.0 132.0 0.45 0.22
6 0.0 0.0 3.5 0.0 132.0 0.30 0.11
7 0.0 0.0 4.5 0.0 136.0 0.30 0.11
8 0.0 0.0 3.5 0.0 132.0 0.30 0.37
9 0.0 0.0 3.5 0.0 132.0 0.38 0.11
I have this df:
round_id team opponent home_dummy GC GP P
0 1.0 Flamengo Atlético-MG 1.0 1.0 0.0 0
1 4.0 Flamengo Grêmio 1.0 1.0 1.0 1
2 5.0 Flamengo Botafogo 1.0 1.0 1.0 1
3 6.0 Flamengo Santos 0.0 0.0 1.0 3
4 7.0 Flamengo Bahia 0.0 3.0 5.0 3
5 8.0 Flamengo Fortaleza 1.0 1.0 2.0 3
6 9.0 Flamengo Fluminense 0.0 1.0 2.0 3
7 10.0 Flamengo Ceará 0.0 2.0 0.0 0
8 3.0 Flamengo Coritiba 0.0 0.0 1.0 3
9 11.0 Flamengo Goiás 1.0 1.0 2.0 3
10 13.0 Flamengo Athlético-PR 1.0 1.0 3.0 3
11 14.0 Flamengo Sport 1.0 0.0 3.0 3
12 15.0 Flamengo Vasco 0.0 1.0 2.0 3
13 16.0 Flamengo Bragantino 1.0 1.0 1.0 1
14 17.0 Flamengo Corinthians 0.0 1.0 5.0 3
15 18.0 Flamengo Internacional 0.0 2.0 2.0 1
16 19.0 Flamengo São Paulo 1.0 4.0 1.0 0
17 12.0 Flamengo Palmeiras 0.0 1.0 1.0 1
18 2.0 Flamengo Atlético-GO 0.0 3.0 0.0 0
19 20.0 Flamengo Atlético-MG 0.0 4.0 0.0 0
Now I'd like to add a column 'last_5', which consists of the sum of the last 5 'P' values, ending up with:
rodada_id clube opponent home_dummy GC GP P last_5
0 1.0 Flamengo Atlético-MG 1.0 1.0 0.0 0 0
1 4.0 Flamengo Grêmio 1.0 1.0 1.0 1 0
2 5.0 Flamengo Botafogo 1.0 1.0 1.0 1 1
3 6.0 Flamengo Santos 0.0 0.0 1.0 3 2
4 7.0 Flamengo Bahia 0.0 3.0 5.0 3 5
5 8.0 Flamengo Fortaleza 1.0 1.0 2.0 3 8
6 9.0 Flamengo Fluminense 0.0 1.0 2.0 3 11
7 10.0 Flamengo Ceará 0.0 2.0 0.0 0 13
8 3.0 Flamengo Coritiba 0.0 0.0 1.0 3 12
9 11.0 Flamengo Goiás 1.0 1.0 2.0 3 12
10 13.0 Flamengo Athlético-PR 1.0 1.0 3.0 3 12
11 14.0 Flamengo Sport 1.0 0.0 3.0 3 12
12 15.0 Flamengo Vasco 0.0 1.0 2.0 3 12
13 16.0 Flamengo Bragantino 1.0 1.0 1.0 1 15
14 17.0 Flamengo Corinthians 0.0 1.0 5.0 3 13
15 18.0 Flamengo Internacional 0.0 2.0 2.0 1 11
16 19.0 Flamengo São Paulo 1.0 4.0 1.0 0 8
17 12.0 Flamengo Palmeiras 0.0 1.0 1.0 1 8
18 2.0 Flamengo Atlético-GO 0.0 3.0 0.0 0 6
19 20.0 Flamengo Atlético-MG 0.0 4.0 0.0 0 5
Please note that up to index 4 (n=5), the sum will have to be of the last 1, 2, 3, 4 rows.
I have tried:
N = 5
df = df.groupby(df.P // N).sum()
But this does not work.
Let us try
df['Last_5'] = df.P.rolling(5,min_periods=1).sum().shift().fillna(0)
Out[9]:
0 0.0
1 0.0
2 1.0
3 2.0
4 5.0
5 8.0
6 11.0
7 13.0
8 12.0
9 12.0
10 12.0
11 12.0
12 12.0
13 15.0
14 13.0
15 13.0
16 11.0
17 8.0
18 6.0
19 5.0
I have this df:
round_id team opponent home_dummy GC GP P
0 1.0 Flamengo Atlético-MG 1.0 1.0 0.0 0
1 4.0 Flamengo Grêmio 1.0 1.0 1.0 1
2 5.0 Flamengo Botafogo 1.0 1.0 1.0 1
3 6.0 Flamengo Santos 0.0 0.0 1.0 3
4 7.0 Flamengo Bahia 0.0 3.0 5.0 3
5 8.0 Flamengo Fortaleza 1.0 1.0 2.0 3
6 9.0 Flamengo Fluminense 0.0 1.0 2.0 3
7 10.0 Flamengo Ceará 0.0 2.0 0.0 0
8 3.0 Flamengo Coritiba 0.0 0.0 1.0 3
9 11.0 Flamengo Goiás 1.0 1.0 2.0 3
10 13.0 Flamengo Athlético-PR 1.0 1.0 3.0 3
11 14.0 Flamengo Sport 1.0 0.0 3.0 3
12 15.0 Flamengo Vasco 0.0 1.0 2.0 3
13 16.0 Flamengo Bragantino 1.0 1.0 1.0 1
14 17.0 Flamengo Corinthians 0.0 1.0 5.0 3
15 18.0 Flamengo Internacional 0.0 2.0 2.0 1
16 19.0 Flamengo São Paulo 1.0 4.0 1.0 0
17 12.0 Flamengo Palmeiras 0.0 1.0 1.0 1
18 2.0 Flamengo Atlético-GO 0.0 3.0 0.0 0
19 20.0 Flamengo Atlético-MG 0.0 4.0 0.0 0
and I've aded a new column to it, which creates a new column with the sum of last N values of another, like so:
df['Last_5'] = df.P.rolling(5,min_periods=1).sum().shift().fillna(0)
which gives me:
round_id team opponent home_dummy GC GP P last_5
0 1.0 Flamengo Atlético-MG 1.0 1.0 0.0 0 0
1 4.0 Flamengo Grêmio 1.0 1.0 1.0 1 0
2 5.0 Flamengo Botafogo 1.0 1.0 1.0 1 1
3 6.0 Flamengo Santos 0.0 0.0 1.0 3 2
4 7.0 Flamengo Bahia 0.0 3.0 5.0 3 5
5 8.0 Flamengo Fortaleza 1.0 1.0 2.0 3 8
6 9.0 Flamengo Fluminense 0.0 1.0 2.0 3 11
7 10.0 Flamengo Ceará 0.0 2.0 0.0 0 13
8 3.0 Flamengo Coritiba 0.0 0.0 1.0 3 12
9 11.0 Flamengo Goiás 1.0 1.0 2.0 3 12
10 13.0 Flamengo Athlético-PR 1.0 1.0 3.0 3 12
11 14.0 Flamengo Sport 1.0 0.0 3.0 3 12
12 15.0 Flamengo Vasco 0.0 1.0 2.0 3 12
13 16.0 Flamengo Bragantino 1.0 1.0 1.0 1 15
14 17.0 Flamengo Corinthians 0.0 1.0 5.0 3 13
15 18.0 Flamengo Internacional 0.0 2.0 2.0 1 11
16 19.0 Flamengo São Paulo 1.0 4.0 1.0 0 8
17 12.0 Flamengo Palmeiras 0.0 1.0 1.0 1 8
18 2.0 Flamengo Atlético-GO 0.0 3.0 0.0 0 6
19 20.0 Flamengo Atlético-MG 0.0 4.0 0.0 0 5
But lets say I have many teams in the same dataframe:
round_id team opponent home_dummy GC GP /
0 1.0 Flamengo Atlético-MG 1.0 1.0 0.0
1 4.0 Flamengo Grêmio 1.0 1.0 1.0
2 5.0 Flamengo Botafogo 1.0 1.0 1.0
3 6.0 Flamengo Santos 0.0 0.0 1.0
4 7.0 Flamengo Bahia 0.0 3.0 5.0
.. ... ... ... ... ... ...
395 15.0 Atlético-GO Bragantino 1.0 1.0 2.0
396 16.0 Atlético-GO Santos 0.0 0.0 1.0
397 17.0 Atlético-GO Athlético-PR 1.0 1.0 1.0
398 9.0 Atlético-GO Vasco 0.0 1.0 2.0
399 20.0 Atlético-GO Corinthians 1.0 1.0 1.0
How do I apply the same calculation and achieve the same result per Team, without overlapping last N rows between teams?
Adding the groupby
df['Last_5'] = df.groupby('team').P.apply(lambda x : x.rolling(5,min_periods=1).sum().shift().fillna(0))
So I'm working with a dataset as an assignment / personal project right now. Basically, I have about 15k entries on about 5k unique IDs and I need to make a simple YES/NO prediction on each ID. Each row is some info on an ID during a certain period(1,2 or 3) and has 43 attributes.
My question is, what's the best approach in this situation? Should I just merge the 3 periods for each ID into 1 and have 129 attributes in a row? Is there a better approach? Thanks in advance.
Here's an exmaple of my dataset
PERIOD ID V_1 V_2 V_3 V_4 V_5 V_6 V_7 V_8 V_9 V_10 V_11 V_12 V_13 V_14 V_15 V_16 V_17 V_18 V_19 V_20 V_21 V_22 V_23 V_24 V_25 V_26 V_27 V_28 V_29 V_30 V_31 V_32 V_33 V_34 V_35 V_36 V_37 V_38 V_39 V_40 V_41 V_42 V_43
0 1 1 27.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 NaN 27.0 2.0 63.48 230.43 226.18 3.92 0.0 0.0 0.33 0.0 0.0 0.0 0.0 92.77 82.12 10.65 0.0 0.0 117.0 112.0 2.0 NaN 35.0 30.0 NaN 0.0 0.0 45.53 1.0550 0.0 0.0 45.53 0.0 0.0
1 2 1 19.0 0.0 NaN 1.0 1.0 0.0 1.0 0.0 NaN 19.0 2.0 NaN 134.75 132.03 2.03 0.0 0.0 0.69 1.0 0.0 0.0 0.0 162.48 162.48 0.00 0.0 NaN 54.0 48.0 2.0 0.0 44.0 44.0 0.0 0.0 0.0 48.00 NaN NaN 0.0 48.00 0.0 0.0
2 3 1 22.0 0.0 0.0 NaN 1.0 0.0 0.0 0.0 0.0 22.0 1.0 21.98 159.08 158.08 1.00 0.0 0.0 0.00 0.0 NaN 0.0 0.0 180.90 180.90 0.00 0.0 0.0 39.0 38.0 1.0 0.0 33.0 33.0 0.0 0.0 NaN 46.59 0.0000 0.0 0.0 46.59 0.0 0.0
3 1 2 NaN NaN 0.0 1.0 1.0 NaN 0.0 NaN 0.0 NaN 4.0 2.20 175.97 164.92 11.00 0.0 0.0 0.05 NaN 0.0 0.0 0.0 281.68 259.63 22.05 NaN 0.0 109.0 103.0 4.0 0.0 152.0 143.0 9.0 0.0 0.0 157.50 3.3075 0.0 0.0 157.50 0.0 0.0
4 2 2 28.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 28.0 8.0 73.93 367.20 339.73 27.47 0.0 0.0 NaN 0.0 0.0 0.0 0.0 504.13 479.53 24.60 0.0 0.0 233.0 222.0 11.0 0.0 288.0 279.0 NaN 0.0 0.0 157.50 3.6400 0.0 0.0 157.50 0.0 0.0
Here's an example of an output
ID OUTPUT
1 1.0
2 0.0
3 0.0
4 0.0
5 1.0
6 1.0
...
I want to run the following model (logistic regression) for the pandas data frame I read.
However, when the predict method comes, it says: "Input contains NaN, infinity or a value too large for dtype('float64')"
My code is: (Note that there must exist 10 numerical and 4 categorial variables)
import pandas as pd
import io
import requests
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
heart = pd.DataFrame(s_rows_cols, columns = header_row, index=range(271))
pd.to_numeric(heart['age'])
pd.to_numeric(heart['restBP'])
pd.to_numeric(heart['chol'])
pd.to_numeric(heart['sugar'])
pd.to_numeric(heart['maxhr'])
pd.to_numeric(heart['angina'])
pd.to_numeric(heart['dep'])
pd.to_numeric(heart['fluor'])
heart['chestpain'] = heart['chestpain'].astype('category')
heart['ecg'] = heart['ecg'].astype('category')
heart['thal'] = heart['thal'].astype('category')
heart['exercise'] = heart['exercise'].astype('category')
x = pd.to_numeric(heart['diagnosis'])
heart['diagnosis'] = (x > 1).astype(int)
heart_train, heart_test, goal_train, goal_test = train_test_split(heart.loc[:,'age':'thal'], heart.loc[:,'diagnosis'], test_size=0.3, random_state=0)
clf = LogisticRegression()
clf.fit(heart_train, goal_train)
heart_test_results = clf.predict(heart_test) #From here it is broken
print(clf.get_params(clf))
print(clf.score(heart_train,goal_train))
The data frame info is as follows (print(heart.info()):
RangeIndex: 271 entries, 0 to 270
Data columns (total 14 columns):
age 270 non-null object
sex 270 non-null object
chestpain 270 non-null category
restBP 270 non-null object
chol 270 non-null object
sugar 270 non-null object
ecg 270 non-null category
maxhr 270 non-null object
angina 270 non-null object
dep 270 non-null object
exercise 270 non-null category
fluor 270 non-null object
thal 270 non-null category
diagnosis 271 non-null int32
dtypes: category(4), int32(1), object(9)
memory usage: 21.4+ KB
None
Do anyone know what I am missing here?
Thanks in advance!!
I gues the reason for this error is how you parse this data:
In [116]: %paste
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
heart = pd.DataFrame(s_rows_cols, columns = header_row, index=range(271))
pd.to_numeric(heart['age'])
pd.to_numeric(heart['restBP'])
pd.to_numeric(heart['chol'])
pd.to_numeric(heart['sugar'])
pd.to_numeric(heart['maxhr'])
pd.to_numeric(heart['angina'])
pd.to_numeric(heart['dep'])
pd.to_numeric(heart['fluor'])
heart['chestpain'] = heart['chestpain'].astype('category')
heart['ecg'] = heart['ecg'].astype('category')
heart['thal'] = heart['thal'].astype('category')
heart['exercise'] = heart['exercise'].astype('category')
## -- End pasted text --
In [117]: heart
Out[117]:
age sex chestpain restBP chol sugar ecg maxhr angina dep exercise fluor thal diagnosis
0 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4 2.0 3.0 3.0 2
1 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6 2.0 0.0 7.0 1
2 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3 1.0 0.0 7.0 2
3 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2 2.0 1.0 7.0 1
4 74.0 0.0 2.0 120.0 269.0 0.0 2.0 121.0 1.0 0.2 1.0 1.0 3.0 1
5 65.0 1.0 4.0 120.0 177.0 0.0 0.0 140.0 0.0 0.4 1.0 0.0 7.0 1
6 56.0 1.0 3.0 130.0 256.0 1.0 2.0 142.0 1.0 0.6 2.0 1.0 6.0 2
7 59.0 1.0 4.0 110.0 239.0 0.0 2.0 142.0 1.0 1.2 2.0 1.0 7.0 2
8 60.0 1.0 4.0 140.0 293.0 0.0 2.0 170.0 0.0 1.2 2.0 2.0 7.0 2
9 63.0 0.0 4.0 150.0 407.0 0.0 2.0 154.0 0.0 4.0 2.0 3.0 7.0 2
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 60.0 1.0 4.0 130.0 206.0 0.0 2.0 132.0 1.0 2.4 2.0 2.0 7.0 2
262 58.0 1.0 2.0 120.0 284.0 0.0 2.0 160.0 0.0 1.8 2.0 0.0 3.0 2
263 49.0 1.0 2.0 130.0 266.0 0.0 0.0 171.0 0.0 0.6 1.0 0.0 3.0 1
264 48.0 1.0 2.0 110.0 229.0 0.0 0.0 168.0 0.0 1.0 3.0 0.0 7.0 2
265 52.0 1.0 3.0 172.0 199.0 1.0 0.0 162.0 0.0 0.5 1.0 0.0 7.0 1
266 44.0 1.0 2.0 120.0 263.0 0.0 0.0 173.0 0.0 0.0 1.0 0.0 7.0 1
267 56.0 0.0 2.0 140.0 294.0 0.0 2.0 153.0 0.0 1.3 2.0 0.0 3.0 1
268 57.0 1.0 4.0 140.0 192.0 0.0 0.0 148.0 0.0 0.4 2.0 0.0 6.0 1
269 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 2
270 None None NaN None None None NaN None None None NaN None NaN None
[271 rows x 14 columns]
NOTE: pay attention at the very last row with NaN's
try to do it this simplified way instead:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
In [118]: df = pd.read_csv(url, sep='\s+', header=None, names=header_row)
In [119]: df
Out[119]:
age sex chestpain restBP chol sugar ecg maxhr angina dep exercise fluor thal diagnosis
0 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4 2.0 3.0 3.0 2
1 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6 2.0 0.0 7.0 1
2 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3 1.0 0.0 7.0 2
3 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2 2.0 1.0 7.0 1
4 74.0 0.0 2.0 120.0 269.0 0.0 2.0 121.0 1.0 0.2 1.0 1.0 3.0 1
5 65.0 1.0 4.0 120.0 177.0 0.0 0.0 140.0 0.0 0.4 1.0 0.0 7.0 1
6 56.0 1.0 3.0 130.0 256.0 1.0 2.0 142.0 1.0 0.6 2.0 1.0 6.0 2
7 59.0 1.0 4.0 110.0 239.0 0.0 2.0 142.0 1.0 1.2 2.0 1.0 7.0 2
8 60.0 1.0 4.0 140.0 293.0 0.0 2.0 170.0 0.0 1.2 2.0 2.0 7.0 2
9 63.0 0.0 4.0 150.0 407.0 0.0 2.0 154.0 0.0 4.0 2.0 3.0 7.0 2
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
260 58.0 0.0 3.0 120.0 340.0 0.0 0.0 172.0 0.0 0.0 1.0 0.0 3.0 1
261 60.0 1.0 4.0 130.0 206.0 0.0 2.0 132.0 1.0 2.4 2.0 2.0 7.0 2
262 58.0 1.0 2.0 120.0 284.0 0.0 2.0 160.0 0.0 1.8 2.0 0.0 3.0 2
263 49.0 1.0 2.0 130.0 266.0 0.0 0.0 171.0 0.0 0.6 1.0 0.0 3.0 1
264 48.0 1.0 2.0 110.0 229.0 0.0 0.0 168.0 0.0 1.0 3.0 0.0 7.0 2
265 52.0 1.0 3.0 172.0 199.0 1.0 0.0 162.0 0.0 0.5 1.0 0.0 7.0 1
266 44.0 1.0 2.0 120.0 263.0 0.0 0.0 173.0 0.0 0.0 1.0 0.0 7.0 1
267 56.0 0.0 2.0 140.0 294.0 0.0 2.0 153.0 0.0 1.3 2.0 0.0 3.0 1
268 57.0 1.0 4.0 140.0 192.0 0.0 0.0 148.0 0.0 0.4 2.0 0.0 6.0 1
269 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 2
[270 rows x 14 columns]
also pay attention at automatically parsed (guessed) dtypes - pd.read_csv() will do all necesarry convertions for you:
In [120]: df.dtypes
Out[120]:
age float64
sex float64
chestpain float64
restBP float64
chol float64
sugar float64
ecg float64
maxhr float64
angina float64
dep float64
exercise float64
fluor float64
thal float64
diagnosis int64
dtype: object
I would suspect it was the train_test_split thing.
I would suggest turning your X, and y into numpy arrays to avoid this problem. This usually solves for this.
X = heart.loc[:,'age':'thal'].as_matrix()
y = heart.loc[:,'diagnosis'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size, random_state)
and then fit for
clf.fit(X_train, y_train)