I've got a dataframe with some NaNs. I'd like to fill them with the column mean values. It's all good but after applying the code below, the dataframe seems to have been change to a series, all values suddenly have precision of lots of places after the decimal point, the column names of the original dataframe have been lost and replaced with 0,1,2, I know I can recreate/reset all of this but is it possible to use SimpleImputer without changing the underlying structure/type of the data?
impute = SimpleImputer(missing_values=np.nan, strategy='mean')
impute.fit(dfn)
dfn_mean=impute.transform(dfn)
I think you can use only pandas solution with DataFrame.fillna and mean, where by default are omited non numeric columns:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,np.nan,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,np.nan,4],
'F':list('aaabbb')
})
df = df.fillna(df.mean())
print (df)
A B C D E F
0 a 4 7.0 1 5.0 a
1 b 5 8.0 3 3.0 a
2 c 4 9.0 5 6.0 a
3 d 5 4.0 7 9.0 b
4 e 5 6.2 1 5.4 b
5 f 4 3.0 0 4.0 b
Your solution should be changed with processing only floats columns by DataFrame.select_dtypes:
from sklearn.impute import SimpleImputer
impute = SimpleImputer(missing_values=np.nan,strategy='mean')
c = df.select_dtypes(np.floating).columns
df[c] = impute.fit_transform(df[c])
print (df)
A B C D E F
0 a 4 7.0 1 5.0 a
1 b 5 8.0 3 3.0 a
2 c 4 9.0 5 6.0 a
3 d 5 4.0 7 9.0 b
4 e 5 6.2 1 5.4 b
5 f 4 3.0 0 4.0 b
Or only numeric, but then integers columns are converted to floats:
from sklearn.impute import SimpleImputer
impute = SimpleImputer(missing_values=np.nan,strategy='mean')
c = df.select_dtypes(np.number).columns
df[c] = impute.fit_transform(df[c])
print (df)
A B C D E F
0 a 4.0 7.0 1.0 5.0 a
1 b 5.0 8.0 3.0 3.0 a
2 c 4.0 9.0 5.0 6.0 a
3 d 5.0 4.0 7.0 9.0 b
4 e 5.0 6.2 1.0 5.4 b
5 f 4.0 3.0 0.0 4.0 b
Related
I'm looking for a fast pandas way of labeling sections within a dataframe.
Suppose I have a dataframe column A with some strings in it, I'd like to create a new column B that tags the sections incrementally between the keyword 'hi' like so:
A B
hi
a 1
b 1
hi
d 2
f 2
g 2
hi
df.assign(C = df['A'].eq('hi').cumsum().mask(df['B'].isna()))
Out:
A B C
0 hi NaN NaN
1 a 1.0 1.0
2 b 1.0 1.0
3 hi NaN NaN
4 d 2.0 2.0
5 f 2.0 2.0
6 g 2.0 2.0
7 hi NaN NaN
i can show it by: df.isnull().sum() and get the max value with: df.isnull().sum().max() ,
but someone can tell me how to represent the column name with max Nan's ?
Thank you all!
Use Series.idxmax with DataFrame.loc for filter column with most missing values:
df.loc[:, df.isnull().sum().idxmax()]
If need select multiple columns with more maximes compare Series with max value:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,np.nan,5,np.nan,4],
'C':[7,8,9,np.nan,2,np.nan],
'D':[1,np.nan,5,7,1,0]
})
print (df)
A B C D
0 a 4.0 7.0 1.0
1 b 5.0 8.0 NaN
2 c NaN 9.0 5.0
3 d 5.0 NaN 7.0
4 e NaN 2.0 1.0
5 f 4.0 NaN 0.0
s = df.isnull().sum()
df = df.loc[:, s.eq(s.max())]
print (df)
B C
0 4.0 7.0
1 5.0 8.0
2 NaN 9.0
3 5.0 NaN
4 NaN 2.0
5 4.0 NaN
When we unstacking the multi-indexed pandas dataframe, the method fillna does not work.
Here is an example.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,(5,4)),
columns=['c1','c2','c3','c4'])
df.iloc[1,2] = np.nan
df.iloc[0,0] = np.nan
df['ind1'] = ['a','a','b','b','c']
df['ind2'] = [1,2,1,2,1]
df = df.set_index(['ind1','ind2'])
print(df)
Now we have df with some NaN values.
c1 c2 c3 c4
ind1 ind2
a 1 NaN 2 1.0 9
2 9.0 5 NaN 7
b 1 1.0 7 2.0 8
2 3.0 6 0.0 2
c 1 9.0 6 6.0 6
Then fillna to the unstacked df does not work.
print(df.unstack().fillna(0))
c1 c2 c3 c4
ind2 1 2 1 2 1 2 1 2
ind1
a 0.0 9.0 2.0 5.0 1.0 0.0 9.0 7.0
b 1.0 3.0 7.0 6.0 2.0 0.0 8.0 2.0
c 9.0 0.0 6.0 NaN 6.0 0.0 6.0 NaN
So is this a problem of Pandas? or does this intended to?
Here is an temporary solution.
df2 = df.unstack()
df2 = pd.DataFrame(np.nan_to_num(df2.values), index=df2.index, columns=df2.columns)
print(df2)
c1 c2 c3 c4
ind2 1 2 1 2 1 2 1 2
ind1
a 0.0 9.0 2.0 5.0 1.0 0.0 9.0 7.0
b 1.0 3.0 7.0 6.0 2.0 0.0 8.0 2.0
c 9.0 0.0 6.0 0.0 6.0 0.0 6.0 0.0
However, this solution is quite dirty.
Note
The pandas version is 1.1.3, and there is no issue with the version ==1.2.1.
Is it possible for pandas to do something like:
df.groupby("A").transform(pd.rolling_mean,10)
You can do this without the transform or apply:
df = pd.DataFrame({'grp':['A']*5+['B']*5,'data':[1,2,3,4,5,2,4,6,8,10]})
df.groupby('grp')['data'].rolling(2, min_periods=1).mean()
Output:
grp
A 0 1.0
1 1.5
2 2.5
3 3.5
4 4.5
B 5 2.0
6 3.0
7 5.0
8 7.0
9 9.0
Name: data, dtype: float64
Update per comment:
df = pd.DataFrame({'grp':['A']*5+['B']*5,'data':[1,2,3,4,5,2,4,6,8,10]},
index=[*'ABCDEFGHIJ'])
df['avg_2'] = df.groupby('grp')['data'].rolling(2, min_periods=1).mean()\
.reset_index(level=0, drop=True)
Output:
grp data avg_2
A A 1 1.0
B A 2 1.5
C A 3 2.5
D A 4 3.5
E A 5 4.5
F B 2 2.0
G B 4 3.0
H B 6 5.0
I B 8 7.0
J B 10 9.0
I run
Python Version: 2.7.12 |Anaconda 4.1.1 (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)] Pandas Version: 0.18.1 IPython Version: 4.2.0
on Windows 7 64.
What would be a quick way of getting a dataframe like
pd.DataFrame([[1,'a',1,'b',2,'c',3,'d',4],
[2,'e',5,'f',6,'g',7],
[3,'h',8,'i',9],
[4,'j',10]],columns=['ID','var1','var2','newVar1_1','newVar1_2','newVar2_1','newVar2_2','newVar3_1','newVar3_2'])
from
pd.DataFrame([[1,'a',1],
[1,'b',2],
[1,'c',3],
[1,'d',4],
[2,'e',5],
[2,'f',6],
[2,'g',7],
[3,'h',8],
[3,'i',9],
[4,'j',10]],columns=['ID','var1','var2'])
What I would do is to group by ID and then iterate on the groupby object to make a new row from each item and append it on an initially emtpty dataframe, but this is slow since in the real case the rows of the starting dataframe are several thousands.
Any suggestions?
df.set_index(['ID', df.groupby('ID').cumcount()]).unstack().sort_index(1, 1)
var1 var2 var1 var2 var1 var2 var1 var2
0 0 1 1 2 2 3 3
ID
1 a 1.0 b 2.0 c 3.0 d 4.0
2 e 5.0 f 6.0 g 7.0 None NaN
3 h 8.0 i 9.0 None NaN None NaN
4 j 10.0 None NaN None NaN None NaN
Or more complete
d1 = df.set_index(['ID', df.groupby('ID').cumcount()]).unstack().sort_index(1, 1)
d1.columns = d1.columns.to_series().map('new{0[0]}_{0[1]}'.format)
d1.reset_index()
ID newvar1_0 newvar2_0 newvar1_1 newvar2_1 newvar1_2 newvar2_2 newvar1_3 newvar2_3
0 1 a 1.0 b 2.0 c 3.0 d 4.0
1 2 e 5.0 f 6.0 g 7.0 None NaN
2 3 h 8.0 i 9.0 None NaN None NaN
3 4 j 10.0 None NaN None NaN None NaN