When we unstacking the multi-indexed pandas dataframe, the method fillna does not work.
Here is an example.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,(5,4)),
columns=['c1','c2','c3','c4'])
df.iloc[1,2] = np.nan
df.iloc[0,0] = np.nan
df['ind1'] = ['a','a','b','b','c']
df['ind2'] = [1,2,1,2,1]
df = df.set_index(['ind1','ind2'])
print(df)
Now we have df with some NaN values.
c1 c2 c3 c4
ind1 ind2
a 1 NaN 2 1.0 9
2 9.0 5 NaN 7
b 1 1.0 7 2.0 8
2 3.0 6 0.0 2
c 1 9.0 6 6.0 6
Then fillna to the unstacked df does not work.
print(df.unstack().fillna(0))
c1 c2 c3 c4
ind2 1 2 1 2 1 2 1 2
ind1
a 0.0 9.0 2.0 5.0 1.0 0.0 9.0 7.0
b 1.0 3.0 7.0 6.0 2.0 0.0 8.0 2.0
c 9.0 0.0 6.0 NaN 6.0 0.0 6.0 NaN
So is this a problem of Pandas? or does this intended to?
Here is an temporary solution.
df2 = df.unstack()
df2 = pd.DataFrame(np.nan_to_num(df2.values), index=df2.index, columns=df2.columns)
print(df2)
c1 c2 c3 c4
ind2 1 2 1 2 1 2 1 2
ind1
a 0.0 9.0 2.0 5.0 1.0 0.0 9.0 7.0
b 1.0 3.0 7.0 6.0 2.0 0.0 8.0 2.0
c 9.0 0.0 6.0 0.0 6.0 0.0 6.0 0.0
However, this solution is quite dirty.
Note
The pandas version is 1.1.3, and there is no issue with the version ==1.2.1.
Related
I am having trouble appending later values from column C to column A within the same df using pandas. I have tried .append and .concat with ignore_index=True, still not working.
import pandas as pd
d = {'a':[1,2,3,None, None], 'b':[7,8,9, None, None], 'c':[None, None, None, 5, 6]}
df = pd.DataFrame(d)
df['a'] = df['a'].append(df['c'], ignore_index=True)
print(df)
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 NaN NaN 5.0
4 NaN NaN 6.0
Desired:
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 5.0 NaN 5.0
4 6.0 NaN 6.0
Thank you for updating that, this is what I would do:
df['a'] = df['a'].fillna(df['c'])
print(df)
Output:
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 5.0 NaN 5.0
4 6.0 NaN 6.0
i can show it by: df.isnull().sum() and get the max value with: df.isnull().sum().max() ,
but someone can tell me how to represent the column name with max Nan's ?
Thank you all!
Use Series.idxmax with DataFrame.loc for filter column with most missing values:
df.loc[:, df.isnull().sum().idxmax()]
If need select multiple columns with more maximes compare Series with max value:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,np.nan,5,np.nan,4],
'C':[7,8,9,np.nan,2,np.nan],
'D':[1,np.nan,5,7,1,0]
})
print (df)
A B C D
0 a 4.0 7.0 1.0
1 b 5.0 8.0 NaN
2 c NaN 9.0 5.0
3 d 5.0 NaN 7.0
4 e NaN 2.0 1.0
5 f 4.0 NaN 0.0
s = df.isnull().sum()
df = df.loc[:, s.eq(s.max())]
print (df)
B C
0 4.0 7.0
1 5.0 8.0
2 NaN 9.0
3 5.0 NaN
4 NaN 2.0
5 4.0 NaN
I have the following dataframe:
df = pd.DataFrame([[0, 1, 2, 4, np.nan, np.nan, np.nan,1],
[0, 1, 2 ,np.nan, np.nan, np.nan,np.nan,1],
[0, 2, 2 ,np.nan, 2, np.nan,1,1]])
With output:
0 1 2 3 4 5 6 7
0 0 1 2 4 NaN NaN NaN 1
1 0 1 2 NaN NaN NaN NaN 1
2 0 2 2 NaN 2 NaN 1 1
with dtypes:
df.dtypes
0 int64
1 int64
2 int64
3 float64
4 float64
5 float64
6 float64
7 int64
Then the underneath rolling summation is applied:
df.rolling(window = 7, min_periods =1, axis = 'columns').sum()
And the output is as follows:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 4.0 4.0 4.0 4.0 4.0
1 0.0 1.0 3.0 NaN NaN NaN NaN 4.0
2 0.0 2.0 4.0 NaN 2.0 2.0 3.0 5.0
I notice that the rolling window stops and starts again whenever the dtype of the next column is different.
I however have a dataframe whereby all columns are of the same object type.
df = df.astype('object')
which has output:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 7.0 7.0 7.0 7.0 8.0
1 0.0 1.0 3.0 3.0 3.0 3.0 3.0 4.0
2 0.0 2.0 4.0 4.0 6.0 6.0 7.0 8.0
My desired output however, stops and starts again after a nan value appears. This would look like:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 7.0 NaN NaN NaN 8.0
1 0.0 1.0 3.0 NaN NaN NaN Nan 4.0
2 0.0 2.0 4.0 NaN 6.0 NaN 7.0 8.0
I figured there must be a way that NaN values are not considered but also not filled in with values obtained from the rolling window.
Anything would help!
Workaround is:
Where are the nan-values located:
nan = df.isnull()
Apply the rolling window.
df = df.rolling(window = 7, min_periods =1, axis = 'columns').sum()
Only show values labeled as false.
df[~nan]
If there are three columns of data, the first column is some category id, the second column and the third column have some missing values, I want to aggregate the id of the first column, after grouping, fill in the third column of each group by the method: 'ffill' Missing value
I found a good idea here: Pandas: filling missing values by weighted average in each group! , but it didn't solve my problem because the output it produced was not what I wanted
Enter the following code to get an example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'name': ['A','A', 'B','B','B','B', 'C','C','C'],'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
'sss':[1, np.nan, 3, np.nan, np.nan, np.nan, 2, np.nan, np.nan]})
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN NaN
2 B NaN 3.0
3 B 2.0 NaN
4 B 3.0 NaN
5 B 1.0 NaN
6 C 3.0 2.0
7 C NaN NaN
8 C 3.0 NaN
Fill in missing values with a previous value after grouping
Then I ran the following code, but it outputs strange results:
df["sss"] = df.groupby("name").transform(lambda x: x.fillna(axis = 0,method = 'ffill'))
df
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN 1.0
2 B NaN NaN
3 B 2.0 2.0
4 B 3.0 3.0
5 B 1.0 1.0
6 C 3.0 3.0
7 C NaN 3.0
8 C 3.0 3.0
The result I want is this:
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN 1.0
2 B NaN 3.0
3 B 2.0 3.0
4 B 3.0 3.0
5 B 1.0 3.0
6 C 3.0 2.0
7 C NaN 2.0
8 C 3.0 2.0
Can someone point out where I am wrong?strong text
I run
Python Version: 2.7.12 |Anaconda 4.1.1 (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)] Pandas Version: 0.18.1 IPython Version: 4.2.0
on Windows 7 64.
What would be a quick way of getting a dataframe like
pd.DataFrame([[1,'a',1,'b',2,'c',3,'d',4],
[2,'e',5,'f',6,'g',7],
[3,'h',8,'i',9],
[4,'j',10]],columns=['ID','var1','var2','newVar1_1','newVar1_2','newVar2_1','newVar2_2','newVar3_1','newVar3_2'])
from
pd.DataFrame([[1,'a',1],
[1,'b',2],
[1,'c',3],
[1,'d',4],
[2,'e',5],
[2,'f',6],
[2,'g',7],
[3,'h',8],
[3,'i',9],
[4,'j',10]],columns=['ID','var1','var2'])
What I would do is to group by ID and then iterate on the groupby object to make a new row from each item and append it on an initially emtpty dataframe, but this is slow since in the real case the rows of the starting dataframe are several thousands.
Any suggestions?
df.set_index(['ID', df.groupby('ID').cumcount()]).unstack().sort_index(1, 1)
var1 var2 var1 var2 var1 var2 var1 var2
0 0 1 1 2 2 3 3
ID
1 a 1.0 b 2.0 c 3.0 d 4.0
2 e 5.0 f 6.0 g 7.0 None NaN
3 h 8.0 i 9.0 None NaN None NaN
4 j 10.0 None NaN None NaN None NaN
Or more complete
d1 = df.set_index(['ID', df.groupby('ID').cumcount()]).unstack().sort_index(1, 1)
d1.columns = d1.columns.to_series().map('new{0[0]}_{0[1]}'.format)
d1.reset_index()
ID newvar1_0 newvar2_0 newvar1_1 newvar2_1 newvar1_2 newvar2_2 newvar1_3 newvar2_3
0 1 a 1.0 b 2.0 c 3.0 d 4.0
1 2 e 5.0 f 6.0 g 7.0 None NaN
2 3 h 8.0 i 9.0 None NaN None NaN
3 4 j 10.0 None NaN None NaN None NaN