How to represent the column with max Nan values in pandas df? - pandas

i can show it by: df.isnull().sum() and get the max value with: df.isnull().sum().max() ,
but someone can tell me how to represent the column name with max Nan's ?
Thank you all!

Use Series.idxmax with DataFrame.loc for filter column with most missing values:
df.loc[:, df.isnull().sum().idxmax()]
If need select multiple columns with more maximes compare Series with max value:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,np.nan,5,np.nan,4],
'C':[7,8,9,np.nan,2,np.nan],
'D':[1,np.nan,5,7,1,0]
})
print (df)
A B C D
0 a 4.0 7.0 1.0
1 b 5.0 8.0 NaN
2 c NaN 9.0 5.0
3 d 5.0 NaN 7.0
4 e NaN 2.0 1.0
5 f 4.0 NaN 0.0
s = df.isnull().sum()
df = df.loc[:, s.eq(s.max())]
print (df)
B C
0 4.0 7.0
1 5.0 8.0
2 NaN 9.0
3 5.0 NaN
4 NaN 2.0
5 4.0 NaN

Related

How do I append an uneven column to an existing one?

I am having trouble appending later values from column C to column A within the same df using pandas. I have tried .append and .concat with ignore_index=True, still not working.
import pandas as pd
d = {'a':[1,2,3,None, None], 'b':[7,8,9, None, None], 'c':[None, None, None, 5, 6]}
df = pd.DataFrame(d)
df['a'] = df['a'].append(df['c'], ignore_index=True)
print(df)
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 NaN NaN 5.0
4 NaN NaN 6.0
Desired:
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 5.0 NaN 5.0
4 6.0 NaN 6.0
Thank you for updating that, this is what I would do:
df['a'] = df['a'].fillna(df['c'])
print(df)
Output:
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 5.0 NaN 5.0
4 6.0 NaN 6.0

How to perform a rolling window on a pandas DataFrame, whereby each row consists nan values that should not be replaced?

I have the following dataframe:
df = pd.DataFrame([[0, 1, 2, 4, np.nan, np.nan, np.nan,1],
[0, 1, 2 ,np.nan, np.nan, np.nan,np.nan,1],
[0, 2, 2 ,np.nan, 2, np.nan,1,1]])
With output:
0 1 2 3 4 5 6 7
0 0 1 2 4 NaN NaN NaN 1
1 0 1 2 NaN NaN NaN NaN 1
2 0 2 2 NaN 2 NaN 1 1
with dtypes:
df.dtypes
0 int64
1 int64
2 int64
3 float64
4 float64
5 float64
6 float64
7 int64
Then the underneath rolling summation is applied:
df.rolling(window = 7, min_periods =1, axis = 'columns').sum()
And the output is as follows:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 4.0 4.0 4.0 4.0 4.0
1 0.0 1.0 3.0 NaN NaN NaN NaN 4.0
2 0.0 2.0 4.0 NaN 2.0 2.0 3.0 5.0
I notice that the rolling window stops and starts again whenever the dtype of the next column is different.
I however have a dataframe whereby all columns are of the same object type.
df = df.astype('object')
which has output:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 7.0 7.0 7.0 7.0 8.0
1 0.0 1.0 3.0 3.0 3.0 3.0 3.0 4.0
2 0.0 2.0 4.0 4.0 6.0 6.0 7.0 8.0
My desired output however, stops and starts again after a nan value appears. This would look like:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 7.0 NaN NaN NaN 8.0
1 0.0 1.0 3.0 NaN NaN NaN Nan 4.0
2 0.0 2.0 4.0 NaN 6.0 NaN 7.0 8.0
I figured there must be a way that NaN values are not considered but also not filled in with values obtained from the rolling window.
Anything would help!
Workaround is:
Where are the nan-values located:
nan = df.isnull()
Apply the rolling window.
df = df.rolling(window = 7, min_periods =1, axis = 'columns').sum()
Only show values labeled as false.
df[~nan]

The previous value in each group is padded with missing values

If there are three columns of data, the first column is some category id, the second column and the third column have some missing values, I want to aggregate the id of the first column, after grouping, fill in the third column of each group by the method: 'ffill' Missing value
I found a good idea here: Pandas: filling missing values by weighted average in each group! , but it didn't solve my problem because the output it produced was not what I wanted
Enter the following code to get an example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'name': ['A','A', 'B','B','B','B', 'C','C','C'],'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
'sss':[1, np.nan, 3, np.nan, np.nan, np.nan, 2, np.nan, np.nan]})
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN NaN
2 B NaN 3.0
3 B 2.0 NaN
4 B 3.0 NaN
5 B 1.0 NaN
6 C 3.0 2.0
7 C NaN NaN
8 C 3.0 NaN
Fill in missing values with a previous value after grouping
Then I ran the following code, but it outputs strange results:
df["sss"] = df.groupby("name").transform(lambda x: x.fillna(axis = 0,method = 'ffill'))
df
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN 1.0
2 B NaN NaN
3 B 2.0 2.0
4 B 3.0 3.0
5 B 1.0 1.0
6 C 3.0 3.0
7 C NaN 3.0
8 C 3.0 3.0
The result I want is this:
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN 1.0
2 B NaN 3.0
3 B 2.0 3.0
4 B 3.0 3.0
5 B 1.0 3.0
6 C 3.0 2.0
7 C NaN 2.0
8 C 3.0 2.0
Can someone point out where I am wrong?strong text

Compare 2 columns and replace to None if found equal

The following command will replace all values for matching row to None.
ndf.iloc[np.where(ndf.path3=='sys_bck_20190101.tar.gz')] = np.nan
What I really need to do is to replace the value of a single column called path4 if it matches with column path3. This does not work:
ndf.iloc[np.where(ndf.path3==ndf.path4), ndf.path3] = np.nan
Update:
There is a pandas method "fillna" that can be used with axis = 'columns'.
Is there a similar method to write "NA" values to the duplcate columns?
I can do this, but it does not look like pythonic.
ndf.loc[ndf.path1==ndf.path2, 'path1'] = np.nan
ndf.loc[ndf.path2==ndf.path3, 'path2'] = np.nan
ndf.loc[ndf.path3==ndf.path4, 'path3'] = np.nan
ndf.loc[ndf.path4==ndf.filename, 'path4'] = np.nan
Update 2
Let me explain the issue:
Assuming this dataframe:
ndf = pd.DataFrame({
'path1':[4,5,4,5,5,4],
'path2':[4,5,4,5,5,4],
'path3':list('abcdef'),
'path4':list('aaabef'),
'col':list('aaabef')
})
The expected results :
0 NaN 4.0 NaN NaN a
1 NaN 5.0 b NaN a
2 NaN 4.0 c NaN a
3 NaN 5.0 d NaN b
4 NaN 5.0 NaN NaN e
5 NaN 4.0 NaN NaN f
As you can see this is reverse of fillna. And I guess there is no easy way to do this in pandas. I have already mentioned the commands I can use. I will like to know if there is a better way to achieve this.
Use:
for c1, c2 in zip(ndf.columns, ndf.columns[1:]):
ndf.loc[ndf[c1]==ndf[c2], c1] = np.nan
print (ndf)
path1 path2 path3 path4 col
0 NaN 4.0 NaN NaN a
1 NaN 5.0 b NaN a
2 NaN 4.0 c NaN a
3 NaN 5.0 d NaN b
4 NaN 5.0 NaN NaN e
5 NaN 4.0 NaN NaN f

Pandas Pivot Table Sort Index Level 1 Not "Sticking"

I know this is a lot, but I really cannot pinpoint what is causing the problem.
Most of this code is just to demonstrate what I'm doing, but the short end of it is:
After reordering columns in a multi-indexed data frame (via
transposing and other methods), calling columns.levels returns the
original sorted levels instead of the new ones.
Given the following:
#Original data frame
import pandas as pd
df = pd.DataFrame(
{'Year':[2012,2012,2012,2012,2012,2012,2013,2013,2013,2013,2013,2013,2014,2014,2014,2014,2014,2014],
'Type':['A','A','B','B','C','C','A','A','B','B','C','C','A','A','B','B','C','C'],
'Org':['a','c','a','b','a','c','a','b','a','c','a','c','a','b','a','c','a','b'],
'Enr':[3,5,3,6,6,4,7,89,5,3,7,34,4,64,3,6,7,44]
})
df.head()
Enr Org Type Year
0 3 a A 2012
1 5 c A 2012
2 3 a B 2012
3 6 b B 2012
4 6 a C 2012
#Pivoted
dfp=df.pivot_table(df,index=['Year'],columns=['Type','Org'],aggfunc=np.sum)\
.sortlevel(ascending=True).sort_index(axis=1)
dfp
Enr
Type A B C
Org a b c a b c a b c
Year
2012 3.0 NaN 5.0 3.0 6.0 NaN 6.0 NaN 4.0
2013 7.0 89.0 NaN 5.0 NaN 3.0 7.0 NaN 34.0
2014 4.0 64.0 NaN 3.0 NaN 6.0 7.0 44.0 NaN
#Transposed
f=dfp.T
Year 2012 2013 2014
Type Org
Enr A a 3.0 7.0 4.0
b NaN 89.0 64.0
c 5.0 NaN NaN
B a 3.0 5.0 3.0
b 6.0 NaN NaN
c NaN 3.0 6.0
C a 6.0 7.0 7.0
b NaN NaN 44.0
c 4.0 34.0 NaN
#Sort level 2 by last column and transpose back
ab2=f.groupby(level=1)[f.columns[-1]].transform(sum)
ab3=pd.concat([f,ab2],axis=1)
ab4=ab3.sort_values([ab3.columns[-1]],ascending=[0])
ab4=ab4.drop(ab4.columns[-1],axis=1,inplace=False)
g=ab4.T
g
Enr
Type A C B
Org a b c a b c a b c
Year
2012 3.0 NaN 5.0 6.0 NaN 4.0 3.0 6.0 NaN
2013 7.0 89.0 NaN 7.0 NaN 34.0 5.0 NaN 3.0
2014 4.0 64.0 NaN 7.0 44.0 NaN 3.0 NaN 6.0
I know this was a lot, but I really cannot pinpoint what is causing the problem.
If you do:
g.Enr.columns.levels
The result is:
FrozenList([['A', 'B', 'C'], ['a', 'b', 'c']])
My question is: Why is it not:
FrozenList([['A', 'C', 'B'], ['a', 'b', 'c']]) ?
I really need it to be the second one.
Thanks in advance!
A MultiIndex stores itself as a set of levels, which are the distinct possible values, and labels, which are integer codes for the actual labels used. Changing the column order is just a reshuffling of the codes, not changing the actual levels.
If you want the levels by the order in which they first appear you could do something like this.
In [61]: c = g.Enr.columns
In [62]: [c.levels[i].take(pd.unique(c.labels[i]))
...: for i in range(len(c.levels))]
Out[62]:
[Index([u'A', u'C', u'B'], dtype='object', name=u'Type'),
Index([u'a', u'b', u'c'], dtype='object', name=u'Org')]