How to do pd.fillna() with condition - pandas

Am trying to do a fillna with if condition
Fimport pandas as pd
df = pd.DataFrame(data={'a':[1,None,3,None],'b':[4,None,None,None]})
print df
df[b].fillna(value=0, inplace=True) only if df[a] is None
print df
a b
0 1 4
1 NaN NaN
2 3 NaN
3 NaN NaN
##What i want to acheive
a b
0 1 4
1 NaN 0
2 3 NaN
3 NaN 0
Please help

You can chain both conditions for test mising values with & for bitwise AND and then replace values to 0:
df.loc[df.a.isna() & df.b.isna(), 'b'] = 0
#alternative
df.loc[df[['a', 'b']].isna().all(axis=1), 'b'] = 0
print (df)
a b
0 1.0 4.0
1 NaN 0.0
2 3.0 NaN
3 NaN 0.0
Or you can use fillna with one condition:
df.loc[df.a.isna(), 'b'] = df.b.fillna(0)

Related

How to keep all values from a dataframe except where NaN is present in another dataframe?

I am new to Pandas and I am stuck at this specific problem where I have 2 DataFrames in Pandas, e.g.
>>> df1
A B
0 1 9
1 2 6
2 3 11
3 4 8
>>> df2
A B
0 Nan 0.05
1 Nan 0.05
2 0.16 Nan
3 0.16 Nan
What I am trying to achieve is to retain all values from df1 except where there is a NaN in df2 i.e.
>>> df3
A B
0 Nan 9
1 Nan 6
2 3 Nan
3 4 Nan
I am talking about dfs with 10,000 rows each so I can't do this manually. Also indices and columns are the exact same in each case. I also have no NaN values in df1.
As far as I understand df.update() will either overwrite all values including NaN or update only those that are NaN.
You can use boolean masking using DataFrame.notna.
# df2 = df2.astype(float) # This needed if your dtypes are not floats.
m = df2.notna()
df1[m]
A B
0 NaN 9.0
1 NaN 6.0
2 3.0 NaN
3 4.0 NaN

In pandas replace consecutive 0s with NaN

I want to clean some data by replacing only CONSECUTIVE 0s in a data frame
Given:
import pandas as pd
import numpy as np
d = [[1,np.NaN,3,4],[2,0,0,np.NaN],[3,np.NaN,0,0],[4,np.NaN,0,0]]
df = pd.DataFrame(d, columns=['a', 'b', 'c', 'd'])
df
a b c d
0 1 NaN 3 4.0
1 2 0.0 0 NaN
2 3 NaN 0 0.0
3 4 NaN 0 0.0
The desired result should be:
a b c d
0 1 NaN 3 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN
where column c & d are affected but column b is NOT affected as it only has 1 zero (and not consecutive 0s).
I have experimented with this answer:
Replacing more than n consecutive values in Pandas DataFrame column
which is along the right lines but the solution keeps the first 0 in a given column which is not desired in my case.
Let us do shift with mask
df=df.mask((df.shift().eq(df)|df.eq(df.shift(-1)))&(df==0))
Out[469]:
a b c d
0 1 NaN 3.0 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN

Merge columns based on values in multiple columns pandas

I have a DataFrame as follows:
Name Col2 Col3
0 A 16-1-2000 NaN
1 B 13-2-2001 NaN
2 C NaN NaN
3 D NaN 23-4-2014
4 X NaN NaN
5 Q NaN 4-5-2009
I want to make a combined column based on either data of Col2 & Col3, such it would give me following output.
Name Col2 Col3 Result
0 A 16-1-2000 NaN 16-1-2000
1 B 13-2-2001 NaN 13-2-2001
2 C NaN NaN NaN
3 D NaN 23-4-2014 23-4-2014
4 X NaN NaN NaN
5 Q NaN 4-5-2009 4-5-2009
I have tried following:
df['Result'] = np.where(df["Col2"].isnull() & df["Col3"].isnull(), np.nan, df["Col2"] if dfCrisiltemp["Col2"].notnull() else df["Col3"])
but no success.
Use combine_first or fillna:
df['new'] = df["Col2"].combine_first(df["Col3"])
#alternative
#df['new'] = df["Col2"].fillna(df["Col3"])
print (df)
Name Col2 Col3 new
0 A 16-1-2000 NaN 16-1-2000
1 B 13-2-2001 NaN 13-2-2001
2 C NaN NaN NaN
3 D NaN 23-4-2014 23-4-2014
4 X NaN NaN NaN
5 Q NaN 4-5-2009 4-5-2009
Your solution should be changed to another np.where:
df['new'] = np.where(df["Col2"].notnull() & df["Col3"].isnull(), df["Col2"],
np.where(df["Col2"].isnull() & df["Col3"].notnull(), df["Col3"], np.nan))
Or numpy.select:
m1 = df["Col2"].notnull() & df["Col3"].isnull()
m2 = df["Col2"].isnull() & df["Col3"].notnull()
df['new'] = np.select([m1, m2], [df["Col2"], df["Col3"]], np.nan)
For general solution filter all columns without first by iloc, forward fill NaNs and last select last column:
df['new'] = df.iloc[:, 1:].ffill(axis=1).iloc[:, -1]

How to work with 'NA' in pandas?

I am merging two data frames in pandas. When joining fields contain 'NA', pandas automatically exclude those records. How can I keep the records having the value 'NA'?
For me it works nice:
df1 = pd.DataFrame({'A':[np.nan,2,1],
'B':[5,7,8]})
print (df1)
A B
0 NaN 5
1 2.0 7
2 1.0 8
df2 = pd.DataFrame({'A':[np.nan,2,3],
'C':[4,5,6]})
print (df2)
A C
0 NaN 4
1 2.0 5
2 3.0 6
print (pd.merge(df1, df2, on=['A']))
A B C
0 NaN 5 4
1 2.0 7 5
print (pd.__version__)
0.19.2
EDIT:
It seems there is another problem - your NA values are converted to NaN.
You can use pandas.read_excel, there is possible define which values are converted to NaN with parameter keep_default_na and na_values:
df = pd.read_excel('test.xlsx',keep_default_na=False,na_values=['NaN'])
print (df)
a b
0 NaN NA
1 20.0 40

Using scalar values in series as variables in user defined function

I want to define a function that is applied element wise for each row in a dataframe, comparing each element to a scalar value in a separate series. I started with the function below.
def greater_than(array, value):
g = array[array >= value].count(axis=1)
return g
But it is applying the mask along axis 0 and I need it to apply it along axis 1. What can I do?
e.g.
In [3]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [4]: df
Out[4]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
In [26]: s
Out[26]: array([ 1, 1000, 1000, 1000])
In [25]: greater_than(df,s)
Out[25]:
0 0
1 1
2 1
3 1
dtype: int64
In [27]: g = df[df >= s]
In [28]: g
Out[28]:
0 1 2 3
0 NaN NaN NaN NaN
1 4.0 NaN NaN NaN
2 8.0 NaN NaN NaN
3 12.0 NaN NaN NaN
The result should look like:
In [29]: greater_than(df,s)
Out[29]:
0 3
1 0
2 0
3 0
dtype: int64
as 1,2, & 3 are all >= 1 and none of the remaining values are greater than or equal to 1000.
Your best bet may be to do some transposes (no copies are made, if that's a concern)
In [164]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [165]: s = np.array([ 1, 1000, 1000, 1000])
In [171]: df.T[(df.T>=s)].T
Out[171]:
0 1 2 3
0 NaN 1.0 2.0 3.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
In [172]: df.T[(df.T>=s)].T.count(axis=1)
Out[172]:
0 3
1 0
2 0
3 0
dtype: int64
You can also just sum the mask directly, if the count is all you're after.
In [173]: (df.T>=s).sum(axis=0)
Out[173]:
0 3
1 0
2 0
3 0
dtype: int64