Foward-fill dataframe based on mask. Fill with last valid value - pandas

I have a dataframe like the following:
index,col1,col2
1,NaN,NaN
2,NaN,NaN
3,NaN,20
4,NaN,21
5,10,22
6,11,23
7,12,24
8,13,NaN
9,NaN,NaN
And a boolean mask dataframe like the following:
index,col1,col2
1,False,False
2,False,False
3,False,False
4,False,True
5,False,False
6,False,False
7,True,True
8,True,False
9,False,False
I would like to convert them to this final dataframe:
index,col1,col2
1,NaN,NaN
2,NaN,NaN
3,NaN,20
4,NaN,20
5,10,22
6,11,23
7,11,23
8,11,NaN
9,NaN,NaN
That is: foward-filling the values matching True on the mask with the last value in the column having False in the mask.
How can I get this?

Let's try:
df.mask(mask).ffill().where(df.notna())
Output:
col1 col2
index
1 NaN NaN
2 NaN NaN
3 NaN 20.0
4 NaN 20.0
5 10.0 22.0
6 11.0 23.0
7 11.0 23.0
8 11.0 NaN
9 NaN NaN

Related

Adding columns with null values in pandas dataframe [duplicate]

When summing two pandas columns, I want to ignore nan-values when one of the two columns is a float. However when nan appears in both columns, I want to keep nan in the output (instead of 0.0).
Initial dataframe:
Surf1 Surf2
0 0
NaN 8
8 15
NaN NaN
16 14
15 7
Desired output:
Surf1 Surf2 Sum
0 0 0
NaN 8 8
8 15 23
NaN NaN NaN
16 14 30
15 7 22
Tried code:
-> the code below ignores nan-values but when taking the sum of two nan-values, it gives 0.0 in the output where I want to keep it as NaN in that particular case to keep these empty values separate from values that are actually 0 after summing.
import pandas as pd
import numpy as np
data = pd.DataFrame({"Surf1": [10,np.nan,8,np.nan,16,15], "Surf2": [22,8,15,np.nan,14,7]})
print(data)
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1)
print(data)
From the documentation pandas.DataFrame.sum
By default, the sum of an empty or all-NA Series is 0.
>>> pd.Series([]).sum() # min_count=0 is the default 0.0
This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty series to be NaN, pass min_count=1.
Change your code to
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1, min_count=1)
output
Surf1 Surf2
0 10.0 22.0
1 NaN 8.0
2 8.0 15.0
3 NaN NaN
4 16.0 14.0
5 15.0 7.0
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
You could mask the result by doing:
df.sum(1).mask(df.isna().all(1))
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
You can do:
df['Sum'] = df.dropna(how='all').sum(1)
Output:
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
You can use min_count, this will sum all the row when there is at least on not null, if all null return null
df['SUM']=df.sum(min_count=1,axis=1)
#df.sum(min_count=1,axis=1)
Out[199]:
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
I think All the solutions listed above work only for the cases when when it is the FIRST column value that is missing. If you have cases when the first column value is non-missing but the second column value is missing, try using:
df['sum'] = df['Surf1']
df.loc[(df['Surf2'].notnull()), 'sum'] = df['Surf1'].fillna(0) + df['Surf2']

Converting a list of list of tuples to a DataFrame (First argument: column, Second argument: value)

So I need a DataFrame out of a list_of_list_of_tuples:
My data looks like this:
tuples = [[(5,0.45),(6,0.56)],[(1,0.23),(2,0.54),(6,0.63)],[(3,0.86),(6,0.36)]]
What I need is this:
index
1
2
3
4
5
6
1
nan
nan
nan
nan
0.45
0.56
2
0.23
0.54
nan
nan
nan
0.63
3
nan
nan
0.86
nan
nan
0.36
So that the first argument in the tuple is the column, and the second is the value.
An index would be nice also.
Can anyone help me?
I have no idea how to formulate the code.
Convert each tuple to dictionary, pass to DataFrame constructor and last add DataFrame.reindex for change order and also add missing columns:
df = pd.DataFrame([dict(x) for x in tuples])
df = df.reindex(range(df.columns.min(), df.columns.max() + 1), axis=1)
print (df)
1 2 3 4 5 6
0 NaN NaN NaN NaN 0.45 0.56
1 0.23 0.54 NaN NaN NaN 0.63
2 NaN NaN 0.86 NaN NaN 0.36
tuples = [[(5,0.45),(6,0.56)],[(1,0.23),(2,0.54),(6,0.63)],[(3,0.86),(6,0.36)]]
for x in tuples:
print(x)
index=[]
values=[]
for tuple in x:
print(tuple[0],tuple[1])
index.append(tuple[0])
values.append(tuple[1])
print(index,values)

Error while setting column equal to another pandas ( ValueError: Must have equal len keys and value when setting with an iterable)

I have the following dataframe in pandas
datadate fyear ebit glp ibc ... ind status year month a_date
gvkey ...
7767 20130831 NaN NaN NaN NaN ... 0 1 2013.0 8.0 0
10871 20110930 NaN NaN NaN NaN ... 0 1 2011.0 9.0 0
15481 20110930 NaN NaN NaN NaN ... 0 1 2011.0 9.0 0
15582 19821031 NaN NaN NaN NaN ... 1 1 1982.0 10.0 0
15582 19831031 NaN NaN NaN NaN ... 1 1 1983.0 10.0 0
... ... ... ... ... ... ... ... ... ... ...
282553 20071231 NaN NaN NaN NaN ... 0 1 2007.0 12.0 0
282553 20081231 NaN NaN NaN NaN ... 0 1 2008.0 12.0 0
282553 20091231 NaN NaN NaN NaN ... 0 1 2009.0 12.0 0
294911 20150930 NaN NaN NaN NaN ... 0 1 2015.0 9.0 0
321467 20161231 NaN NaN NaN NaN ... 0 1 2016.0 12.0 0
I want to run the following command to assign the year value to the column a_date if month is at least 6. (Please do not consider that there are NaNs in the dataframe):
df.iloc[(df['month']>=6).values,-1]=df.iloc[(df['month']>=6).values,-3]
but I get the error
ValueError: Must have equal len keys and value when setting with an iterable
How do I proceed then? I really cannot get why I get this error. I googled and found some solutions to the same ValueError but they do not apply to my case. I would like to avoid using dictionaries and keep everything in one line if possible. I know I could solve with a loop but I am looking for a more efficient solution
I think that the error comes from the iloc function in the right part of your line (after =), because this function returns a series and not a value. So you are affecting a serie to a dataframe cell, which for me is the source of the error. Using pandas, for me the code would be :
df.loc[df['month'] >= 6, 'a_date'] = df['year']
The loc function allows to select a group of lines according to a condition (here df['month'] >= 6), a column to apply a change (here 'a_date') and the change you want to apply (here, as it is another column of the dataframe : df['year'])
I found an efficient solution myself using np.where:
df['a_date']=np.where(df['month']>=6,df['year'],df['year']-1)

concatenate 3 pandas dataframes on index and one column

I'd like to concatenate 3 dataframes on index and on 'type' column where some index values are missing (dfb and dfc have incomplete index, while dfa has complete index) . when I do concat some columns disappear as shown below. (i'd like final dataframe have MultiIndex so I can pick up parts of concatenated dataframe by type, and df['type2'] should have sorted index).
I tried concat with various parameters but it did not work.
dfa=pd.DataFrame({'type':['type1','type1','type2'],'a':[10,20,30]},index=[1,2,3])
dfb=pd.DataFrame({'type':['type1','type2'],'b':[11,21]},index=[2,3])
dfc=pd.DataFrame({'type':['type3'],'c':[33]},index=[3])
dfa
dfb
dfc
pd.concat([dfa,dfb,dfc],axis=0,keys=['type']) #wrong. columns b and c disappear!
I'd like the efficient solution as I have 5 dataframes whith 2000 "types" and index size of each is around 10K
desired:
example of desired dataframe:
pd.DataFrame({'a':[10,20,30,np.nan],'b':[np.nan,11,21,np.nan],'c':
[np.nan,np.nan,np.nan,33],'type':['type1','type1','type2','type3']},index=
[1,2,3,3])
After creating df:
dfa=pd.DataFrame({'type':['type1','type1','type2'],'a':[10,20,30]},index=[1,2,3])
dfb=pd.DataFrame({'type':['type1','type2'],'b':[11,21]},index=[2,3])
dfc=pd.DataFrame({'type':['type3'],'c':[33]},index=[3])
You can use merge and reset_index like this:
dfs = [dfa, dfb, dfc] # ... add as many df as you wish
res = dfs[0].reset_index()
for i in range(1,len(dfs)):
res = res.merge(dfs[i].reset_index(), how='outer', left_on=['index','type'], right_on=['index','type'])
res = res.set_index('index')
print(res)
The result will be:
type a b c
index
1 type1 10.0 NaN NaN
2 type1 20.0 11.0 NaN
3 type2 30.0 21.0 NaN
3 type3 NaN NaN 33.0
The problems that you aren't defining enough keys to match the number of dataframes concatenated.
Try this:
pd.concat([dfa, dfb, dfc], axis=0, keys=['type_a', 'type_b', 'type_c'])
Output:
a b c type
type_a 1 10.0 NaN NaN type1
2 20.0 NaN NaN type1
3 30.0 NaN NaN type2
type_b 2 NaN 11.0 NaN type1
3 NaN 21.0 NaN type2
type_c 3 NaN NaN 33.0 type3
Or leave keys parameter out all together:
pd.concat([dfa, dfb, dfc], axis=0)
Output:
a b c type
1 10.0 NaN NaN type1
2 20.0 NaN NaN type1
3 30.0 NaN NaN type2
2 NaN 11.0 NaN type1
3 NaN 21.0 NaN type2
3 NaN NaN 33.0 type3

For every row in pandas, do until sample ID change

How can I iterarate over rows in a dataframe until the sample ID change?
my_df:
ID loc_start
sample1 10
sample1 15
sample2 10
sample2 20
sample3 5
Something like:
samples = ["sample1", "sample2" ,"sample3"]
out = pd.DataFrame()
for sample in samples:
if my_df["ID"] == sample:
my_list = []
for index, row in my_df.iterrows():
other_list = [row.loc_start]
my_list.append(other_list)
my_list = pd.DataFrame(my_list)
out = pd.merge(out, my_list)
Expected output:
sample1 sample2 sample3
10 10 5
15 20
I realize of course that this could be done easier if my_df really would look like this. However, what I'm after is the principle to iterate over rows until a certain column value change.
Based on the input & output provided, this should work.
You need to provide more info if you are looking for something else.
df.pivot(columns='ID', values = 'loc_start').rename_axis(None, axis=1).apply(lambda x: pd.Series(x.dropna().values))
output
sample1 sample2 sample3
0 10.0 10.0 5.0
1 15.0 20.0 NaN
Ben.T is correct that a pivot works here. Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(0, 5, (10, 2)), columns=list("AB"))
# what does the df look like? Here, I consider column A to be analogous to your "ID" column
In [5]: df
Out[5]:
A B
0 3 1
1 2 1
2 4 2
3 4 1
4 0 4
5 4 2
6 4 1
7 3 1
8 1 1
9 4 0
# now do a pivot and see what it looks like
df2 = df.pivot(columns="A", values="B")
In [8]: df2
Out[8]:
A 0 1 2 3 4
0 NaN NaN NaN 1.0 NaN
1 NaN NaN 1.0 NaN NaN
2 NaN NaN NaN NaN 2.0
3 NaN NaN NaN NaN 1.0
4 4.0 NaN NaN NaN NaN
5 NaN NaN NaN NaN 2.0
6 NaN NaN NaN NaN 1.0
7 NaN NaN NaN 1.0 NaN
8 NaN 1.0 NaN NaN NaN
9 NaN NaN NaN NaN 0.0
Not quite what you wanted. With a little help from Jezreal's answer
df2 = df2.apply(lambda x: pd.Series(x.dropna().values))
In [20]: df3
Out[20]:
A 0 1 2 3 4
0 4.0 1.0 1.0 1.0 2.0
1 NaN NaN NaN 1.0 1.0
2 NaN NaN NaN NaN 2.0
3 NaN NaN NaN NaN 1.0
4 NaN NaN NaN NaN 0.0
The empty spots in the dataframe have to be filled with something, and NaN is used by default. Is this what you wanted?
If, on the other hand, you wanted to perform an operation on your data you would use the groupby instead.
df2 = df.groupby(by="A", as_index=False).mean()