Dynamic sum of one column based on NA values of another column in Pandas - pandas

I've got an ordered dataframe, df. It's grouped by 'ID' and ordered by 'order'
df = pd.DataFrame(
{'ID': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A','A', 'A','A', 'B','B', 'B','B', 'B', 'B', 'B','B'],
'order': [1,3,4,6,7,9,11,12,13,14,15,16,19,25,8,10,15,17,20,25,29,31],
'col1': [1,2,np.nan, 1,2,3,4,5, np.nan, np.nan,6,7,8,9,np.nan,np.nan,np.nan,10,11,12,np.nan,13],
'col2': [1,5,6,np.nan,1,2,3,np.nan,2,3,np.nan,np.nan,3,1,5,np.nan,np.nan, np.nan,2,3, np.nan,np.nan],
}
)
In each ID group, I would need to sum col1 for those rows that have col2 as NA. The sum includes the value of col1 for which the next value of col2 exists:
I prefer a vecterised solution to make it fast, but it could be difficult.
i need to use this in a groupby (as col1_dynamic_sum should be grouped by ID)
What i have done so far, is define a function that helps count the number of previous consecutive NAs in the row:
def count_prev_consec_na(input_col):
"""
This function takes a dataframe Series (column) and outputs the number of consecutive misisng values in previous rows
"""
try:
a1 = input_col.isna() + 0 ## missing
a2 = ~input_col.isna() + 0 ## not missing
b1 = a1.shift().fillna(0) ## prev missing
d = a1.cumsum()
e = b1*a2
f = d*e
g = f.replace(0, np.nan)
h=g.ffill()
h = h.fillna(0)
i = h.shift()
result = h-i
result = result.fillna(0)
return (result)
except Exception as e:
print(e.message)
return None
I think one solution is to use this to get a dynamic number of rows that needs to be rolled back for sum:
df['roll_back_count'] = df.groupby(['ID'], as_index = False).col2.transform(count_prev_consec_na)
ID order col1 col2 roll_back_count
A 1 1.0 1.0 0.0
A 3 2.0 5.0 0.0
A 4 NaN 6.0 0.0
A 6 1.0 NaN 0.0
A 7 2.0 1.0 1.0 ## I want to sum col1 of order 6 and 7 and remove order 6 row
A 9 3.0 2.0 0.0
A 11 4.0 3.0 0.0
A 12 5.0 NaN 0.0
A 13 NaN 2.0 1.0 ## I want to sum col1 of order 12 and 13 and remove order 12 row
A 14 NaN 3.0 0.0
A 15 6.0 NaN 0.0
A 16 7.0 NaN 0.0
A 19 8.0 3.0 2.0 ## I want to sum col1 of order 15,16,19 and remove order 15 and 16 rows
A 25 9.0 1.0 0.0
B 8 NaN 5.0 0.0
B 10 NaN NaN 0.0
B 15 NaN NaN 0.0
B 17 10.0 NaN 0.0 ## I want to sum col1 of order 10,15,17,20 and remove order 10,15,17 rows
B 20 11.0 2.0 3.0
B 25 12.0 3.0 0.0
B 29 NaN NaN 0.0
B 31 13.0 NaN 0.0
this is my desired output:
desired_output:
ID order col1_dynamic_sum col2
A 1 1.0 1
A 3 2.0 5
A 4 NaN 6
A 7 3.0 1
A 9 3.0 2
A 11 4.0 3
A 13 5.0 2
B 14 NaN 3
B 19 21.0 3
B 25 9.0 1
B 8 NaN 5
B 20 21.0 2
B 25 12.0 3
note: the sums should ignore NAs
again, i prefer vecterised solution, but it might not be possible due to the rolling effect.

Gah, I think I found a solution that doesn't involve rolling at all!
I created a new grouping ID based on NA values of the col2, using the index of rows that don't have any values. I would then use this grouping ID to aggregate!
def create_na_group(rollback_col):
a = ~rollback_col.isna() + 0
b = a.replace(0, np.nan)
c = rollback_col.index
d = c*b
d = d.bfill()
return(d)
df['na_group'] = df.groupby(['ID'], as_index = False).col2.transform(create_na_group)
df = df.loc[~df.na_group.isna()]
desired_output = df.groupby(['ID','na_group'], as_index=False).agg(
order = ('order', 'last')
, col1_dyn_sum = ('col1', sum)
, col2 = ('col2', sum)
)
I just have to find a way to make sure NaN don't become 0, like in rows 2,7 and 10.
ID na_group order col1_dyn_sum col2
0 A 0.0 1 1.0 1.0
1 A 1.0 3 2.0 5.0
2 A 2.0 4 0.0 6.0
3 A 4.0 7 3.0 1.0
4 A 5.0 9 3.0 2.0
5 A 6.0 11 4.0 3.0
6 A 8.0 13 5.0 2.0
7 A 9.0 14 0.0 3.0
8 A 12.0 19 21.0 3.0
9 A 13.0 25 9.0 1.0
10 B 14.0 8 0.0 5.0
11 B 18.0 20 21.0 2.0
12 B 19.0 25 12.0 3.0
I'll just creat two separate sum columns with lamba x: x.sum(skipna = False) and lamba x: x.sum(skipna = True) and then if the skipna = True sum column is 0 and skipna = False sum column is NA then I'll leave the final sum as NA, otherwise, I use the skipna = True sum column as the final desired output.

Related

group dataframe if the column has the same value in consecutive order

let's say I have a dataframe that looks like below:
I want to assign my assets to one group if I have treatment that are consecutive. If there are two consecutive assets without treatment after them, then we still can assign them to the same group. However, if there are more than two assets without treatment, then those assets (without treatment) will have empty group. The next assets that have treatment will be assigned to a new group
You can use a rolling check if whether there was at least one Y in the last N occurrences.
I am providing two options depending on whether or not it's important not to label the leading/trailing Ns:
# maximal number of days without treatment
# to remain in same group
N = 2
m = df['Treatment'].eq('Y')
group = m.rolling(N+1, min_periods=1).max().eq(0)
group = (group & ~group.shift(fill_value=False)).cumsum().add(1)
df['group'] = group
# don't label leading/trailing N
m1 = m.groupby(group).cummax()
m2 = m[::-1].groupby(group).cummax()
df['group2'] = group.where(m1&m2)
print(df)
To handle the last NaNs separately:
m3 = ~m[::-1].cummax()
df['group3'] = group.where(m1&m2|m3)
Output:
Treatment group group2 group3
0 Y 1 1.0 1.0
1 Y 1 1.0 1.0
2 Y 1 1.0 1.0
3 N 1 1.0 1.0
4 N 1 1.0 1.0
5 Y 1 1.0 1.0
6 Y 1 1.0 1.0
7 Y 1 1.0 1.0
8 N 1 NaN NaN
9 N 1 NaN NaN
10 N 2 NaN NaN
11 Y 2 2.0 2.0
12 Y 2 2.0 2.0
13 Y 2 2.0 2.0
14 Y 2 2.0 2.0
15 N 2 NaN 2.0
Other example for N=1:
Treatment group group2 group3
0 Y 1 1.0 1.0
1 Y 1 1.0 1.0
2 Y 1 1.0 1.0
3 N 1 NaN NaN
4 N 2 NaN NaN
5 Y 2 2.0 2.0
6 Y 2 2.0 2.0
7 Y 2 2.0 2.0
8 N 2 NaN NaN
9 N 3 NaN NaN
10 N 3 NaN NaN
11 Y 3 3.0 3.0
12 Y 3 3.0 3.0
13 Y 3 3.0 3.0
14 Y 3 3.0 3.0
15 N 3 NaN 3.0

How to fill nans with multiple if-else conditions?

I have a dataset:
value score
0 0.0 8
1 0.0 7
2 NaN 4
3 1.0 11
4 2.0 22
5 NaN 12
6 0.0 4
7 NaN 15
8 0.0 5
9 2.0 24
10 1.0 12
11 1.0 15
12 0.0 5
13 2.0 26
14 NaN 28
There are some NaNs in it. I want to fill those NaNs with these conditions:
If 'score' is less than 10, then fill nan with 0.0
If 'score' is between 10 and 20, then fill nan with 1.0
If 'score' is greater than 20, then fill nan with 2.0
How do I do this in pandas?
Here is an example dataframe:
value = [0,0,np.nan,1,2,np.nan,0,np.nan,0,2,1,1,0,2,np.nan]
score = [8,7,4,11,22,12,4,15,5,24,12,15,5,26,28]
pd.DataFrame({'value': value, 'score':score})
Do with cut then fillna
df.value.fillna(pd.cut(df.score,[-np.Inf,10,20,np.Inf],labels = [0,1,2]).astype(int),inplace=True)
df
Out[6]:
value score
0 0.0 8
1 0.0 7
2 0.0 4
3 1.0 11
4 2.0 22
5 1.0 12
6 0.0 4
7 1.0 15
8 0.0 5
9 2.0 24
10 1.0 12
11 1.0 15
12 0.0 5
13 2.0 26
14 2.0 28
You could use numpy.select with conditions on <10, 10≤score<20, etc. but a more efficient version could be to use a floor division to have values below 10 become 0, below 20 -> 1, etc.
df['value'] = df['value'].fillna(df['score'].floordiv(10))
with numpy.select:
df['value'] = df['value'].fillna(np.select([df['score'].lt(10),
df['score'].between(10, 20),
df['score'].ge(20)],
[0, 1, 2])
)
output:
value score
0 0.0 8
1 0.0 7
2 0.0 4
3 1.0 11
4 2.0 22
5 1.0 12
6 0.0 4
7 1.0 15
8 0.0 5
9 2.0 24
10 1.0 12
11 1.0 15
12 0.0 5
13 2.0 26
14 2.0 28
use np.select or pd.cut to map the intervals to values, then fillna:
mapping = np.select((df['score'] < 10, df['score'] > 20),
(0, 2), 1)
df['value'] = df['value'].fillna(mapping)

How to represent the column with max Nan values in pandas df?

i can show it by: df.isnull().sum() and get the max value with: df.isnull().sum().max() ,
but someone can tell me how to represent the column name with max Nan's ?
Thank you all!
Use Series.idxmax with DataFrame.loc for filter column with most missing values:
df.loc[:, df.isnull().sum().idxmax()]
If need select multiple columns with more maximes compare Series with max value:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,np.nan,5,np.nan,4],
'C':[7,8,9,np.nan,2,np.nan],
'D':[1,np.nan,5,7,1,0]
})
print (df)
A B C D
0 a 4.0 7.0 1.0
1 b 5.0 8.0 NaN
2 c NaN 9.0 5.0
3 d 5.0 NaN 7.0
4 e NaN 2.0 1.0
5 f 4.0 NaN 0.0
s = df.isnull().sum()
df = df.loc[:, s.eq(s.max())]
print (df)
B C
0 4.0 7.0
1 5.0 8.0
2 NaN 9.0
3 5.0 NaN
4 NaN 2.0
5 4.0 NaN

How to perform a rolling window on a pandas DataFrame, whereby each row consists nan values that should not be replaced?

I have the following dataframe:
df = pd.DataFrame([[0, 1, 2, 4, np.nan, np.nan, np.nan,1],
[0, 1, 2 ,np.nan, np.nan, np.nan,np.nan,1],
[0, 2, 2 ,np.nan, 2, np.nan,1,1]])
With output:
0 1 2 3 4 5 6 7
0 0 1 2 4 NaN NaN NaN 1
1 0 1 2 NaN NaN NaN NaN 1
2 0 2 2 NaN 2 NaN 1 1
with dtypes:
df.dtypes
0 int64
1 int64
2 int64
3 float64
4 float64
5 float64
6 float64
7 int64
Then the underneath rolling summation is applied:
df.rolling(window = 7, min_periods =1, axis = 'columns').sum()
And the output is as follows:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 4.0 4.0 4.0 4.0 4.0
1 0.0 1.0 3.0 NaN NaN NaN NaN 4.0
2 0.0 2.0 4.0 NaN 2.0 2.0 3.0 5.0
I notice that the rolling window stops and starts again whenever the dtype of the next column is different.
I however have a dataframe whereby all columns are of the same object type.
df = df.astype('object')
which has output:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 7.0 7.0 7.0 7.0 8.0
1 0.0 1.0 3.0 3.0 3.0 3.0 3.0 4.0
2 0.0 2.0 4.0 4.0 6.0 6.0 7.0 8.0
My desired output however, stops and starts again after a nan value appears. This would look like:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 7.0 NaN NaN NaN 8.0
1 0.0 1.0 3.0 NaN NaN NaN Nan 4.0
2 0.0 2.0 4.0 NaN 6.0 NaN 7.0 8.0
I figured there must be a way that NaN values are not considered but also not filled in with values obtained from the rolling window.
Anything would help!
Workaround is:
Where are the nan-values located:
nan = df.isnull()
Apply the rolling window.
df = df.rolling(window = 7, min_periods =1, axis = 'columns').sum()
Only show values labeled as false.
df[~nan]

The previous value in each group is padded with missing values

If there are three columns of data, the first column is some category id, the second column and the third column have some missing values, I want to aggregate the id of the first column, after grouping, fill in the third column of each group by the method: 'ffill' Missing value
I found a good idea here: Pandas: filling missing values by weighted average in each group! , but it didn't solve my problem because the output it produced was not what I wanted
Enter the following code to get an example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'name': ['A','A', 'B','B','B','B', 'C','C','C'],'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
'sss':[1, np.nan, 3, np.nan, np.nan, np.nan, 2, np.nan, np.nan]})
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN NaN
2 B NaN 3.0
3 B 2.0 NaN
4 B 3.0 NaN
5 B 1.0 NaN
6 C 3.0 2.0
7 C NaN NaN
8 C 3.0 NaN
Fill in missing values with a previous value after grouping
Then I ran the following code, but it outputs strange results:
df["sss"] = df.groupby("name").transform(lambda x: x.fillna(axis = 0,method = 'ffill'))
df
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN 1.0
2 B NaN NaN
3 B 2.0 2.0
4 B 3.0 3.0
5 B 1.0 1.0
6 C 3.0 3.0
7 C NaN 3.0
8 C 3.0 3.0
The result I want is this:
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN 1.0
2 B NaN 3.0
3 B 2.0 3.0
4 B 3.0 3.0
5 B 1.0 3.0
6 C 3.0 2.0
7 C NaN 2.0
8 C 3.0 2.0
Can someone point out where I am wrong?strong text