I've got an ordered dataframe, df. It's grouped by 'ID' and ordered by 'order'
df = pd.DataFrame(
{'ID': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A','A', 'A','A', 'B','B', 'B','B', 'B', 'B', 'B','B'],
'order': [1,3,4,6,7,9,11,12,13,14,15,16,19,25,8,10,15,17,20,25,29,31],
'col1': [1,2,np.nan, 1,2,3,4,5, np.nan, np.nan,6,7,8,9,np.nan,np.nan,np.nan,10,11,12,np.nan,13],
'col2': [1,5,6,np.nan,1,2,3,np.nan,2,3,np.nan,np.nan,3,1,5,np.nan,np.nan, np.nan,2,3, np.nan,np.nan],
}
)
In each ID group, I would need to sum col1 for those rows that have col2 as NA. The sum includes the value of col1 for which the next value of col2 exists:
I prefer a vecterised solution to make it fast, but it could be difficult.
i need to use this in a groupby (as col1_dynamic_sum should be grouped by ID)
What i have done so far, is define a function that helps count the number of previous consecutive NAs in the row:
def count_prev_consec_na(input_col):
"""
This function takes a dataframe Series (column) and outputs the number of consecutive misisng values in previous rows
"""
try:
a1 = input_col.isna() + 0 ## missing
a2 = ~input_col.isna() + 0 ## not missing
b1 = a1.shift().fillna(0) ## prev missing
d = a1.cumsum()
e = b1*a2
f = d*e
g = f.replace(0, np.nan)
h=g.ffill()
h = h.fillna(0)
i = h.shift()
result = h-i
result = result.fillna(0)
return (result)
except Exception as e:
print(e.message)
return None
I think one solution is to use this to get a dynamic number of rows that needs to be rolled back for sum:
df['roll_back_count'] = df.groupby(['ID'], as_index = False).col2.transform(count_prev_consec_na)
ID order col1 col2 roll_back_count
A 1 1.0 1.0 0.0
A 3 2.0 5.0 0.0
A 4 NaN 6.0 0.0
A 6 1.0 NaN 0.0
A 7 2.0 1.0 1.0 ## I want to sum col1 of order 6 and 7 and remove order 6 row
A 9 3.0 2.0 0.0
A 11 4.0 3.0 0.0
A 12 5.0 NaN 0.0
A 13 NaN 2.0 1.0 ## I want to sum col1 of order 12 and 13 and remove order 12 row
A 14 NaN 3.0 0.0
A 15 6.0 NaN 0.0
A 16 7.0 NaN 0.0
A 19 8.0 3.0 2.0 ## I want to sum col1 of order 15,16,19 and remove order 15 and 16 rows
A 25 9.0 1.0 0.0
B 8 NaN 5.0 0.0
B 10 NaN NaN 0.0
B 15 NaN NaN 0.0
B 17 10.0 NaN 0.0 ## I want to sum col1 of order 10,15,17,20 and remove order 10,15,17 rows
B 20 11.0 2.0 3.0
B 25 12.0 3.0 0.0
B 29 NaN NaN 0.0
B 31 13.0 NaN 0.0
this is my desired output:
desired_output:
ID order col1_dynamic_sum col2
A 1 1.0 1
A 3 2.0 5
A 4 NaN 6
A 7 3.0 1
A 9 3.0 2
A 11 4.0 3
A 13 5.0 2
B 14 NaN 3
B 19 21.0 3
B 25 9.0 1
B 8 NaN 5
B 20 21.0 2
B 25 12.0 3
note: the sums should ignore NAs
again, i prefer vecterised solution, but it might not be possible due to the rolling effect.
Gah, I think I found a solution that doesn't involve rolling at all!
I created a new grouping ID based on NA values of the col2, using the index of rows that don't have any values. I would then use this grouping ID to aggregate!
def create_na_group(rollback_col):
a = ~rollback_col.isna() + 0
b = a.replace(0, np.nan)
c = rollback_col.index
d = c*b
d = d.bfill()
return(d)
df['na_group'] = df.groupby(['ID'], as_index = False).col2.transform(create_na_group)
df = df.loc[~df.na_group.isna()]
desired_output = df.groupby(['ID','na_group'], as_index=False).agg(
order = ('order', 'last')
, col1_dyn_sum = ('col1', sum)
, col2 = ('col2', sum)
)
I just have to find a way to make sure NaN don't become 0, like in rows 2,7 and 10.
ID na_group order col1_dyn_sum col2
0 A 0.0 1 1.0 1.0
1 A 1.0 3 2.0 5.0
2 A 2.0 4 0.0 6.0
3 A 4.0 7 3.0 1.0
4 A 5.0 9 3.0 2.0
5 A 6.0 11 4.0 3.0
6 A 8.0 13 5.0 2.0
7 A 9.0 14 0.0 3.0
8 A 12.0 19 21.0 3.0
9 A 13.0 25 9.0 1.0
10 B 14.0 8 0.0 5.0
11 B 18.0 20 21.0 2.0
12 B 19.0 25 12.0 3.0
I'll just creat two separate sum columns with lamba x: x.sum(skipna = False) and lamba x: x.sum(skipna = True) and then if the skipna = True sum column is 0 and skipna = False sum column is NA then I'll leave the final sum as NA, otherwise, I use the skipna = True sum column as the final desired output.
for this dataframe
values ii
0 3.0 4
1 0.0 1
2 3.0 8
3 2.0 5
4 2.0 1
5 3.0 5
6 2.0 4
7 1.0 8
8 0.0 5
9 1.0 1
This line returns "Must ptoduce aggregated values
bii2=df.groupby(['ii'])['values'].agg(pd.Series.mode)
While this line works
bii3=df.groupby('ii')['values'].agg(lambda x: pd.Series.mode(x)[0])
Could you explain why is that?
Problem is mode return sometimes 2 or more values, check solution with GroupBy.apply:
bii2=df.groupby(['ii'])['values'].apply(pd.Series.mode)
print (bii2)
ii
1 0 0.0
1 1.0
2 2.0
4 0 2.0
1 3.0
5 0 0.0
1 2.0
2 3.0
8 0 1.0
1 3.0
Name: values, dtype: float64
And pandas agg need scalar in output, so return error. So if select first value it working nice
bii3=df.groupby('ii')['values'].agg(lambda x: pd.Series.mode(x).iat[0])
print (bii3)
ii
1 0.0
4 2.0
5 0.0
8 1.0
Name: values, dtype: float64
I am trying to make a new column 'ID' which should give a unique ID each time there is no 'NaN' value in 'Data' column. If the non null values come right to each other, the ID remains the same. I have provided how my final Id column should look like below as reference to better understand. Could anyone guide me on this?
Id Data
0 NaN
0 NaN
0 NaN
1 54
1 55
0 NaN
0 NaN
2 67
0 NaN
0 NaN
3 33
3 44
3 22
0 NaN
.groupby the cumsum to get consecutive groups, using where to mask the NaN. .ngroup gets the consecutive IDs. Also possible with rank.
s = df.Data.isnull().cumsum().where(df.Data.notnull())
df['ID'] = df.groupby(s).ngroup()+1
# df['ID'] = s.rank(method='dense').fillna(0).astype(int)
Output:
Data ID
0 NaN 0
1 NaN 0
2 NaN 0
3 54.0 1
4 55.0 1
5 NaN 0
6 NaN 0
7 67.0 2
8 NaN 0
9 NaN 0
10 33.0 3
11 44.0 3
12 22.0 3
13 NaN 0
Using factorize
v=pd.factorize(df.Data.isnull().cumsum()[df.Data.notnull()])[0]+1
df.loc[df.Data.notnull(),'Newid']=v
df.Newid.fillna(0,inplace=True)
df
Id Data Newid
0 0 NaN 0.0
1 0 NaN 0.0
2 0 NaN 0.0
3 1 54.0 1.0
4 1 55.0 1.0
5 0 NaN 0.0
6 0 NaN 0.0
7 2 67.0 2.0
8 0 NaN 0.0
9 0 NaN 0.0
10 3 33.0 3.0
11 3 44.0 3.0
12 3 22.0 3.0
13 0 NaN 0.0
how I can calculate on the most easy way, how much values changes I have in the specific DataFrame columns. For example I have follow DF:
a b
0 1
1 1
2 1
3 2
4 1
5 2
6 2
7 3
8 3
9 3
In this Data Frame the values in the column b have been changed 4 times (in the rows 4,5,6 and 8).
My very simple solution is:
a = 0
for i in range(df.shape[0] - 1):
if df['b'].iloc[i] != df['b'].iloc[i+1]:
a+=1
I think need boolean indexing with index:
idx = df.index[df['b'].diff().shift().fillna(0).ne(0)]
print (idx)
Int64Index([4, 5, 6, 8], dtype='int64')
For more general solution is possible indexing by arange:
a = np.arange(len(df))[df['b'].diff().shift().bfill().ne(0)].tolist()
print (a)
[4, 5, 6, 8]
Explanation:
First get difference by Series.diff:
print (df['b'].diff())
0 NaN
1 0.0
2 0.0
3 1.0
4 -1.0
5 1.0
6 0.0
7 1.0
8 0.0
9 0.0
Name: b, dtype: float64
Then shift by one value:
print (df['b'].diff().shift())
0 NaN
1 NaN
2 0.0
3 0.0
4 1.0
5 -1.0
6 1.0
7 0.0
8 1.0
9 0.0
Name: b, dtype: float64
Replace first NaNs by fillna:
print (df['b'].diff().shift().fillna(0))
0 0.0
1 0.0
2 0.0
3 0.0
4 1.0
5 -1.0
6 1.0
7 0.0
8 1.0
9 0.0
Name: b, dtype: float64
And compare for not equal to 0
print (df['b'].diff().shift().fillna(0).ne(0))
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 True
9 False
Name: b, dtype: bool
If the a is a column and not the index:
idx = df['a'].loc[df['b'].diff().shift().fillna(0) != 0]
I am running below code and get this:
import pandas as pd
pf=pd.read_csv("https://www.dropbox.com/s/08kuxi50d0xqnfc/demo.csv?dl=1")
x=pf[pf['fuv1'] == 0].count()*100/1892
x
id 0.528541
date 0.528541
count 0.528541
idade 0.528541
site 0.528541
baseline 0.528541
fuv1 0.528541
fuv2 0.475687
fuv3 0.528541
fuv4 0.475687
dtype: float64
What I want is just to get this result 0.528541 and forgot all the above results.
What to do?
Thanks.
If want count number of 0 values in column fuv1 use sum for count Trues which are processes like 1s:
print ((pf['fuv1'] == 0).sum())
10
x = (pf['fuv1'] == 0).sum()*100/1892
print (x)
0.528541226216
Explanation why different outputs - count exclude NaNs:
pf=pd.read_csv("https://www.dropbox.com/s/08kuxi50d0xqnfc/demo.csv?dl=1")
x=pf[pf['fuv1'] == 0]
print (x)
id date count idade site baseline fuv1 fuv2 fuv3 fuv4
0 0 4/1/2016 10 13 A 1 0.0 1.0 0.0 1.0
2 2 4/3/2016 9 5 C 1 0.0 NaN 0.0 1.0
3 3 4/4/2016 108 96 D 1 0.0 1.0 0.0 NaN
11 11 4/12/2016 6 13 C 1 0.0 1.0 1.0 0.0
13 13 4/14/2016 12 4 C 1 0.0 1.0 1.0 0.0
40 40 5/11/2016 14 7 C 1 0.0 1.0 1.0 1.0
41 41 5/12/2016 0 26 C 1 0.0 1.0 1.0 1.0
42 42 5/13/2016 10 15 C 1 0.0 1.0 1.0 1.0
60 60 5/31/2016 13 3 D 1 0.0 1.0 1.0 1.0
74 74 6/14/2016 15 7 B 1 0.0 1.0 1.0 1.0
print (x.count())
id 10
date 10
count 10
idade 10
site 10
baseline 10
fuv1 10
fuv2 9
fuv3 10
fuv4 9
dtype: int64
In [282]: pf.loc[pf['fuv1'] == 0, 'id'].count()*100/1892
Out[282]: 0.5285412262156448
import pandas as pd
pf=pd.read_csv("https://www.dropbox.com/s/08kuxi50d0xqnfc/demo.csv?dl=1")
x = (pf['fuv1'] == 0).sum()*100/1892
y=pf["idade"].mean()
l = "Performance"
k = "LTFU"
def test(l1,k1):
return pd.DataFrame({'a':[l1, k1], 'b':[x, y]})
df1 = test(l,k)
df1.columns = [''] * len(df1.columns)
df1.index = [''] * len(df1.index)
print(round(df1, 2))
Performance 0.53
LTFU 14.13