Count NaN Single Column with Multiple Conditions in Other Columns - pandas

i cannot seem to figure this out trying many different things and there apparently is no answer across the web that i have found. I have data that has values in a single column "data" and i need to sum or count occurrences of NaN in this column based on a groupby of conditions in two other columns such as this which resembles my data below:
site data day month year
0 Red NaN 20 1 2020
1 Red 5.6 31 1 2020
2 Red NaN 6 1 2020
3 Red NaN 9 2 2020
3 Blue 4.5 14 1 2020
4 Blue 6.2 19 2 2020
5 Blue NaN 11 2 2020
The outcome should look like this:
site month count sumNaN
0 Red 1 3 2
1 Red 2 1 1
2 Blue 1 1 0
3 Blue 2 2 1
Thank you very much.

Try:
(df.assign(data=df['data'].isna())
.groupby(['site','month'])
['data'].agg(['count','sum'])
.reset_index()
)
Output:
site month count sum
0 Blue 1 1 0
1 Blue 2 2 1
2 Red 1 3 2
3 Red 2 1 1

You can used named aggregation within the agg:
(df.groupby(['site', 'month'], as_index = False)
.agg(count=('data', 'size'),
sumNaN=('data', lambda df: df.isna().sum())
)
)
site month count sumNaN
0 Blue 1 1 0.0
1 Blue 2 2 1.0
2 Red 1 3 2.0
3 Red 2 1 1.0

Related

Pandas: Get rolling mean with a add operation in between

My Pandas df is like:
ID delta price
1 -2 4
2 2 5
3 -3 3
4 0.8
5 0.9
6 -2.3
7 2.8
8 1
9 1
10 1
11 1
12 1
Pandas already has robust mean calculation method in built. I need to use it slightly differently.
So, in my df, price at row 4 would be sum of (a) rolling mean of price in row 1, 2, 3 and (b) delta at row 4.
Once, this is computed: I would move to row 5 for this: (a) rolling mean of price in row 2, 3, 4 and (b) delta at row 5. This would give price at row 5.....
I can iterate over rows to get this but my actual dataframe in quite big and iterating over row would slow things up....any better way to achieve?
I do not think we have method in panda can use the pervious calculated value in the next calculation
n = 3
for x in df.index[df.price.isna()]:
df.loc[x,'price'] = (df.loc[x-n:x,'price'].sum() + df.loc[x,'delta'])/4
df
Out[150]:
ID delta price
0 1 -2.0 4.000000
1 2 2.0 5.000000
2 3 -3.0 3.000000
3 4 0.8 3.200000
4 5 0.9 3.025000
5 6 -2.3 1.731250
6 7 2.8 2.689062
7 8 1.0 2.111328
8 9 1.0 1.882910
9 10 1.0 1.920825
10 11 1.0 1.728766
11 12 1.0 1.633125

Is there a way to get the previously received message by id in pandas?

I have a dataframe like this:
ID Message week
10 A 1
11 A 1
12 C 1
10 B 2
12 B 2
How can I get one like this?:
ID Message week previous
10 A 1 nan
11 A 1 nan
12 C 1 nan
10 B 2 A
12 B 2 A
Use an asof merge to bring the closest message in the past. allow_exact_matches=False prevents merging on the same week.
df = df.sort_values('week') # Only b/c merge_asof requires sorted input
res = (pd.merge_asof(df, df.rename(columns={'Message': 'previous'}),
on='week', by='ID',
direction='backward', allow_exact_matches=False))
ID Message week previous
0 10 A 1 NaN
1 11 A 1 NaN
2 12 C 1 NaN
3 10 B 2 A
4 12 B 2 C
We can use groupby with Series.shift here:
df["previous"] = df.groupby("ID")["Message"].shift()
ID Message week previous
0 10 A 1 NaN
1 11 A 1 NaN
2 12 C 1 NaN
3 10 B 2 A
4 12 B 2 C

Compute lagged means per name and round in pandas

I need to compute lagged means per groups in my dataframe. This is how my df looks like:
name value round
0 a 5 3
1 b 4 3
2 c 3 2
3 d 1 2
4 a 2 1
5 c 1 1
0 c 1 3
1 d 4 3
2 b 3 2
3 a 1 2
4 b 5 1
5 d 2 1
I would like to compute lagged means for column value per name and round. That is, for name a in round 3 I need to have value_mean = 1.5 (because (1+2)/2). And of course, there will be nan values when round = 1.
I tried this:
df['value_mean'] = df.groupby('name').expanding().mean().groupby('name').shift(1)['value'].values
but it gives a nonsense:
name value round value_mean
0 a 5 3 NaN
1 b 4 3 5.0
2 c 3 2 3.5
3 d 1 2 NaN
4 a 2 1 4.0
5 c 1 1 3.5
0 c 1 3 NaN
1 d 4 3 3.0
2 b 3 2 2.0
3 a 1 2 NaN
4 b 5 1 1.0
5 d 2 1 2.5
Any idea, how can I do this, please? I found this, but it seems not relevant for my problem: Calculate the mean value using two columns in pandas
You can do that as follows
# sort the values as they need to be counted
df.sort_values(['name', 'round'], inplace=True)
df.reset_index(drop=True, inplace=True)
# create a grouper to calculate the running count
# and running sum as the basis of the average
grouper= df.groupby('name')
ser_sum= grouper['value'].cumsum()
ser_count= grouper['value'].cumcount()+1
ser_mean= ser_sum.div(ser_count)
ser_same_name= df['name'] == df['name'].shift(1)
# finally you just have to set the first entry
# in each name-group to NaN (this usually would
# set the entries for each name and round=1 to NaN)
df['value_mean']= ser_mean.shift(1).where(ser_same_name, np.NaN)
# if you want to see the intermediate products,
# you can uncomment the following lines
#df['sum']= ser_sum
#df['count']= ser_count
df
Output:
name value round value_mean
0 a 2 1 NaN
1 a 1 2 2.0
2 a 5 3 1.5
3 b 5 1 NaN
4 b 3 2 5.0
5 b 4 3 4.0
6 c 1 1 NaN
7 c 3 2 1.0
8 c 1 3 2.0
9 d 2 1 NaN
10 d 1 2 2.0
11 d 4 3 1.5

Pandas get order of column value grouped by other column value

I have the following dataframe:
srch_id price
1 30
1 20
1 25
3 15
3 102
3 39
Now I want to create a third column in which I determine the price position grouped by the search id. This is the result I want:
srch_id price price_position
1 30 3
1 20 1
1 25 2
3 15 1
3 102 3
3 39 2
I think I need to use the transform function. However I can't seem to figure out how I should handle the argument I get using .transform():
def k(r):
return min(r)
tmp = train.groupby('srch_id')['price']
train['min'] = tmp.transform(k)
Because r is either a list or an element?
You can use series.rank() with df.groupby():
df['price_position']=df.groupby('srch_id')['price'].rank()
print(df)
srch_id price price_position
0 1 30 3.0
1 1 20 1.0
2 1 25 2.0
3 3 15 1.0
4 3 102 3.0
5 3 39 2.0
is this:
df['price_position'] = df.sort_values('price').groupby('srch_id').price.cumcount() + 1
Out[1907]:
srch_id price price_position
0 1 30 3
1 1 20 1
2 1 25 2
3 3 15 1
4 3 102 3
5 3 39 2

Pandas Dynamic Index Referencing during Calculation

I have the following data frame
val sum
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0
5 6 0
6 7 0
I would like to calculate the sum of the next three rows' (including the current row) values. I need to do this for very big files. What is the most efficient way? The expected result is
val sum
0 1 6
1 2 9
2 3 12
3 4 15
4 5 18
5 6 13
6 7 7
In general, how can I dynamically referencing to other rows (via boolean operations) while making assignments?
> pd.rolling_sum(df['val'], window=3).shift(-2)
0 6
1 9
2 12
3 15
4 18
5 NaN
6 NaN
If you want the last values to be "filled in" then you'll need to tack on NaN's to the end of your dataframe.