Count NaN Single Column with Multiple Conditions in Other Columns

Count NaN Single Column with Multiple Conditions in Other Columns - pandas

i cannot seem to figure this out trying many different things and there apparently is no answer across the web that i have found. I have data that has values in a single column "data" and i need to sum or count occurrences of NaN in this column based on a groupby of conditions in two other columns such as this which resembles my data below:
site data day month year
0 Red NaN 20 1 2020
1 Red 5.6 31 1 2020
2 Red NaN 6 1 2020
3 Red NaN 9 2 2020
3 Blue 4.5 14 1 2020
4 Blue 6.2 19 2 2020
5 Blue NaN 11 2 2020
The outcome should look like this:
site month count sumNaN
0 Red 1 3 2
1 Red 2 1 1
2 Blue 1 1 0
3 Blue 2 2 1
Thank you very much.

Try:
(df.assign(data=df['data'].isna())
.groupby(['site','month'])
['data'].agg(['count','sum'])
.reset_index()
)
Output:
site month count sum
0 Blue 1 1 0
1 Blue 2 2 1
2 Red 1 3 2
3 Red 2 1 1

You can used named aggregation within the agg:
(df.groupby(['site', 'month'], as_index = False)
.agg(count=('data', 'size'),
sumNaN=('data', lambda df: df.isna().sum())
)
)
site month count sumNaN
0 Blue 1 1 0.0
1 Blue 2 2 1.0
2 Red 1 3 2.0
3 Red 2 1 1.0

Related

Pandas: Get rolling mean with a add operation in between

My Pandas df is like:
ID delta price
1 -2 4
2 2 5
3 -3 3
4 0.8
5 0.9
6 -2.3
7 2.8
8 1
9 1
10 1
11 1
12 1
Pandas already has robust mean calculation method in built. I need to use it slightly differently.
So, in my df, price at row 4 would be sum of (a) rolling mean of price in row 1, 2, 3 and (b) delta at row 4.
Once, this is computed: I would move to row 5 for this: (a) rolling mean of price in row 2, 3, 4 and (b) delta at row 5. This would give price at row 5.....
I can iterate over rows to get this but my actual dataframe in quite big and iterating over row would slow things up....any better way to achieve?

I do not think we have method in panda can use the pervious calculated value in the next calculation
n = 3
for x in df.index[df.price.isna()]:
df.loc[x,'price'] = (df.loc[x-n:x,'price'].sum() + df.loc[x,'delta'])/4
df
Out[150]:
ID delta price
0 1 -2.0 4.000000
1 2 2.0 5.000000
2 3 -3.0 3.000000
3 4 0.8 3.200000
4 5 0.9 3.025000
5 6 -2.3 1.731250
6 7 2.8 2.689062
7 8 1.0 2.111328
8 9 1.0 1.882910
9 10 1.0 1.920825
10 11 1.0 1.728766
11 12 1.0 1.633125

Is there a way to get the previously received message by id in pandas?

I have a dataframe like this:
ID Message week
10 A 1
11 A 1
12 C 1
10 B 2
12 B 2
How can I get one like this?:
ID Message week previous
10 A 1 nan
11 A 1 nan
12 C 1 nan
10 B 2 A
12 B 2 A

Use an asof merge to bring the closest message in the past. allow_exact_matches=False prevents merging on the same week.
df = df.sort_values('week') # Only b/c merge_asof requires sorted input
res = (pd.merge_asof(df, df.rename(columns={'Message': 'previous'}),
on='week', by='ID',
direction='backward', allow_exact_matches=False))
ID Message week previous
0 10 A 1 NaN
1 11 A 1 NaN
2 12 C 1 NaN
3 10 B 2 A
4 12 B 2 C

We can use groupby with Series.shift here:
df["previous"] = df.groupby("ID")["Message"].shift()
ID Message week previous
0 10 A 1 NaN
1 11 A 1 NaN
2 12 C 1 NaN
3 10 B 2 A
4 12 B 2 C

Compute lagged means per name and round in pandas

I need to compute lagged means per groups in my dataframe. This is how my df looks like:
name value round
0 a 5 3
1 b 4 3
2 c 3 2
3 d 1 2
4 a 2 1
5 c 1 1
0 c 1 3
1 d 4 3
2 b 3 2
3 a 1 2
4 b 5 1
5 d 2 1
I would like to compute lagged means for column value per name and round. That is, for name a in round 3 I need to have value_mean = 1.5 (because (1+2)/2). And of course, there will be nan values when round = 1.
I tried this:
df['value_mean'] = df.groupby('name').expanding().mean().groupby('name').shift(1)['value'].values
but it gives a nonsense:
name value round value_mean
0 a 5 3 NaN
1 b 4 3 5.0
2 c 3 2 3.5
3 d 1 2 NaN
4 a 2 1 4.0
5 c 1 1 3.5
0 c 1 3 NaN
1 d 4 3 3.0
2 b 3 2 2.0
3 a 1 2 NaN
4 b 5 1 1.0
5 d 2 1 2.5
Any idea, how can I do this, please? I found this, but it seems not relevant for my problem: Calculate the mean value using two columns in pandas

You can do that as follows
# sort the values as they need to be counted
df.sort_values(['name', 'round'], inplace=True)
df.reset_index(drop=True, inplace=True)
# create a grouper to calculate the running count
# and running sum as the basis of the average
grouper= df.groupby('name')
ser_sum= grouper['value'].cumsum()
ser_count= grouper['value'].cumcount()+1
ser_mean= ser_sum.div(ser_count)
ser_same_name= df['name'] == df['name'].shift(1)
# finally you just have to set the first entry
# in each name-group to NaN (this usually would
# set the entries for each name and round=1 to NaN)
df['value_mean']= ser_mean.shift(1).where(ser_same_name, np.NaN)
# if you want to see the intermediate products,
# you can uncomment the following lines
#df['sum']= ser_sum
#df['count']= ser_count
df
Output:
name value round value_mean
0 a 2 1 NaN
1 a 1 2 2.0
2 a 5 3 1.5
3 b 5 1 NaN
4 b 3 2 5.0
5 b 4 3 4.0
6 c 1 1 NaN
7 c 3 2 1.0
8 c 1 3 2.0
9 d 2 1 NaN
10 d 1 2 2.0
11 d 4 3 1.5

Pandas get order of column value grouped by other column value

I have the following dataframe:
srch_id price
1 30
1 20
1 25
3 15
3 102
3 39
Now I want to create a third column in which I determine the price position grouped by the search id. This is the result I want:
srch_id price price_position
1 30 3
1 20 1
1 25 2
3 15 1
3 102 3
3 39 2
I think I need to use the transform function. However I can't seem to figure out how I should handle the argument I get using .transform():
def k(r):
return min(r)
tmp = train.groupby('srch_id')['price']
train['min'] = tmp.transform(k)
Because r is either a list or an element?

You can use series.rank() with df.groupby():
df['price_position']=df.groupby('srch_id')['price'].rank()
print(df)
srch_id price price_position
0 1 30 3.0
1 1 20 1.0
2 1 25 2.0
3 3 15 1.0
4 3 102 3.0
5 3 39 2.0

is this:
df['price_position'] = df.sort_values('price').groupby('srch_id').price.cumcount() + 1
Out[1907]:
srch_id price price_position
0 1 30 3
1 1 20 1
2 1 25 2
3 3 15 1
4 3 102 3
5 3 39 2

Pandas Dynamic Index Referencing during Calculation

I have the following data frame
val sum
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0
5 6 0
6 7 0
I would like to calculate the sum of the next three rows' (including the current row) values. I need to do this for very big files. What is the most efficient way? The expected result is
val sum
0 1 6
1 2 9
2 3 12
3 4 15
4 5 18
5 6 13
6 7 7
In general, how can I dynamically referencing to other rows (via boolean operations) while making assignments?

> pd.rolling_sum(df['val'], window=3).shift(-2)
0 6
1 9
2 12
3 15
4 18
5 NaN
6 NaN
If you want the last values to be "filled in" then you'll need to tack on NaN's to the end of your dataframe.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Count NaN Single Column with Multiple Conditions in Other Columns - pandas

Try: (df.assign(data=df['data'].isna()) .groupby(['site','month']) ['data'].agg(['count','sum']) .reset_index() ) Output: site month count sum 0 Blue 1 1 0 1 Blue 2 2 1 2 Red 1 3 2 3 Red 2 1 1

You can used named aggregation within the agg: (df.groupby(['site', 'month'], as_index = False) .agg(count=('data', 'size'), sumNaN=('data', lambda df: df.isna().sum()) ) ) site month count sumNaN 0 Blue 1 1 0.0 1 Blue 2 2 1.0 2 Red 1 3 2.0 3 Red 2 1 1.0

Related

Pandas: Get rolling mean with a add operation in between

Is there a way to get the previously received message by id in pandas?

Compute lagged means per name and round in pandas

Pandas get order of column value grouped by other column value

Pandas Dynamic Index Referencing during Calculation

Categories

Resources