Grouping and delete whole group with condition - pandas

In pandas dataframe first I want to group the data with 'batch_id' column then check in the 'result' column if all values are negative delete that group
Using follwoing code
df.groupby('batch_id').filter(lambda g: (g.result != 'negative').all())

Create DF:
df = pd.DataFrame({'batch_id':[5, 1, 2, 3, 4, 1, 2, 1, 4, 3], \
'result':['good','negative','2,000','negative', '66', \
'negative','negative','negative', '22', 'clean']})
batch_id result
0 5 good
1 1 negative
2 2 2,000
3 3 negative
4 4 66
5 1 negative
6 2 negative
7 1 negative
8 4 22
9 3 clean
df.groupby('batch_id').filter(lambda g: ~ (g.result == 'negative').all())
Output:
batch_id result
0 5 good
2 2 2,000
3 3 negative
4 4 66
6 2 negative
8 4 22
9 3 clean

Related

Viewing frequency of multiple values in grouped Pandas data frame

I have a data frame with three column variables A,B,C, taking numeric values in {1,2}, {6,7}, and {11,12}. I would like to see the following. For what fraction of possible observed pairs (A,B) do we have both [observations for which C=11 and observations for which C=12].
I start by entering the dataframe:
df = pd.DataFrame({"A": [1, 2, 1, 1, 2, 1, 1, 2], "B": [6,7,7,6,7,6,6,6], "C": [11,12,11,11,12,12,11,12]})
--------
A B C
0 1 6 11
1 2 7 12
2 1 7 11
3 1 6 11
4 2 7 12
5 1 6 12
6 1 6 11
7 2 6 12
Then I think I need to use groupby. I run
g = df.groupby(["A", "B"])
"g.C.value_counts()"
-----------
A B C
1 6 11 3
12 1
7 11 1
2 6 12 1
7 12 2
Name: C, dtype: int64
This shows that we have one pair of (A,B) for which we have both a C=11 and a C=12, and 3 pairs of (A,B) for which we only have either C=11 or C=12. So I would like to make pandas tells me that we have 25% of (A,B) paris for which C takes both values and 75% for which it only takes one value.
How can I accomplish this? I would like to do so for a big data frame where I can't just eyeball it from the value_counts--this small dataframe is just to illustrate.
Thanks!
Pass normalize=True
out = df.groupby(["A", "B"]).C.value_counts(normalize=True)
Out[791]:
A B C
1 6 11 0.75
12 0.25
7 11 1.00
2 6 12 1.00
7 12 1.00
Name: C, dtype: float64

Converting classes to nearest group with maximum vote

I have a time series data with a column of a multiclass object, I would like to convert the object happening in less than two instances to the nearest bigger group. Here is an example of the data frame. I wish to convert 'No' appears at the 4th row to 'Yes' and similarly, 'Yes' in the 16th to 'No'.
df = pd.DataFrame(data = {'A': ['Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Probable',
'Probable', 'Probable', 'Yes', 'Yes', 'Yes',
'No', 'No','No', 'Yes', 'No','No'],
'Counter': [1, 2, 3, 1, 1, 2, 1, 2, 3, 1, 2, 3, 1, 2,
3, 1, 1, 2]})
Does anyone could help me to define a loop or function? Thank you in advance.
The idea is to identify groups of repeated categories and calculate total count per group; then for groups <2 count change their categories to the nearest (which we take to be its preceding group with >=2 total count)
This identifies nearby groups and calculates the sum per group, putting the result in gc (for 'group counter'):
group_id = df['A'].ne(df['A'].shift()).cumsum()
df['gc'] = df.groupby(group_id, group_keys = False)['Counter'].transform(sum)
df
output:
A Counter gc
-- -------- --------- ----
0 Yes 1 6
1 Yes 2 6
2 Yes 3 6
3 No 1 1
4 Yes 1 3
5 Yes 2 3
6 Probable 1 6
7 Probable 2 6
8 Probable 3 6
9 Yes 1 6
10 Yes 2 6
11 Yes 3 6
12 No 1 6
13 No 2 6
14 No 3 6
15 Yes 1 1
16 No 1 3
17 No 2 3
Now we replace categories with NaNs for those where gc is < 2, and fill forward:
import numpy as np
df.loc[df['gc'] <2,'A'] = np.NaN
df.fillna(method = 'ffill').drop(columns = 'gc')
output:
A Counter
-- -------- ---------
0 Yes 1
1 Yes 2
2 Yes 3
3 Yes 1
4 Yes 1
5 Yes 2
6 Probable 1
7 Probable 2
8 Probable 3
9 Yes 1
10 Yes 2
11 Yes 3
12 No 1
13 No 2
14 No 3
15 No 1
16 No 1
17 No 2

create column based on column values - merge integers

I would like to create a new column "Group". The integer values from column "Step_ID" should be converted into 1 and 2. The fist two values should be converted to 1, the second two values to 2, the third two values to 1 etc. See the image below.
import pandas as pd
data = {'Step_ID': [1, 1, 2, 2, 3, 4, 5, 6, 6, 7, 8, 8, 9, 10, 11, 11]}
df1 = pd.DataFrame(data)
You can try:
m = (df.Step_ID % 2) + df.Step_ID
df['new_group'] = (m.ne(m.shift()).cumsum() % 2).replace(0,2)
OUTPUT:
Step_ID new_group
0 1 1
1 1 1
2 2 1
3 2 1
4 3 2
5 4 2
6 5 1
7 6 1
8 6 1
9 7 2
10 8 2
11 8 2
12 9 1
13 10 1
14 11 2
15 11 2

Pandas Dataframe get trend in column

I have a dataframe:
np.random.seed(1)
df1 = pd.DataFrame({'day':[3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6],
'item': [1, 1, 2, 2, 1, 2, 3, 3, 4, 3, 4],
'price':np.random.randint(1,30,11)})
day item price
0 3 1 6
1 4 1 12
2 4 2 13
3 4 2 9
4 5 1 10
5 5 2 12
6 5 3 6
7 5 3 16
8 5 4 1
9 6 3 17
10 6 4 2
After the groupby code gb = df1.groupby(['day','item'])['price'].mean(), I get:
gb
day item
3 1 6
4 1 12
2 11
5 1 10
2 12
3 11
4 1
6 3 17
4 2
Name: price, dtype: int64
I want to get the trend from the groupby series replacing back into the dataframe column price. The price is the variation of the item-price with repect to the previous day price
day item price
0 3 1 nan
1 4 1 6
2 4 2 nan
3 4 2 nan
4 5 1 -2
5 5 2 1
6 5 3 nan
7 5 3 nan
8 5 4 nan
9 6 3 6
10 6 4 1
Please help me to code the last step. A single/double line code will be most helpful. As the actual dataframe is huge, I would like to avoid iterations.
Hope this helps!
#get the average values
mean_df=df1.groupby(['day','item'])['price'].mean().reset_index()
#rename columns
mean_df.columns=['day','item','average_price']
#sort by day an item in ascending
mean_df=mean_df.sort_values(by=['day','item'])
#shift the price for each item and each day
mean_df['shifted_average_price'] = mean_df.groupby(['item'])['average_price'].shift(1)
#combine with original df
df1=pd.merge(df1,mean_df,on=['day','item'])
#replace the price by difference of previous day's
df1['price']=df1['price']-df1['shifted_average_price']
#drop unwanted columns
df1.drop(['average_price', 'shifted_average_price'], axis=1, inplace=True)

Conditional filter of entire group for DataFrameGroupBy

If I have the following data
>>> data = pd.DataFrame({'day': [1, 1, 1, 1, 2, 2, 2, 2, 3, 4],
'hour':[4, 5, 6, 7, 4, 5, 6, 7, 4, 7]})
>>> data
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7
8 3 4
9 4 7
And I would like to keep only days where hour has 4 unique values then I would think to do something like this
>>> data.groupby('day').apply(lambda x: x[x['hour'].nunique() == 4])
But this returns KeyError: True
I am hoping to get this
>>> data
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7
Where we see that where day == 3 and day == 4 have been filtered because when grouped by day they don't have 4 unique values of hour. I'm doing this at scale so simply filtering where (day == 3) & (day == 4) is not an option. I think grouping would be a good way to do this but can't get it to work. Anyone have experience with applying functions to DataFrameGroupBy?
I think you actually need to filter the data:
>>> data.groupby('day').filter(lambda x: x['hour'].nunique() == 4)
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7