Converting classes to nearest group with maximum vote - pandas

I have a time series data with a column of a multiclass object, I would like to convert the object happening in less than two instances to the nearest bigger group. Here is an example of the data frame. I wish to convert 'No' appears at the 4th row to 'Yes' and similarly, 'Yes' in the 16th to 'No'.
df = pd.DataFrame(data = {'A': ['Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Probable',
'Probable', 'Probable', 'Yes', 'Yes', 'Yes',
'No', 'No','No', 'Yes', 'No','No'],
'Counter': [1, 2, 3, 1, 1, 2, 1, 2, 3, 1, 2, 3, 1, 2,
3, 1, 1, 2]})
Does anyone could help me to define a loop or function? Thank you in advance.

The idea is to identify groups of repeated categories and calculate total count per group; then for groups <2 count change their categories to the nearest (which we take to be its preceding group with >=2 total count)
This identifies nearby groups and calculates the sum per group, putting the result in gc (for 'group counter'):
group_id = df['A'].ne(df['A'].shift()).cumsum()
df['gc'] = df.groupby(group_id, group_keys = False)['Counter'].transform(sum)
df
output:
A Counter gc
-- -------- --------- ----
0 Yes 1 6
1 Yes 2 6
2 Yes 3 6
3 No 1 1
4 Yes 1 3
5 Yes 2 3
6 Probable 1 6
7 Probable 2 6
8 Probable 3 6
9 Yes 1 6
10 Yes 2 6
11 Yes 3 6
12 No 1 6
13 No 2 6
14 No 3 6
15 Yes 1 1
16 No 1 3
17 No 2 3
Now we replace categories with NaNs for those where gc is < 2, and fill forward:
import numpy as np
df.loc[df['gc'] <2,'A'] = np.NaN
df.fillna(method = 'ffill').drop(columns = 'gc')
output:
A Counter
-- -------- ---------
0 Yes 1
1 Yes 2
2 Yes 3
3 Yes 1
4 Yes 1
5 Yes 2
6 Probable 1
7 Probable 2
8 Probable 3
9 Yes 1
10 Yes 2
11 Yes 3
12 No 1
13 No 2
14 No 3
15 No 1
16 No 1
17 No 2

Related

find when a value is above threshold and store the result in a list in pandas

here is my problem.
I have a df like this one :
user_id profile item level amount cumulative_amount
1 1 1 1 10 10
1 1 1 2 30 40
1 1 2 1 10 10
1 1 2 2 10 20
1 1 2 3 20 40
1 1 3 1 40 40
1 1 4 1 20 20
1 1 4 2 20 40
2 1 1 1 10 10
2 1 1 5 30 40
2 1 2 1 10 10
2 1 2 2 10 20
2 1 2 6 20 40
2 1 3 6 40 40
2 1 4 1 20 20
2 1 4 3 20 40
For each item, user_id and profile I need to know what is the level when the cumulative amount is above a certain threshold (ex:40), and store the result in a list of lists.
For example, I should have something like:
[[2, 3, 1, 2], [5,6,6,3]]
Thanks everyone for the help!
IIUC, you can filter for values above or equal to threshold (40), then get the first matching level per group:
(df
.loc[df['cumulative_amount'].ge(40)]
.groupby(['user_id', 'profile', 'item'])
['level'].first()
)
output Series:
user_id profile item
1 1 1 2
2 3
3 1
4 2
2 1 1 5
2 6
3 6
4 3
Name: thresh, dtype: int64
Then to get a list per user_id:
out_lst = (df
.loc[df['cumulative_amount'].ge(40)]
.groupby(['user_id', 'profile', 'item'])
['level'].first()
.groupby(level='user_id').agg(list).to_list()
)
output: [[2, 3, 1, 2], [5, 6, 6, 3]]

create column based on column values - merge integers

I would like to create a new column "Group". The integer values from column "Step_ID" should be converted into 1 and 2. The fist two values should be converted to 1, the second two values to 2, the third two values to 1 etc. See the image below.
import pandas as pd
data = {'Step_ID': [1, 1, 2, 2, 3, 4, 5, 6, 6, 7, 8, 8, 9, 10, 11, 11]}
df1 = pd.DataFrame(data)
You can try:
m = (df.Step_ID % 2) + df.Step_ID
df['new_group'] = (m.ne(m.shift()).cumsum() % 2).replace(0,2)
OUTPUT:
Step_ID new_group
0 1 1
1 1 1
2 2 1
3 2 1
4 3 2
5 4 2
6 5 1
7 6 1
8 6 1
9 7 2
10 8 2
11 8 2
12 9 1
13 10 1
14 11 2
15 11 2

Grouping and delete whole group with condition

In pandas dataframe first I want to group the data with 'batch_id' column then check in the 'result' column if all values are negative delete that group
Using follwoing code
df.groupby('batch_id').filter(lambda g: (g.result != 'negative').all())
Create DF:
df = pd.DataFrame({'batch_id':[5, 1, 2, 3, 4, 1, 2, 1, 4, 3], \
'result':['good','negative','2,000','negative', '66', \
'negative','negative','negative', '22', 'clean']})
batch_id result
0 5 good
1 1 negative
2 2 2,000
3 3 negative
4 4 66
5 1 negative
6 2 negative
7 1 negative
8 4 22
9 3 clean
df.groupby('batch_id').filter(lambda g: ~ (g.result == 'negative').all())
Output:
batch_id result
0 5 good
2 2 2,000
3 3 negative
4 4 66
6 2 negative
8 4 22
9 3 clean

Pandas Dataframe get trend in column

I have a dataframe:
np.random.seed(1)
df1 = pd.DataFrame({'day':[3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6],
'item': [1, 1, 2, 2, 1, 2, 3, 3, 4, 3, 4],
'price':np.random.randint(1,30,11)})
day item price
0 3 1 6
1 4 1 12
2 4 2 13
3 4 2 9
4 5 1 10
5 5 2 12
6 5 3 6
7 5 3 16
8 5 4 1
9 6 3 17
10 6 4 2
After the groupby code gb = df1.groupby(['day','item'])['price'].mean(), I get:
gb
day item
3 1 6
4 1 12
2 11
5 1 10
2 12
3 11
4 1
6 3 17
4 2
Name: price, dtype: int64
I want to get the trend from the groupby series replacing back into the dataframe column price. The price is the variation of the item-price with repect to the previous day price
day item price
0 3 1 nan
1 4 1 6
2 4 2 nan
3 4 2 nan
4 5 1 -2
5 5 2 1
6 5 3 nan
7 5 3 nan
8 5 4 nan
9 6 3 6
10 6 4 1
Please help me to code the last step. A single/double line code will be most helpful. As the actual dataframe is huge, I would like to avoid iterations.
Hope this helps!
#get the average values
mean_df=df1.groupby(['day','item'])['price'].mean().reset_index()
#rename columns
mean_df.columns=['day','item','average_price']
#sort by day an item in ascending
mean_df=mean_df.sort_values(by=['day','item'])
#shift the price for each item and each day
mean_df['shifted_average_price'] = mean_df.groupby(['item'])['average_price'].shift(1)
#combine with original df
df1=pd.merge(df1,mean_df,on=['day','item'])
#replace the price by difference of previous day's
df1['price']=df1['price']-df1['shifted_average_price']
#drop unwanted columns
df1.drop(['average_price', 'shifted_average_price'], axis=1, inplace=True)

Conditional filter of entire group for DataFrameGroupBy

If I have the following data
>>> data = pd.DataFrame({'day': [1, 1, 1, 1, 2, 2, 2, 2, 3, 4],
'hour':[4, 5, 6, 7, 4, 5, 6, 7, 4, 7]})
>>> data
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7
8 3 4
9 4 7
And I would like to keep only days where hour has 4 unique values then I would think to do something like this
>>> data.groupby('day').apply(lambda x: x[x['hour'].nunique() == 4])
But this returns KeyError: True
I am hoping to get this
>>> data
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7
Where we see that where day == 3 and day == 4 have been filtered because when grouped by day they don't have 4 unique values of hour. I'm doing this at scale so simply filtering where (day == 3) & (day == 4) is not an option. I think grouping would be a good way to do this but can't get it to work. Anyone have experience with applying functions to DataFrameGroupBy?
I think you actually need to filter the data:
>>> data.groupby('day').filter(lambda x: x['hour'].nunique() == 4)
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7