Conditional filter of entire group for DataFrameGroupBy - pandas

If I have the following data
>>> data = pd.DataFrame({'day': [1, 1, 1, 1, 2, 2, 2, 2, 3, 4],
'hour':[4, 5, 6, 7, 4, 5, 6, 7, 4, 7]})
>>> data
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7
8 3 4
9 4 7
And I would like to keep only days where hour has 4 unique values then I would think to do something like this
>>> data.groupby('day').apply(lambda x: x[x['hour'].nunique() == 4])
But this returns KeyError: True
I am hoping to get this
>>> data
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7
Where we see that where day == 3 and day == 4 have been filtered because when grouped by day they don't have 4 unique values of hour. I'm doing this at scale so simply filtering where (day == 3) & (day == 4) is not an option. I think grouping would be a good way to do this but can't get it to work. Anyone have experience with applying functions to DataFrameGroupBy?

I think you actually need to filter the data:
>>> data.groupby('day').filter(lambda x: x['hour'].nunique() == 4)
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7

Related

create column based on column values - merge integers

I would like to create a new column "Group". The integer values from column "Step_ID" should be converted into 1 and 2. The fist two values should be converted to 1, the second two values to 2, the third two values to 1 etc. See the image below.
import pandas as pd
data = {'Step_ID': [1, 1, 2, 2, 3, 4, 5, 6, 6, 7, 8, 8, 9, 10, 11, 11]}
df1 = pd.DataFrame(data)
You can try:
m = (df.Step_ID % 2) + df.Step_ID
df['new_group'] = (m.ne(m.shift()).cumsum() % 2).replace(0,2)
OUTPUT:
Step_ID new_group
0 1 1
1 1 1
2 2 1
3 2 1
4 3 2
5 4 2
6 5 1
7 6 1
8 6 1
9 7 2
10 8 2
11 8 2
12 9 1
13 10 1
14 11 2
15 11 2

Concatenate all combinations of sub-level columns in a pandas DataFrame

Given the following DataFrame:
cols = pd.MultiIndex.from_product([['A', 'B'], ['a', 'b']])
example = pd.DataFrame([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]], columns=cols)
example
A B
a b a b
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
I would like to end up with the following one:
A B
0 0 2
1 4 6
2 8 10
3 0 3
4 4 7
5 8 11
6 1 2
7 5 6
8 9 10
9 1 3
10 5 7
11 9 11
I used this code:
concatenated = pd.DataFrame([])
for A_sub_col in ('a', 'b'):
for B_sub_col in ('a', 'b'):
new_frame = example[[['A', A_sub_col], ['B', B_sub_col]]]
new_frame.columns = ['A', 'B']
concatenated = pd.concat([concatenated, new_frame])
However, I strongly suspect that there is a more straight-forward, idiomatic way to do that with Pandas. How would one go about it?
Here's an option using list comprehension:
pd.concat([
example[[('A', i), ('B', j)]].droplevel(level=1, axis=1)
for i in example['A'].columns
for j in example['B'].columns
]).reset_index(drop=True)
Output:
A B
0 0 2
1 4 6
2 8 10
3 0 3
4 4 7
5 8 11
6 1 2
7 5 6
8 9 10
9 1 3
10 5 7
11 9 11
Here is one way. Not sure how more pythonic this is. It is definitely less readable :-) but on the other hand does not use explicit loops:
(example
.apply(lambda c: [list(c)])
.stack(level=1)
.apply(lambda c:[list(c)])
.explode('A')
.explode('B')
.apply(pd.Series.explode)
.reset_index(drop = True)
)
to understand what's going on it would be helpful to do this one step at a time, but the end result is
A B
0 0 2
1 4 6
2 8 10
3 0 3
4 4 7
5 8 11
6 1 2
7 5 6
8 9 10
9 1 3
10 5 7
11 9 11

Passing Tuple to a function via apply

I am trying to run below function which takes two points..
point A=(2,3)
point B=(4,5
def Somefunc(pointA, point B):
x= pointA[0] + pointB[1]
return x
Now, when in try to create a separate column based on this fucntion, it is throwing me errors like cannot convert the series to <class 'float'>, so I tried this
df['T']=df.apply(Somefunc((df['A'].apply(lambda x: float(x)),df['B'].apply(lambda x: float(x))),\
(df['C'].apply(lambda x: float(x)),df['D'].apply(lambda x: float(x)))),axis=0))
Sample dataframe below;
A B C D
1 2 3 5
2 4 7 8
4 7 9 0
Any help will be appreciated.
This is the best guess I can make as to what you're trying to do:
df['T']=df.apply(lambda row: [(row['A'],row['B']),(row['C'],row['D'])],axis=1)
Edit: to apply your function;
df['T'] = df.apply(lambda row: SomeFunc((row['A'],row['B']),(row['C'],row['D'])),axis=1)
that being said, the same result can be achieved much quicker and idiomatically like so:
>>> df
A B C D
0 2 7 3 3
1 3 1 5 7
2 2 0 6 2
3 3 9 5 9
4 0 2 3 7
>>> df['T']=df.apply(tuple,axis=1)
>>> df
A B C D T
0 2 7 3 3 (2, 7, 3, 3)
1 3 1 5 7 (3, 1, 5, 7)
2 2 0 6 2 (2, 0, 6, 2)
3 3 9 5 9 (3, 9, 5, 9)
4 0 2 3 7 (0, 2, 3, 7)

Grouping and delete whole group with condition

In pandas dataframe first I want to group the data with 'batch_id' column then check in the 'result' column if all values are negative delete that group
Using follwoing code
df.groupby('batch_id').filter(lambda g: (g.result != 'negative').all())
Create DF:
df = pd.DataFrame({'batch_id':[5, 1, 2, 3, 4, 1, 2, 1, 4, 3], \
'result':['good','negative','2,000','negative', '66', \
'negative','negative','negative', '22', 'clean']})
batch_id result
0 5 good
1 1 negative
2 2 2,000
3 3 negative
4 4 66
5 1 negative
6 2 negative
7 1 negative
8 4 22
9 3 clean
df.groupby('batch_id').filter(lambda g: ~ (g.result == 'negative').all())
Output:
batch_id result
0 5 good
2 2 2,000
3 3 negative
4 4 66
6 2 negative
8 4 22
9 3 clean

Pandas Dataframe get trend in column

I have a dataframe:
np.random.seed(1)
df1 = pd.DataFrame({'day':[3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6],
'item': [1, 1, 2, 2, 1, 2, 3, 3, 4, 3, 4],
'price':np.random.randint(1,30,11)})
day item price
0 3 1 6
1 4 1 12
2 4 2 13
3 4 2 9
4 5 1 10
5 5 2 12
6 5 3 6
7 5 3 16
8 5 4 1
9 6 3 17
10 6 4 2
After the groupby code gb = df1.groupby(['day','item'])['price'].mean(), I get:
gb
day item
3 1 6
4 1 12
2 11
5 1 10
2 12
3 11
4 1
6 3 17
4 2
Name: price, dtype: int64
I want to get the trend from the groupby series replacing back into the dataframe column price. The price is the variation of the item-price with repect to the previous day price
day item price
0 3 1 nan
1 4 1 6
2 4 2 nan
3 4 2 nan
4 5 1 -2
5 5 2 1
6 5 3 nan
7 5 3 nan
8 5 4 nan
9 6 3 6
10 6 4 1
Please help me to code the last step. A single/double line code will be most helpful. As the actual dataframe is huge, I would like to avoid iterations.
Hope this helps!
#get the average values
mean_df=df1.groupby(['day','item'])['price'].mean().reset_index()
#rename columns
mean_df.columns=['day','item','average_price']
#sort by day an item in ascending
mean_df=mean_df.sort_values(by=['day','item'])
#shift the price for each item and each day
mean_df['shifted_average_price'] = mean_df.groupby(['item'])['average_price'].shift(1)
#combine with original df
df1=pd.merge(df1,mean_df,on=['day','item'])
#replace the price by difference of previous day's
df1['price']=df1['price']-df1['shifted_average_price']
#drop unwanted columns
df1.drop(['average_price', 'shifted_average_price'], axis=1, inplace=True)