Pandas return True if condition True in any of previous n rows - pandas

example df:
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9],[1, 2, 3], [4, 5, 6], [7, 8, 9],[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 1 2 3
4 4 5 6
5 7 8 9
6 1 2 3
7 4 5 6
8 7 8 9
Goal is to get a new column, 'd', that returns True when a certain condition is true anywhere within a rolling window of size n.
For example, desired column 'd' for condition "column c == 2 within rolling window of 2":
a b c d
0 1 2 3 nan
1 4 5 6 True
2 7 8 9 False
3 1 2 3 True
4 4 5 6 True
5 7 8 9 False
6 1 2 3 True
7 4 5 6 True
8 7 8 9 False
I hope my question is understood thank you for taking your time
To be clear, I am trying to return True if ANY of the rows in the rolling window return True

I imagine you meant column b:
s = df2['b'].eq(2).rolling(2).max()
df2['d'] = s.astype(bool).mask(s.isna())
NB. you need to use max as rolling only works with numeric data.
Output:
a b c d
0 1 2 3 NaN
1 4 5 6 True
2 7 8 9 False
3 1 2 3 True
4 4 5 6 True
5 7 8 9 False
6 1 2 3 True
7 4 5 6 True
8 7 8 9 False

Related

groupby and count on multiple columns of dataframe

I have a df
df = pd.DataFrame([
[1, 1, 'A', 10],
[4, 1 ,'A', 6],
[7, 2 ,'A', 3],
[2, 2 ,'A', 4],
[6, 2 ,'B', 9],
[5, 2 ,'B', 7],
[5, 1 ,'B', 12],
[5, 1 ,'B', 4],
[5, 2 ,'C', 9],
[5, 1 ,'C', 3],
[5, 1 ,'C', 4],
[5, 2 ,'C', 7]
],
index=['A', 'A', 'A','A','A','A','A','A','A','A','A','A'],
columns=['A', 'B', 'C', 'D'])
I can count the number of non zero values for column D grouped by column A using:
df['countTrans'] = df['D'].ne(0).groupby(df['A']).transform('sum')
where the output is:
df:
A B C D countTrans
A 1 1 A 10 1.0
A 4 1 A 0 0.0
A 7 2 A 3 1.0
A 2 2 A 4 1.0
A 6 2 B 9 1.0
A 5 2 B 7 7.0
A 5 1 B 12 7.0
A 5 1 B 4 7.0
A 5 2 C 9 7.0
A 5 1 C 3 7.0
A 5 1 C 4 7.0
A 5 2 C 7 7.0
however I would like to also group by not only by column A but also column B.
I have tried variants of:
df['countTrans'] = df['D'].ne(0).groupby(df['A'], df['B']).transform('sum')
df['countTrans'] = df['D'].ne(0).groupby(df['A','B']).transform('sum')
without success
my desired output would look like:
df:
A B C D countTrans
A 1 1 A 10 1.0
A 4 1 A 0 0.0
A 7 2 A 3 1.0
A 2 2 A 4 1.0
A 6 2 B 9 1.0
A 5 2 B 7 3.0
A 5 1 B 12 4.0
A 5 1 B 4 4.0
A 5 2 C 9 3.0
A 5 1 C 3 4.0
A 5 1 C 4 4.0
A 5 2 C 7 3.0
Possible solution is pass Series to list:
df['countTrans'] = df['D'].ne(0).groupby([df['A'], df['B']]).transform('sum')
print (df)
A B C D countTrans
A 1 1 A 10 1
A 4 1 A 6 1
A 7 2 A 3 1
A 2 2 A 4 1
A 6 2 B 9 1
A 5 2 B 7 3
A 5 1 B 12 4
A 5 1 B 4 4
A 5 2 C 9 3
A 5 1 C 3 4
A 5 1 C 4 4
A 5 2 C 7 3
Or create helper column by DataFrame.assign (more 'clean' in my opinion):
df['countTrans'] = df.assign(E = df['D'].ne(0)).groupby(['A','B'])['E'].transform('sum')
#similar solution with overwrite D
#df['countTrans'] = df.assign(D = df['D'].ne(0)).groupby(['A','B'])['D'].transform('sum')

Concatenate all combinations of sub-level columns in a pandas DataFrame

Given the following DataFrame:
cols = pd.MultiIndex.from_product([['A', 'B'], ['a', 'b']])
example = pd.DataFrame([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]], columns=cols)
example
A B
a b a b
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
I would like to end up with the following one:
A B
0 0 2
1 4 6
2 8 10
3 0 3
4 4 7
5 8 11
6 1 2
7 5 6
8 9 10
9 1 3
10 5 7
11 9 11
I used this code:
concatenated = pd.DataFrame([])
for A_sub_col in ('a', 'b'):
for B_sub_col in ('a', 'b'):
new_frame = example[[['A', A_sub_col], ['B', B_sub_col]]]
new_frame.columns = ['A', 'B']
concatenated = pd.concat([concatenated, new_frame])
However, I strongly suspect that there is a more straight-forward, idiomatic way to do that with Pandas. How would one go about it?
Here's an option using list comprehension:
pd.concat([
example[[('A', i), ('B', j)]].droplevel(level=1, axis=1)
for i in example['A'].columns
for j in example['B'].columns
]).reset_index(drop=True)
Output:
A B
0 0 2
1 4 6
2 8 10
3 0 3
4 4 7
5 8 11
6 1 2
7 5 6
8 9 10
9 1 3
10 5 7
11 9 11
Here is one way. Not sure how more pythonic this is. It is definitely less readable :-) but on the other hand does not use explicit loops:
(example
.apply(lambda c: [list(c)])
.stack(level=1)
.apply(lambda c:[list(c)])
.explode('A')
.explode('B')
.apply(pd.Series.explode)
.reset_index(drop = True)
)
to understand what's going on it would be helpful to do this one step at a time, but the end result is
A B
0 0 2
1 4 6
2 8 10
3 0 3
4 4 7
5 8 11
6 1 2
7 5 6
8 9 10
9 1 3
10 5 7
11 9 11

Pandas Dataframe get trend in column

I have a dataframe:
np.random.seed(1)
df1 = pd.DataFrame({'day':[3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6],
'item': [1, 1, 2, 2, 1, 2, 3, 3, 4, 3, 4],
'price':np.random.randint(1,30,11)})
day item price
0 3 1 6
1 4 1 12
2 4 2 13
3 4 2 9
4 5 1 10
5 5 2 12
6 5 3 6
7 5 3 16
8 5 4 1
9 6 3 17
10 6 4 2
After the groupby code gb = df1.groupby(['day','item'])['price'].mean(), I get:
gb
day item
3 1 6
4 1 12
2 11
5 1 10
2 12
3 11
4 1
6 3 17
4 2
Name: price, dtype: int64
I want to get the trend from the groupby series replacing back into the dataframe column price. The price is the variation of the item-price with repect to the previous day price
day item price
0 3 1 nan
1 4 1 6
2 4 2 nan
3 4 2 nan
4 5 1 -2
5 5 2 1
6 5 3 nan
7 5 3 nan
8 5 4 nan
9 6 3 6
10 6 4 1
Please help me to code the last step. A single/double line code will be most helpful. As the actual dataframe is huge, I would like to avoid iterations.
Hope this helps!
#get the average values
mean_df=df1.groupby(['day','item'])['price'].mean().reset_index()
#rename columns
mean_df.columns=['day','item','average_price']
#sort by day an item in ascending
mean_df=mean_df.sort_values(by=['day','item'])
#shift the price for each item and each day
mean_df['shifted_average_price'] = mean_df.groupby(['item'])['average_price'].shift(1)
#combine with original df
df1=pd.merge(df1,mean_df,on=['day','item'])
#replace the price by difference of previous day's
df1['price']=df1['price']-df1['shifted_average_price']
#drop unwanted columns
df1.drop(['average_price', 'shifted_average_price'], axis=1, inplace=True)

Conditional filter of entire group for DataFrameGroupBy

If I have the following data
>>> data = pd.DataFrame({'day': [1, 1, 1, 1, 2, 2, 2, 2, 3, 4],
'hour':[4, 5, 6, 7, 4, 5, 6, 7, 4, 7]})
>>> data
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7
8 3 4
9 4 7
And I would like to keep only days where hour has 4 unique values then I would think to do something like this
>>> data.groupby('day').apply(lambda x: x[x['hour'].nunique() == 4])
But this returns KeyError: True
I am hoping to get this
>>> data
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7
Where we see that where day == 3 and day == 4 have been filtered because when grouped by day they don't have 4 unique values of hour. I'm doing this at scale so simply filtering where (day == 3) & (day == 4) is not an option. I think grouping would be a good way to do this but can't get it to work. Anyone have experience with applying functions to DataFrameGroupBy?
I think you actually need to filter the data:
>>> data.groupby('day').filter(lambda x: x['hour'].nunique() == 4)
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7

Merging dataframes in pandas

I am new to pandas and I am facing the following problem:
I have 2 data frames:
df1 :
x y
1 3 4
2 nan
3 6
4 nan
5 9 2
6 1 4 9
df2:
x y
1 2 3 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 5 3 7
The size of the two is same.
I want to merge the two dataframes such that all the resulting dataframe i get is the following:
result :
x y
1 3 4 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 5 6 7
So in the result, priority is given to df2. If there is a value in df2, it is put first and the remaining values are put from df1 (they have the same position as in df1). There should be no repeated values in the result (i.e if a value is in position 1 in df1 and position 3 in df2, then that value should come only in position 1 in the result and not repeat)
Any kind of help will be appreciated.
Thanks!
IIUC
Setup
df1 = pd.DataFrame(dict(x=range(1, 7),
y=[[3, 4], None, [6], None, [9, 2], [1, 4, 9]]))
df2 = pd.DataFrame(dict(x=range(1, 7), y=[[2, 3, 6, 1, 5], [4, 1, 8, 7, 5],
[6, 3, 1, 4, 5], [2, 1, 3, 5, 4],
[9, 2, 3, 8, 7], [1, 4, 5, 3, 7]]))
print df1
print
print df2
x y
0 1 [3, 4]
1 2 None
2 3 [6]
3 4 None
4 5 [9, 2]
5 6 [1, 4, 9]
x y
0 1 [2, 3, 6, 1, 5]
1 2 [4, 1, 8, 7, 5]
2 3 [6, 3, 1, 4, 5]
3 4 [2, 1, 3, 5, 4]
4 5 [9, 2, 3, 8, 7]
5 6 [1, 4, 5, 3, 7]
convert to something more usable:
df1_ = df1.set_index('x').y.apply(pd.Series)
df2_ = df2.set_index('x').y.apply(pd.Series)
print df1_
print
print df2_
0 1 2
x
1 3.0 4.0 NaN
2 NaN NaN NaN
3 6.0 NaN NaN
4 NaN NaN NaN
5 9.0 2.0 NaN
6 1.0 4.0 9.0
0 1 2 3 4
x
1 2 3 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 5 3 7
Combine with priority given to df1 (I think you meant df1 as that what was consistent with my interpretation of your question and the expected output you provided) then reducing to eliminate duplicates:
print df1_.combine_first(df2_).apply(lambda x: x.unique(), axis=1)
0 1 2 3 4
x
1 3 4 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 9 3 7