Making pandas.plot legend and stacks in the same specified order - pandas

I have a stacked bar plot that I want to display the legend and stacks in the specified same order.
In order the sort the legend classes, I added 1, 2 in front of True, False (the classes I want to sort). Not the most ideal way but it works. The problem is, it doesn't sort the stacks.
Example data
d = {'sel_date':['2020-01', '2020-01', '2020-01', '2021-02', '2021-03', '2020-01', '2020-01', '2020-01', '2021-02', '2021-03'], \
'id':list('yyzzz'*2), \
'is_new': ['1. True', '1. True', '2. False', '1. True', '2. False', '1. True', '1. True', '2. False', '2. False', '2. False'], \
'Short Name':list('ababa'*2)}
d
df = pd.DataFrame(data=d)
df
sel_date id is_new Short Name
0 2020-01 y 1. True a
1 2020-01 y 1. True b
2 2020-01 z 2. False a
3 2021-02 z 1. True b
4 2021-03 z 2. False a
5 2020-01 y 1. True a
6 2020-01 y 1. True b
7 2020-01 z 2. False a
8 2021-02 z 2. False b
9 2021-03 z 2. False a
Plot by this function
def plot_stacked_barplot_example(feature, suptitle=None, df=reg, groupby_column='Short Name'):
sns.set_theme(style='white', font_scale=1.4)
fig, ax = plt.subplots(figsize=(30, 20))
for i, (group, data) in enumerate(df.groupby(groupby_column)):
ax = plt.subplot(2, 2, i+1)
pivot = (data.groupby('sel_date')[feature].value_counts(normalize=True)
.mul(100).unstack(feature)
.plot(kind='bar', stacked=True, ax=ax
))
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.xaxis.set_ticks([]) # Hide labels from xticks
ax.get_legend().remove()
handles, labels = ax.get_legend_handles_labels()
# sort both labels and handles by labels
labels, handles = zip(*sorted(zip(labels, handles), key=lambda t: t[0]))
ax.legend(handles, labels, bbox_to_anchor=(1, 1), loc='upper left')
plot_stacked_barplot_example(df=df, feature='is_new')
I'd like the stacks in the same order as the legend.

Related

Why this iteration doesn't change correctly the global DF Variable

I have a similar code from the one below for my job and I don't know why it doesn't change correctly the global DF variables by a nested array for loop.
>> df1 = pd.DataFrame({
>> 'x': [1,2,3,4,5],
>> 'y': ['a', 'b', 'c', 'd', 'e']
>> })
>> df2 = df1
>> for array in [[df1, 9], [df2, 'z']]:
>> array[0]['x'] = array[1]
>> array[0]['y'] = array[1]
>> print(array[0])
x y
0 9 9
1 9 9
2 9 9
3 9 9
4 9 9
x y
0 z z
1 z z
2 z z
3 z z
4 z z
>> print(df1)
x y
0 z z
1 z z
2 z z
3 z z
4 z z
>> print(df2)
x y
0 z z
1 z z
2 z z
3 z z
4 z z
So in the first iteration we see the correct changes, df1 with 9 in both columns and df2 with z in both columns.
But then when we check the global variables we see everything as z, even the df1. And I don't know why.
When an object in python is mutable, you copy by reference and not by value. For example; int and str are immutable object types, but list, dict and pandas.DataFrame are mutable. See the below example for int and list what this means:
a = 1
b = a
b += 1
print(a)
# >> 1
x = [1,2,3]
y = x
y.append(4)
print(x)
# >> [1, 2, 3, 4]
So, when you assigned df2, you assigned it to the exact same object as where df1 was referring to. That means, that when you change df2, you also change the object that is referred to by df1, because it is physically the same object. You can check this by using the inbuilt id() function:
df1 = pd.DataFrame({'x': [1,2,3,4,5], 'y': ['a', 'b', 'c', 'd', 'e']})
df2 = df1
print(id(df1), id(df2))
# >> 4695746416 4695746416
To have a new copy of the same dataframe, you need to use copy():
df1 = pd.DataFrame({'x': [1,2,3,4,5], 'y': ['a', 'b', 'c', 'd', 'e']})
df2 = df1.copy()
print(id(df1), id(df2))
# >> 4695749728 4695742816

Subplots with counter like legends

I have written plot_dataframe() to create two subplots (one for line chart and another for histogram bar chart) for a dataframe that is passed via argument.
Then I call this function from plot_kernels() with multiple dataframs.
def plot_dataframe(df, cnt):
row = df.iloc[0].astype(int) # First row in the dataframe
plt.subplot(2, 1, 1)
row.plot(legend=cnt) # Line chart
plt.subplot(2, 1, 2)
df2 = row.value_counts()
df2.reindex().plot(kind='bar', legend=cnt) # Histogram
def plot_kernels(mydict2):
plt.figure(figsize=(20, 15))
cnt=1
for key in my_dict2:
df = my_dict2[key]
plot_dataframe(df, cnt)
cnt = cnt + 1
plt.show()
The dictionary looks like
{'K1::foo(bar::z(x,u))': Value Value
0 10 2
1 5 2
2 10 2, 'K3::foo(bar::y(z,u))': Value Value
0 6 12
1 7 13
2 8 14}
And based on the values in row[0], [10,2] are shown in blue line and [6,12] are shown in orange line. For histogram, they are similar. As you can see the legends in the subplots are shown as 0 in the figure. I expect to see 1 and 2. How can I fix that?
Change legend to label, then force the legend after you plot everything:
def plot_dataframe(df, cnt,axes):
row = df.iloc[0].astype(int) # First row in the dataframe
row.plot(label=cnt, ax=axes[0]) # Line chart -- use label, not legend
df2 = row.value_counts()
df2.plot(kind='bar', ax=axes[1], label=cnt) # Histogram
def plot_kernels(d):
# I'd create the axes first and pass to the plot function
fig,axes = plt.subplots(2,1, figsize=(20, 15))
cnt=1
for key in d:
df = d[key]
plot_dataframe(df, cnt, axes=axes)
cnt = cnt + 1
# render the legend
for ax in axes:
ax.legend()
plt.show()
Output:

In R, I want to pull out the minimum value of z when x and y are duplicates

In R, I want to pull out the minimum value of z when x and y are duplicates.
line x y z
3559 1284283 10315402 20995
3560 1284283 10315402 20982
3596 1284341 10315344 20995
3597 1284341 10315344 20982
3633 1284399 10315286 20995
3634 1284399 10315286 20982
I did this so far:
identify duplicate x,y locations in the zone, creates a logical vector
I added the logical vector to the dataframe in column 4
z1_repeat_xy <- duplicated(df_Zone1[,1:2])
df_Zone1$repeat_xy <- z1_repeat_xy # feeds the true or false value to a new column called repeat_xy.
line# x y z repeat_xy
135161 1283668 10314903 19994 FALSE
135164 1283726 10314845 19994 FALSE
135167 1283784 10314787 19994 FALSE
135170 1283842 10314729 19994 FALSE
135171 1283842 10314729 19981 TRUE
135172 1283842 10314729 19968 TRUE
now I want the minumum z value corresponding to TRUE values in the 4th column along with the rows containing FALSE
I create a new df with only TRUE values in the repeat column
df_repeats <- filter(df_zone1, repeat_xy == TRUE)
Created the test_max and test_min df using the ave() function
seems to do what I want. Gives the min or max z value for repeat xy values.
test_max <- df_repeats[df_repeats$z == ave(df_repeats$z, df_repeats$x , FUN =max),]
test_min <- df_repeats[df_repeats$z == ave(df_repeats$z, df_repeats$x , FUN =min),]
write.table(test_max, file = "test_max.txt", sep = " ", row.names = FALSE)
write.table(test_min, file = "test_min.txt", sep = " ", row.names = FALSE)

geom_violin using the weight aesthetic unexpectedly drop levels

library(tidyverse)
set.seed(12345)
dat <- data.frame(year = c(rep(1990, 100), rep(1991, 100), rep(1992, 100)),
fish_length = sample(x = seq(from = 10, 131, by = 0.1), 300, replace = F),
nb_caught = sample(x = seq(from = 1, 200, by = 0.1), 300, replace = T),
stringsAsFactors = F) %>%
mutate(age = ifelse(fish_length < 20, 1,
ifelse(fish_length >= 20 & fish_length < 100, 2,
ifelse(fish_length >= 100 & fish_length < 130, 3, 4)))) %>%
arrange(year, fish_length)
head(dat)
year fish_length nb_caught age
1 1990 10.1 45.2 1
2 1990 10.7 170.0 1
3 1990 10.9 62.0 1
4 1990 12.1 136.0 1
5 1990 14.1 80.8 1
6 1990 15.0 188.9 1
dat %>% group_by(year) %>% summarise(ages = n_distinct(age)) # Only 1992 has age 4 fish
# A tibble: 3 x 2
year ages
<dbl> <int>
1 1990 3
2 1991 3
3 1992 4
dat %>% filter(age == 4) # only 1 row for age 4
year fish_length nb_caught age
1 1992 130.8 89.2 4
Here:
year = year of sampling
fish_length = length of the fish in cm
nb_caught = number of fish caught following the use of an age-length key, hence explaining the presence of decimals
age = age of the fish
graph1: geom_violin not using the weight aesthetic.
Here, I got to copy each line of dat according to the value found in nb_caught.
dim(dat) # 300 rows
dat_graph1 <- dat[rep(1:nrow(dat), floor(dat$nb_caught)), ]
dim(dat_graph1) # 30932 rows
dat_graph1$nb_caught <- NULL # useless now
sum(dat$nb_caught) - nrow(dat_graph1) # 128.2 rows lost here
Since I have decimal values of nb_caught, I took the integer value to create dat_graph1. I lost 128.2 "rows" in the process.
Now for the graph:
dat_tile <- data.frame(year = sort(unique(dat$year))[sort(unique(dat$year)) %% 2 == 0])
# for the figure's background
graph1 <- ggplot(data = dat_graph1,
aes(x = as.factor(year), y = fish_length, fill = as.factor(age),
color = as.factor(age), .drop = F)) +
geom_tile(data = dat_tile, aes(x = factor(year), y = 1, height = Inf, width = 1),
fill = "grey80", inherit.aes = F) +
geom_violin(draw_quantiles = c(0.05, 0.5, 0.95), color = "black",
scale = "width", position = "dodge") +
scale_x_discrete(expand = c(0,0)) +
labs(x = "Year", y = "Fish length", fill = "Age", color = "Age", title = "graph1") +
scale_fill_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_color_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_y_continuous(expand = expand_scale(mult = 0.01)) +
theme_bw()
graph1
graph1
Note here that I have a flat bar for age 4 in year 1992.
dat_graph1 %>% filter(year == 1992, age == 4) %>% pull(fish_length) %>% unique
[1] 130.8
That is because I only have one length for that particular year-age combination.
graph2: geom_violin using the weight aesthetic.
Now, instead of copying each row of dat by the value of number_caught, let's use the weight aesthetic.
Let's calculate the weight wt that each line of dat will have in the calculation of the density curve of each year-age combinations.
dat_graph2 <- dat %>%
group_by(year, age) %>%
mutate(wt = nb_caught / sum(nb_caught)) %>%
as.data.frame()
head(dat_graph2)
year fish_length nb_caught age wt
1 1990 10.1 45.2 1 0.03573123
2 1990 10.7 170.0 1 0.13438735
3 1990 10.9 62.0 1 0.04901186
4 1990 12.1 136.0 1 0.10750988
5 1990 14.1 80.8 1 0.06387352
6 1990 15.0 188.9 1 0.14932806
graph2 <- ggplot(data = dat_graph2,
aes(x = as.factor(year), y = fish_length, fill = as.factor(age),
color = as.factor(age), .drop = F)) +
geom_tile(data = dat_tile, aes(x = factor(year), y = 1, height = Inf, width = 1),
fill = "grey80", inherit.aes = F) +
geom_violin(aes(weight = wt), draw_quantiles = c(0.05, 0.5, 0.95), color = "black",
scale = "width", position = "dodge") +
scale_x_discrete(expand = c(0,0)) +
labs(x = "Year", y = "Fish length", fill = "Age", color = "Age", title = "graph2") +
scale_fill_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_color_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_y_continuous(expand = expand_scale(mult = 0.01)) +
theme_bw()
graph2
dat_graph2 %>% filter(year == 1992, age == 4)
year fish_length nb_caught age wt
1 1992 130.8 89.2 4 1
graph2
Note here that the flat bar for age 4 in year 1992 seen on graph1 has been dropped here even though the line exists in dat_graph2.
My questions
Why is the age 4 in 1992 level dropped when using the weight aesthetic? How can I overcome this?
Why are the two graphs not visually alike even though they used the same data?
Thanks in advance for your help!
1.
Problem 1 is not related to using the weight aesthetic. You can check this by dropping the weight aesthetic in the code for your second graph. The problem is, that the algorithm for computing the density fails, when there are too less observations.
That is the reason, why group 4 shows up in graph 1 with the expanded dataset (grpah 1). Here you increase the number of observations by replicating the number of obs.
Unfortunately, geom_violin gives no warning in your specific case. However, if you filter dat_graph2 for age == 4 geom_violin gives you the warning
Warning message:
Computation failed in `stat_ydensity()`:
replacement has 1 row, data has 0
geom_density is much clearer on this issue, giving a warning, that groups with less than two obs have been dropped.
Unfortunately, I have no solution to overcome this, besides working with the expanded dataset.
2.
Concerning problem 2 I have no convincing answer except that I guess that this is related to the details of the kernel density estimator used by geom_violin, geom_density, ... and perhaps also somehow related to the number of data points.

Pandas - Find and index rows that match row sequence pattern

I would like to find a pattern in a dataframe in a categorical variable going down rows. I can see how to use Series.shift() to look up / down and using boolean logic to find the pattern, however, I want to do this with a grouping variable and also label all rows that are part of the pattern, not just the starting row.
Code:
import pandas as pd
from numpy.random import choice, randn
import string
# df constructor
n_rows = 1000
df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', periods=n_rows, freq='H'),
'group_var': choice(list(string.ascii_uppercase), n_rows),
'row_pat': choice([0, 1, 2, 3], n_rows),
'values': randn(n_rows)})
# sorting
df.sort_values(by=['group_var', 'date_time'], inplace=True)
df.head(10)
Which returns this:
I can find the start of the pattern (with no grouping though) by this:
# the row ordinal pattern to detect
p0, p1, p2, p3 = 1, 2, 2, 0
# flag the row at the start of the pattern
df['pat_flag'] = \
df['row_pat'].eq(p0) & \
df['row_pat'].shift(-1).eq(p1) & \
df['row_pat'].shift(-2).eq(p2) & \
df['row_pat'].shift(-3).eq(p3)
df.head(10)
What i cant figure out, is how to do this only withing the "group_var", and instead of returning True for the start of the pattern, return true for all rows that are part of the pattern.
Appreciate any tips on how to solve this!
Thanks...
I think you have 2 ways - simplier and slowier solution or faster complicated.
use Rolling.apply and test pattern
replace 0s to NaNs by mask
use bfill with limit (same as fillna with method='bfill') for repeat 1
then fillna NaNs to 0
last cast to bool by astype
pat = np.asarray([1, 2, 2, 0])
N = len(pat)
df['rm0'] = (df['row_pat'].rolling(window=N , min_periods=N)
.apply(lambda x: (x==pat).all())
.mask(lambda x: x == 0)
.bfill(limit=N-1)
.fillna(0)
.astype(bool)
)
If is important performance, use strides, solution from link was modify:
use rolling window approach
compare with pattaern and return Trues for match by all
get indices of first occurencies by np.mgrid and indexing
create all indices with list comprehension
compare by numpy.in1d and create new column
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
return c
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]
d = [i for x in c for i in range(x, x+N)]
df['rm2'] = np.in1d(np.arange(len(arr)), d)
Another solution, thanks #divakar:
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
m = (rolling_window(arr, len(pat)) == pat).all(1)
m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))
Timings:
np.random.seed(456)
import pandas as pd
from numpy.random import choice, randn
from scipy.ndimage.morphology import binary_dilation
import string
# df constructor
n_rows = 100000
df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', periods=n_rows, freq='H'),
'group_var': choice(list(string.ascii_uppercase), n_rows),
'row_pat': choice([0, 1, 2, 3], n_rows),
'values': randn(n_rows)})
# sorting
df.sort_values(by=['group_var', 'date_time'], inplace=True)
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
return c
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
m = (rolling_window(arr, len(pat)) == pat).all(1)
m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]
d = [i for x in c for i in range(x, x+N)]
df['rm2'] = np.in1d(np.arange(len(arr)), d)
print (df.iloc[460:480])
date_time group_var row_pat values rm0 rm1 rm2
12045 2019-06-25 21:00:00 A 3 -0.081152 False False False
12094 2019-06-27 22:00:00 A 1 -0.818167 False False False
12125 2019-06-29 05:00:00 A 0 -0.051088 False False False
12143 2019-06-29 23:00:00 A 0 -0.937589 False False False
12145 2019-06-30 01:00:00 A 3 0.298460 False False False
12158 2019-06-30 14:00:00 A 1 0.647161 False False False
12164 2019-06-30 20:00:00 A 3 -0.735538 False False False
12210 2019-07-02 18:00:00 A 1 -0.881740 False False False
12341 2019-07-08 05:00:00 A 3 0.525652 False False False
12343 2019-07-08 07:00:00 A 1 0.311598 False False False
12358 2019-07-08 22:00:00 A 1 -0.710150 True True True
12360 2019-07-09 00:00:00 A 2 -0.752216 True True True
12400 2019-07-10 16:00:00 A 2 -0.205122 True True True
12404 2019-07-10 20:00:00 A 0 1.342591 True True True
12413 2019-07-11 05:00:00 A 1 1.707748 False False False
12506 2019-07-15 02:00:00 A 2 0.319227 False False False
12527 2019-07-15 23:00:00 A 3 2.130917 False False False
12600 2019-07-19 00:00:00 A 1 -1.314070 False False False
12604 2019-07-19 04:00:00 A 0 0.869059 False False False
12613 2019-07-19 13:00:00 A 2 1.342101 False False False
In [225]: %%timeit
...: df['rm0'] = (df['row_pat'].rolling(window=N , min_periods=N)
...: .apply(lambda x: (x==pat).all())
...: .mask(lambda x: x == 0)
...: .bfill(limit=N-1)
...: .fillna(0)
...: .astype(bool)
...: )
...:
1 loop, best of 3: 356 ms per loop
In [226]: %%timeit
...: arr = df['row_pat'].values
...: b = np.all(rolling_window(arr, N) == pat, axis=1)
...: c = np.mgrid[0:len(b)][b]
...: d = [i for x in c for i in range(x, x+N)]
...: df['rm2'] = np.in1d(np.arange(len(arr)), d)
...:
100 loops, best of 3: 7.63 ms per loop
In [227]: %%timeit
...: arr = df['row_pat'].values
...: b = np.all(rolling_window(arr, N) == pat, axis=1)
...:
...: m = (rolling_window(arr, len(pat)) == pat).all(1)
...: m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
...: df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))
...:
100 loops, best of 3: 7.25 ms per loop
You could make use of the pd.rolling() methods and then simply compare the arrays that it returns with the array that contains the pattern that you are attempting to match on.
pattern = np.asarray([1.0, 2.0, 2.0, 0.0])
n_obs = len(pattern)
df['rolling_match'] = (df['row_pat']
.rolling(window=n_obs , min_periods=n_obs)
.apply(lambda x: (x==pattern).all())
.astype(bool) # All as bools
.shift(-1 * (n_obs - 1)) # Shift back
.fillna(False) # convert NaNs to False
)
It is important to specify the min periods here in order to ensure that you only find exact matches (and so the equality check won't fail when the shapes are misaligned). The apply function is doing a pairwise check between the two arrays, and then we use the .all() to ensure all match. We convert to a bool, and then call shift on the function to move it to being a 'forward looking' indicator instead of only occurring after the fact.
Help on the rolling functionality available here -
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html
This works.
It works like this:
a) For every group, it takes a window of size 4 and scans through the column until it finds the combination (1,2,2,0) in exact sequence. As soon as it finds the sequence, it populates the corresponding index values of new column 'pat_flag' with 1.
b) If it doesn't find the combination, it populates the column with 0.
pattern = [1,2,2,0]
def get_pattern(df):
df = df.reset_index(drop=True)
df['pat_flag'] = 0
get_indexes = []
temp = []
for index, row in df.iterrows():
mindex = index +1
# get the next 4 values
for j in range(mindex, mindex+4):
if j == df.shape[0]:
break
else:
get_indexes.append(j)
temp.append(df.loc[j,'row_pat'])
# check if sequence is matched
if temp == pattern:
df.loc[get_indexes,'pat_flag'] = 1
else:
# reset if the pattern is not found in given window
temp = []
get_indexes = []
return df
# apply function to the groups
df = df.groupby('group_var').apply(get_pattern)
## snippet of output
date_time group_var row_pat values pat_flag
41 2018-03-13 21:00:00 C 3 0.731114 0
42 2018-03-14 05:00:00 C 0 1.350164 0
43 2018-03-14 11:00:00 C 1 -0.429754 1
44 2018-03-14 12:00:00 C 2 1.238879 1
45 2018-03-15 17:00:00 C 2 -0.739192 1
46 2018-03-18 06:00:00 C 0 0.806509 1
47 2018-03-20 06:00:00 C 1 0.065105 0
48 2018-03-20 08:00:00 C 1 0.004336 0
Expanding on Emmet02's answer: using the rolling function for all groups and setting match-column to 1 for all matching pattern indices:
pattern = np.asarray([1,2,2,0])
# Create a match column in the main dataframe
df.assign(match=False, inplace=True)
for group_var, group in df.groupby("group_var"):
# Per group do rolling window matching, the last
# values of matching patterns in array 'match'
# will be True
match = (
group['row_pat']
.rolling(window=len(pattern), min_periods=len(pattern))
.apply(lambda x: (x==pattern).all())
)
# Get indices of matches in current group
idx = np.arange(len(group))[match == True]
# Include all indices of matching pattern,
# counting back from last index in pattern
idx = idx.repeat(len(pattern)) - np.tile(np.arange(len(pattern)), len(idx))
# Update matches
match.values[idx] = True
df.loc[group.index, 'match'] = match
df[df.match==True]
edit: Without a for loop
# Do rolling matching in group clause
match = (
df.groupby("group_var")
.rolling(len(pattern))
.row_pat.apply(lambda x: (x==pattern).all())
)
# Convert NaNs
match = (~match.isnull() & match)
# Get indices of matches in current group
idx = np.arange(len(df))[match]
# Include all indices of matching pattern
idx = idx.repeat(len(pattern)) - np.tile(np.arange(len(pattern)), len(idx))
# Mark all indices that are selected by "idx" in match-column
df = df.assign(match=df.index.isin(df.index[idx]))
You can do this by defining a custom aggregate function, then using it in group_by statement, finally merge it back to the original dataframe. Something like this:
Aggregate function:
def pattern_detect(column):
# define any other pattern to detect here
p0, p1, p2, p3 = 1, 2, 2, 0
column.eq(p0) & \
column.shift(-1).eq(p1) & \
column.shift(-2).eq(p2) & \
column.shift(-3).eq(p3)
return column.any()
Use group by function next:
grp = df.group_by('group_var').agg([patter_detect])['row_pat']
Now merge it back to the original dataframe:
df = df.merge(grp, left_on='group_var',right_index=True, how='left')