Pandas: fetch rows that continuously have a similar value - sql

I have a dataframe like so..
id time status
-- ---- ------
a 1 T
a 2 F
b 1 T
b 2 T
a 3 T
a 4 T
b 3 F
b 4 T
b 5 T
I would like to fetch the ids that continuously have the status 'T' for a certain threshold number of times (say 2 in this case).
Thus the fetched rows would be...
id time status
-- ---- ------
b 1 T
b 2 T
a 3 T
a 4 T
b 4 T
b 5 T
I can think of an iterative solution. What I am looking for is something more pandas/sql like. I think an order by id and then time followed by a group by first by id and then status should work, but I'd like to be sure.

Compare values by Series.eq for T and count consecutive values with Series.shift and Series.cumsum, count by Series.value_counts and Series.map to original - get counts per consecutive groups. Then compare by Series.ge and last filter by boolean indexing chain both mask by bitwise AND:
N = 2
m1 = df['status'].eq('T')
g = df['status'].ne(df['status'].shift()).cumsum()
m2 = g.map(g.value_counts()).ge(N)
df = df[m1 & m2]
print (df)
id time status
2 b 1 T
3 b 2 T
4 a 3 T
5 a 4 T
7 b 4 T
8 b 5 T
Details:
print (df.assign(m1=m1, g=g, counts=g.map(g.value_counts()), m2=m2))
id time status m1 g counts m2
0 a 1 T True 1 1 False
1 a 2 F False 2 1 False
2 b 1 T True 3 4 True
3 b 2 T True 3 4 True
4 a 3 T True 3 4 True
5 a 4 T True 3 4 True
6 b 3 F False 4 1 False
7 b 4 T True 5 2 True
8 b 5 T True 5 2 True

Related

Label the first element in each groupby

I have a data frame that looks like the following
df = pd.DataFrame({'group':[1,1,2,2,2],'time':[1,2,3,4,5],'C':[6,7,8,9,10]})
group time C
0 1 1 6
1 1 2 7
2 2 3 8
3 2 4 9
4 2 5 10
and I'm looking to label the first element (in terms of time) in each group as True, i.e.:
group time C first_in_group
0 1 1 6 True
1 1 2 7 False
2 2 3 8 True
3 2 4 9 False
4 2 5 10 False
I tried several combinations of groupby, first but did not manage to achieve what I wanted.
Is there an elegant way to do it in Pandas?
Use duplicated:
df['first_in_group'] = ~df.group.duplicated()
OUTPUT:
group time C first_in_group
0 1 1 6 True
1 1 2 7 False
2 2 3 8 True
3 2 4 9 False
4 2 5 10 False
NOTE: Do the sorting 1st (if required).
df = df.sort_values(['group', 'time'])

Fill in Data Frame based on Previous Data

I am working on a project with a retailer where we are wanting to clean some data for reporting purposes.
The retailer has multiple stores and every week the staff in the stores would scan the different items on different displays (They scan the display first to let us know which display they are talking about). Also, they only scan displays that changed in that week, if a display was not changed then we assume that it stayed the same.
Right now we are working with 2 dataframes:
Hierarchy Data Frame Example:
This table basically has weeks 1 to 52 for every end cap (display) in every store. Let's assume the company only has 2 stores and 3 end caps in each store. Also different stores could have different End Cap codes but that shouldn't matter for our purposes (I don't think).
Week Store End Cap
0 1 1 A
1 1 1 B
2 1 1 C
3 1 2 A
4 1 2 B
5 1 2 D
6 2 1 A
7 2 1 B
8 2 1 C
9 2 2 A
10 2 2 B
11 2 2 D
Next we have the historical file with actual changes to be used to update the End Caps.
Week Store End Cap UPC
0 1 1 A 123456
1 1 1 B 789456
2 1 1 B 546879
3 1 1 C 423156
4 1 2 A 231567
5 1 2 B 456123
6 1 2 D 689741
7 2 1 A 321654
8 2 1 C 852634
9 2 1 C 979541
10 2 2 A 132645
11 2 2 B 787878
12 2 2 D 615432
To merge the two dataframes I used:
merged_df = pd.merge(hierarchy, hist, how='left', left_on=['Week','Store', 'End Cap'], right_on = ['Week','Store', 'End Cap'])
Which gave me:
Week Store End Cap UPC
0 1 1 A 123456.0
1 1 1 B 789456.0
2 1 1 B 546879.0
3 1 1 C 423156.0
4 1 2 A 231567.0
5 1 2 B 456123.0
6 1 2 D 689741.0
7 2 1 A 321654.0
8 2 1 B NaN
9 2 1 C 852634.0
10 2 1 C 979541.0
11 2 2 A 132645.0
12 2 2 B 787878.0
13 2 2 D 615432.0
Except for the one instance where it shows NAN. Store 1 end cap 2 in week 2 did not change and hence was not scanned. So it did not show up in the historical dataframe. In this case I would want to see the latest items that were scanned for that end cap at that store (see row 2 & 3 of the historical dataframe). So technically that could have also been scanned in Week 52 of last year but I just want to fill the NAN with the latest information to show that it did not change. How do I go about doing that?
The desired output would look like:
Week Store End Cap UPC
0 1 1 A 123456.0
1 1 1 B 789456.0
2 1 1 B 546879.0
3 1 1 C 423156.0
4 1 2 A 231567.0
5 1 2 B 456123.0
6 1 2 D 689741.0
7 2 1 A 321654.0
8 2 1 B 789456.0
9 2 1 B 546879.0
10 2 1 C 852634.0
11 2 1 C 979541.0
12 2 2 A 132645.0
13 2 2 B 787878.0
14 2 2 D 615432.0
Thank you!
EDIT:
Further to the above, I tried to sort the data and then forward fill which only partially fixed the issue I am having:
sorted_df = merged_df.sort_values(['End Cap', 'Store'], ascending=[True, True])
Week Store End Cap UPC
0 1 1 A 123456.0
7 2 1 A 321654.0
4 1 2 A 231567.0
11 2 2 A 132645.0
1 1 1 B 789456.0
2 1 1 B 546879.0
8 2 1 B NaN
5 1 2 B 456123.0
12 2 2 B 787878.0
3 1 1 C 423156.0
9 2 1 C 852634.0
10 2 1 C 979541.0
6 1 2 D 689741.0
13 2 2 D 615432.0
sorted_filled = sorted_df.fillna(method='ffill')
Gives me:
Week Store End Cap UPC
0 1 1 A 123456.0
7 2 1 A 321654.0
4 1 2 A 231567.0
11 2 2 A 132645.0
1 1 1 B 789456.0
2 1 1 B 546879.0
8 2 1 B 546879.0
5 1 2 B 456123.0
12 2 2 B 787878.0
3 1 1 C 423156.0
9 2 1 C 852634.0
10 2 1 C 979541.0
6 1 2 D 689741.0
13 2 2 D 615432.0
This output did add the 546879 to week 2 store1 End Cap B but it did not add the 789456 which I also need. I need it to add another row with that value as well.
You can also do it like this creating a helper column to handle duplicate UPC per store/week/end cap.
idxcols=['Week', 'Store', 'End Cap']
hist_idx = hist.set_index(idxcols + [hist.groupby(idxcols).cumcount()])
hier_idx = hierarchy.set_index(idxcols+[hierarchy.groupby(idxcols).cumcount()])
hier_idx.join(hist_idx, how='right')\
.unstack('Week')\
.ffill(axis=1)\
.stack('Week')\
.reorder_levels([3,0,1,2])\
.sort_index()\
.reset_index()\
.drop('level_3', axis=1)
Output:
Week Store End Cap UPC
0 1 1 A 123456.0
1 1 1 B 789456.0
2 1 1 B 546879.0
3 1 1 C 423156.0
4 1 2 A 231567.0
5 1 2 B 456123.0
6 1 2 D 689741.0
7 2 1 A 321654.0
8 2 1 B 789456.0
9 2 1 B 546879.0
10 2 1 C 852634.0
11 2 1 C 979541.0
12 2 2 A 132645.0
13 2 2 B 787878.0
14 2 2 D 615432.0
You could try something like this:
# New df without Nan values
df1 = merged_df[~merged_df["name"].isna()]
# New df with Nan values only
df2 = merged_df[merged_df["name"].isna()]
# Set previous week
df2["Week"] = df2["Week"] - 1
# For each W/S/EC in df2, grab corresponding UPC value in df1
# and append a new row (shifted back to current week) to df1
for week in df2["Week"].values:
for store in df2["Store"].values:
for cap in df2["Enc Cap"].values:
mask = (
(df1["Week"] == week)
& (df1["Store"] == store)
& (df1["End Cap"] == cap)
)
upc = df1.loc[mask, "UPC"].item()
row = [week + 1, store, cap, upc]
df1.loc[len(df1)] = row
sorted_df = df1.sort_values(by=["Week", "Store", "End Cap"])

How to plot values of my columns being above a certain treshold?

I've been stuck with this problem for a while. I have a dataset which looks more or less like this:
Students Subject Mark
1 M F 7 4 3 7
2 I 5 6
3 M F I S 2 3 0
4 M 2 2
5 F M I 5 1
6 I M F 6 2 3
7 I M 7
Now, I want to create a barplot using pandas and seaborn showing how many students:
Have 3 ore more letters in the column "Subject"
Have at least one 3 in the colum "Marks"
Have both things
I tried with:
n_subject = dataset['Subject'].str.count('\w+')
dataset['NumberSubjects']= n_subject
n_over = dataset[dataset.n_subject >= 3.0]
But it does not work and I'm stuck. I'm sure it is a very basic problem but I don't know what to do.
3 or more subjects:
df["Subject"].str.count("\w+") >= 3
Has one or more marks of 3:
df["Mark"].str.count("3") >= 1
Both:
(df["Subject"].str.count("\w+") >= 3) & (df["Mark"].str.count("3") >= 1)
Boolean representation:
Students Subject Mark one two three
0 1 M F 7 4 3 7 False True False
1 2 I 5 6 False False False
2 3 M F I S 2 3 0 True True True
3 4 M 2 2 False False False
4 5 F M I 5 1 True False False
5 6 I M F 6 2 3 True True True
6 7 I M 7 False False False
I am not really sure what should be the barplot representing (summary of Mark?) But here is what you need for filtering purposes. Also, string count counts empty spaces too, but there are multiple ways of handling this. I am just giving you an idea what / how to do it.
>>> m1 = df.Subject.apply(lambda x: len(x.split()) >= 3)
>>> m2 = df.Mark.str.contains('3')
>>> m3 = m1|m2
>>> df[m1]
Students Subject Mark
2 3 M F I S 2 3 0
4 5 F M I 5 1
5 6 I M F 6 2 3
>>> df[m2]
Students Subject Mark
0 1 M F 7 4 3 7
2 3 M F I S 2 3 0
5 6 I M F 6 2 3
>>> df[m3]
Students Subject Mark
0 1 M F 7 4 3 7
2 3 M F I S 2 3 0
4 5 F M I 5 1
5 6 I M F 6 2 3

Replace values of duplicated rows with first record in pandas?

Input
df
id label
a 1
b 2
a 3
a 4
b 2
b 3
c 1
c 2
d 2
d 3
Expected
df
id label
a 1
b 2
a 1
a 1
b 2
b 2
c 1
c 1
d 2
d 2
For id a, the label value is 1 and id b is 2 because 1 and 2 is the first record for a and b.
Try
I refer this post, but still not solve it.
Update with transform first
df['lb2']=df.groupby('id').label.transform('first')
df
Out[87]:
id label lb2
0 a 1 1
1 b 2 2
2 a 3 1
3 a 4 1
4 b 2 2
5 b 3 2
6 c 1 1
7 c 2 1
8 d 2 2
9 d 3 2

pandas groupby apply optimizing a loop

For the following data:
index bond stock investor_bond inverstor_stock
0 1 2 A B
1 1 2 A E
2 1 2 A F
3 1 2 B B
4 1 2 B E
5 1 2 B F
6 1 3 A A
7 1 3 A E
8 1 3 A G
9 1 3 B A
10 1 3 B E
11 1 3 B G
12 2 4 C F
13 2 4 C A
14 2 4 C C
15 2 5 B E
16 2 5 B B
17 2 5 B H
bond1 has two investors, A,B. stock2 has three investors, B,E,F. For each investor pair (investor_bond, investor_stock), we want to filter it out if they had ever invested in the same bond/stock.
For example, for a pair of (B,F) of index=5, we want to filter it out because both of them invested in stock 2.
Sample output should be like:
index bond stock investor_bond investor_stock
11 1 3 B G
So far I have tried using two loops.
A1 = A1.groupby('bond').apply(lambda x: x[~x.investor_stock.isin(x.bond)]).reset_index(drop=True)
stock_list=A1.groupby(['bond','stock']).apply(lambda x: x.investor_stock.unique()).reset_index()
stock_list=stock_list.rename(columns={0:'s'})
stock_list=stock_list.groupby('bond').apply(lambda x: list(x.s)).reset_index()
stock_list=stock_list.rename(columns={0:'s'})
A1=pd.merge(A1,stock_list,on='bond',how='left')
A1['in_out']=False
for j in range(0,len(A1)):
for i in range (0,len(A1.s[j])):
A1['in_out'] = A1.in_out | (
A1.investor_bond.isin(A1.s[j][i]) & A1.investor_stock.isin(A1.s[j][i]))
print(j)
The loop is running forever due to the data size, and I am seeking a faster way.