pandas groupby apply optimizing a loop - pandas

For the following data:
index bond stock investor_bond inverstor_stock
0 1 2 A B
1 1 2 A E
2 1 2 A F
3 1 2 B B
4 1 2 B E
5 1 2 B F
6 1 3 A A
7 1 3 A E
8 1 3 A G
9 1 3 B A
10 1 3 B E
11 1 3 B G
12 2 4 C F
13 2 4 C A
14 2 4 C C
15 2 5 B E
16 2 5 B B
17 2 5 B H
bond1 has two investors, A,B. stock2 has three investors, B,E,F. For each investor pair (investor_bond, investor_stock), we want to filter it out if they had ever invested in the same bond/stock.
For example, for a pair of (B,F) of index=5, we want to filter it out because both of them invested in stock 2.
Sample output should be like:
index bond stock investor_bond investor_stock
11 1 3 B G
So far I have tried using two loops.
A1 = A1.groupby('bond').apply(lambda x: x[~x.investor_stock.isin(x.bond)]).reset_index(drop=True)
stock_list=A1.groupby(['bond','stock']).apply(lambda x: x.investor_stock.unique()).reset_index()
stock_list=stock_list.rename(columns={0:'s'})
stock_list=stock_list.groupby('bond').apply(lambda x: list(x.s)).reset_index()
stock_list=stock_list.rename(columns={0:'s'})
A1=pd.merge(A1,stock_list,on='bond',how='left')
A1['in_out']=False
for j in range(0,len(A1)):
for i in range (0,len(A1.s[j])):
A1['in_out'] = A1.in_out | (
A1.investor_bond.isin(A1.s[j][i]) & A1.investor_stock.isin(A1.s[j][i]))
print(j)
The loop is running forever due to the data size, and I am seeking a faster way.

Related

How to create a rolling unique count by group using pandas

I have a dataframe like the following:
group value
1 a
1 a
1 b
1 b
1 b
1 b
1 c
2 d
2 d
2 d
2 d
2 e
I want to create a column with how many unique values there have been so far for the group. Like below:
group value group_value_id
1 a 1
1 a 1
1 b 2
1 b 2
1 b 2
1 b 2
1 c 3
2 d 1
2 d 1
2 d 1
2 d 1
2 e 2
Use custom lambda function with GroupBy.transform and factorize:
df['group_value_id']=df.groupby('group')['value'].transform(lambda x:pd.factorize(x)[0]) + 1
print (df)
group value group_value_id
0 1 a 1
1 1 a 1
2 1 b 2
3 1 b 2
4 1 b 2
5 1 b 2
6 1 c 3
7 2 d 1
8 2 d 1
9 2 d 1
10 2 d 1
11 2 e 2
because:
df['group_value_id'] = df.groupby('group')['value'].rank('dense')
print (df)
DataError: No numeric types to aggregate
Also cab be solved as :
df['group_val_id'] = (df.groupby('group')['value'].
apply(lambda x:x.astype('category').cat.codes + 1))
df
group value group_val_id
0 1 a 1
1 1 a 1
2 1 b 2
3 1 b 2
4 1 b 2
5 1 b 2
6 1 c 3
7 2 d 1
8 2 d 1
9 2 d 1
10 2 d 1
11 2 e 2

Fill in Data Frame based on Previous Data

I am working on a project with a retailer where we are wanting to clean some data for reporting purposes.
The retailer has multiple stores and every week the staff in the stores would scan the different items on different displays (They scan the display first to let us know which display they are talking about). Also, they only scan displays that changed in that week, if a display was not changed then we assume that it stayed the same.
Right now we are working with 2 dataframes:
Hierarchy Data Frame Example:
This table basically has weeks 1 to 52 for every end cap (display) in every store. Let's assume the company only has 2 stores and 3 end caps in each store. Also different stores could have different End Cap codes but that shouldn't matter for our purposes (I don't think).
Week Store End Cap
0 1 1 A
1 1 1 B
2 1 1 C
3 1 2 A
4 1 2 B
5 1 2 D
6 2 1 A
7 2 1 B
8 2 1 C
9 2 2 A
10 2 2 B
11 2 2 D
Next we have the historical file with actual changes to be used to update the End Caps.
Week Store End Cap UPC
0 1 1 A 123456
1 1 1 B 789456
2 1 1 B 546879
3 1 1 C 423156
4 1 2 A 231567
5 1 2 B 456123
6 1 2 D 689741
7 2 1 A 321654
8 2 1 C 852634
9 2 1 C 979541
10 2 2 A 132645
11 2 2 B 787878
12 2 2 D 615432
To merge the two dataframes I used:
merged_df = pd.merge(hierarchy, hist, how='left', left_on=['Week','Store', 'End Cap'], right_on = ['Week','Store', 'End Cap'])
Which gave me:
Week Store End Cap UPC
0 1 1 A 123456.0
1 1 1 B 789456.0
2 1 1 B 546879.0
3 1 1 C 423156.0
4 1 2 A 231567.0
5 1 2 B 456123.0
6 1 2 D 689741.0
7 2 1 A 321654.0
8 2 1 B NaN
9 2 1 C 852634.0
10 2 1 C 979541.0
11 2 2 A 132645.0
12 2 2 B 787878.0
13 2 2 D 615432.0
Except for the one instance where it shows NAN. Store 1 end cap 2 in week 2 did not change and hence was not scanned. So it did not show up in the historical dataframe. In this case I would want to see the latest items that were scanned for that end cap at that store (see row 2 & 3 of the historical dataframe). So technically that could have also been scanned in Week 52 of last year but I just want to fill the NAN with the latest information to show that it did not change. How do I go about doing that?
The desired output would look like:
Week Store End Cap UPC
0 1 1 A 123456.0
1 1 1 B 789456.0
2 1 1 B 546879.0
3 1 1 C 423156.0
4 1 2 A 231567.0
5 1 2 B 456123.0
6 1 2 D 689741.0
7 2 1 A 321654.0
8 2 1 B 789456.0
9 2 1 B 546879.0
10 2 1 C 852634.0
11 2 1 C 979541.0
12 2 2 A 132645.0
13 2 2 B 787878.0
14 2 2 D 615432.0
Thank you!
EDIT:
Further to the above, I tried to sort the data and then forward fill which only partially fixed the issue I am having:
sorted_df = merged_df.sort_values(['End Cap', 'Store'], ascending=[True, True])
Week Store End Cap UPC
0 1 1 A 123456.0
7 2 1 A 321654.0
4 1 2 A 231567.0
11 2 2 A 132645.0
1 1 1 B 789456.0
2 1 1 B 546879.0
8 2 1 B NaN
5 1 2 B 456123.0
12 2 2 B 787878.0
3 1 1 C 423156.0
9 2 1 C 852634.0
10 2 1 C 979541.0
6 1 2 D 689741.0
13 2 2 D 615432.0
sorted_filled = sorted_df.fillna(method='ffill')
Gives me:
Week Store End Cap UPC
0 1 1 A 123456.0
7 2 1 A 321654.0
4 1 2 A 231567.0
11 2 2 A 132645.0
1 1 1 B 789456.0
2 1 1 B 546879.0
8 2 1 B 546879.0
5 1 2 B 456123.0
12 2 2 B 787878.0
3 1 1 C 423156.0
9 2 1 C 852634.0
10 2 1 C 979541.0
6 1 2 D 689741.0
13 2 2 D 615432.0
This output did add the 546879 to week 2 store1 End Cap B but it did not add the 789456 which I also need. I need it to add another row with that value as well.
You can also do it like this creating a helper column to handle duplicate UPC per store/week/end cap.
idxcols=['Week', 'Store', 'End Cap']
hist_idx = hist.set_index(idxcols + [hist.groupby(idxcols).cumcount()])
hier_idx = hierarchy.set_index(idxcols+[hierarchy.groupby(idxcols).cumcount()])
hier_idx.join(hist_idx, how='right')\
.unstack('Week')\
.ffill(axis=1)\
.stack('Week')\
.reorder_levels([3,0,1,2])\
.sort_index()\
.reset_index()\
.drop('level_3', axis=1)
Output:
Week Store End Cap UPC
0 1 1 A 123456.0
1 1 1 B 789456.0
2 1 1 B 546879.0
3 1 1 C 423156.0
4 1 2 A 231567.0
5 1 2 B 456123.0
6 1 2 D 689741.0
7 2 1 A 321654.0
8 2 1 B 789456.0
9 2 1 B 546879.0
10 2 1 C 852634.0
11 2 1 C 979541.0
12 2 2 A 132645.0
13 2 2 B 787878.0
14 2 2 D 615432.0
You could try something like this:
# New df without Nan values
df1 = merged_df[~merged_df["name"].isna()]
# New df with Nan values only
df2 = merged_df[merged_df["name"].isna()]
# Set previous week
df2["Week"] = df2["Week"] - 1
# For each W/S/EC in df2, grab corresponding UPC value in df1
# and append a new row (shifted back to current week) to df1
for week in df2["Week"].values:
for store in df2["Store"].values:
for cap in df2["Enc Cap"].values:
mask = (
(df1["Week"] == week)
& (df1["Store"] == store)
& (df1["End Cap"] == cap)
)
upc = df1.loc[mask, "UPC"].item()
row = [week + 1, store, cap, upc]
df1.loc[len(df1)] = row
sorted_df = df1.sort_values(by=["Week", "Store", "End Cap"])

Replace values of duplicated rows with first record in pandas?

Input
df
id label
a 1
b 2
a 3
a 4
b 2
b 3
c 1
c 2
d 2
d 3
Expected
df
id label
a 1
b 2
a 1
a 1
b 2
b 2
c 1
c 1
d 2
d 2
For id a, the label value is 1 and id b is 2 because 1 and 2 is the first record for a and b.
Try
I refer this post, but still not solve it.
Update with transform first
df['lb2']=df.groupby('id').label.transform('first')
df
Out[87]:
id label lb2
0 a 1 1
1 b 2 2
2 a 3 1
3 a 4 1
4 b 2 2
5 b 3 2
6 c 1 1
7 c 2 1
8 d 2 2
9 d 3 2

compare two column of two dataframe pandas

I have 2 data frames like :
df_out:
a b c d
1 1 2 1
2 1 2 3
3 1 3 5
df_fin:
a e f g
1 0 2 1
2 5 2 3
3 1 3 5
5 2 4 6
7 3 2 5
I want to get result as :
a b c d a e f g
1 1 2 1 1 0 2 1
2 1 2 3 2 5 2 3
3 1 3 5 3 1 3 5
in the other word I have two diffrent data frames that are common in one column(a), I want two compare this two columns(df_fin.a and df_out.a) and select the rows from df_fin that have the same value in column a and create new dataframe that has selected rows from df_fin and added columns from df_out ?
I think you need merge with left join:
df = pd.merge(df_out, df_fin, on='a', how='left')
print (df)
a b c d e f g
0 1 1 2 1 0 2 1
1 2 1 2 3 5 2 3
2 3 1 3 5 1 3 5
EDIT:
df1 = df_fin[df_fin['a'].isin(df_out['a'])]
df2 = df_out.join(df1.set_index('a'), on='a')
print (df2)
a b c d e f g
0 1 1 2 1 0 2 1
1 2 1 2 3 5 2 3
2 3 1 3 5 1 3 5

Group by with a pandas dataframe using different aggregation for different columns

I have a pandas dataframe df with columns [a, b, c, d, e, f]. I want to perform a group by on df. I can best describe what it's supposed to do in SQL:
SELECT a, b, min(c), min(d), max(e), sum(f)
FROM df
GROUP BY a, b
How do I do this group by using pandas on my dataframe df?
consider df:
a b c d e f
1 1 2 5 9 3
1 1 3 3 4 5
2 2 4 7 4 4
2 2 5 3 8 8
I expect the result to be:
a b c d e f
1 1 2 3 9 8
2 2 4 3 8 12
use agg
df = pd.DataFrame(
dict(
a=list('aaaabbbb'),
b=list('ccddccdd'),
c=np.arange(8),
d=np.arange(8),
e=np.arange(8),
f=np.arange(8),
)
)
funcs = dict(c='min', d='min', e='max', f='sum')
df.groupby(['a', 'b']).agg(funcs).reset_index()
a b c e f d
0 a c 0 1 1 0
1 a d 2 3 5 2
2 b c 4 5 9 4
3 b d 6 7 13 6
with your data
a b c e f d
0 1 1 2 9 8 3
1 2 2 4 8 12 3