Clean the data based on condition pandas - pandas

I have a data frame as shown below
ID Unit_ID Price Duration
1 A 200 2
2 B 1000 3
2 C 1000 3
2 D 1000 3
2 F 1000 3
2 G 200 1
3 A 500 2
3 B 200 2
From the above data frame if ID, Price and Duration are same then replace the Price by average (Price divided by count of Such combination).
For example from the above data frame from row 2 to 5 has same ID, Price and Duration, that means its count is 4, so the new Price = 1000/4 = 250.
Expected Output:
ID Unit_ID Price Duration
1 A 200 2
2 B 250 3
2 C 250 3
2 D 250 3
2 F 250 3
2 G 200 1
3 A 500 2
3 B 200 2

Use GroupBy.transform with GroupBy.size for Series with same size like original filled by counts, so possible divide by Series.div:
df['Price'] = df['Price'].div(df.groupby(['ID','Price','Duration'])['Price'].transform('size'))
print (df)
ID Unit_ID Price Duration
0 1 A 200.0 2
1 2 B 250.0 3
2 2 C 250.0 3
3 2 D 250.0 3
4 2 F 250.0 3
5 2 G 200.0 1
6 3 A 500.0 2
7 3 B 200.0 2
Detail:
print (df.groupby(['ID','Price','Duration'])['Price'].transform('size'))
0 1
1 4
2 4
3 4
4 4
5 1
6 1
7 1
Name: Price, dtype: int64

Related

Fill in Data Frame based on Previous Data

I am working on a project with a retailer where we are wanting to clean some data for reporting purposes.
The retailer has multiple stores and every week the staff in the stores would scan the different items on different displays (They scan the display first to let us know which display they are talking about). Also, they only scan displays that changed in that week, if a display was not changed then we assume that it stayed the same.
Right now we are working with 2 dataframes:
Hierarchy Data Frame Example:
This table basically has weeks 1 to 52 for every end cap (display) in every store. Let's assume the company only has 2 stores and 3 end caps in each store. Also different stores could have different End Cap codes but that shouldn't matter for our purposes (I don't think).
Week Store End Cap
0 1 1 A
1 1 1 B
2 1 1 C
3 1 2 A
4 1 2 B
5 1 2 D
6 2 1 A
7 2 1 B
8 2 1 C
9 2 2 A
10 2 2 B
11 2 2 D
Next we have the historical file with actual changes to be used to update the End Caps.
Week Store End Cap UPC
0 1 1 A 123456
1 1 1 B 789456
2 1 1 B 546879
3 1 1 C 423156
4 1 2 A 231567
5 1 2 B 456123
6 1 2 D 689741
7 2 1 A 321654
8 2 1 C 852634
9 2 1 C 979541
10 2 2 A 132645
11 2 2 B 787878
12 2 2 D 615432
To merge the two dataframes I used:
merged_df = pd.merge(hierarchy, hist, how='left', left_on=['Week','Store', 'End Cap'], right_on = ['Week','Store', 'End Cap'])
Which gave me:
Week Store End Cap UPC
0 1 1 A 123456.0
1 1 1 B 789456.0
2 1 1 B 546879.0
3 1 1 C 423156.0
4 1 2 A 231567.0
5 1 2 B 456123.0
6 1 2 D 689741.0
7 2 1 A 321654.0
8 2 1 B NaN
9 2 1 C 852634.0
10 2 1 C 979541.0
11 2 2 A 132645.0
12 2 2 B 787878.0
13 2 2 D 615432.0
Except for the one instance where it shows NAN. Store 1 end cap 2 in week 2 did not change and hence was not scanned. So it did not show up in the historical dataframe. In this case I would want to see the latest items that were scanned for that end cap at that store (see row 2 & 3 of the historical dataframe). So technically that could have also been scanned in Week 52 of last year but I just want to fill the NAN with the latest information to show that it did not change. How do I go about doing that?
The desired output would look like:
Week Store End Cap UPC
0 1 1 A 123456.0
1 1 1 B 789456.0
2 1 1 B 546879.0
3 1 1 C 423156.0
4 1 2 A 231567.0
5 1 2 B 456123.0
6 1 2 D 689741.0
7 2 1 A 321654.0
8 2 1 B 789456.0
9 2 1 B 546879.0
10 2 1 C 852634.0
11 2 1 C 979541.0
12 2 2 A 132645.0
13 2 2 B 787878.0
14 2 2 D 615432.0
Thank you!
EDIT:
Further to the above, I tried to sort the data and then forward fill which only partially fixed the issue I am having:
sorted_df = merged_df.sort_values(['End Cap', 'Store'], ascending=[True, True])
Week Store End Cap UPC
0 1 1 A 123456.0
7 2 1 A 321654.0
4 1 2 A 231567.0
11 2 2 A 132645.0
1 1 1 B 789456.0
2 1 1 B 546879.0
8 2 1 B NaN
5 1 2 B 456123.0
12 2 2 B 787878.0
3 1 1 C 423156.0
9 2 1 C 852634.0
10 2 1 C 979541.0
6 1 2 D 689741.0
13 2 2 D 615432.0
sorted_filled = sorted_df.fillna(method='ffill')
Gives me:
Week Store End Cap UPC
0 1 1 A 123456.0
7 2 1 A 321654.0
4 1 2 A 231567.0
11 2 2 A 132645.0
1 1 1 B 789456.0
2 1 1 B 546879.0
8 2 1 B 546879.0
5 1 2 B 456123.0
12 2 2 B 787878.0
3 1 1 C 423156.0
9 2 1 C 852634.0
10 2 1 C 979541.0
6 1 2 D 689741.0
13 2 2 D 615432.0
This output did add the 546879 to week 2 store1 End Cap B but it did not add the 789456 which I also need. I need it to add another row with that value as well.
You can also do it like this creating a helper column to handle duplicate UPC per store/week/end cap.
idxcols=['Week', 'Store', 'End Cap']
hist_idx = hist.set_index(idxcols + [hist.groupby(idxcols).cumcount()])
hier_idx = hierarchy.set_index(idxcols+[hierarchy.groupby(idxcols).cumcount()])
hier_idx.join(hist_idx, how='right')\
.unstack('Week')\
.ffill(axis=1)\
.stack('Week')\
.reorder_levels([3,0,1,2])\
.sort_index()\
.reset_index()\
.drop('level_3', axis=1)
Output:
Week Store End Cap UPC
0 1 1 A 123456.0
1 1 1 B 789456.0
2 1 1 B 546879.0
3 1 1 C 423156.0
4 1 2 A 231567.0
5 1 2 B 456123.0
6 1 2 D 689741.0
7 2 1 A 321654.0
8 2 1 B 789456.0
9 2 1 B 546879.0
10 2 1 C 852634.0
11 2 1 C 979541.0
12 2 2 A 132645.0
13 2 2 B 787878.0
14 2 2 D 615432.0
You could try something like this:
# New df without Nan values
df1 = merged_df[~merged_df["name"].isna()]
# New df with Nan values only
df2 = merged_df[merged_df["name"].isna()]
# Set previous week
df2["Week"] = df2["Week"] - 1
# For each W/S/EC in df2, grab corresponding UPC value in df1
# and append a new row (shifted back to current week) to df1
for week in df2["Week"].values:
for store in df2["Store"].values:
for cap in df2["Enc Cap"].values:
mask = (
(df1["Week"] == week)
& (df1["Store"] == store)
& (df1["End Cap"] == cap)
)
upc = df1.loc[mask, "UPC"].item()
row = [week + 1, store, cap, upc]
df1.loc[len(df1)] = row
sorted_df = df1.sort_values(by=["Week", "Store", "End Cap"])

Pandas pivot? pivot_table? melt? stack or unstack?

I have a dataframe that looks like this:
id Revenue Cost qty time
0 A 400 50 2 1
1 A 900 200 8 2
2 A 800 100 8 3
3 B 300 20 1 1
4 B 600 150 4 2
5 B 650 155 4 3
And I'm trying to get to this:
id Type 1 2 3
0 A Revenue 400 900 800
1 A Cost 50 200 100
2 A qty 2 8 8
3 B Revenue 300 600 650
4 B Cost 20 150 155
5 B qty 1 4 4
Where time will always just be repeated 1-3, so I need to transpose or pivot on just time, with the column for 1-3
Here is what I have tried so far:
pd.pivot_table(df, values = ['Revenue', 'qty', 'Cost'] , index=['id'], columns='time').reset_index()
But that just makes one really long table that puts everything side by side vs stacked like this:
Revenue qty Cost
1 2 3 1 2 3 1 2 3
In that situation I would need to convert the Revenue, qty and Cost to a row and just use the 1, 2, 3 as the column names. So the ID would be duplicated for each 'type' but list it out based on time 1-3.
We can still do unstack and stack
df.set_index(['id','time']).stack().unstack(level=1).reset_index()
Out[24]:
time id level_1 1 2 3
0 A Revenue 400 900 800
1 A Cost 50 200 100
2 A qty 2 8 8
3 B Revenue 300 600 650
4 B Cost 20 150 155
5 B qty 1 4 4
An alternative, using melt and pivot on Pandas 1.1.0 :
(df
.melt(["id", "time"])
.pivot(["id", "variable"], "time", "value")
.reset_index()
.rename_axis(columns=None)
)
id variable 1 2 3
0 A Cost 50 200 100
1 A Revenue 400 900 800
2 A qty 2 8 8
3 B Cost 20 150 155
4 B Revenue 300 600 650
5 B qty 1 4 4

Pandas sum with groupby on condition

I have this dataframe:
id priority quantity
0 A 1 2
1 A 2 4
2 A 3 4
3 A 4 2
4 B 1 5
5 B 2 7
6 B 3 2
7 B 4 3
that I want to turn into this one:
id priority quantity cumulativeQuantity
0 A 1 2 2
1 A 2 4 6
2 A 3 4 10
3 A 4 2 12
4 B 1 5 5
5 B 2 7 12
6 B 3 2 14
7 B 4 3 17
Columns id, priority and quantity haven't changed.
cumulativeQuantity is the sum, by id, of all quantity from 1 to n where n is the priority of the current row.
priority can take any value. Only orders matter. We sum if priority is lower than the priority of the current row.
ANSWER:
df.groupby(['id','priority']).sum().groupby(level=0).cumsum().reset_index()

Pandas get order of column value grouped by other column value

I have the following dataframe:
srch_id price
1 30
1 20
1 25
3 15
3 102
3 39
Now I want to create a third column in which I determine the price position grouped by the search id. This is the result I want:
srch_id price price_position
1 30 3
1 20 1
1 25 2
3 15 1
3 102 3
3 39 2
I think I need to use the transform function. However I can't seem to figure out how I should handle the argument I get using .transform():
def k(r):
return min(r)
tmp = train.groupby('srch_id')['price']
train['min'] = tmp.transform(k)
Because r is either a list or an element?
You can use series.rank() with df.groupby():
df['price_position']=df.groupby('srch_id')['price'].rank()
print(df)
srch_id price price_position
0 1 30 3.0
1 1 20 1.0
2 1 25 2.0
3 3 15 1.0
4 3 102 3.0
5 3 39 2.0
is this:
df['price_position'] = df.sort_values('price').groupby('srch_id').price.cumcount() + 1
Out[1907]:
srch_id price price_position
0 1 30 3
1 1 20 1
2 1 25 2
3 3 15 1
4 3 102 3
5 3 39 2

which rows are duplicates to each other

I have got a database with a lot of columns. Some of the rows are duplicates (on a certain subset).
Now I want to find out which row duplicates which row and put them together.
For instance, let's suppose that the data frame is
id A B C
0 0 1 2 0
1 1 2 3 4
2 2 1 4 8
3 3 1 2 3
4 4 2 3 5
5 5 5 6 2
and subset is
['A','B']
I expect something like this:
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2
Is there any function that can help me do this?
Thanks :)
Use DataFrame.duplicated with keep=False for mask with all dupes, then flter by boolean indexing, sorting by DataFrame.sort_values and join together by concat:
L = ['A','B']
m = df.duplicated(L, keep=False)
df = pd.concat([df[m].sort_values(L), df[~m]], ignore_index=True)
print (df)
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2