Expanding group by window to count nunique - pandas

I have the following df:
df=pd.DataFrame(data={'month':[1]*4+[2]*4+[3]*4,'customer':[1,2,3,4,1,5,6,7,2,3,10,7]})
I want to create an expanding window to count number of unique customers at any point.
the output for the following df should be:
{1:4,2:7,3:8}
because in the first month we had 4 different customers, in the 2nd one, 3 where added (the other one was in the first month, and in the last month only one added (number 10))
Thanks

You can first drop the duplicated customers (only keep the first ones that appeared) and then cumulatively sum the number of (now unique) customers per month:
counts = df.drop_duplicates("customer").groupby("month").size().cumsum().to_dict()
to get
>>> counts
{1: 4, 2: 7, 3: 8}

Since there are repeated customers, you can drop those repeated customers using
df.drop_duplicates(subset='customer',ignore_index=True,inplace=True)
By default it will keep the first occurence of customer number and will drop next occurences. To count the number of unique customers each month,
df['customer'] = df.groupby('month')['customer'].transform('count')
df = df.drop_duplicates(ignore_index=True)
To roll the window over the customer column, calculate cumulative sum of that column
df['customer'] = df['customer'].cumsum()
It will give the desired output
month customers
1 4
2 7
3 8

Related

updating the next several row values based on the value of a row in another column

I'm trying to figure out how to add the values of one column (the amount column) to the next few rows based on the condition of another column (the days column). If the condition of the days column is greater than 1, for each day greater than 1 I add the amount column to that many following rows. So if days is three, I add the amount to the next two rows (the first day is just the current row). I actually think this is easier if I make a copy of the amount column, so I made a copy called backlog.
So let's say I have an amount column that represents the amount of support tickets that need to be resolved each day. Each amount has a number of days it takes for the amount to be resolved. I need the total amount to be a sum of the value today and the sum of the outstanding tickets. So if I have an amount of 1 for 2 days, I have 1 ticket amount today and I add that same 1 tomorrow to the ticket amount of tomorrow. If this doesn't make sense, the below examples will. I have a solution as well, but my main issue is doing this efficiently.
Here is a sample dataframe to use:
amount = list(np.zeros(10)) + [random.randint(1,3) for val in range(15)]
random.shuffle(amount)
ex = pd.DataFrame({
'Amount': amount
})
ex.loc[ex['Amount']>0, 'Days'] = [random.randint(0,4) for val in range(15)]
ex.loc[ex['Amount']==0, 'Days'] = 0
ex['Days'] = ex['Days'].astype(int)
ex['Backlog'] = ex['Amount']
ex.head(10)
Input Dataframe:
Amount
Days
Backlog
2
0
2
1
3
1
2
2
2
3
0
3
Desired Output Dataframe:
Amount
Days
Backlog
2
0
2
1
3
1
2
2
3
3
0
6
In the last two values of the backlog column, I have a value of 3 (2 from the current day amount plus 1 from the prior day amount) and a value of 6 (3 for the current day + 2 from the previous day amount + 1 from two days ago).
I have made code for this below, which I think achieves the outcome:
for i in range(0, len(ex['Amount'])):
Days = ex['Days'].iloc[i]
if Days >= 2:
for j in range (1,Days):
if (i+j)>= len(ex['Amount']):
break
ex['Backlog'].iloc[i+j] += ex['Amount'].iloc[i]
The problem is that I'm already using two for loops to slice the data frame for two features first, so when this code is used as a function for a very large data frame it runs far too slowly, and my main goal has been to implement a faster way to do this. Is there a more efficient pandas method to achieve the same outcome? Possibly without having to use slow iteration or a nested for loop? I'm at a loss.

Pandas Sequebtial Count of members within a group and the sum

If I want to have a sequential count within a group I can do something like
df['GID'] = df.groupby(['G_COL1','G_COL2]).cumcount()
I cannot however figure out how to generate a column that contains the total number of values within the group. So if the group had three members df['GID'] would contain 0,1 & 2 and df['COUNT'] would contain the value 3 for each of the three members
df["count_zeros"] = pd.DataFrame((df["GID"]==0)).cumsum()
df["COUNT"] = df.groupby("count_zeros").transform(lambda x: len(x))["GID"]
I think the above gives what you want. The GID column starts from zero whenever a new group starts taking place and then we count how many zeros, i.e. new group "starts" we have with len.
As Scott Boston, commented,
df["COUNT"] = df.groupby("count_zeros")['GID'].transform('count')
works and looks great :)

Pandas matching algorithm with itself

I'm trying to create a matching algo in pandas that does the following with a given table:
A table contains purchases and sales of products by date, item, quantity (+ for purchases and - for sales) and price.
Conditions:
Create an algorithm that matches purchases and sales per item and the corresponding average profit for each item in total.
Matches can only be on the same date, otherwise they are not matched at all.
Remaining positive or negative inventories per day are ignored
Negative inventories are allowed.
Example with a single product:
date product quantity price
1 X +2 1
1 X -1 2
1 X -2 4
2 X +1 1
2 X +1 2
3 X -1 4
Answer:
The result would be that only on day 1 the 3 trades are matched, with a profit of -2+2+4=4. Because inventory is +2, -1, and then again -1. The remaining inventory of -1 is ignored. Day 2 and 3 have no matches because the trades are not closed on the same day.
Correct output:
product Profit
X +4
Is there any elegant way to get to this result without having to loop over the table multiple times with iterrow?
For reproducing the df:
df = pd.DataFrame({'date':[1,1,1,2,2,3],'product': ['X']*6,'quantity':[2,-1,-2,1,1,-1],'price':[1,2,4,1,2,4]})
The process that you describing could use groupby & aggregate, something like this:
df.groupby('date').sum()
But I don't fully understand your rules for matching. So in Day 1, I got a different total profit. Price * quantity is (+2*1)+(-1*2)+(-2*4)=-8, so profit seems to be 8.
Using iterrow() is a rather bad practice. Not only you're writing excessive code, but also it's likely much slower (check a comparison here).
Most of those type of jobs can be accomplished by combining groupby(), aggregate() and apply(). Check out this great tutorial.
I hope this helps you or future answers :)

Understanding Correlation Between Columns Pandas DataFrame

I have a dataset with daily sales of two products for the first 10 days of their release. The dataframe below shows a single and dozens of items being sold per day for each product. Its believed that no dozens product was sold before a single item of the product had been sold. The two products (Period_ID) has expected number of dozens sale.
d = {'Period_ID':['A12']*10, 'Prod_A_Doz':[1.2]*10, 'Prod_B_Doz':[2.4]*10, 'A_Singles':[0,0,0,1,1,2,2,3,3,4], 'B_Singles':[0,0,1,1,2,2,3,3,4,4],
'A_Dozens':[0,0,0,0,0,0,0,1,1,1], 'B_Dozens':[0,0,0,0,0,0,1,1,2,2]}
df = pd.DataFrame(data=d)
QUESTION
I want to construct a descriptive analysis in which one of my questions is to figure out how many single items of each product sold in average before a dozen was sold the 1st time, 2nd time,..., 10th time?
Given that df.Period_ID.nunique() = 1568
Modifying the dataset for sales per day as oppose to the above cumulative sales and using Pankaj Joshi solution with small alteration,
print(f'Average number of single items before {index + 1} dozen = {df1.A_Singles[:val+1].mean():0.2f}')
d = {'Period_ID':['A12']*10, 'Prob_A_Doz':[1.2]*10, 'Prod_B_Doz':[2.4]*10, 'A_Singles':[0,0,0,1,0,1,0,1,0,1], 'B_Singles':[0,0,1,0,1,0,1,0,1,0],
'A_Dozens':[0,0,0,0,0,0,0,1,0,0], 'B_Dozens':[0,0,0,0,0,0,1,0,1,0]}
df1 = pd.DataFrame(data=d)
# For product A
Average number of single items before 1 dozen = 0.38
# For product B
6
Average number of single items before 1 dozen = 0.43
8
Average number of single items before 2 dozen = 0.44, But I want this to be counted from the last Dozens of sales. so rather 0.44, it should be 0.5
The aim is once I have the information for each Period_ID then i will take the average for all df.Period_ID.nunique() (= 1568) and try to optimise the expected number of 'Dozens' sale for each product given under the col Prod_A_Doz and Prod_B_Doz
I would appreciate all the help.
Here is how I will go about it:
d = {'Period_ID':['A12']*10, 'Prob_A_Doz':[1.2]*10, 'Prod_B_Doz':[2.4]*10, 'A_Singles':[0,0,0,1,1,2,2,3,3,4], 'B_Singles':[0,0,1,1,2,2,3,3,4,4],
'A_Dozens':[0,0,0,0,0,0,0,1,1,1], 'B_Dozens':[0,0,0,0,0,0,1,1,2,2]}
df1 = pd.DataFrame(data=d)
for per_id in set(df1.Period_ID):
print(per_id)
df_temp = df1[df1.Period_ID == per_id]
for index, val in enumerate(df_temp.index[df_temp.A_Dozens>0]):
print(val)
print(f'Average number of single items before {index} dozen = {df_temp.A_Singles[:val].mean():0.2f}')
print(f'Average number of single items before {index} dozen = {df_temp.B_Dozens[:val].mean():0.2f}')

Picking one of many identical rows with certain condition

To set the scene, what I define as identical rows are when the combination of destination and vehicle_brand are the same. For instance in the figure below,
SQL table name: cardriven
rows 2 and 3 are "identical" because of the Dallas-Toyota "combination." Now I want to only display the row with the higher request_id. So for example, between rows 2 and 3, row 3 would get displayed and row 2 would be hidden/removed because 169 > 100. So in the end, only rows 3, 4, 5, 7, and 8 will show and rows 1, 2, 6, and 9 would get hidden/removed.
Hopefully you understand what I am going for here but if you have any questions, please let me know. This will be written in SQL code.
Another problem: I added a new column for dates and entered some random ones for rows 2-4. Row 2 is 12/1/17, row 3 is 11/5/2016, and row 4 is 7/6/2017. Note that row 3 has the highest request_id out of the Dallas-Toyota combination. I decided to enter a new entry in with a request_id = 501 and entry of Dallas, Toyota, and 12/22/2017. After running the program, for Dallas-Toyota I return row 3 but with request_id = 501! It SHOULD return the entry I just entered.
You can use Group By and the Max function to get the highest value.
SELECT MAX(request_id), destination, vehicle_brand
FROM cardriven
GROUP BY destination, vehicle_brand