aggregate data between two dates with two dataframes - pandas

Given I have the following DF,
Assume this table has all the sales rep and all the Q end dates in the last 20 years.
Q End date
Rep
Var1
03/31/2010
Bob
11
03/31/2010
Alice
12
03/31/2010
Jack
13
06/30/2010
Bob
14
06/30/2010
Alice
15
06/30/2010
Jack
16
I also have a table of transactions events
Sell Date
Rep
04/01/2009
Bob
03/01/2010
Bob
02/01/2010
Jack
02/01/2010
Jack
I am trying to modify the first DF so to have a column that aggregates the number of transactions that happened 12 month prior to the q end date per Qend per Rep
The result should look like this
Q End end
Rep
Var1
Trailing 12M transactions
03/31/2010
Bob
11
2
03/31/2010
Alice
12
0
03/31/2010
Jack
13
2
06/30/2010
Bob
14
1
06/30/2010
Alice
15
0
06/30/2010
Jack
16
2
My table has 2000-3000 sales rep per Q for ~20 years and number of transactions per trailing 12m can range between 0-7k ish.
Any help here would be appreciated. Thanks!

Try:
df1["Q End date"] = pd.to_datetime(df1["Q End date"])
df2["Sell Date"] = pd.to_datetime(df2["Sell Date"])
df2 = df2.sort_values(by="Sell Date").set_index("Sell Date")
df1["Trailing 12M transactions"] = df1.apply(
lambda x: df2.loc[
x["Q End date"] - pd.DateOffset(years=1) : x["Q End date"]
]
.eq(x["Rep"])
.sum(),
axis=1,
)
print(df1)
Prints:
Q End date Rep Var1 Trailing 12M transactions
0 2010-03-31 Bob 11 2
1 2010-03-31 Alice 12 0
2 2010-03-31 Jack 13 2
3 2010-06-30 Bob 14 1
4 2010-06-30 Alice 15 0
5 2010-06-30 Jack 16 2

Related

Pandas: to get mean for each data category daily [duplicate]

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

R - get a vector that tells me if a value of another vector is the first appearence or not

I have a data frame of sales with three columns: the code of the customer, the month the customer bought that item, and the year.
A customer can buy something in september and then in december make another purchase, so appear two times. But I'm interested in knowing the absolutely new customoers by month and year.
So I have thought in make an iteration and some checks and use the %in% function and build a boolean vector that tells me if a customer is new or not and then count by month and year with SQL using this new vector.
But I'm wondering if there's a specific function or a better way to do that.
This is an example of the data I would like to have:
date cust month new_customer
1 14975 25 1 TRUE
2 14976 30 1 TRUE
3 14977 22 1 TRUE
4 14978 4 1 TRUE
5 14979 25 1 FALSE
6 14980 11 1 TRUE
7 14981 17 1 TRUE
8 14982 17 1 FALSE
9 14983 18 1 TRUE
10 14984 7 1 TRUE
11 14985 24 1 TRUE
12 14986 22 1 FALSE
So put it more simple: the data frame is sorted by date, and I'm interested in a vector (new_customer) that tells me if the customer purchased something for the first time or not. For example customer 25 bought something the first day, and then four days later bought something again, so is not a new customer. The same can be seen with customer 17 and 22.
I create dummy data my self with id, month of numeric format, and year
dat <-data.frame(
id = c(1,2,3,4,5,6,7,8,1,3,4,5,1,2,2),
month = c(1,6,7,8,2,3,4,8,11,1,10,9,1,12,2),
year = c(2019,2019,2019,2019,2019,2020,2020,2020,2020,2020,2021,2021,2021,2021,2021)
)
id month year
1 1 1 2019
2 2 6 2019
3 3 7 2019
4 4 8 2019
5 5 2 2019
6 6 3 2020
7 7 4 2020
8 8 8 2020
9 1 11 2020
10 3 1 2020
11 4 10 2021
12 5 9 2021
13 1 1 2021
14 2 12 2021
15 2 2 2021
Then, group by id and arrange by year and month (order is meaningful). Then use filter and row_number().
dat %>%
group_by(id) %>%
arrange(year, month) %>%
filter(row_number() == 1)
id month year
<dbl> <dbl> <dbl>
1 1 1 2019
2 5 2 2019
3 2 6 2019
4 3 7 2019
5 4 8 2019
6 6 3 2020
7 7 4 2020
8 8 8 2020
Sample Code
You can change in your code according to this logic:-
Create Table:-
CREATE TABLE PURCHASE(Posting_Date DATE,Customer_Id INT,Customer_Name VARCHAR(15));
Insert Data Into Table
Posting_Date Customer_Id Customer_Name
2018-01-01 C_01 Jack
2018-02-01 C_01 Jack
2018-03-01 C_01 Jack
2018-04-01 C_02 James
2019-04-01 C_01 Jack
2019-05-01 C_01 Jack
2019-05-01 C_03 Gill
2020-01-01 C_02 James
2020-01-01 C_04 Jones
Code
WITH Date_CTE (PostingDate,CustomerID,FirstYear)
AS
(
SELECT MIN(Posting_Date) as [Date],
Customer_Id,
YEAR(MIN(Posting_Date)) as [F_Purchase_Year]
FROM PURCHASE
GROUP BY Customer_Id
)
SELECT T.[ActualYear],(CASE WHEN T.[Customer Status] = 'new' THEN COUNT(T.[Customer Status]) END) AS [New Customer]
FROM (
SELECT DISTINCT YEAR(T2.Posting_Date) AS [ActualYear],
T2.Customer_Id,
(CASE WHEN T1.FirstYear = YEAR(T2.Posting_Date) THEN 'new' ELSE 'old' END) AS [Customer Status]
FROM Date_CTE AS T1
left outer join PURCHASE AS T2 ON T1.CustomerID = T2.Customer_Id
) AS T
GROUP BY T.[ActualYear],T.[Customer Status]
Final Result
ActualYear New Customer
2018 2
2019 1
2020 1
2019 NULL
2020 NULL

How to map columns From one DataFrame onto another based on a column values between the two?

Basically, I just want map values from one dataframe to another based on some common column, ('ID' + 'Key')
df1:
ID Name
1 Sam
2 Ryan
4 Sam
16 Brian
7 Tom
8 Gemma
9 Steve
11 Sarah
df1:
Key PPID M
1 22 MM
2 23 R
4 25 MM
16 27 RR
7 21 RR
8 11 R
0 13 SS
11 14 RR
new df:
ID PPID M
Sam 22 MM
Ryan 23 R
Sam 25 MM
Brian 27 RR
Tom 21 RR
Gemma 11 R
0 13 SS
Sarah 14 RR
IIUC
df1.Key.replace(dict(zip(df.ID,df.Name)),inplace=True)
df1
Key PPID M
0 Sam 22 MM
1 Ryan 23 R
2 Sam 25 MM
3 Brian 27 RR
4 Tom 21 RR
5 Gemma 11 R
6 0 13 SS
7 Sarah 14 RR

Pandas groupby multiple keys selecting unique values and transforming

I have a data frame df=
Owner Manager Date Hours City
John Jerry 1/2/16 10 LA
John Jerry 1/2/16 10 SF
Mary Jerry 1/2/16 9 LA
Zach Joe 1/3/16 5 SD
Wendy Joe 1/3/16 4 SF
Hal Joe 1/4/16 2 SD
... 100,000 entries
I would like to group by 'Manager' and 'Date', then select unique values of 'Owner' and sum 'Hours' of that selection, finally transforming the sum to a new column 'Hours_by_Manager'.
My desired output is:
Owner Manager Date Hours City Hours_by_Manager
John Jerry 1/2/16 10 LA 19
John Jerry 1/2/16 10 SF 19
Mary Jerry 1/2/16 9 LA 19
Zach Joe 1/3/16 5 SD 9
Wendy Joe 1/3/16 4 SF 9
Hal Joe 1/4/16 2 SD 2
I tried using pandas 'groupby' like this:
df['Hours_by_Manager']=df.groupby(['Manager','Date'])['Hours'].transform(lambda x: sum(x.unique()))
Which gives me what I want, but only because the value of hours is different between 'Owner'. What I'm looking for is something like this:
df['Hours_by_Manager']=df.groupby(['Manager','Date'])['Owner'].unique()['Hours']transform(lambda x: sum(x))
Which obviously is not syntactically correct. I know I could use for loops, but I would like to keep things vectorized. Any suggestions?
import pandas as pd
df = pd.DataFrame({'City': ['LA', 'SF', 'LA', 'SD', 'SF', 'SD'],
'Date': ['1/2/16', '1/2/16', '1/2/16', '1/3/16', '1/3/16', '1/4/16'],
'Hours': [10, 10, 9, 5, 4, 2],
'Manager': ['Jerry', 'Jerry', 'Jerry', 'Joe', 'Joe', 'Joe'],
'Owner': ['John', 'John', 'Mary', 'Zach', 'Wendy', 'Hal']})
uniques = df.drop_duplicates(subset=['Hours','Owner','Date'])
hours = uniques.groupby(['Manager', 'Date'])['Hours'].sum().reset_index()
hours = hours.rename(columns={'Hours':'Hours_by_Manager'})
result = pd.merge(df, hours, how='left')
print(result)
yields
City Date Hours Manager Owner Hours_by_Manager
0 LA 1/2/16 10 Jerry John 19
1 SF 1/2/16 10 Jerry John 19
2 LA 1/2/16 9 Jerry Mary 19
3 SD 1/3/16 5 Joe Zach 9
4 SF 1/3/16 4 Joe Wendy 9
5 SD 1/4/16 2 Joe Hal 2
Explanation:
An Owner on a given Date works a unique number of Hours. So let's first create a table of unique ['Hours','Owner','Date'] rows:
uniques = df.drop_duplicates(subset=['Hours','Owner','Date'])
# alternatively, uniques = df.groupby(['Hours','Owner','Date']).first().reset_index()
# City Date Hours Manager Owner
# 0 LA 1/2/16 10 Jerry John
# 2 LA 1/2/16 9 Jerry Mary
# 3 SD 1/3/16 5 Joe Zach
# 4 SF 1/3/16 4 Joe Wendy
# 5 SD 1/4/16 2 Joe Hal
Now we can group by ['Manager', 'Date'] and sum the Hours:
hours = uniques.groupby(['Manager', 'Date'])['Hours'].sum().reset_index()
Manager Date Hours
0 Jerry 1/2/16 19
1 Joe 1/3/16 9
2 Joe 1/4/16 2
The hours['Hours'] column contains the values we want in df['Hours_by_Manager'].
hours = hours.rename(columns={'Hours':'Hours_by_Manager'})
So now we can merge df and hours to obtain the desired result:
result = pd.merge(df, hours, how='left')
# City Date Hours Manager Owner Hours_by_Manager
# 0 LA 1/2/16 10 Jerry John 19
# 1 SF 1/2/16 10 Jerry John 19
# 2 LA 1/2/16 9 Jerry Mary 19
# 3 SD 1/3/16 5 Joe Zach 9
# 4 SF 1/3/16 4 Joe Wendy 9
# 5 SD 1/4/16 2 Joe Hal 2

Retrieve top 48 unique records from database based on a sorted Field

I have database table that I am after some SQL for (Which is defeating me so far!)
Imagine there are 192 Athletic Clubs who all take part in 12 Track Meets per season.
So that is 2304 individual performances per season (for example in the 100Metres)
I would like to find the top 48 (unique) individual performances from the table, these 48 athletes are then going to take part in the end of season World Championships.
So imagine the 2 fastest times are both set by "John Smith", but he can only be entered once in the world champs. So i would then look for the next fastest time not set by "John Smith"... so on and so until I have 48 unique athletes..
hope that makes sense.
thanks in advance if anyone can help
PS
I did have a nice screen shot created that would explain it much better. but as a newish user i cannot post images.
I'll try a copy and paste version instead...
ID AthleteName AthleteID Time
1 Josh Lewis 3 11.99
2 Joe Dundee 4 11.31
3 Mark Danes 5 13.44
4 Josh Lewis 3 13.12
5 John Smith 1 11.12
6 John Smith 1 12.18
7 John Smith 1 11.22
8 Adam Bennett 6 11.33
9 Ronny Bower 7 12.88
10 John Smith 1 13.49
11 Adam Bennett 6 12.55
12 Mark Danes 5 12.12
13 Carl Tompkins 2 13.11
14 Joe Dundee 4 11.28
15 Ronny Bower 7 12.14
16 Carl Tompkin 2 11.88
17 Nigel Downs 8 14.14
18 Nigel Downs 8 12.19
Top 4 unique individual performances
1 John Smith 1 11.12
3 Joe Dundee 4 11.28
5 Adam Bennett 6 11.33
6 Carl Tompkins 2 11.88
Basically something like this:
select top 48 *
from (
select athleteId,min(time) as bestTime
from theRaces
where raceId = '123' -- e.g., 123=100 meters
group by athleteId
) x
order by bestTime
try this --
select x.ID, x.AthleteName , x.AthleteID , x.Time
(
select rownum tr_count,v.AthleteID AthleteID, v.AthleteName AthleteName, v.Time Time,v.id id
from
(
select
tr1.AthleteName AthleteName, tr1.Time time,min(tr1.id) id, tr1.AthleteID AthleteID
from theRaces tr1
where time =
(select min(time) from theRaces tr2 where tr2.athleteId = tr1.athleteId)
group by tr1.AthleteName, tr1.AthleteID, tr1.Time
having tr1.Time = ( select min(tr2.time) from theRaces tr2 where tr1.AthleteID =tr2.AthleteID)
order by tr1.time
) v
) x
where x.tr_count < 48