Pandas aggregating across time windows with user logs - pandas

I am trying to take a dataframe of logs and aggregate counts across time windows, specifically before a Purchase. The goal is to create features that can be used to predict a future purchase.
Here is my original df
user_id activity_date activity_type
0 2013-07-11 EmailOpen
0 2013-07-11 FormSubmit
0 2013-07-15 EmailOpen
0 2013-07-17 Purchase
0 2013-07-18 EmailOpen
and I would like my result to look like:
user_id EmailOpen_count FormSubmit_count Days_since_start Purchase
0 2 1 6 1
0 1 0 1 0
The above idea is I have aggregated before the purchase, and since that user had only one purchase, the next row will aggregate everything after the last purchase.
I tried to extract the Purchase dates first and then do an iterative approach but ran it all night with no success. Here's how I was going to extract the dates, but even this took way too long and I am sure that building the new dataframe would have taken millennia:
purchase_dict = {}
for user in list_of_users:
# Stores list of days when purchase was made for each user.
days_bought = list(df[df['user_id'] == user][df['activity_type'] == 'Purchase']['activity_date'])
purchase_dict[user] = days_bought
I'm wondering if there is a semi-efficient way with groupbys, agg, time_between, etc. Thanks!

Perhaps a bit clunky, and needing some column renaming at the end, but this appears to work for me (with new testing data):
user_id activity_date activity_type
0 2013-07-11 EmailOpen
0 2013-07-11 FormSubmit
0 2013-07-15 EmailOpen
0 2013-07-17 Purchase
0 2013-07-18 EmailOpen
1 2013-07-12 Purchase
1 2013-07-12 FormSubmit
1 2013-07-15 EmailOpen
1 2013-07-18 Purchase
1 2013-07-18 EmailOpen
2 2013-07-09 EmailOpen
2 2013-07-10 Purchase
2 2013-07-15 EmailOpen
2 2013-07-22 Purchase
2 2013-07-23 EmailOpen
# Convert to datetime
df['activity_date'] = pd.to_datetime(df['activity_date'])
# Create shifted flag to identify purchase
df['x'] = (df['activity_type'] == 'Purchase').astype(int).shift().fillna(method='bfill')
# Calculate time window as cumsum of this shifted flag
df['time_window'] = df.groupby('user_id')['x'].cumsum()
# Pivot to count activities by user ID and time window
df2 = df.pivot_table(values='activity_date', index=['user_id', 'time_window'],
columns='activity_type', aggfunc=len, fill_value=0)
# Create separate table of days elapsed by user ID & time window
time_elapsed = ( df.groupby(['user_id', 'time_window'])['activity_date'].max()
- df.groupby(['user_id', 'time_window'])['activity_date'].min() )
# Merge dataframes
df3 = df2.join(time_elapsed)
yields
EmailOpen FormSubmit Purchase activity_date
user_id time_window
0 0.0 2 1 1 6 days
1.0 1 0 0 0 days
1 0.0 0 0 1 0 days
1.0 1 1 1 6 days
2.0 1 0 0 0 days
2 0.0 1 0 1 1 days
1.0 1 0 1 7 days
2.0 1 0 0 0 days
Edit per comments:
To add in time elapsed by type of activity:
time_since_activity = ( df.groupby(['user_id', 'time_window'])['activity_date'].max()
- df.groupby(['user_id', 'time_window', 'activity_type'])['activity_date'].max() )
df4 = df3.join(time_since_activity.unstack('activity_type'), rsuffix='_time')
yielding
EmailOpen FormSubmit ... FormSubmittime Purchasetime
user_id time_window ...
0 0.0 2 1 ... 6 days 0 days
1.0 1 0 ... NaT NaT
1 0.0 0 0 ... NaT 0 days
1.0 1 1 ... 6 days 0 days
2.0 1 0 ... NaT NaT
2 0.0 1 0 ... NaT 0 days
1.0 1 0 ... NaT 0 days
2.0 1 0 ... NaT NaT

Related

Creating 2 additional columns based on past dates - PostgresSQL

Seeking some help after spending alot of time on searching but to no avail and decided to post this here as I'm rather new to SQL, so any help is greatly appreciated. I've tried a few functions but can't seem to get it right. e.g. GROUP BY, BETWEEN etc
On the PrestoSQL server, I have a table as shown below starting with columns Date, ID and COVID. Using GROUP BY ID, I would like to create a column EverCOVIDBefore which looks back at all past dates of the COVID column to see if there was ever COVID = 1 or not, as well as another column called COVID_last_2_mth which checks if there was ever COVID = 1 within the past 2 months
(Highlighted columns are my expected outcomes)
Link to dataset: https://drive.google.com/file/d/1Sc5Olrx9g2A36WnLcCFMU0YTQ3-qWROU/view?usp=sharing
You can do:
select *,
max(covid) over(partition by id order by date) as ever_covid_before,
max(covid) over(partition by id order by date
range between interval '2 month' preceding and current row)
as covid_last_two_months
from t
Result:
date id covid ever_covid_before covid_last_two_months
----------- --- ------ ------------------ ---------------------
2020-01-15 1 0 0 0
2020-02-15 1 0 0 0
2020-03-15 1 1 1 1
2020-04-15 1 0 1 1
2020-05-15 1 0 1 1
2020-06-15 1 0 1 0
2020-01-15 2 0 0 0
2020-02-15 2 1 1 1
2020-03-15 2 0 1 1
2020-04-15 2 0 1 1
2020-05-15 2 0 1 0
2020-06-15 2 1 1 1
See running example at db<>fiddle.

How to produce monthly count when given a date range in pandas?

I have a dataframe that records users, a label, and the start and end date of them being labelled as such
e.g.
user
label
start_date
end_date
1
x
2018-01-01
2018-10-01
2
x
2019-05-10
2020-01-01
3
y
2019-04-01
2022-04-20
1
b
2018-10-01
2020-05-08
etc
where each row is for a given user and a label; a user appears multiple times for different labels
I want to get a count of users for every month for each label, such as this:
date
count_label_x
count_label_y
count_label_b
count_label_
2018-01
10
0
20
5
2018-02
2
5
15
3
2018-03
20
6
8
3
etc
where for instance for the first entry of the previous table, that user should be counted once for every month between his start and end date. The problem boils down to this and since I only have a few labels I can filter labels one by one and produce one output for each label. But how do I check and count users given an interval?
Thanks
You can use date_range combined with to_period to generate the active months, then pivot_table with aggfunc='nunique' to aggregate the unique user (if you want to count the duplicated users use aggfunc='count'):
out = (df
.assign(period=[pd.date_range(a, b, freq='M').to_period('M')
for a,b in zip(df['start_date'], df['end_date'])])
.explode('period')
.pivot_table(index='period', columns='label', values='user',
aggfunc='nunique', fill_value=0)
)
output:
label b x y
period
2018-01 0 1 0
2018-02 0 1 0
2018-03 0 1 0
2018-04 0 1 0
2018-05 0 1 0
...
2021-12 0 0 1
2022-01 0 0 1
2022-02 0 0 1
2022-03 0 0 1
handling NaT
if you have the same start/end and want to count the value:
out = (df
.assign(period=[pd.date_range(a, b, freq='M').to_period('M')
for a,b in zip(df['start_date'], df['end_date'])])
.explode('period')
.assign(period=lambda d: d['period'].fillna(d['start_date'].dt.to_period('M')))
.pivot_table(index='period', columns='label', values='user',
aggfunc='nunique', fill_value=0)
)

How to show the closest date to the selected one

I'm trying to extract the stock in an specific date. To do so, I'm doing a cumulative of stock movements by date, product and warehouse.
select m.codart AS REF,
m.descart AS 'DESCRIPTION',
m.codalm AS WAREHOUSE,
m.descalm AS WAREHOUSEDESCRIP,
m.unidades AS UNITS,
m.entran AS 'IN',
m.salen AS 'OUT',
m.entran*1 + m.salen*-1 as MOVEMENT,
(select sum(m1.entran*1 + m1.salen*-1)
from MOVSTOCKS m1
where m1.codart = m.codart and m1.codalm = m.codalm and m.fecdoc >= m1.fecdoc) as 'CUMULATIVE',
m.PRCMEDIO as 'VALUE',
m.FECDOC as 'DATE',
m.REFERENCIA as 'REF',
m.tipdoc as 'DOCUMENT'
from MOVSTOCKS m
where (m.entran <> 0 or m.salen <> 0)
and (select max(m2.fecdoc) from MOVSTOCKS m2) < '2020-11-30T00:00:00.000'
order by m.fecdoc
Without the and (select max(m2.fecdoc) from MOVSTOCKS m2) < '2020-11-30T00:00:00.000' it shows data like this, which is ok.
REF WAREHOUSE UNITS IN OUT MOVEMENT CUMULATIVE DATE
1 0 2 0 2 -2 -7 2020-11-25
1 1 3 0 3 -3 -3 2020-11-25
1 0 5 0 5 -5 -7 2020-11-25
1 0 9 9 0 9 2 2020-11-26
2 0 2 2 0 2 2 2020-11-26
1 0 1 1 0 1 3 2020-12-01
The problem is, with the subselect in the where clause it returns no results (I think it is because it just looks for the max date and says it is bigger than 2020-11-30). I would like it to show the closest dates (all of them, for each product and warehouse) to the selected one, in this case 2020-11-30.
It should look slike this:
REF WAREHOUSE UNITS IN OUT MOVEMENT CUMULATIVE DATE
1 1 3 0 3 -3 -3 2020-11-25
1 0 9 9 0 9 2 2020-11-26
2 0 2 2 0 2 2 2020-11-26
Sorry if I'm not clear. Ask me if I have to clarify anything
Thank you
I am guessing that you want something like this:
select t.*
from (select m.*,
sum(m.entran - m1.salen) over (partition by m.codart, m.codalm order by fecdoc) as cumulative,
max(fecdoc) over (partition by m.codart, m.codalm) as max_fecdoc
from MOVSTOCKS m
where fecdoc < '2020-11-30'
) m
where fecdoc = max_fecdoc;
The subquery calculates the cumulative amount of stock using window functions and filters for records before the cutoff date. The outer query selects the most recent record from the combination of codeart/codalm, which seems to be how you are identifying a product.

Nested loop for pandas

I have a database with consists of multiple dates corresponding to each ID. Now I want to iterate over each ID and find the difference between the i and i+1 dates to flag the data based on certain values.
For example:
ID date
0 12.01.2012
0 14.02.2012
0 15.09.2013
1 13.01.2011
1 15.08.2012
For ID 0 I want to find the difference of consecutive dates and compare them with a condition to flag the database based on that.
Are you looking for something like this:
res = df['date'].apply(lambda x : df['date']-x)
res.columns = df.id.tolist()
res.index=df.id.tolist()
for input:
date id
0 2019-01-01 0
1 2019-01-06 0
2 2019-01-01 0
3 2019-01-04 1
Output will be:
0 0 0 1
0 0 days 5 days 0 days 3 days
0 -5 days 0 days -5 days -2 days
0 0 days 5 days 0 days 3 days
1 -3 days 2 days -3 days 0 days
for consecutive difference you can use:
df["diff"]=df.groupby("id").diff()
input:
date id
0 1984-10-18 0
1 1980-07-19 0
2 1972-04-16 0
3 1969-04-05 1
4 1967-05-29 1
5 1985-07-13 2
output:
date id diff
0 1984-10-18 0 NaT
1 1980-07-19 0 -1552 days
2 1972-04-16 0 -3016 days
3 1969-04-05 1 NaT
4 1967-05-29 1 -677 days
5 1985-07-13 2 NaT

How to perform a Distinct Sum using MDX?

So I have data like this:
Date EMPLOYEE_ID HEADCOUNT TERMINATIONS
1/31/2011 1 1 0
2/28/2011 1 1 0
3/31/2011 1 1 0
4/30/2011 1 1 0
...
1/31/2012 1 1 0
2/28/2012 1 1 0
3/31/2012 1 1 0
1/31/2012 2 1 0
2/28/2011 2 1 0
3/31/2011 2 1 0
4/30/2011 2 0 1
1/31/2012 3 1 0
2/28/2011 3 1 0
3/31/2011 3 1 0
4/30/2011 3 1 0
...
1/31/2012 3 1 0
2/28/2012 3 1 0
3/31/2012 3 1 0
And I want to sum up the headcount, but I need to remove the duplicate entries from the sum by the employee_id. From the data you can see employee_id 1 occurs many times in the table, but I only want to add its headcount column once. For example if I rolled up on year I might get a report using this query:
with member [Measures].[Distinct HeadCount] as
??? how do I define this???
select { [Date].[YEAR].children } on ROWS,
{ [Measures].[Distinct HeadCount] } on COLUMNS
from [someCube]
It would product this output:
YEAR Distinct HeadCount
2011 3
2012 2
Any ideas how to do this with MDX? Is there a way to control which row is used in the sum for each employee?
You can use an expression like this:
WITH MEMBER [Measures].[Distinct HeadCount] AS
Sum(NonEmpty('the set of the employee ids', 'all the dates of the current year (ie [Date].[YEAR].CurrentMember)'), [Measures].[HeadCount])
If you want a more generic expression you can use this:
WITH MEMBER [Measures].[Distinct HeadCount] AS
Sum(NonEmpty('the set of the employee ids',
Descendants(Axis(0).Item(0).Item(0).Hierarchy.CurrentMember, Axis(0).Item(0).Item(0).Hierarchy.CurrentMember.Level, LEAVES)),
IIf(IsLeaf(Axis(0).Item(0).Item(0).Hierarchy.CurrentMember),
[Measures].[HeadCount],
NULL))