I have a database with consists of multiple dates corresponding to each ID. Now I want to iterate over each ID and find the difference between the i and i+1 dates to flag the data based on certain values.
For example:
ID date
0 12.01.2012
0 14.02.2012
0 15.09.2013
1 13.01.2011
1 15.08.2012
For ID 0 I want to find the difference of consecutive dates and compare them with a condition to flag the database based on that.
Are you looking for something like this:
res = df['date'].apply(lambda x : df['date']-x)
res.columns = df.id.tolist()
res.index=df.id.tolist()
for input:
date id
0 2019-01-01 0
1 2019-01-06 0
2 2019-01-01 0
3 2019-01-04 1
Output will be:
0 0 0 1
0 0 days 5 days 0 days 3 days
0 -5 days 0 days -5 days -2 days
0 0 days 5 days 0 days 3 days
1 -3 days 2 days -3 days 0 days
for consecutive difference you can use:
df["diff"]=df.groupby("id").diff()
input:
date id
0 1984-10-18 0
1 1980-07-19 0
2 1972-04-16 0
3 1969-04-05 1
4 1967-05-29 1
5 1985-07-13 2
output:
date id diff
0 1984-10-18 0 NaT
1 1980-07-19 0 -1552 days
2 1972-04-16 0 -3016 days
3 1969-04-05 1 NaT
4 1967-05-29 1 -677 days
5 1985-07-13 2 NaT
Related
I have a dataframe that has columns like these:
Date earnings workingday length_week first_wday_week last_wdayweek
01.01.2000 10000 1 1
02.01.2000 0 0 1
03.01.2000 0 0 2
04.01.2000 0 0 2
05.01.2000 0 0 2
06.01.2000 23000 1 2
07.01.2000 1000 1 2
08.01.2000 0 0 2
09.01.2000 0 0 2
..
..
..
30.01.2000 0 0 0
31.01.2000 0 1 3
01.02.2000 0 1 3
02.02.2000 2500 1 3
working day indicates there earnings present on that particular day. I am trying to generate last three column from the date.
length_week : gives number of working days in that week
first_working_day_of_week : 1 if its first working day of a week
last_working_day_of_week : 1 if its last working day of a week
Can anyone help me with this?
I first changed the format of your date column as pd.to_datetime couldn't infer the right date format:
df.Date.str.replace('.', '-', regex=True)
df.Date = pd.to_datetime(df.Date, format='%d-%m-%Y')
Then use isocalendar so that we can work with weeks and days more easily:
df[['year', 'week', 'weekday']] = df.Date.dt.isocalendar()
Now length_week is just the sum of workingdays for each seperate weeks:
df['length_week'] = df.groupby(['year', 'week']).workingday.transform('sum')
and we can get frst_worday_week with idxmax:
min_indexes = df.groupby(['year', 'week'], as_index=False).workingday.transform('idxmax')
df['frst_worday_week'] = np.where(df.index == min_indexes.workingday, 1, 0)
Lastly, last_workdayweek is similar but a bit tricky. We need the last occurence of idxmax, so we will reverse each week inside groupby:
max_indexes = df.groupby(['year', 'week'], as_index=False).\
workingday.transform(lambda x: x[::-1].idxmax())
df['last_workdayweek'] = np.where(df.index == max_indexes.workingday, 1, 0)
I have a dataframe with ID and date ( and calculated day difference between the rows for the same ID)
ID date day_difference
1 27/06/2019 0
1 28/06/2019 1
1 29/06/2019 1
1 01/07/2019 2
1 02/07/2019 1
1 03/07/2019 1
1 05/07/2019 2
2 27/06/2019 0
2 28/06/2019 1
2 29/06/2019 1
2 01/08/2019 33
2 02/08/2019 1
2 03/08/2019 1
2 04/08/2019 1
which i would like to group by ID and calculate total duration with a condition if day difference is bigger than 30 days re-use that ID again and create a new group starting counting duration from that day after a 30day gap.
Desired result
ID Duration
1 8
2 3
2 4
Thanks.
You can do:
(df.groupby(['ID', df.day_difference.gt(30).cumsum()])
.agg(ID=('ID','first'), Duration=('ID','count'))
.reset_index(drop=True)
)
Output:
ID Duration
0 1 7
1 2 3
2 2 4
I am trying to take a dataframe of logs and aggregate counts across time windows, specifically before a Purchase. The goal is to create features that can be used to predict a future purchase.
Here is my original df
user_id activity_date activity_type
0 2013-07-11 EmailOpen
0 2013-07-11 FormSubmit
0 2013-07-15 EmailOpen
0 2013-07-17 Purchase
0 2013-07-18 EmailOpen
and I would like my result to look like:
user_id EmailOpen_count FormSubmit_count Days_since_start Purchase
0 2 1 6 1
0 1 0 1 0
The above idea is I have aggregated before the purchase, and since that user had only one purchase, the next row will aggregate everything after the last purchase.
I tried to extract the Purchase dates first and then do an iterative approach but ran it all night with no success. Here's how I was going to extract the dates, but even this took way too long and I am sure that building the new dataframe would have taken millennia:
purchase_dict = {}
for user in list_of_users:
# Stores list of days when purchase was made for each user.
days_bought = list(df[df['user_id'] == user][df['activity_type'] == 'Purchase']['activity_date'])
purchase_dict[user] = days_bought
I'm wondering if there is a semi-efficient way with groupbys, agg, time_between, etc. Thanks!
Perhaps a bit clunky, and needing some column renaming at the end, but this appears to work for me (with new testing data):
user_id activity_date activity_type
0 2013-07-11 EmailOpen
0 2013-07-11 FormSubmit
0 2013-07-15 EmailOpen
0 2013-07-17 Purchase
0 2013-07-18 EmailOpen
1 2013-07-12 Purchase
1 2013-07-12 FormSubmit
1 2013-07-15 EmailOpen
1 2013-07-18 Purchase
1 2013-07-18 EmailOpen
2 2013-07-09 EmailOpen
2 2013-07-10 Purchase
2 2013-07-15 EmailOpen
2 2013-07-22 Purchase
2 2013-07-23 EmailOpen
# Convert to datetime
df['activity_date'] = pd.to_datetime(df['activity_date'])
# Create shifted flag to identify purchase
df['x'] = (df['activity_type'] == 'Purchase').astype(int).shift().fillna(method='bfill')
# Calculate time window as cumsum of this shifted flag
df['time_window'] = df.groupby('user_id')['x'].cumsum()
# Pivot to count activities by user ID and time window
df2 = df.pivot_table(values='activity_date', index=['user_id', 'time_window'],
columns='activity_type', aggfunc=len, fill_value=0)
# Create separate table of days elapsed by user ID & time window
time_elapsed = ( df.groupby(['user_id', 'time_window'])['activity_date'].max()
- df.groupby(['user_id', 'time_window'])['activity_date'].min() )
# Merge dataframes
df3 = df2.join(time_elapsed)
yields
EmailOpen FormSubmit Purchase activity_date
user_id time_window
0 0.0 2 1 1 6 days
1.0 1 0 0 0 days
1 0.0 0 0 1 0 days
1.0 1 1 1 6 days
2.0 1 0 0 0 days
2 0.0 1 0 1 1 days
1.0 1 0 1 7 days
2.0 1 0 0 0 days
Edit per comments:
To add in time elapsed by type of activity:
time_since_activity = ( df.groupby(['user_id', 'time_window'])['activity_date'].max()
- df.groupby(['user_id', 'time_window', 'activity_type'])['activity_date'].max() )
df4 = df3.join(time_since_activity.unstack('activity_type'), rsuffix='_time')
yielding
EmailOpen FormSubmit ... FormSubmittime Purchasetime
user_id time_window ...
0 0.0 2 1 ... 6 days 0 days
1.0 1 0 ... NaT NaT
1 0.0 0 0 ... NaT 0 days
1.0 1 1 ... 6 days 0 days
2.0 1 0 ... NaT NaT
2 0.0 1 0 ... NaT 0 days
1.0 1 0 ... NaT 0 days
2.0 1 0 ... NaT NaT
For each row, I need to calculate the integer part from dividing by 4. For each subsequent row, we add the remainder of the division by 4 previous and current lines and look at the whole part and the remainders from dividing by 4. Consider the example below:
id val
1 22
2 1
3 1
4 2
5 1
6 6
7 1
After dividing by 4, we look at the whole part and the remainders. For each id we add up the accumulated points until they are divided by 4:
id val wh1 rem1 wh2 rem2 RESULT(wh1+wh2)
1 22 5 2 0 2 5
2 1 0 1 (3/4=0) 3%4=3 0
3 1 0 1 (4/4=1) 4%4=0 1
4 2 0 2 (2/4=0) 2%4=2 0
5 1 0 1 (3/4=0) 3%4=3 0
6 7 1 2 (5/4=1) 5%4=1 2
7 1 0 1 (2/4=0) 2%4=1 0
How can I get the next RESULT column with sql?
Data of project:
http://sqlfiddle.com/#!18/9e18f/2
The whole part from the division into 4 is easy, the problem is to calculate the accumulated remains for each id, and to calculate which of them will also be divided into 4
So I have data like this:
Date EMPLOYEE_ID HEADCOUNT TERMINATIONS
1/31/2011 1 1 0
2/28/2011 1 1 0
3/31/2011 1 1 0
4/30/2011 1 1 0
...
1/31/2012 1 1 0
2/28/2012 1 1 0
3/31/2012 1 1 0
1/31/2012 2 1 0
2/28/2011 2 1 0
3/31/2011 2 1 0
4/30/2011 2 0 1
1/31/2012 3 1 0
2/28/2011 3 1 0
3/31/2011 3 1 0
4/30/2011 3 1 0
...
1/31/2012 3 1 0
2/28/2012 3 1 0
3/31/2012 3 1 0
And I want to sum up the headcount, but I need to remove the duplicate entries from the sum by the employee_id. From the data you can see employee_id 1 occurs many times in the table, but I only want to add its headcount column once. For example if I rolled up on year I might get a report using this query:
with member [Measures].[Distinct HeadCount] as
??? how do I define this???
select { [Date].[YEAR].children } on ROWS,
{ [Measures].[Distinct HeadCount] } on COLUMNS
from [someCube]
It would product this output:
YEAR Distinct HeadCount
2011 3
2012 2
Any ideas how to do this with MDX? Is there a way to control which row is used in the sum for each employee?
You can use an expression like this:
WITH MEMBER [Measures].[Distinct HeadCount] AS
Sum(NonEmpty('the set of the employee ids', 'all the dates of the current year (ie [Date].[YEAR].CurrentMember)'), [Measures].[HeadCount])
If you want a more generic expression you can use this:
WITH MEMBER [Measures].[Distinct HeadCount] AS
Sum(NonEmpty('the set of the employee ids',
Descendants(Axis(0).Item(0).Item(0).Hierarchy.CurrentMember, Axis(0).Item(0).Item(0).Hierarchy.CurrentMember.Level, LEAVES)),
IIf(IsLeaf(Axis(0).Item(0).Item(0).Hierarchy.CurrentMember),
[Measures].[HeadCount],
NULL))