summary of converted and churned customers from time series dataframe - pandas

I have similar to the below dataframe and I would like to create a few summary stats around the behaviour of customers over time
pd.DataFrame([
['id1','23/5/2019','not_emailed']
,['id1','24/5/2019','not_emailed']
,['id1','25/5/2019','emailed']
,['id1','26/5/2019','emailed']
,['id1','27/5/2019','emailed']
,['id1','28/5/2019','emailed']
,['id1','29/5/2019','emailed']
,['id1','30/5/2019','emailed']
,['id1','31/5/2019','emailed']
,['id1','1/6/2019','emailed']
,['id1','2/6/2019','emailed']
,['id2','23/5/2019','not_emailed']
,['id2','24/5/2019','not_emailed']
,['id2','25/5/2019','emailed']
,['id2','26/5/2019','emailed']
,['id2','27/5/2019','emailed']
,['id3','29/5/2019','not_emailed']
,['id3','30/5/2019','emailed']
,['id3','31/5/2019','emailed']
,['id3','1/6/2019','emailed']
,['id3','2/6/2019','emailed']
,['id4','29/5/2019','not_emailed']
,['id4','30/5/2019','emailed']
,['id4','31/5/2019','emailed']
,['id4','1/6/2019','emailed']
,['id4','2/6/2019','emailed']
,['id4','2/7/2019','emailed']
,['id4','3/7/2019','emailed']
,['id4','4/7/2019','emailed']
],columns=['id','date','status'])
The main scenarios that could be observed in this data set are:
id1 emailed on 25th but not converted
id2 emailed on 27th and converted on 28th because we dont see any more logs for this id
id3 emailed on 30th and converted on 3rd because we dont see any more logs for this id
id4 emailed on 30th and converted on 3rd but churned againon the 2nd
I would like to get a summary of that information per day
How many emailed, how many converted, how many churned that had previously converted
A desired potential output could be:
pd.DataFrame([
['29/5/2019',10,3,1] ,
['30/5/2019',10,2,1]
],columns=['date','emailed_total','converted_total','churned_total']
)
Not that numbers above are random and don't reflect the stats of the first dataset shared
My approaches so far:
1)
partially solves the problem:
find first day of emailed
calculate days passed since first
group by the elapsed days and aggregate
works but not for churn customers
2)
loop through dates
filter out unique ids emailed
loop through dates in the future and calculate the differences between sets
does the job but not very clean and pythonic

I have written the code to answer your question as I understand it at the moment. But as I commented, the status regarding churning has been fluffed up, so there are only two different totals. It is not done. The name of the column is not the name you want either.
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')
df2 = df.groupby(['date','status']).agg('count').unstack().fillna(0)
df2.columns = df2.columns.droplevel()
df2 = df2.rename_axis(columns=None).reset_index()
df2.sort_index(ascending=True, inplace=True)
df
date emailed not_emailed
0 2019-05-23 0.0 2.0
1 2019-05-24 0.0 2.0
2 2019-05-25 2.0 0.0
3 2019-05-26 2.0 0.0
4 2019-05-27 2.0 0.0
5 2019-05-28 1.0 0.0
6 2019-05-29 1.0 2.0
7 2019-05-30 3.0 0.0
8 2019-05-31 3.0 0.0
9 2019-06-01 3.0 0.0
10 2019-06-02 3.0 0.0
11 2019-07-02 1.0 0.0
12 2019-07-03 1.0 0.0
13 2019-07-04 1.0 0.0

As per your request i made some insight about your data, time to conversion and unconverted id time.
I hope it helps
df['date']=pd.to_datetime(df['date'],infer_datetime_format=True)
df.sort_values(by='date',inplace=True)
dates=df['date'].unique()
ids=df['id'].unique()
df=df.set_index(['id','date'])
out=pd.DataFrame(index=dates)
for i, new_df in df.groupby(level=0):
new_df=new_df.droplevel(0)
new_df=new_df.rename(columns={'status':i})
out=out.merge(new_df, how='outer', left_index=True, right_index=True)
not_converted=out[out.columns[out.iloc[-1,:]=='emailed']]
converted=out[out.columns[out.iloc[-1,:].isnull()]]
start_mailing_date_NC=(not_converted=='emailed').cumsum().idxmin() #not converted id metrics
delta_NC=(dates[-1]-start_mailing_date_NC) #dates[-1] could be changed to actual date
print("Days from first mail unconverted by id: ")
print(delta_NC.to_string())
print(' Mean Days not converted: %s'%(delta_NC.mean()))
print( '\n')
start_mailing_date=(converted=='emailed').cumsum().idxmin() #converted id metrics
conversion_mailing_date=(converted=='emailed').cumsum().idxmax()#converted id metrics
delta=(conversion_mailing_date-start_mailing_date)
print("Days to conversion by id: ")
print(delta.to_string())
print(' Mean Days to conversion: %s'%(delta.mean()))
output:
Days from first mail unconverted by id:
id4 42 days
Mean Days not converted: 42 days 00:00:00
Days to conversion by id:
id1 10 days
id2 4 days
id3 10 days
Mean Days to conversion: 8 days 00:00:00

Related

How to get the groupby nth row directly in the row as an item?

I have Date, Time, Open, High, low, Close, data on a minute basis of a stock. It is arranged in ascending order ( date wise ). I want to make a new column and for every day (for each row) insert the yesterday price at second row of last date). So for instance I have mentioned price of 18812.3 in front of 11th Jan since last date was 10th Jan and its second row has a price of 18812.3. Similarly I have done it for day before yesterday too. I tried using nth of groupby object but for I have to create a group by object. The below code is getting the a new Dataframe but I would like to create a column directly having the desired values.
test = bn_futures.groupby('Date')['Open','High','Low','Close'].nth(1).reset_index()
Try: (check comments)
# Convert Date to datetime64 and set it as index
df = df.assign(Date=pd.to_datetime(df['Date'], dayfirst=True)).set_index('Date')
# Find second value for each day
prices = df.groupby(level=0)['Open'].nth(1).squeeze()
# Find last row for each day
mask = ~df.index.duplicated(keep='last')
# Create new columns
df.loc[mask, 'price at yesterday'] = prices.shift(1)
df.loc[mask, 'price 2d ago'] = prices.shift(2)
Output:
>>> df
Open price at yesterday price 2d ago
Date
2015-01-09 1 NaN NaN
2015-01-09 2 NaN NaN
2015-01-09 3 NaN NaN
2015-01-10 4 NaN NaN
2015-01-10 5 NaN NaN
2015-01-10 6 2.0 NaN
2015-01-11 7 NaN NaN
2015-01-11 8 NaN NaN
2015-01-11 9 5.0 2.0
Setup a MRE:
df = pd.DataFrame({'Date': ['09-01-2015', '09-01-2015', '09-01-2015',
'10-01-2015', '10-01-2015', '10-01-2015',
'11-01-2015', '11-01-2015', '11-01-2015'],
'Open': [1, 2, 3, 4, 5, 6, 7, 8, 9]})

Rolling median of date-indexed data with duplicate dates

My date-indexed data can have multiple observations for a given date.
I want to get the rolling median of a value but am not getting the result that I am looking for:
df = pd.DataFrame({
'date': ['2020-06-22', '2020-06-23','2020-06-24','2020-06-24', '2020-06-25', '2020-06-26'],
'value': [2,8,5,1,3,7]
})
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Attempt to get the 3-day rolling median of 'value':
df['value'].rolling('3D').median()
# This yields the following, i.e. one median value
# per **observation**
# (two values for 6/24 in this example):
date
2020-06-22 2.0
2020-06-23 5.0
2020-06-24 5.0
2020-06-24 3.5
2020-06-25 4.0
2020-06-26 4.0
Name: value, dtype: float64
# I was hoping to get one median value
# per **distinct date** in the index
# The median for 6/24, for example, would be computed
# from **all** observations on 6/22, 6/23 and 6/24(2 observations)
date
2020-06-22 NaN
2020-06-23 NaN
2020-06-24 3.5
2020-06-25 4.0
2020-06-26 4.0
Name: value, dtype: float64
How do I need to change my code?
As far as I can tell, your code produces the right answer for the second occurrence of 2020-06-24, as 3.5 is the median of 4 numbers 2,8,5,1. The first occurrence of 2020-06-24only uses its own value and the ones from the two prior days. Presumably, and I am speculating here, it is looking at the '3D' window in the elements in the rows preceding it in the timeseries, not following.
So I think your code only needs a small modification to satisfy your requirement and that is if there are multiple rows with the same date we should just pick the last one. We will do this below with groupby. Also you want the first two values to be NaN rather than medians of shorter time series -- this can be achieved by passing min_periods = 3 in the rolling function. Here is all the code, I put the median into its own column
df['median'] = df['value'].rolling('3D', min_periods = 3).median()
df.groupby(level = 0, axis = 0).last()
prints
value median
date
2020-06-22 2 NaN
2020-06-23 8 NaN
2020-06-24 1 3.5
2020-06-25 3 4.0
2020-06-26 7 4.0

See if a customer had a purchase across every quarter and then graph

I have a dataframe that looks like this:
customer_id|date |sales_amount
479485 |20190120 | 500
479485 |20180320 | 200
472848 |20191020 | 100
This data has transaction information from 2016-2019. For each business quarter (grouped by 3 months) I want to see if a unique customer had a transaction. Basically I want the y-axis for the table to be each unique customer_id and then the x-axis of the table to be the 12 quarters in the time period of the data with a Boolean of whether or not a customer had a transaction in that quarter.
Ultimately I want to visualize this data to see the distribution of the transactions for each quarter across all the unique customers.
Expect output:
customer_id|2017- Q1 |2017- Q2|.. |2019- Q4
479485 |20190120 | 0 |.. | 1
469488 |20180320 | 0 |.. | 0
452848 |20191020 | 1 |.. | 1
I have changed the date column to datetime but am unsure how to group and proceed to the next step.
Solution:
df.groupby([df['customer_id'], df['date'].apply(lambda _: pd.Period(_, 'Q'))])['sales_amount'].count().unstack().fillna(0)
Output:
date 2017Q1 2018Q1 2019Q1 2019Q4
customer_id
469471 1.0 0.0 0.0 0.0
469488 0.0 1.0 1.0 1.0
472848 0.0 0.0 0.0 1.0
479485 1.0 1.0 1.0 0.0
Notes
Assumptions: (1) all of the year-quarters appear in you data set, and (2) there's only a single transaction per quarter.
To get around (1), set the index as date, and reindex with missing dates, filling nans with zero values. The above output is based on a sample of dummy data, hence only four quarters are shown.
To get around (2), run np.sign(_) over your output.

Pandas - calculate rolling average of group excluding current row

For an example:
data = {'Platoon': ['A','A','A','A','A','A','B','B','B','B','B','C','C','C','C','C'],
'Date' : [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5],
'Casualties': [1,4,5,7,5,5,6,1,4,5,6,7,4,6,4,6]}
df = pd.DataFrame(data)
This works to calculate the rolling average, inclusive of the current row:
df['avg'] = df.groupby(['Platoon'])['Casualties'].transform(lambda x: x.rolling(2, 1).mean())
Which gives:
Platoon Date Casualties Avg
A 1 1 1.0
A 2 4 2.5
A 3 5 4.5
A 4 7 6.0
......
What I want to get is:
Platoon Date Casualties Avg
A 1 1 1.0
A 2 4 1.0
A 3 5 2.5
A 4 7 4.5
......
I suspect I can use shift here but I can't figure it out!
You need shift with bfill
df.groupby(['Platoon'])['Casualties'].apply(lambda x: x.rolling(2, 1).mean().shift().bfill())

Sort a alphanumeric column in pandas and replace it with original column of the dataset [duplicate]

I have a data frame like this:
print(df)
0 1 2
0 354.7 April 4.0
1 55.4 August 8.0
2 176.5 December 12.0
3 95.5 February 2.0
4 85.6 January 1.0
5 152 July 7.0
6 238.7 June 6.0
7 104.8 March 3.0
8 283.5 May 5.0
9 278.8 November 11.0
10 249.6 October 10.0
11 212.7 September 9.0
As you can see, months are not in calendar order. So I created a second column to get the month number corresponding to each month (1-12). From there, how can I sort this data frame according to calendar months' order?
Use sort_values to sort the df by a specific column's values:
In [18]:
df.sort_values('2')
Out[18]:
0 1 2
4 85.6 January 1.0
3 95.5 February 2.0
7 104.8 March 3.0
0 354.7 April 4.0
8 283.5 May 5.0
6 238.7 June 6.0
5 152.0 July 7.0
1 55.4 August 8.0
11 212.7 September 9.0
10 249.6 October 10.0
9 278.8 November 11.0
2 176.5 December 12.0
If you want to sort by two columns, pass a list of column labels to sort_values with the column labels ordered according to sort priority. If you use df.sort_values(['2', '0']), the result would be sorted by column 2 then column 0. Granted, this does not really make sense for this example because each value in df['2'] is unique.
I tried the solutions above and I do not achieve results, so I found a different solution that works for me. The ascending=False is to order the dataframe in descending order, by default it is True. I am using python 3.6.6 and pandas 0.23.4 versions.
final_df = df.sort_values(by=['2'], ascending=False)
You can see more details in pandas documentation here.
Using column name worked for me.
sorted_df = df.sort_values(by=['Column_name'], ascending=True)
Panda's sort_values does the work.
There are various parameters one can pass, such as ascending (bool or list of bool):
Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
As the default is ascending, and OP's goal is to sort ascending, one doesn't need to specify that parameter (see the last note below for the way to solve descending), so one can use one of the following ways:
Performing the operation in-place, and keeping the same variable name. This requires one to pass inplace=True as follows:
df.sort_values(by=['2'], inplace=True)
# or
df.sort_values(by = '2', inplace = True)
# or
df.sort_values('2', inplace = True)
If doing the operation in-place is not a requirement, one can assign the change (sort) to a variable:
With the same name of the original dataframe, df as
df = df.sort_values(by=['2'])
With a different name, such as df_new, as
df_new = df.sort_values(by=['2'])
All this previous operations would give the following output
0 1 2
4 85.6 January 1.0
3 95.5 February 2.0
7 104.8 March 3.0
0 354.7 April 4.0
8 283.5 May 5.0
6 238.7 June 6.0
5 152 July 7.0
1 55.4 August 8.0
11 212.7 September 9.0
10 249.6 October 10.0
9 278.8 November 11.0
2 176.5 December 12.0
Finally, one can reset the index with pandas.DataFrame.reset_index, to get the following
df.reset_index(drop = True, inplace = True)
# or
df = df.reset_index(drop = True)
[Out]:
0 1 2
0 85.6 January 1.0
1 95.5 February 2.0
2 104.8 March 3.0
3 354.7 April 4.0
4 283.5 May 5.0
5 238.7 June 6.0
6 152 July 7.0
7 55.4 August 8.0
8 212.7 September 9.0
9 249.6 October 10.0
10 278.8 November 11.0
11 176.5 December 12.0
A one-liner that sorts ascending, and resets the index would be as follows
df = df.sort_values(by=['2']).reset_index(drop = True)
[Out]:
0 1 2
0 85.6 January 1.0
1 95.5 February 2.0
2 104.8 March 3.0
3 354.7 April 4.0
4 283.5 May 5.0
5 238.7 June 6.0
6 152 July 7.0
7 55.4 August 8.0
8 212.7 September 9.0
9 249.6 October 10.0
10 278.8 November 11.0
11 176.5 December 12.0
Notes:
If one is not doing the operation in-place, forgetting the steps mentioned above may lead one (as this user) to not be able to get the expected result.
There are strong opinions on using inplace. For that, one might want to read this.
One is assuming that the column 2 is not a string. If it is, one will have to convert it:
Using pandas.to_numeric
df['2'] = pd.to_numeric(df['2'])
Using pandas.Series.astype
df['2'] = df['2'].astype(float)
If one wants in descending order, one needs to pass ascending=False as
df = df.sort_values(by=['2'], ascending=False)
# or
df.sort_values(by = '2', ascending=False, inplace=True)
[Out]:
0 1 2
2 176.5 December 12.0
9 278.8 November 11.0
10 249.6 October 10.0
11 212.7 September 9.0
1 55.4 August 8.0
5 152 July 7.0
6 238.7 June 6.0
8 283.5 May 5.0
0 354.7 April 4.0
7 104.8 March 3.0
3 95.5 February 2.0
4 85.6 January 1.0
Just as another solution:
Instead of creating the second column, you can categorize your string data(month name) and sort by that like this:
df.rename(columns={1:'month'},inplace=True)
df['month'] = pd.Categorical(df['month'],categories=['December','November','October','September','August','July','June','May','April','March','February','January'],ordered=True)
df = df.sort_values('month',ascending=False)
It will give you the ordered data by month name as you specified while creating the Categorical object.
Just adding some more operations on data. Suppose we have a dataframe df, we can do several operations to get desired outputs
ID cost tax label
1 216590 1600 test
2 523213 1800 test
3 250 1500 experiment
(df['label'].value_counts().to_frame().reset_index()).sort_values('label', ascending=False)
will give sorted output of labels as a dataframe
index label
0 test 2
1 experiment 1
This worked for me
df.sort_values(by='Column_name', inplace=True, ascending=False)
You probably need to reset the index after sorting:
df = df.sort_values('2')
df = df.reset_index(drop=True)
Here is template of sort_values according to pandas documentation.
DataFrame.sort_values(by, axis=0,
ascending=True,
inplace=False,
kind='quicksort',
na_position='last',
ignore_index=False, key=None)[source]
In this case it will be like this.
df.sort_values(by=['2'])
API Reference pandas.DataFrame.sort_values
Just adding a few more insights
df=raw_df['2'].sort_values() # will sort only one column (i.e 2)
but ,
df =raw_df.sort_values(by=["2"] , ascending = False) # this will sort the whole df in decending order on the basis of the column "2"
If you want to sort column dynamically but not alphabetically.
and dont want to use pd.sort_values().
you can try below solution.
Problem : sort column "col1" in this sequence ['A', 'C', 'D', 'B']
import pandas as pd
import numpy as np
## Sample DataFrame ##
df = pd.DataFrame({'col1': ['A', 'B', 'D', 'C', 'A']})
>>> df
col1
0 A
1 B
2 D
3 C
4 A
## Solution ##
conditions = []
values = []
for i,j in enumerate(['A','C','D','B']):
conditions.append((df['col1'] == j))
values.append(i)
df['col1_Num'] = np.select(conditions, values)
df.sort_values(by='col1_Num',inplace = True)
>>> df
col1 col1_Num
0 A 0
4 A 0
3 C 1
2 D 2
1 B 3
This one worked for me:
df=df.sort_values(by=[2])
Whereas:
df=df.sort_values(by=['2'])
is not working.
Example:
Assume you have a column with values 1 and 0 and you want to separate and use only one value, then:
// furniture is one of the columns in the csv file.
allrooms = data.groupby('furniture')['furniture'].agg('count')
allrooms
myrooms1 = pan.DataFrame(allrooms, columns = ['furniture'], index = [1])
myrooms2 = pan.DataFrame(allrooms, columns = ['furniture'], index = [0])
print(myrooms1);print(myrooms2)