Pandas - Mapping two Dataframe based on date ranges - pandas

I am trying to categorise users based on their lifecycle. Given below Pandas dataframe shows the number of times a customer raised a ticket depending on how long they have used the product.
master dataframe
cust_id,start_date,end_date
101,02/01/2019,12/01/2019
101,14/02/2019,24/04/2019
101,27/04/2019,02/05/2019
102,25/01/2019,02/02/2019
103,02/01/2019,22/01/2019
Master lookup table
start_date,end_date,project_name
01/01/2019,13/01/2019,project_a
14/01/2019,13/02/2019,project_b
15/02/2019,13/03/2019,project_c
14/03/2019,13/06/2019,project_d
I am trying to map the above two data frames such that I am able to add project_name to the master dataframe
Expected output:
cust_id,start_date,end_date,project_name
101,02/01/2019,12/01/2019,project_a
101,14/02/2019,24/04/2019,project_c
101,14/02/2019,24/04/2019,project_d
101,27/04/2019,02/05/2019,project_d
102,25/01/2019,02/02/2019,project_b
103,02/01/2019,22/01/2019,project_a
103,02/01/2019,22/01/2019,project_b
I do expect duplicate rows in the final output as a single row in the master dataframe would fall under multiple rows of master lookup table

I think you need:
df = df1.assign(a=1).merge(df2.assign(a=1), on='a')
m1 = df['start_date_y'].between(df['start_date_x'], df['end_date_x'])
m2 = df['end_date_y'].between(df['start_date_x'], df['end_date_x'])
df = df[m1 | m2]
print (df)
cust_id start_date_x end_date_x a start_date_y end_date_y project_name
1 101 2019-02-01 2019-12-01 1 2019-01-14 2019-02-13 project_b
2 101 2019-02-01 2019-12-01 1 2019-02-15 2019-03-13 project_c
3 101 2019-02-01 2019-12-01 1 2019-03-14 2019-06-13 project_d
6 101 2019-02-14 2019-04-24 1 2019-02-15 2019-03-13 project_c
7 101 2019-02-14 2019-04-24 1 2019-03-14 2019-06-13 project_d

Related

Get a random sample from dataframe with grouped columns?

I have a dataframe of time series data, called dates_c that looks like this:
DATE_T Da HN NAR TJH
0 2014-01-01 00:11:25 2014-01-01 3520 11931 769.198
1 2014-01-01 00:11:25 2014-01-01 3560 11942 338.143
2 2014-01-01 00:11:25 2014-01-01 3542 11937 665.481
3 2014-01-01 00:11:25 2014-01-01 3563 11944 529.058
4 2014-01-01 00:11:25 2014-01-01 3535 11936 2883.945
I want to get 60 random rows per Da + NAR. This is what I did:
np.random.seed(987)
columns = ['DATE_T', 'HN', 'TJH']
new = dates_c.groupby(['Da', 'NAR'])[columns].apply(pd.Series.sample, n=60, replace=False).reset_index()
I keep getting this error:
ValueError: Key 2014-01-01 00:00:00 not in level Index([2014-01-01, 2014-01-02, 2014-01-03, 2014-01-04, 2014-01-05, 2014-01-06,
2014-01-07, 2014-01-08, 2014-01-09, 2014-01-10,
...
2014-12-22, 2014-12-23, 2014-12-24, 2014-12-25, 2014-12-26, 2014-12-27,
2014-12-28, 2014-12-29, 2014-12-30, 2014-12-31],
dtype='object', name='Date', length=320)
Here you need replace = True since some group may do not have enough data point for n=60
out = df.groupby(['Date', 'NOAA_AR']).apply(lambda x : x.sample(n=60,replace=True))
Try:
dates_count.groupby(['Date', 'NOAA_AR'])[COLS].sample(60).reset_index()

Pandas - Find difference based on two subsequent rows of Dataframe

I have a Dataframe that captures date when ticket was raised by a customer that is captured in column labelled date. If the ref_column for the current cell is same as the following cell then I need to find difference of aging based on date column current cell and the following cell for the same cust_id. if the ref_column is to the same then I need to find difference of date and ref_date of the same row.
Given below is how my data is:
cust_id,date,ref_column,ref_date
101,15/01/19,abc,31/01/19
101,17/01/19,abc,31/01/19
101,19/01/19,xyz,31/01/19
102,15/01/19,abc,31/01/19
102,21/01/19,klm,31/01/19
102,25/01/19,xyz,31/01/19
103,15/01/19,xyz,31/01/19
Expected output:
cust_id,date,ref_column,ref_date,aging(in days)
101,15/01/19,abc,31/01/19,2
101,17/01/19,abc,31/01/19,14
101,19/01/19,xyz,31/01/19,0
102,15/01/19,abc,31/01/19,16
102,21/01/19,klm,31/01/19,10
102,25/01/19,xyz,31/01/19,0
103,15/01/19,xyz,31/01/19,0
Aging(in days) is 0 for the last entry for a given cust_id
Here's my approach:
# convert dates to datetime type
# ignore if already are
df['date'] = pd.to_datetime(df['date'])
df['ref_date'] = pd.to_datetime(df['ref_date'])
# customer group
groups = df.groupby('cust_id')
# where ref_column is the same with the next:
same_ = df['ref_column'].eq(groups['ref_column'].shift(-1))
# update these ones
df['aging'] = np.where(same_,
-groups['date'].diff(-1).dt.days, # same ref as next row
df['ref_date'].sub(df['date']).dt.days) # diff ref than next row
# update last elements in groups:
last_idx = groups['date'].idxmax()
df.loc[last_idx, 'aging'] = 0
Output:
cust_id date ref_column ref_date aging
0 101 2019-01-15 abc 2019-01-31 2.0
1 101 2019-01-17 abc 2019-01-31 14.0
2 101 2019-01-19 xyz 2019-01-31 0.0
3 102 2019-01-15 abc 2019-01-31 16.0
4 102 2019-01-21 klm 2019-01-31 10.0
5 102 2019-01-25 xyz 2019-01-31 0.0
6 103 2019-01-15 xyz 2019-01-31 0.0

Adding the most recent data to a pandas data frame

I am trying to build, and keep up to date, a data frame/time series where I scrape the data from a website table, and want to take the most recent data, and add to the data I've already got. A sample of what the data frame looks like is:
Date Price
0 10/01/19 100
1 09/01/19 95
2 08/01/19 96
3 07/01/19 97
What I then want to do is run my little program and have it identify that I am missing data for the 11th and 12th of Jan, and then add it to the top of the data frame. I am quite comfortable with compiling a data frame using .read_html, and generally building a data frame, but this is a bit beyond my talents currently.
I know the done thing is usually to show you what I have so far attempted but to be honest I actually don't know where to begin with this one.
Many thanks
Lets say the old dataframe as df which looks like:
Date Price
0 2019-01-10 100
1 2019-01-09 95
2 2019-01-08 96
3 2019-01-07 97
After 2 days you download a data which gives you 2 rows for 2019-01-11 and 2019-01-12, lets name it new_df (values are just as examples):
Date Price
0 2019-01-12 67
1 2019-01-11 89
2 2019-01-10 100
3 2019-01-09 95
Note: there are a few values in the new df which are present in the old df.
Using df.append() , df.drop_duplicates() and df.sort_values() :-
>>df.append(new_df,ignore_index=True).drop_duplicates().sort_values(by='Date',ascending=False)
Date Price
4 2019-01-12 67
5 2019-01-11 89
0 2019-01-10 100
1 2019-01-09 95
2 2019-01-08 96
3 2019-01-07 97
This will append the new values and sort them in descending manner based on Date column keeping the latest date at the top.
if you want the index sorted just add sort_index() in the end : df.append(new_df,ignore_index=True).drop_duplicates().sort_values(by='Date',ascending=False).sort_index()

Filtering Pandas column with specific conditions?

I have a pandas dataframe that looks like
Start Time
0 2017-06-23 15:09:32
1 2017-05-25 18:19:03
2 2017-01-04 08:27:49
3 2017-03-06 13:49:38
4 2017-01-17 14:53:07
5 2017-06-26 09:01:20
6 2017-05-26 09:41:44
7 2017-01-21 14:28:38
8 2017-04-20 16:08:51
I want to filter out the ones with month == 06. So it would be the row 1 and 5.
I know how to filter it out for column that has only few categories, but in this case, if it's a date, I need to parse the date and check the month. But I am not sure how to do it with pandas. Please help.
Using
#df['Start Time']=pd.to_datetime(df['Start Time'])
df1=df[df['Start Time'].dt.month==6].copy()

groupby pandas dataframe, take difference between value of latest and earliest date

I have a Cumulative column and I want to groupby index and take the values corresponding to the latest date minus the values corresponding to the earliest date.
Very similar to this: group by pandas dataframe and select latest in each group
But take the difference between latest and earliest in each group.
I'm a python rookie, and here is my solution:
import pandas as pd
from io import StringIO
csv = StringIO("""index id product date
0 220 6647 2014-09-01
1 220 6647 2014-09-03
2 220 6647 2014-10-16
3 826 3380 2014-11-11
4 826 3380 2014-12-09
5 826 3380 2015-05-19
6 901 4555 2014-09-01
7 901 4555 2014-10-05
8 901 4555 2014-11-01""")
df = pd.read_table(csv, sep='\s+',index_col='index')
df['date']=pd.to_datetime(df['date'],errors='coerce')
df_sort=df.sort_values('date')
df_sort.drop(['product'], axis=1,inplace=True)
df_sort.groupby('id').tail(1).set_index('id')-df_sort.groupby('id').head(1).set_index('id')