Get a random sample from dataframe with grouped columns? - pandas

I have a dataframe of time series data, called dates_c that looks like this:
DATE_T Da HN NAR TJH
0 2014-01-01 00:11:25 2014-01-01 3520 11931 769.198
1 2014-01-01 00:11:25 2014-01-01 3560 11942 338.143
2 2014-01-01 00:11:25 2014-01-01 3542 11937 665.481
3 2014-01-01 00:11:25 2014-01-01 3563 11944 529.058
4 2014-01-01 00:11:25 2014-01-01 3535 11936 2883.945
I want to get 60 random rows per Da + NAR. This is what I did:
np.random.seed(987)
columns = ['DATE_T', 'HN', 'TJH']
new = dates_c.groupby(['Da', 'NAR'])[columns].apply(pd.Series.sample, n=60, replace=False).reset_index()
I keep getting this error:
ValueError: Key 2014-01-01 00:00:00 not in level Index([2014-01-01, 2014-01-02, 2014-01-03, 2014-01-04, 2014-01-05, 2014-01-06,
2014-01-07, 2014-01-08, 2014-01-09, 2014-01-10,
...
2014-12-22, 2014-12-23, 2014-12-24, 2014-12-25, 2014-12-26, 2014-12-27,
2014-12-28, 2014-12-29, 2014-12-30, 2014-12-31],
dtype='object', name='Date', length=320)

Here you need replace = True since some group may do not have enough data point for n=60
out = df.groupby(['Date', 'NOAA_AR']).apply(lambda x : x.sample(n=60,replace=True))

Try:
dates_count.groupby(['Date', 'NOAA_AR'])[COLS].sample(60).reset_index()

Related

Merge when date is between two dates Pandas

I'm looking for way in which I can merge a table on multiple conditions, one of which is when a date is between two dates in the other table
Below is the two data sets
DATA SET 1
Code 1
Code 2
Date
Number
001
192
02.02.22
10
002
192
05.03.22
12
002
192
09.05.22
8
003
193
14.06.22
14
003
193
16.08.22
18
DATA SET 2
Code 1
Code 2
Date Start
Date End
005
192
15.01.22
5.02.22
002
192
01.05.22
01.06.22
003
193
10.08.22
10.09.22
003
192
01.03.22
15.03.22
007
192
10.06.22
18.06.22
I basically need to end up with Data Set 2 but with the Number column attached - merged on Code 1, Code 2, and when the date in DS1 is between the two dates in DS 2.
In this example above, the outcome would look like this:
Code 1
Code 2
Date Start
Date End
Number
002
192
01.05.22
01.06.22
8
003
193
10.08.22
10.09.22
18
Thanks
Try:
# Convert to datetime
df1['Date'] = pd.to_datetime(df1['Date'], dayfirst=True)
df2['Date Start'] = pd.to_datetime(df2['Date Start'], dayfirst=True)
df2['Date End'] = pd.to_datetime(df2['Date End'], dayfirst=True)
# Merge on Code 1 and Code 2 then keep only rows where Start Date <= Date <= End Date
out = df2.merge(df1, how='left', on=['Code 1', 'Code 2']) \
.query('Date.between(`Date Start`, `Date End`)')
Output:
Code 1
Code 2
Date Start
Date End
Date
Number
2
192
2022-05-01 00:00:00
2022-06-01 00:00:00
2022-05-09 00:00:00
8
3
193
2022-08-10 00:00:00
2022-09-10 00:00:00
2022-08-16 00:00:00
18

difference in two date column in Pandas

I am trying to get difference between two date columns below script and data used in script, but I am getting same results for all three rows
df = pd.read_csv(r'Book1.csv',encoding='cp1252')
df
Out[36]:
Start End DifferenceinDays DifferenceinHrs
0 10/26/2013 12:43 12/15/2014 0:04 409 9816
1 2/3/2014 12:43 3/25/2015 0:04 412 9888
2 5/14/2014 12:43 7/3/2015 0:04 409 9816
I am expecting results as in column DifferenceinDays which is calculated in excel but in python getting same values for all three rows, Please refer to below code used, can anyone let me know how is to calculate difference between 2 date column, I am trying to get number of hours between two date columns.
df["Start"] = pd.to_datetime(df['Start'])
df["End"] = pd.to_datetime(df['End'])
df['hrs']=(df.End-df.Start)
df['hrs']
Out[38]:
0 414 days 11:21:00
1 414 days 11:21:00
2 414 days 11:21:00
Name: hrs, dtype: timedelta64[ns]
IIUC, np.timedelta64(1,'h')
Additionally, it looks like excel calculates the hours differently, unsure why.
import numpy as np
df['hrs'] = (df['End'] - df['Start']) / np.timedelta64(1,'h')
print(df)
Start End DifferenceinHrs hrs
0 2013-10-26 12:43:00 2014-12-15 00:04:00 9816 9947.35
1 2014-02-03 12:43:00 2015-03-25 00:04:00 9888 9947.35
2 2014-05-14 12:43:00 2015-07-03 00:04:00 9816 9947.35

How to sum of certain values using pandas datetime operations

Headline is not clear. Let me explain.
I have a dataframe like this:
Order Quantity Date Accepted Date Delivered
20 01-05-2010 01-02-2011
10 01-11-2010 01-03-2011
300 01-12-2010 01-04-2011
5 01-03-2011 01-03-2012
20 01-04-2012 01-11-2013
10 01-07-2013 01-12-2014
I want to basically create another column that contains the total undelivered items for each row.
Expected output:
Order Quantity Date Accepted Date Delivered Pending Order
20 01-05-2010 01-02-2011 20
10 01-11-2010 01-03-2011 30
300 01-12-2010 01-04-2011 330
5 01-03-2011 01-03-2012 305
20 01-04-2012 01-11-2013 20
10 01-07-2013 01-12-2014 30
Here, I have taken a part of your dataframe and try to get the result.
df = pd.DataFrame({'order': [20, 10, 300, 200],
'Date_aceepted': ['01-05-2010', '01-11-2010', '01-12-2010', '01-12-2010'],
'Date_delever': ['01-02-2011', '01-03-2011', '01-04-2011', '01-12-2010']})
order Date_aceepted Date_delever
0 20 01-05-2010 01-02-2011
1 10 01-11-2010 01-03-2011
2 300 01-12-2010 01-04-2011
3 200 01-12-2010 01-12-2010
Then I will change the Date_accepted and Date_deliver to date time by using pandas data time module
df['date1'] = pd.to_datetime(df['Date_aceepted'])
df['date2'] = pd.to_datetime(df['Date_delever'])
Then I will make a new data frame in which the Date_accepted and Date_delever are not the same. I assume you just need that in your final result.
dff = df[df['date1'] != df['date2']]
You can see the last row in which both accepted and delever are same is now removed in dff.
order Date_aceepted Date_delever date1 date2
0 20 01-05-2010 01-02-2011 2010-01-05 2011-01-02
1 10 01-11-2010 01-03-2011 2010-01-11 2011-01-03
2 300 01-12-2010 01-04-2011 2010-01-12 2011-01-04
Then I did use pandas cumsum of pending order
dff['pending'] = dff['order'].cumsum()
and it gives
order Date_aceepted Date_delever date1 date2 pending
0 20 01-05-2010 01-02-2011 2010-01-05 2011-01-02 20
1 10 01-11-2010 01-03-2011 2010-01-11 2011-01-03 30
2 300 01-12-2010 01-04-2011 2010-01-12 2011-01-04 330
The final data frame has two extra columns that can be dropped if you don't want in your result.

Pandas - Mapping two Dataframe based on date ranges

I am trying to categorise users based on their lifecycle. Given below Pandas dataframe shows the number of times a customer raised a ticket depending on how long they have used the product.
master dataframe
cust_id,start_date,end_date
101,02/01/2019,12/01/2019
101,14/02/2019,24/04/2019
101,27/04/2019,02/05/2019
102,25/01/2019,02/02/2019
103,02/01/2019,22/01/2019
Master lookup table
start_date,end_date,project_name
01/01/2019,13/01/2019,project_a
14/01/2019,13/02/2019,project_b
15/02/2019,13/03/2019,project_c
14/03/2019,13/06/2019,project_d
I am trying to map the above two data frames such that I am able to add project_name to the master dataframe
Expected output:
cust_id,start_date,end_date,project_name
101,02/01/2019,12/01/2019,project_a
101,14/02/2019,24/04/2019,project_c
101,14/02/2019,24/04/2019,project_d
101,27/04/2019,02/05/2019,project_d
102,25/01/2019,02/02/2019,project_b
103,02/01/2019,22/01/2019,project_a
103,02/01/2019,22/01/2019,project_b
I do expect duplicate rows in the final output as a single row in the master dataframe would fall under multiple rows of master lookup table
I think you need:
df = df1.assign(a=1).merge(df2.assign(a=1), on='a')
m1 = df['start_date_y'].between(df['start_date_x'], df['end_date_x'])
m2 = df['end_date_y'].between(df['start_date_x'], df['end_date_x'])
df = df[m1 | m2]
print (df)
cust_id start_date_x end_date_x a start_date_y end_date_y project_name
1 101 2019-02-01 2019-12-01 1 2019-01-14 2019-02-13 project_b
2 101 2019-02-01 2019-12-01 1 2019-02-15 2019-03-13 project_c
3 101 2019-02-01 2019-12-01 1 2019-03-14 2019-06-13 project_d
6 101 2019-02-14 2019-04-24 1 2019-02-15 2019-03-13 project_c
7 101 2019-02-14 2019-04-24 1 2019-03-14 2019-06-13 project_d

Pandas - Find difference based on two subsequent rows of Dataframe

I have a Dataframe that captures date when ticket was raised by a customer that is captured in column labelled date. If the ref_column for the current cell is same as the following cell then I need to find difference of aging based on date column current cell and the following cell for the same cust_id. if the ref_column is to the same then I need to find difference of date and ref_date of the same row.
Given below is how my data is:
cust_id,date,ref_column,ref_date
101,15/01/19,abc,31/01/19
101,17/01/19,abc,31/01/19
101,19/01/19,xyz,31/01/19
102,15/01/19,abc,31/01/19
102,21/01/19,klm,31/01/19
102,25/01/19,xyz,31/01/19
103,15/01/19,xyz,31/01/19
Expected output:
cust_id,date,ref_column,ref_date,aging(in days)
101,15/01/19,abc,31/01/19,2
101,17/01/19,abc,31/01/19,14
101,19/01/19,xyz,31/01/19,0
102,15/01/19,abc,31/01/19,16
102,21/01/19,klm,31/01/19,10
102,25/01/19,xyz,31/01/19,0
103,15/01/19,xyz,31/01/19,0
Aging(in days) is 0 for the last entry for a given cust_id
Here's my approach:
# convert dates to datetime type
# ignore if already are
df['date'] = pd.to_datetime(df['date'])
df['ref_date'] = pd.to_datetime(df['ref_date'])
# customer group
groups = df.groupby('cust_id')
# where ref_column is the same with the next:
same_ = df['ref_column'].eq(groups['ref_column'].shift(-1))
# update these ones
df['aging'] = np.where(same_,
-groups['date'].diff(-1).dt.days, # same ref as next row
df['ref_date'].sub(df['date']).dt.days) # diff ref than next row
# update last elements in groups:
last_idx = groups['date'].idxmax()
df.loc[last_idx, 'aging'] = 0
Output:
cust_id date ref_column ref_date aging
0 101 2019-01-15 abc 2019-01-31 2.0
1 101 2019-01-17 abc 2019-01-31 14.0
2 101 2019-01-19 xyz 2019-01-31 0.0
3 102 2019-01-15 abc 2019-01-31 16.0
4 102 2019-01-21 klm 2019-01-31 10.0
5 102 2019-01-25 xyz 2019-01-31 0.0
6 103 2019-01-15 xyz 2019-01-31 0.0