Create an extra column based condition in pandas - pandas

I have a data frame as shown below
Tenancy_ID Start_Date Cancelled_Date
1 2011-10-02 07:18:16 2011-12-02 08:15:16
2 2012-10-22 07:18:17 NaT
1 2013-06-02 07:14:12 NaT
3 2016-10-02 07:18:16 2017-03-02 08:18:15
From the above I would like to create new column named as Cancelled_Status based on the availability of cancelled date in Cancelled_Date.
Expected Output:
Tenancy_ID Start_Date Cancelled_Date Cancelled_status
1 2011-10-02 07:18:16 2011-12-02 08:15:16 Cancelled
2 2012-10-22 07:18:17 NaT Not_Cancelled
1 2013-06-02 07:14:12 NaT Not_Cancelled
3 2016-10-02 07:18:16 2017-03-02 08:18:15 Cancelled

Use numpy.where with Series.isna:
df['Cancelled_status'] = np.where(df['Cancelled_Date'].isna(), 'Not_Cancelled', 'Cancelled')
Alternative with
Series.notna:
df['Cancelled_status'] = np.where(df['Cancelled_Date'].notna(), 'Cancelled', 'Not_Cancelled')

Related

sql compute time differ duration and skip rows value equal null

I'm new to sql ,I use pandas a lot ,but my boss asks me to use sql replace lots of pandas code.
I have a table my_table:
case_id first_created last_paid submitted_time
3456 2021-01-27 2021-01-29 2021-01-26 21:34:36.566023+00:00
7891 2021-08-02 2021-09-16 2022-10-26 19:49:14.135585+00:00
1245 2021-09-13 None 2022-10-31 02:03:59.620348+00:00
9073 None None 2021-09-12 10:25:30.845687+00:00
6891 2021-08-03 2021-09-17 None
First I need create 2 variables:
create_duration = first_created-submitted_time
paid_duration= last_paid-submitted_time
And if submitted_time is none just ignore that row ,else if create_duration or paid_duration
is negative ,convert it to 0,the unit should be days.
The ideal output should something similliar :
case_id first_created last_paid submitted_time create_duration paid_duration
3456 2021-01-27 2021-01-29 2021-01-26 21:34:36.566023+00:00 1 3
7891 2021-08-02 2021-09-16 2022-10-26 19:49:14.135585+00:00 0 0
1245 2021-09-13 None 2022-10-31 02:03:59.620348+00:00 0 null
9073 None None 2021-09-12 10:25:30.845687+00:00 null null
6891 2021-08-03 2021-09-17 null null null
My code:
select * from my_table
first_created-submitted_time as create_duration
last_paid-submitted_time as paid_duration
I have to say I'm too bad at sql,I have no idea how to continue,any friend can help ?

Pandas - Mapping two Dataframe based on date ranges

I am trying to categorise users based on their lifecycle. Given below Pandas dataframe shows the number of times a customer raised a ticket depending on how long they have used the product.
master dataframe
cust_id,start_date,end_date
101,02/01/2019,12/01/2019
101,14/02/2019,24/04/2019
101,27/04/2019,02/05/2019
102,25/01/2019,02/02/2019
103,02/01/2019,22/01/2019
Master lookup table
start_date,end_date,project_name
01/01/2019,13/01/2019,project_a
14/01/2019,13/02/2019,project_b
15/02/2019,13/03/2019,project_c
14/03/2019,13/06/2019,project_d
I am trying to map the above two data frames such that I am able to add project_name to the master dataframe
Expected output:
cust_id,start_date,end_date,project_name
101,02/01/2019,12/01/2019,project_a
101,14/02/2019,24/04/2019,project_c
101,14/02/2019,24/04/2019,project_d
101,27/04/2019,02/05/2019,project_d
102,25/01/2019,02/02/2019,project_b
103,02/01/2019,22/01/2019,project_a
103,02/01/2019,22/01/2019,project_b
I do expect duplicate rows in the final output as a single row in the master dataframe would fall under multiple rows of master lookup table
I think you need:
df = df1.assign(a=1).merge(df2.assign(a=1), on='a')
m1 = df['start_date_y'].between(df['start_date_x'], df['end_date_x'])
m2 = df['end_date_y'].between(df['start_date_x'], df['end_date_x'])
df = df[m1 | m2]
print (df)
cust_id start_date_x end_date_x a start_date_y end_date_y project_name
1 101 2019-02-01 2019-12-01 1 2019-01-14 2019-02-13 project_b
2 101 2019-02-01 2019-12-01 1 2019-02-15 2019-03-13 project_c
3 101 2019-02-01 2019-12-01 1 2019-03-14 2019-06-13 project_d
6 101 2019-02-14 2019-04-24 1 2019-02-15 2019-03-13 project_c
7 101 2019-02-14 2019-04-24 1 2019-03-14 2019-06-13 project_d

Pandas - group by to return first occurence and there on every third occurence of a value

I am trying to filter records from a Dataframe based on their occurence. I am trying to filter out the first occurence and then on every third occurence based on emp_id. Given below is how my Dataframe is.
emp_id,date,value
101,2018-12-01,10001
101,2018-12-03,10002
101,2018-12-05,10003
101,2018-12-13,10004
In the above sample, expected output is :
emp_id,date,value
101,2018-12-01,10001
101,2018-12-13,10004
Given below is the code I have built this far:
df['emp_id'] = df.groupby('emp_id').cumcount()+1
df['emp_id'] = np.where((df['emp_id']%3)==0,1,0)
This however returns back 2nd occurence and every third occurrence after that. How could I modify such that it returns back the first occurence and then on every third occurence based on emp_id
I think you need boolean indexing with check 0 or 1, assign to column is not necessary, is possible create helper Series s:
print (df)
emp_id date value
0 101 2018-12-01 10001
1 101 2018-12-03 10002
2 101 2018-12-05 10003
3 101 2018-12-13 10004
4 101 2018-12-01 10005
5 101 2018-12-03 10006
6 101 2018-12-05 10007
7 101 2018-12-13 10008
s = df.groupby('emp_id').cumcount()
df['check'] = (s % 3) == 0
Alternative:
s = df.groupby('emp_id').cumcount() + 1
df['check'] = (s % 3) == 1
print (df)
emp_id date value check
0 101 2018-12-01 10001 True
1 101 2018-12-03 10002 False
2 101 2018-12-05 10003 False
3 101 2018-12-13 10004 True
4 101 2018-12-01 10005 False
5 101 2018-12-03 10006 False
6 101 2018-12-05 10007 True
7 101 2018-12-13 10008 False

Pandas/SQL join

I would like to add some data (event_date) from table B to table A, as described below. It looks like a join on event_id, however this column contains duplicate values in both tables. There are more columns in both tables but I'm omitting them for clarity.
How to achieve the desired effect in Pandas and in SQL in the most direct way?
Table A:
id,event_id
1,123
2,123
3,456
4,456
5,456
Table B:
id,event_id,event_date
11,123,2017-02-06
12,456,2017-02-07
13,123,2017-02-06
14,456,2017-02-07
15,123,2017-02-06
16,123,2017-02-06
Desired outcome (table A + event_date):
id,event_id,event_date
1,123,2017-02-06
2,123,2017-02-06
3,456,2017-02-07
4,456,2017-02-07
5,456,2017-02-07
Using merge, first drop duplicates from B
In [662]: A.merge(B[['event_id', 'event_date']].drop_duplicates())
Out[662]:
id event_id event_date
0 1 123 2017-02-06
1 2 123 2017-02-06
2 3 456 2017-02-07
3 4 456 2017-02-07
4 5 456 2017-02-07
SQL part:
select distinct a.*, b.event_date
from table_a a
join table_b b
on a.event_id = b.event_id;
You can use Pandas Merge to get the desired result. Finally get only the columns that you are interested from DataFrame
df_Final = pd.merge(df1,df2,on='event_id',how='left')
print df_Final[['id_y','event_id','event_date']]
output
id_y event_id event_date
0 1 123 2017-02-06
1 2 123 2017-02-06
2 3 456 2017-02-07
3 4 456 2017-02-07
4 5 456 2017-02-07
5 1 123 2017-02-06
6 2 123 2017-02-06
7 3 456 2017-02-07
8 4 456 2017-02-07
9 5 456 2017-02-07
10 1 123 2017-02-06
11 2 123 2017-02-06
12 1 123 2017-02-06
13 2 123 2017-02-06

Find mean of difference between dates -Sql Select

I have a table with below details
Repid | buildDate | BuildVersion
---------------------------------
1 2013-11-15 10:41:00 1683
1 2013-11-15 11:10:00 1684
1 2013-11-15 12:14:00 1685
2 2013-11-15 10:41:00 1688
2 2013-11-15 11:10:00 1689
2 2013-11-15 12:14:00 1690
for each Repid, i need to find the average of difference in hours between successive build versions.
select b1.RepId
, avg(abs(datediff(hour, b1.buildDate, b2.buildDate)))
from builds b1
join builds b2
on b1.BuildVersion = b2.BuildVersion + 1
and b1.Repid = b2.Repid
group by
b1.RepId
Live example at SQL Fiddle.