pandas retain values on different index dataframes - pandas

I need to merge two dataframes with different frequencies (daily to weekly). However, would like to retain the weekly values when merging to the daily dataframe.
There is a grouping variable in the data, group.
import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta
daily={'date':[datetime.date(2022,1,1)+relativedelta(day=i) for i in range(1,10)]*2,
'group':['A' for x in range(1,10)]+['B' for x in range(1,10)],
'daily_value':[x for x in range(1,10)]*2}
weekly={'date':[datetime.date(2022,1,1),datetime.date(2022,1,7)]*2,
'group':['A','A']+['B','B'],
'weekly_value':[100,200,300,400]}
daily_data=pd.DataFrame(daily)
weekly_data=pd.DataFrame(weekly)
daily_data output:
date group daily_value
0 2022-01-01 A 1
1 2022-01-02 A 2
2 2022-01-03 A 3
3 2022-01-04 A 4
4 2022-01-05 A 5
5 2022-01-06 A 6
6 2022-01-07 A 7
7 2022-01-08 A 8
8 2022-01-09 A 9
9 2022-01-01 B 1
10 2022-01-02 B 2
11 2022-01-03 B 3
12 2022-01-04 B 4
13 2022-01-05 B 5
14 2022-01-06 B 6
15 2022-01-07 B 7
16 2022-01-08 B 8
17 2022-01-09 B 9
weekly_data output:
date group weekly_value
0 2022-01-01 A 100
1 2022-01-07 A 200
2 2022-01-01 B 300
3 2022-01-07 B 400
The desired output
desired={'date':[datetime.date(2022,1,1)+relativedelta(day=i) for i in range(1,10)]*2,
'group':['A' for x in range(1,10)]+['B' for x in range(1,10)],
'daily_value':[x for x in range(1,10)]*2,
'weekly_value':[100]*6+[200]*3+[300]*6+[400]*3}
desired_data=pd.DataFrame(desired)
desired_data output:
date group daily_value weekly_value
0 2022-01-01 A 1 100
1 2022-01-02 A 2 100
2 2022-01-03 A 3 100
3 2022-01-04 A 4 100
4 2022-01-05 A 5 100
5 2022-01-06 A 6 100
6 2022-01-07 A 7 200
7 2022-01-08 A 8 200
8 2022-01-09 A 9 200
9 2022-01-01 B 1 300
10 2022-01-02 B 2 300
11 2022-01-03 B 3 300
12 2022-01-04 B 4 300
13 2022-01-05 B 5 300
14 2022-01-06 B 6 300
15 2022-01-07 B 7 400
16 2022-01-08 B 8 400
17 2022-01-09 B 9 400

Use merge_asof with sorting values by datetimes, last sorting like original by both columns:
daily_data['date'] = pd.to_datetime(daily_data['date'])
weekly_data['date'] = pd.to_datetime(weekly_data['date'])
df = (pd.merge_asof(daily_data.sort_values('date'),
weekly_data.sort_values('date'),
on='date',
by='group').sort_values(['group','date'], ignore_index=True))
print (df)
date group daily_value weekly_value
0 2022-01-01 A 1 100
1 2022-01-02 A 2 100
2 2022-01-03 A 3 100
3 2022-01-04 A 4 100
4 2022-01-05 A 5 100
5 2022-01-06 A 6 100
6 2022-01-07 A 7 200
7 2022-01-08 A 8 200
8 2022-01-09 A 9 200
9 2022-01-01 B 1 300
10 2022-01-02 B 2 300
11 2022-01-03 B 3 300
12 2022-01-04 B 4 300
13 2022-01-05 B 5 300
14 2022-01-06 B 6 300
15 2022-01-07 B 7 400
16 2022-01-08 B 8 400
17 2022-01-09 B 9 400

Related

standard sql: Get customer count and first purchase date per customer and store_id

I use standard sql and I need a query that gets the total count of purchases per customer, for each store_id. And also the first purchase date per customer, for each store_id.
I have a table with this structure:
customer_id
store_id
product_no
customer_no
purchase_date
price
1
10
100
200
2022-01-01
50
1
10
110
200
2022-01-02
70
1
20
120
200
2022-01-02
60
1
20
130
200
2022-01-02
40
1
30
140
200
2022-01-02
60
Current query:
Select
customer_id,
store_id,
product_id,
product_no,
customer_no,
purchase_date,
Price,
first_value(purchase_date) over (partition_by customer_no order by purchase_date) as first_purhcase_date,
count(customer_no) over (partition by customer_id, store_id, customer_no) as customer_purchase_count)
From my_table
This gives me this type of output:
customer_id
store_id
product_no
customer_no
purchase_date
price
first_purchase_date
customer_purchase_count
1
10
100
200
2022-01-01
50
2022-01-01
2
1
10
110
200
2022-01-02
70
2022-01-01
2
1
20
120
210
2022-01-02
60
2022-01-02
2
1
20
130
210
2022-01-02
40
2022-01-02
2
1
30
140
220
2022-01-10
60
2022-01-10
3
1
10
140
220
2022-01-10
60
2022-01-10
3
1
10
140
220
2022-01-10
60
2022-01-10
3
1
10
150
220
2022-01-10
60
2022-01-10
1
However, I want it to look like the table below in its final form. How can I achieve that? If possible I would also like to add 4 colums called "only_in_store_10","only_in_store_20","only_in_store_30","only_in_store_40" for all customer_no that only shopped at that store. It should mark with at ○ on each row of each customer_no that satisfies the condition.
customer_id
store_id
product_no
customer_no
purchase_date
price
first_purchase_date
customer_purchase_count
first_purchase_date_per_store
first_purchase_date_per_store
store_row_nr
1
10
100
200
2022-01-01
50
2022-01-01
2
2022-01-01
1
1
1
10
110
200
2022-01-02
70
2022-01-01
2
2022-01-02
1
1
1
20
120
210
2022-01-02
60
2022-01-02
2
2022-01-02
2
1
1
20
130
210
2022-01-03
40
2022-01-02
2
2022-01-02
2
1
1
30
140
220
2022-01-10
60
2022-01-10
3
2022-01-10
1
1
1
10
140
220
2022-01-11
50
2022-01-11
3
2022-01-11
2
1
1
10
140
220
2022-01-12
40
2022-01-11
3
2022-01-11
2
2
1
10
150
220
2022-01-13
60
2022-01-13
1
2022-01-13
1
1

Get Data in a row with specific values

I Have a series of data like example below:
Customer
Date
Value
a
2022-01-02
100
a
2022-01-03
100
a
2022-01-04
100
a
2022-01-05
100
a
2022-01-06
100
b
2022-01-02
100
b
2022-01-03
100
b
2022-01-04
100
b
2022-01-05
100
b
2022-01-06
090
b
2022-01-07
100
c
2022-02-03
100
c
2022-02-04
100
c
2022-02-05
100
c
2022-02-06
100
c
2022-02-07
100
d
2022-04-10
100
d
2022-04-11
100
d
2022-04-12
100
d
2022-04-13
100
d
2022-04-14
100
d
2022-04-15
090
e
2022-04-10
100
e
2022-04-11
100
e
2022-04-12
080
e
2022-04-13
070
e
2022-04-14
100
e
2022-04-15
100
The result I want are customer A,C and D only. Because A, C and D have value 100 for 5 days in a row.
The start date of each customer is different.
What is the query in BigQuery I need to write for that case above?
Thank you so much
Would you consider below query ?
SELECT DISTINCT Customer
FROM sample_table
QUALIFY 5 = COUNTIF(Value = 100) OVER (
PARTITION BY Customer ORDER BY UNIX_DATE(Date) RANGE BETWEEN 4 PRECEDING AND CURRENT ROW
);
+-----+----------+
| Row | Customer |
+-----+----------+
| 1 | a |
| 2 | c |
| 3 | d |
+-----+----------+
Note that it assumes Date column has DATE type.

How to write the Pivot clause for this specific table?

I am using SQL Server 2014. Below is an extract of my table (t1):
Name RoomType Los RO BB HB FB StartDate EndDate CaptureDate
A DLX 7 0 0 154 200 2022-01-01 2022-01-07 2021-12-31
B SUP 7 110 0 0 0 2022-01-01 2022-01-07 2021-12-31
C COS 7 0 0 200 139 2022-01-01 2022-01-07 2021-12-31
D STD 7 0 75 0 500 2022-01-01 2022-01-07 2021-12-31
I need a Pivot query to convert the above table into the following output:
Name RoomType Los MealPlan Price StartDate EndDate CaptureDate
A DLX 7 RO 0 2022-01-01 2022-01-07 2021-12-31
A DLX 7 BB 0 2022-01-01 2022-01-07 2021-12-31
A DLX 7 HB 154 2022-01-01 2022-01-07 2021-12-31
A DLX 7 FB 200 2022-01-01 2022-01-07 2021-12-31
B SUP 7 RO 110 2022-01-01 2022-01-07 2021-12-31
B SUP 7 BB 0 2022-01-01 2022-01-07 2021-12-31
B SUP 7 HB 0 2022-01-01 2022-01-07 2021-12-31
B SUP 7 FB 0 2022-01-01 2022-01-07 2021-12-31
C COS 7 RO 0 2022-01-01 2022-01-07 2021-12-31
C COS 7 BB 0 2022-01-01 2022-01-07 2021-12-31
C COS 7 HB 200 2022-01-01 2022-01-07 2021-12-31
C COS 7 FB 139 2022-01-01 2022-01-07 2021-12-31
D STD 7 RO 0 2022-01-01 2022-01-07 2021-12-31
D STD 7 BB 75 2022-01-01 2022-01-07 2021-12-31
D STD 7 HB 0 2022-01-01 2022-01-07 2021-12-31
D STD 7 FB 500 2022-01-01 2022-01-07 2021-12-31
I had a look at the following article but it does not seem to address my problem:
SQL Server Pivot Clause
I am did some further research but I did not land on any site that provided a solution to this problem.
Any help would be highly appreciated.
You actually want an UNPIVOT here (comparison docs).
SELECT Name, RoomType, Los, MealPlan, Price,
StartDate, EndDate, CaptureDate
FROM dbo.t1
UNPIVOT (Price FOR MealPlan IN ([RO],[BB],[HB],[FB])) AS u;
Example db<>fiddle

Groupby filter based on count, calculate duration, penultimate status

I have a dataframe as shown below.
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 P 2017-10-22 100
3 1 F 2018-06-22 600
4 1 M 2018-08-22 150
5 1 P 2018-10-22 120
6 1 F 2019-03-22 750
7 2 M 2017-06-29 200
8 2 F 2017-09-29 600
9 2 F 2018-01-29 500
10 2 M 2018-03-29 100
11 2 P 2018-08-29 100
12 2 M 2018-10-29 100
13 2 F 2018-12-29 500
14 3 M 2017-03-20 300
15 3 F 2018-06-20 700
16 3 P 2018-08-20 100
17 3 M 2018-10-20 250
18 3 F 2018-11-20 100
19 3 P 2018-12-20 100
20 3 F 2019-03-20 600
21 3 M 2019-05-20 200
22 4 M 2017-08-10 800
23 4 F 2018-06-10 100
24 4 P 2018-08-10 120
25 4 F 2018-10-10 500
26 4 M 2019-01-10 200
27 4 F 2019-06-10 600
28 5 M 2018-10-10 200
29 5 F 2019-06-10 500
30 6 F 2019-06-10 600
31 7 M 2017-08-10 800
32 7 F 2018-06-10 100
33 7 P 2018-08-10 120
34 7 M 2019-01-10 200
35 7 F 2019-06-10 600
where
F = Failure
M = Maintenance
P = Planned
Step1 - Select the data of IDs which is having at least two status(F or M or P) before the last Failure
Step2 - Ignore the rows if the last raw per ID is not F, expected output after this as shown below.
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 P 2017-10-22 100
3 1 F 2018-06-22 600
4 1 M 2018-08-22 150
5 1 P 2018-10-22 120
6 1 F 2019-03-22 750
7 2 M 2017-06-29 200
8 2 F 2017-09-29 600
9 2 F 2018-01-29 500
10 2 M 2018-03-29 100
11 2 P 2018-08-29 100
12 2 M 2018-10-29 100
13 2 F 2018-12-29 500
14 3 M 2017-03-20 300
15 3 F 2018-06-20 700
16 3 P 2018-08-20 100
17 3 M 2018-10-20 250
18 3 F 2018-11-20 100
19 3 P 2018-12-20 100
20 3 F 2019-03-20 600
22 4 M 2017-08-10 800
23 4 F 2018-06-10 100
24 4 P 2018-08-10 120
25 4 F 2018-10-10 500
26 4 M 2019-01-10 200
27 4 F 2019-06-10 600
31 7 M 2017-08-10 800
32 7 F 2018-06-10 100
33 7 P 2018-08-10 120
34 7 M 2019-01-10 200
35 7 F 2019-06-10 600
Now, for each id last status is failure
Then from the above df I would like to prepare below Data frame
ID No_of_F No_of_M No_of_P SLS NoDays_to_SLS NoDays_SLS_to_LS
1 3 2 2 P 487 151
2 3 3 2 M 487 61
3 3 2 2 P 640 90
4 3 1 1 M 518 151
7 2 1 1 M 518 151
SLS = Second Last Status
LS = Last Status
I tried the following code to calculate the duration.
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df['D'] = df.groupby('ID')['Date'].diff().dt.days
We can create a mask with gropuby + bfill that allows us to perform both selections.
m = df.Status.eq('F').replace(False, np.NaN).groupby(df.ID).bfill()
df = df.loc[m.groupby(df.ID).transform('sum').gt(2) & m]
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 P 2017-10-22 100
3 1 F 2018-06-22 600
4 1 M 2018-08-22 150
5 1 P 2018-10-22 120
6 1 F 2019-03-22 750
7 2 M 2017-06-29 200
8 2 F 2017-09-29 600
9 2 F 2018-01-29 500
10 2 M 2018-03-29 100
11 2 P 2018-08-29 100
12 2 M 2018-10-29 100
13 2 F 2018-12-29 500
14 3 M 2017-03-20 300
15 3 F 2018-06-20 700
16 3 P 2018-08-20 100
17 3 M 2018-10-20 250
18 3 F 2018-11-20 100
19 3 P 2018-12-20 100
20 3 F 2019-03-20 600
22 4 M 2017-08-10 800
23 4 F 2018-06-10 100
24 4 P 2018-08-10 120
25 4 F 2018-10-10 500
26 4 M 2019-01-10 200
27 4 F 2019-06-10 600
31 7 M 2017-08-10 800
32 7 F 2018-06-10 100
33 7 P 2018-08-10 120
34 7 M 2019-01-10 200
35 7 F 2019-06-10 600
The second part is a bit more annoying. There's almost certainly a smarter way to do this, but here's the straight forward way:
s = df.Date.diff().dt.days
res = pd.concat([df.groupby('ID').Status.value_counts().unstack().add_prefix('No_of_'),
df.groupby('ID').Status.apply(lambda x: x.iloc[-2]).to_frame('SLS'),
(s.where(s.gt(0)).groupby(df.ID).apply(lambda x: x.cumsum().iloc[-2])
.to_frame('NoDays_to_SLS')),
s.groupby(df.ID).apply(lambda x: x.iloc[-1]).to_frame('NoDays_SLS_to_LS')],
axis=1)
Output:
No_of_F No_of_M No_of_P SLS NoDays_to_SLS NoDays_SLS_to_LS
ID
1 3 2 2 P 487.0 151.0
2 3 3 1 M 487.0 61.0
3 3 2 2 P 640.0 90.0
4 3 2 1 M 518.0 151.0
7 2 2 1 M 518.0 151.0
Here's my attempt (Note: I am using pandas 0.25) :
df = pd.read_clipboard()
df['Date'] = pd.to_datetime(df['Date'])
df_1 = df.groupby('ID',group_keys=False)\
.apply(lambda x: x[(x['Status']=='F')[::-1].cumsum().astype(bool)])
df_2 = df_1[df_1.groupby('ID')['Status'].transform('count') > 2]
g = df_2.groupby('ID')
df_Counts = g['Status'].value_counts().unstack().add_prefix('No_of_')
df_SLS = g['Status'].agg(lambda x: x.iloc[-2]).rename('SLS')
df_dates = g['Date'].agg(NoDays_to_SLS = lambda x: x.iloc[-2]-x.iloc[0],
NoDays_to_SLS_LS = lambda x: x.iloc[-1]-x.iloc[-2])
pd.concat([df_Counts, df_SLS, df_dates], axis=1).reset_index()
Output:
ID No_of_F No_of_M No_of_P SLS NoDays_to_SLS NoDays_to_SLS_LS
0 1 3 2 2 P 487 days 151 days
1 2 3 3 1 M 487 days 61 days
2 3 3 2 2 P 640 days 90 days
3 4 3 2 1 M 518 days 151 days
4 7 2 2 1 M 518 days 151 days
There are some enhancements in 0.25 that this code uses.

Pandas df row count

Date Ct
0 2015-04-01 1
1 2015-04-01 2
2 2015-04-01 3
3 2015-04-01 4
4 2015-04-02 1
5 2015-04-02 2
6 2015-04-02 3
7 2015-04-02 4
8 2015-04-03 1
9 2015-04-03 2
10 2015-04-03 3
11 2015-04-03 4
12 2015-04-04 1
13 2015-04-04 2
14 2015-04-04 3
15 2015-04-04 4
I have a string column 'Date' and I would like to create the 'Ct' column as represented below to maintain a count of the rows for a certain date. Date needs to be a string in my application, there will not always be an equal number of rows for each date, and 'Ct' will always count in the order of the numerical index. An answer or a nudge in the right direction would be greatly appreciated.
OK, this is a little weird but you can add a temp column and set this value to 1:
df['temp'] = 1
you can then perform a groupby on 'Date' and call transform on the 'temp' column to perform the count:
In [80]:
df['Ct'] = df.groupby('Date')['temp'].transform(pd.Series.cumsum)
df
Out[80]:
Date temp Ct
0 2015-04-01 1 1
1 2015-04-01 1 2
2 2015-04-01 1 3
3 2015-04-01 1 4
4 2015-04-02 1 1
5 2015-04-02 1 2
6 2015-04-02 1 3
7 2015-04-02 1 4
8 2015-04-03 1 1
9 2015-04-03 1 2
10 2015-04-03 1 3
11 2015-04-03 1 4
12 2015-04-04 1 1
13 2015-04-04 1 2
14 2015-04-04 1 3
15 2015-04-04 1 4
In [81]:
df.drop('temp',axis=1,inplace=True)
df
Out[81]:
Date Ct
0 2015-04-01 1
1 2015-04-01 2
2 2015-04-01 3
3 2015-04-01 4
4 2015-04-02 1
5 2015-04-02 2
6 2015-04-02 3
7 2015-04-02 4
8 2015-04-03 1
9 2015-04-03 2
10 2015-04-03 3
11 2015-04-03 4
12 2015-04-04 1
13 2015-04-04 2
14 2015-04-04 3
15 2015-04-04 4
The reason we can't just say perform the cumsum on the 'Date' column is because if it's a string then this will result in your date strings being concatenated with each other which is not what you want.
EDIT
Thanks to the master #Jeff for pointing out that the temp columns is unncecessary and you just use cumcount
In [150]:
df['Ct'] = df.groupby('Date').cumcount() + 1
df
Out[150]:
Date Ct
0 2015-04-01 1
1 2015-04-01 2
2 2015-04-01 3
3 2015-04-01 4
4 2015-04-02 1
5 2015-04-02 2
6 2015-04-02 3
7 2015-04-02 4
8 2015-04-03 1
9 2015-04-03 2
10 2015-04-03 3
11 2015-04-03 4
12 2015-04-04 1
13 2015-04-04 2
14 2015-04-04 3
15 2015-04-04 4