Get Data in a row with specific values - google-bigquery

I Have a series of data like example below:
Customer
Date
Value
a
2022-01-02
100
a
2022-01-03
100
a
2022-01-04
100
a
2022-01-05
100
a
2022-01-06
100
b
2022-01-02
100
b
2022-01-03
100
b
2022-01-04
100
b
2022-01-05
100
b
2022-01-06
090
b
2022-01-07
100
c
2022-02-03
100
c
2022-02-04
100
c
2022-02-05
100
c
2022-02-06
100
c
2022-02-07
100
d
2022-04-10
100
d
2022-04-11
100
d
2022-04-12
100
d
2022-04-13
100
d
2022-04-14
100
d
2022-04-15
090
e
2022-04-10
100
e
2022-04-11
100
e
2022-04-12
080
e
2022-04-13
070
e
2022-04-14
100
e
2022-04-15
100
The result I want are customer A,C and D only. Because A, C and D have value 100 for 5 days in a row.
The start date of each customer is different.
What is the query in BigQuery I need to write for that case above?
Thank you so much

Would you consider below query ?
SELECT DISTINCT Customer
FROM sample_table
QUALIFY 5 = COUNTIF(Value = 100) OVER (
PARTITION BY Customer ORDER BY UNIX_DATE(Date) RANGE BETWEEN 4 PRECEDING AND CURRENT ROW
);
+-----+----------+
| Row | Customer |
+-----+----------+
| 1 | a |
| 2 | c |
| 3 | d |
+-----+----------+
Note that it assumes Date column has DATE type.

Related

standard sql: Get customer count and first purchase date per customer and store_id

I use standard sql and I need a query that gets the total count of purchases per customer, for each store_id. And also the first purchase date per customer, for each store_id.
I have a table with this structure:
customer_id
store_id
product_no
customer_no
purchase_date
price
1
10
100
200
2022-01-01
50
1
10
110
200
2022-01-02
70
1
20
120
200
2022-01-02
60
1
20
130
200
2022-01-02
40
1
30
140
200
2022-01-02
60
Current query:
Select
customer_id,
store_id,
product_id,
product_no,
customer_no,
purchase_date,
Price,
first_value(purchase_date) over (partition_by customer_no order by purchase_date) as first_purhcase_date,
count(customer_no) over (partition by customer_id, store_id, customer_no) as customer_purchase_count)
From my_table
This gives me this type of output:
customer_id
store_id
product_no
customer_no
purchase_date
price
first_purchase_date
customer_purchase_count
1
10
100
200
2022-01-01
50
2022-01-01
2
1
10
110
200
2022-01-02
70
2022-01-01
2
1
20
120
210
2022-01-02
60
2022-01-02
2
1
20
130
210
2022-01-02
40
2022-01-02
2
1
30
140
220
2022-01-10
60
2022-01-10
3
1
10
140
220
2022-01-10
60
2022-01-10
3
1
10
140
220
2022-01-10
60
2022-01-10
3
1
10
150
220
2022-01-10
60
2022-01-10
1
However, I want it to look like the table below in its final form. How can I achieve that? If possible I would also like to add 4 colums called "only_in_store_10","only_in_store_20","only_in_store_30","only_in_store_40" for all customer_no that only shopped at that store. It should mark with at ○ on each row of each customer_no that satisfies the condition.
customer_id
store_id
product_no
customer_no
purchase_date
price
first_purchase_date
customer_purchase_count
first_purchase_date_per_store
first_purchase_date_per_store
store_row_nr
1
10
100
200
2022-01-01
50
2022-01-01
2
2022-01-01
1
1
1
10
110
200
2022-01-02
70
2022-01-01
2
2022-01-02
1
1
1
20
120
210
2022-01-02
60
2022-01-02
2
2022-01-02
2
1
1
20
130
210
2022-01-03
40
2022-01-02
2
2022-01-02
2
1
1
30
140
220
2022-01-10
60
2022-01-10
3
2022-01-10
1
1
1
10
140
220
2022-01-11
50
2022-01-11
3
2022-01-11
2
1
1
10
140
220
2022-01-12
40
2022-01-11
3
2022-01-11
2
2
1
10
150
220
2022-01-13
60
2022-01-13
1
2022-01-13
1
1

How to write the Pivot clause for this specific table?

I am using SQL Server 2014. Below is an extract of my table (t1):
Name RoomType Los RO BB HB FB StartDate EndDate CaptureDate
A DLX 7 0 0 154 200 2022-01-01 2022-01-07 2021-12-31
B SUP 7 110 0 0 0 2022-01-01 2022-01-07 2021-12-31
C COS 7 0 0 200 139 2022-01-01 2022-01-07 2021-12-31
D STD 7 0 75 0 500 2022-01-01 2022-01-07 2021-12-31
I need a Pivot query to convert the above table into the following output:
Name RoomType Los MealPlan Price StartDate EndDate CaptureDate
A DLX 7 RO 0 2022-01-01 2022-01-07 2021-12-31
A DLX 7 BB 0 2022-01-01 2022-01-07 2021-12-31
A DLX 7 HB 154 2022-01-01 2022-01-07 2021-12-31
A DLX 7 FB 200 2022-01-01 2022-01-07 2021-12-31
B SUP 7 RO 110 2022-01-01 2022-01-07 2021-12-31
B SUP 7 BB 0 2022-01-01 2022-01-07 2021-12-31
B SUP 7 HB 0 2022-01-01 2022-01-07 2021-12-31
B SUP 7 FB 0 2022-01-01 2022-01-07 2021-12-31
C COS 7 RO 0 2022-01-01 2022-01-07 2021-12-31
C COS 7 BB 0 2022-01-01 2022-01-07 2021-12-31
C COS 7 HB 200 2022-01-01 2022-01-07 2021-12-31
C COS 7 FB 139 2022-01-01 2022-01-07 2021-12-31
D STD 7 RO 0 2022-01-01 2022-01-07 2021-12-31
D STD 7 BB 75 2022-01-01 2022-01-07 2021-12-31
D STD 7 HB 0 2022-01-01 2022-01-07 2021-12-31
D STD 7 FB 500 2022-01-01 2022-01-07 2021-12-31
I had a look at the following article but it does not seem to address my problem:
SQL Server Pivot Clause
I am did some further research but I did not land on any site that provided a solution to this problem.
Any help would be highly appreciated.
You actually want an UNPIVOT here (comparison docs).
SELECT Name, RoomType, Los, MealPlan, Price,
StartDate, EndDate, CaptureDate
FROM dbo.t1
UNPIVOT (Price FOR MealPlan IN ([RO],[BB],[HB],[FB])) AS u;
Example db<>fiddle

pandas retain values on different index dataframes

I need to merge two dataframes with different frequencies (daily to weekly). However, would like to retain the weekly values when merging to the daily dataframe.
There is a grouping variable in the data, group.
import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta
daily={'date':[datetime.date(2022,1,1)+relativedelta(day=i) for i in range(1,10)]*2,
'group':['A' for x in range(1,10)]+['B' for x in range(1,10)],
'daily_value':[x for x in range(1,10)]*2}
weekly={'date':[datetime.date(2022,1,1),datetime.date(2022,1,7)]*2,
'group':['A','A']+['B','B'],
'weekly_value':[100,200,300,400]}
daily_data=pd.DataFrame(daily)
weekly_data=pd.DataFrame(weekly)
daily_data output:
date group daily_value
0 2022-01-01 A 1
1 2022-01-02 A 2
2 2022-01-03 A 3
3 2022-01-04 A 4
4 2022-01-05 A 5
5 2022-01-06 A 6
6 2022-01-07 A 7
7 2022-01-08 A 8
8 2022-01-09 A 9
9 2022-01-01 B 1
10 2022-01-02 B 2
11 2022-01-03 B 3
12 2022-01-04 B 4
13 2022-01-05 B 5
14 2022-01-06 B 6
15 2022-01-07 B 7
16 2022-01-08 B 8
17 2022-01-09 B 9
weekly_data output:
date group weekly_value
0 2022-01-01 A 100
1 2022-01-07 A 200
2 2022-01-01 B 300
3 2022-01-07 B 400
The desired output
desired={'date':[datetime.date(2022,1,1)+relativedelta(day=i) for i in range(1,10)]*2,
'group':['A' for x in range(1,10)]+['B' for x in range(1,10)],
'daily_value':[x for x in range(1,10)]*2,
'weekly_value':[100]*6+[200]*3+[300]*6+[400]*3}
desired_data=pd.DataFrame(desired)
desired_data output:
date group daily_value weekly_value
0 2022-01-01 A 1 100
1 2022-01-02 A 2 100
2 2022-01-03 A 3 100
3 2022-01-04 A 4 100
4 2022-01-05 A 5 100
5 2022-01-06 A 6 100
6 2022-01-07 A 7 200
7 2022-01-08 A 8 200
8 2022-01-09 A 9 200
9 2022-01-01 B 1 300
10 2022-01-02 B 2 300
11 2022-01-03 B 3 300
12 2022-01-04 B 4 300
13 2022-01-05 B 5 300
14 2022-01-06 B 6 300
15 2022-01-07 B 7 400
16 2022-01-08 B 8 400
17 2022-01-09 B 9 400
Use merge_asof with sorting values by datetimes, last sorting like original by both columns:
daily_data['date'] = pd.to_datetime(daily_data['date'])
weekly_data['date'] = pd.to_datetime(weekly_data['date'])
df = (pd.merge_asof(daily_data.sort_values('date'),
weekly_data.sort_values('date'),
on='date',
by='group').sort_values(['group','date'], ignore_index=True))
print (df)
date group daily_value weekly_value
0 2022-01-01 A 1 100
1 2022-01-02 A 2 100
2 2022-01-03 A 3 100
3 2022-01-04 A 4 100
4 2022-01-05 A 5 100
5 2022-01-06 A 6 100
6 2022-01-07 A 7 200
7 2022-01-08 A 8 200
8 2022-01-09 A 9 200
9 2022-01-01 B 1 300
10 2022-01-02 B 2 300
11 2022-01-03 B 3 300
12 2022-01-04 B 4 300
13 2022-01-05 B 5 300
14 2022-01-06 B 6 300
15 2022-01-07 B 7 400
16 2022-01-08 B 8 400
17 2022-01-09 B 9 400

Pandas: Groupby multiple columns, finding the max value and keep other columns in dataframe

I am looking to group multiple columns in a dataframe, keep only the max Value, and keeping the corresponding date column
Below is how the dataframe looks like:
Index
Site
Device Type
Value
Time
0
AAA
A
10
2021-02-02 01:30:00
1
AAA
A
5
2021-02-02 01:35:00
2
AAA
B
2
2021-02-02 01:40:00
3
BBB
C
3
2021-02-02 02:00:00
4
BBB
C
11
2021-02-02 02:05:00
5
BBB
C
20
2021-02-02 02:10:00
6
BBB
D
30
2021-02-02 04:00:00
I am trying to get the following output:
Index
Site
Device Type
Value
Time
0
AAA
A
10
2021-02-02 01:30:00
1
AAA
B
2
2021-02-02 01:40:00
2
BBB
C
20
2021-02-02 02:10:00
3
BBB
D
30
2021-02-02 04:00:00
When I try the following groupby, the Time column drops:
df_max = df.groupby(['Site','Device Type'],as_index=False)['Value'].max()
I am looking to Keep the Time value corresponding to the maximum value found.
Thank you
You are very close. try using idxmax and show rows at that location:
df.loc[df.groupby(['Site','Device Type'])['Value'].idxmax()].reset_index(drop=True)
Index Site Device Type Value Time
0 0 AAA A 10 2021-02-02 01:30:00
1 2 AAA B 2 2021-02-02 01:40:00
2 5 BBB C 20 2021-02-02 02:10:00
3 6 BBB D 30 2021-02-02 04:00:00

How to restrict the upper limit of rows while doing join in SQL?

I have two tables: balance and calendar.
Balance :
Account Date Balance
1111 01/01/2014 100
1111 02/01/2014 156
1111 03/01/2014 300
1111 04/01/2014 300
1111 07/01/2014 468
1112 02/01/2014 300
1112 03/01/2014 300
1112 06/01/2014 300
1112 07/01/2014 350
1112 08/01/2014 400
1112 09/01/2014 450
1113 01/01/2014 30
1113 02/01/2014 40
1113 03/01/2014 45
1113 06/01/2014 45
1113 07/01/2014 60
1113 08/01/2014 50
1113 09/01/2014 20
1113 10/01/2014 10
Calendar
date business_day_ind
01/01/2014 N
02/01/2014 Y
03/01/2014 Y
04/01/2014 N
05/01/2014 N
06/01/2014 Y
07/01/2014 Y
08/01/2014 Y
09/01/2014 Y
10/01/2014 Y
I need to do the following:
I need to fill in the missing days for all the accounts up to the maximum day for which it has value. Say for account 1111, it has value only till 07/01/2014, so the dates need to be filled only till that. But when I join with the calendar table (plain left join), I am not able restrict the maximum day to the day available for an account.
1111 01/01/2014 100 N
1111 02/01/2014 156 Y
1111 03/01/2014 300 Y
1111 04/01/2014 300 Y
1111 05/01/2014 N
1111 06/01/2014 N
1111 07/01/2014 468 Y
1111 08/01/2014 Y
1111 09/01/2014 Y
1111 10/01/2014 Y
1112 01/01/2014 N
1112 02/01/2014 300 Y
1112 03/01/2014 300 Y
1112 04/01/2014 N
1112 05/01/2014 N
1112 06/01/2014 300 Y
1112 07/01/2014 350 Y
1112 08/01/2014 400 Y
1112 09/01/2014 450 Y
1112 10/01/2014 Y
I need an efficient way (preferably not involving multiple steps) to restrict the dates up to an account's maximum balance available date (07/01/2014 in case of 1111,09/01/2014 in case 1112)
Desired output:
1111 01/01/2014 100 N
1111 02/01/2014 156 Y
1111 03/01/2014 300 Y
1111 04/01/2014 300 Y
1111 05/01/2014 N
1111 06/01/2014 N
1111 07/01/2014 468 Y
1112 01/01/2014 N
1112 02/01/2014 300 Y
1112 03/01/2014 300 Y
1112 04/01/2014 N
1112 05/01/2014 N
1112 06/01/2014 300 Y
1112 07/01/2014 350 Y
1112 08/01/2014 400 Y
1112 09/01/2014 450 Y
After filling the missing days, I am planning to impute the balance of previous business day to the missing days. I am planning to get previous business day for every date and do an update to missing rows by joining the original balance table with acct and previous business day as key.
Thanks.
I am Greenplum database.
A possible way would be put a second select in a subquery. For instance:
select ... from calendar a left outer join balance b on a.date = b.date
where a.date <= (select max(date) from balance c where b.Account = c.Account )
I suppose that you have third table, accounts:
select
accounts.account,
calendar.date,
balance.balance,
calendar.business_day_ind
from
accounts cross join lateral (
select *
from calendar
where calendar.date <= (
select max(date)
from balance
where balance.account = accounts.account)) as calendar left join
balance on (balance.account = accounts.account and balance.date = calendar.date)
order by
accounts.account, calendar.date;
About lateral joins
That was a fun challenge!
CREATE TABLE balance
(account int, balance_date timestamp, balance int)
DISTRIBUTED BY (account, balance_date);
INSERT INTO balance
values (1111,'01/01/2014', 100),
(1111, '02/01/2014', 156),
(1111, '03/01/2014', 300),
(1111, '04/01/2014', 300),
(1111, '07/01/2014', 468),
(1112, '02/01/2014', 300),
(1112, '03/01/2014', 300),
(1112, '06/01/2014', 300),
(1112, '07/01/2014', 350),
(1112, '08/01/2014', 400),
(1112, '09/01/2014', 450),
(1113, '01/01/2014', 30),
(1113, '02/01/2014', 40),
(1113, '03/01/2014', 45),
(1113, '06/01/2014', 45),
(1113, '07/01/2014', 60),
(1113, '08/01/2014', 50),
(1113, '09/01/2014', 20),
(1113, '10/01/2014', 10);
CREATE TABLE calendar
(calendar_date timestamp, business_day_ind boolean)
DISTRIBUTED BY (calendar_date);
INSERT INTO calendar
values ('01/01/2014', false),
('02/01/2014', true),
('03/01/2014', true),
('04/01/2014', false),
('05/01/2014', false),
('06/01/2014', true),
('07/01/2014', true),
('08/01/2014', true),
('09/01/2014', true),
('10/01/2014', true);
analyze balance;
analyze calendar;
And now the query.
select d.account, d.my_date, b.balance, c.business_day_ind
from (
select account, start_date + interval '1 month' * (generate_series(0, duration)) AS my_date
from (
select account, start_date, (date_part('year', duration) * 12 + date_part('month', duration))::int as duration
from (
select start_date, age(end_date, start_date) as duration, account
from (
select account, min(balance_date) as start_date, max(balance_date) as end_date
from balance
group by account
) as sub1
) as sub2
) sub3
) as d
left outer join balance b on d.account = b.account and d.my_date = b.balance_date
join calendar c on c.calendar_date = d.my_date
order by d.account, d.my_date;
Results:
account | my_date | balance | business_day_ind
---------+---------------------+---------+------------------
1111 | 2014-01-01 00:00:00 | 100 | f
1111 | 2014-02-01 00:00:00 | 156 | t
1111 | 2014-03-01 00:00:00 | 300 | t
1111 | 2014-04-01 00:00:00 | 300 | f
1111 | 2014-05-01 00:00:00 | | f
1111 | 2014-06-01 00:00:00 | | t
1111 | 2014-07-01 00:00:00 | 468 | t
1112 | 2014-02-01 00:00:00 | 300 | t
1112 | 2014-03-01 00:00:00 | 300 | t
1112 | 2014-04-01 00:00:00 | | f
1112 | 2014-05-01 00:00:00 | | f
1112 | 2014-06-01 00:00:00 | 300 | t
1112 | 2014-07-01 00:00:00 | 350 | t
1112 | 2014-08-01 00:00:00 | 400 | t
1112 | 2014-09-01 00:00:00 | 450 | t
1113 | 2014-01-01 00:00:00 | 30 | f
1113 | 2014-02-01 00:00:00 | 40 | t
1113 | 2014-03-01 00:00:00 | 45 | t
1113 | 2014-04-01 00:00:00 | | f
1113 | 2014-05-01 00:00:00 | | f
1113 | 2014-06-01 00:00:00 | 45 | t
1113 | 2014-07-01 00:00:00 | 60 | t
1113 | 2014-08-01 00:00:00 | 50 | t
1113 | 2014-09-01 00:00:00 | 20 | t
1113 | 2014-10-01 00:00:00 | 10 | t
(25 rows)
I had to get the min and max dates for each account and then use generate_series to generate the months between the two dates. It would have been a bit cleaner query if you wanted a record for each day but I had to use another subquery to get the results at a monthly level.