Computing window functions for multiple dates - sql

I have a table sales of consisting of user id's, products those users have purchased, and the date of purchase:
date
user_id
product
2021-01-01
1
apple
2021-01-02
1
orange
2021-01-02
2
apple
2021-01-02
3
apple
2021-01-03
3
orange
2021-01-04
4
apple
If I wanted to see product counts based on every users' most recent purchase, I would do something like this:
WITH latest_sales AS (
SELECT
date
, user_id
, product
, row_number() OVER(PARTITION BY user_id ORDER BY date DESC) AS rn
FROM
sales
)
SELECT
product
, count(1) AS count
FROM
latest_sales
WHERE
rn = 1
GROUP BY
product
Producing:
product
count
apple
2
orange
2
However, this will only produce results for my most recent date. If I looked at this on 2021-01-02. The results would be:
product
count
apple
2
orange
1
How could I code this so I could see counts of the most recent products purchased by user, but for multiple dates?
So the output would be something like this:
date
product
count
2021-01-01
apple
1
2021-01-01
orange
0
2021-01-02
apple
2
2021-01-02
orange
1
2021-01-03
apple
1
2021-01-03
orange
2
2021-01-04
apple
2
2021-01-04
orange
2
Appreciate any help on this.

I'm afraid the window function row_number() with the PARTITION BY user_id clause is not relevant in your case because it only focusses on the user_id of the current row whereas you want a consolidate view with all the users.
I dont have a better idea than doing a self-join on table sales :
WITH list AS (
SELECT DISTINCT ON (s2.date, user_id)
s2.date
, product
FROM sales AS s1
INNER JOIN (SELECT DISTINCT date FROM sales) AS s2
ON s1.date <= s2.date
ORDER BY s2.date, user_id, s1.date DESC
)
SELECT date, product, count(*)
FROM list
GROUP BY date, product
ORDER BY date
see the test result in dbfiddle

Related

Handling duplicates when rolling totals using OVER Partition by

I'm trying to get the rolling amount column totals for each date, from the 1st day of the month to whatever the date column value is, shown in the input table.
Output Requirements
Partition by the 'team' column
Restart rolling totals on the 1st of each month
Question 1
Is my below query correct to get my desired output requirements shown in Output Table below? It seems to work but I must confirm.
SELECT
*,
SUM(amount) OVER (
PARTITION BY
team,
month_id
ORDER BY
date ASC
) rolling_amount_total
FROM input_table;
Question 2
How can I handle duplicate dates, shown in the first 2 rows of Input Table? Whenever there is a duplicate date the amount is a duplicate as well. I see a solution here: https://stackoverflow.com/a/60115061/6388651 but no luck getting it to remove the duplicates. My non-working code example is below.
SELECT
*,
SUM(amount) OVER (
PARTITION BY
team,
month_id
ORDER BY
date ASC
) rolling_amount_total
FROM (
SELECT DISTINCT
date,
amount,
team,
month_id
FROM input_table
) t
Input Table
date
amount
team
month_id
2022-04-01
1
A
2022-04
2022-04-01
1
A
2022-04
2022-04-02
2
A
2022-04
2022-05-01
4
B
2022-05
2022-05-02
4
B
2022-05
Desired Output Table
date
amount
team
month_id
Rolling_Amount_Total
2022-04-01
1
A
2022-04
1
2022-04-02
2
A
2022-04
3
2022-05-01
4
B
2022-05
4
2022-05-02
4
B
2022-05
8
Q1. Your sum() over () is correct
Q2. Replace from input_table, in your first query, with :
from (select date, sum(amount) as amount, team, month_id
from input_table
group by date, team, month_id
) as t

Average on the most recent date for which data is available

I have a table (on BigQuery) that looks like the following:
Date
Type
Score
2021-01-04
A
5
2021-01-04
A
4
2021-01-04
A
5
2021-01-02
A
1
2021-01-02
A
1
2021-01-01
A
3
2021-01-04
B
NULL
2021-01-04
B
NULL
2021-01-02
B
NULL
2021-01-02
B
NULL
2021-01-01
B
2
2021-01-01
B
5
2021-01-04
C
NULL
2021-01-04
C
4
2021-01-04
C
NULL
2021-01-01
C
1
2021-01-01
C
2
2021-01-01
C
3
What I would like to get is the average score for each type but the average should be taken only on the most recent date for which at least one score is available for the type. From the example above, the aim is to obtain the following table:
Type
AVG Score
A
(5+4+5)/3
B
(2+5)/2
C
(4)/1
I need a solution that could be adapted if I want the average score, not for each type, but for each combination of two columns (type/color), still on the most recent date for which at least one score is available for the combination.
An alternative solution is as given below, you can try it:-
SELECT type, AVG(score)
FROM mytable
WHERE score IS NOT NULL
and (type, date1) in (
SELECT (type, max(cast (date1 as date)))
FROM mytable
WHERE score IS NOT NULL
GROUP BY type
)
GROUP BY type
This answers the original question.
One method uses window functions:
select type, avg(score)
from (select t.*,
dense_rank() over (partition by type order by date desc) as seqnum
from t
where score is not null
) t
where seqnum = 1
group by type;

Count the number of transactions per month for an individual group by date Hive

I have a table of customer transactions where each item purchased by a customer is stored as one row. So, for a single transaction there can be multiple rows in the table. I have another col called visit_date.
There is a category column called cal_month_nbr which ranges from 1 to 12 based on which month transaction occurred.
The data looks like below
Id visit_date Cal_month_nbr
---- ------ ------
1 01/01/2020 1
1 01/02/2020 1
1 01/01/2020 1
2 02/01/2020 2
1 02/01/2020 2
1 03/01/2020 3
3 03/01/2020 3
first
I want to know how many times customer visits per month using their visit_date
i.e i want below output
id cal_month_nbr visit_per_month
--- --------- ----
1 1 2
1 2 1
1 3 1
2 2 1
3 3 1
and what is the avg frequency of visit per ids
ie.
id Avg_freq_per_month
---- -------------
1 1.33
2 1
3 1
I tried with below query but it counts each item as one transaction
select avg(count_e) as num_visits_per_month,individual_id
from
(
select r.individual_id, cal_month_nbr, count(*) as count_e
from
ww_customer_dl_secure.cust_scan
GROUP by
r.individual_id, cal_month_nbr
order by count_e desc
) as t
group by individual_id
I would appreciate any help, guidance or suggestions
You can divide the total visits by the number of months:
select individual_id,
count(*) / count(distinct cal_month_nbr)
from ww_customer_dl_secure.cust_scan c
group by individual_id;
If you want the average number of days per month, then:
select individual_id,
count(distinct visit_date) / count(distinct cal_month_nbr)
from ww_customer_dl_secure.cust_scan c
group by individual_id;
Actually, Hive may not be efficient at calculating count(distinct), so multiple levels of aggregation might be faster:
select individual_id, avg(num_visit_days)
from (select individual_id, cal_month_nbr, count(*) as num_visit_days
from (select distinct individual_id, visit_date, cal_month_nbr
from ww_customer_dl_secure.cust_scan c
) iv
group by individual_id, cal_month_nbr
) ic
group by individual_id;

Include only transition states in SQL query

I have a table with customers and their purchase behaviour that looks as follows:
customer shop time
----------------------------
1 5 13.30
1 5 14.33
1 10 22.17
2 3 12.15
2 1 13.30
2 1 15.55
2 3 17.29
Since I want the shift in shop I need the following output
customer shop time
----------------------------
1 5 13.30
1 10 22.17
2 3 12.15
2 1 13.30
2 3 17.29
I have tried using
ROW_NUMBER() OVER (PARTITION BY customer, shop ORDER BY time ASC) AS a counter
and then only keeping all counter=1. However, this troubles me when the customer visits the same shop again later on, as with customer=2 and shop=3 in my example.
I came up with this:
WITH a AS
(
SELECT
customer, shop, time,
ROW_NUMBER() OVER (PARTITION BY customer ORDER BY time ASC) AS counter
FROM
db
)
SELECT a1.*
FROM a a1
JOIN a AS a2 ON (a1.device = a2.device AND a2.counter1 + 1 = a1.counter1 AND a2.id <> a1.id)
UNION
SELECT a.*
FROM a
WHERE counter1 = 1
However, this is very inefficient and running it in AWS where my data is located results in a error telling me that
Query exhausted resources at this scale factor
Is there any way to make this query more efficient?
This is a gaps-and-islands problem. But the simplest solution uses lag():
select customer, shop, time
from (select t.*, lag(shop) over (partition by customer order by time) as prev_shop
from t
) t
where prev_shop is null or prev_shop <> shop;

Find how many times it took to achieve a particular outcome in SQL Table

I need to find out how many attempts it takes to achieve an outcome from a SQL table. For example My table contains CustomerID, Outcome, OutcomeType. The outcome I am looking for is Sale
So if I had this record:
CID Outcome OutcomeID Date
1 No Answer 0 01/01/2015 08:00:00
1 No Interest 0 02/01/2015 09:00:00
1 Sale 1 02/02/2015 10:00:00
1 Follow up 2 03/02/2015 10:00:00
I can see it took 2 attempts to get a sale. I need to do this for all the customers in a table which contains thousands of entries. They may have entries after the sale and I need to exclude these, they may also have additional sales after the first but I am only interested in the first sale.
i hope this is enough info,
many thanks in advance
Edit as requested, the outcome I would look for would be:
CID CountToOutcome
1 2
2 3
3 5
etc
You can do this with window functions and aggregation:
select cid,
min(case when Outcome = 'Sale' then seqnum end) - 1 as AttemptsBeforeSale
from (select t.*,
row_number() over (partition by cid order by date) as seqnum
from t
) t
group by cid;
Note: This provides the value for the first sale for each cid.