Redshift separate partitions with identical data by time - sql

I have data in a Redshift table like product_id, price, and time_of_purchase. I want to create partitions for every time the price changed since the previous purchase. In this case the price of an item may go back to a previous price, but I need this to be a separate partition, e.g.:
Note the price was $2, then went up to $3, then went back to $2. If I do something like (partition by product_id, price order by time_of_purchase) then the last row gets partitioned with the top two, which I don't want. How can I do this correctly so I get three separate partitions?

Use lag() to get the previous value and then a cumulative sum:
select t.*,
sum(case when prev_price = price then 0 else 1 end) over
(partition by product_id order by time_of_purchase) as partition_id
from (select t.*,
lag(price) over (partition by product_id order by time_of_purchase) as prev_price
from t
) t

As opposed to #Gordon Linoff, I prefer to do it step by step, using WITH clauses ...
And, as I stated several times in other posts - please add your exemplary data in a copy-paste ready format, so we don't have to copy-paste your examples.
I like to add my examples in a self-contained micro demo format, with the input data already in my post, so everyone can play with it, that's why ..
WITH
-- your input, typed manually ....
indata(product_id,price,tm_of_p) AS (
SELECT 1,2.00,TIMESTAMP '2020-09-14 09:00'
UNION ALL SELECT 1,2.00,TIMESTAMP '2020-09-14 10:00'
UNION ALL SELECT 1,3.00,TIMESTAMP '2020-09-14 11:00'
UNION ALL SELECT 1,3.00,TIMESTAMP '2020-09-14 12:00'
UNION ALL SELECT 1,2.00,TIMESTAMP '2020-09-14 13:00'
)
,
with_change_counter AS (
SELECT
*
, CASE WHEN LAG(price) OVER(PARTITION BY product_id ORDER BY tm_of_p) <> price
THEN 1
ELSE 0
END AS chg_count
FROM indata
)
SELECT
product_id
, price
, tm_of_p
, SUM(chg_count) OVER(PARTITION BY product_id ORDER BY tm_of_p) AS session_id
FROM with_change_counter;
-- out product_id | price | tm_of_p | session_id
-- out ------------+-------+---------------------+------------
-- out 1 | 2.00 | 2020-09-14 09:00:00 | 0
-- out 1 | 2.00 | 2020-09-14 10:00:00 | 0
-- out 1 | 3.00 | 2020-09-14 11:00:00 | 1
-- out 1 | 3.00 | 2020-09-14 12:00:00 | 1
-- out 1 | 2.00 | 2020-09-14 13:00:00 | 2

Related

How to aggregate over date including all prior dates

I am working with a table in Databricks Delta lake. It gets new records appended every month. The field insert_dt indicates when the records are inserted.
| ID | Mrc | insert_dt |
|----|-----|------------|
| 1 | 40 | 2022-01-01 |
| 2 | 30 | 2022-01-01 |
| 3 | 50 | 2022-01-01 |
| 4 | 20 | 2022-02-01 |
| 5 | 45 | 2022-02-01 |
| 6 | 55 | 2022-03-01 |
Now I want to aggregate by insert_dt and calculate the average of Mrc. For each date, the average is done not just for the records of that date but all records with date prior to that. In this example, there are 3 rows for 2022-01-01, 5 rows for 2022-02-01 and 6 rows for 2022-03-01. The expected results would look like this:
| Mrc | insert_dt |
|-----|------------|
| 40 | 2022-01-01 |
| 37 | 2022-02-01 |
| 40 | 2022-03-01 |
How do I write a query to do that?
I checked the documentation for Delta-lake databricks (https://docs.databricks.com/sql/language-manual/sql-ref-window-functions.html ) and it looks like TSQL so I think this will work for you, but you may need to tweak slightly.
The approach is to condense each day to a single point and then use window functions to get the running totals. Note that any given day may have a different count, so you can't just average the averages.
--Enter the sample data you gave as a CTE for testing
;with cteSample as (
SELECT * FROM ( VALUES
(1, 40, CONVERT(date,'2022-01-01'))
, ('2', '30', '2022-01-01')
, ('3', '50', '2022-01-01')
, ('4', '20', '2022-02-01')
, ('5', '45', '2022-02-01')
, ('6', '55', '2022-03-01')
) as TabA(ID, Mrc, insert_dt)
)--Solution begins here, find the total and count for each date
--because window can only handle a single "last row"
, cteGrouped as (
SELECT insert_dt, SUM(Mrc) as MRCSum, COUNT(*) as MRCCount
FROM cteSample
GROUP BY insert_dt
)--Now use the window function to get the totals "up to today"
, cteTotals as (
SELECT insert_dt
, SUM(MRCSum) OVER (ORDER BY insert_dt RANGE
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS MrcSum
, SUM(MRCCount) OVER (ORDER BY insert_dt RANGE
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS MrcCount
FROM cteGrouped as G
) --Now divide out to get the average to date
SELECT insert_dt, MrcSum/MrcCount as MRCAverage
FROM cteTotals as T
This gives the following output
insert_dt
MRCAverage
2022-01-01
40
2022-02-01
37
2022-03-01
40
Calculate a running average using a window function (the inner subquery) and then pick only one row per insert_dt - the one with the highest id. I only tested this on PostgreSQL 13 so not sure how far does delta-lake support the SQL standard and will it work there or not though.
select mrc, insert_dt from
(
select avg(mrc) over (order by insert_dt, id) mrc, insert_dt,
row_number() over (partition by insert_dt order by id desc) rn
from the_table
) t
where rn = 1
order by insert_dt;
DB-fiddle demo
Update If the_table has no id column then use a CTE to add one.
with t_id as (select *, row_number() over (order by insert_dt) id from the_table)
select mrc, insert_dt from
(
select avg(mrc) over (order by insert_dt, id) mrc, insert_dt,
row_number() over (partition by insert_dt order by id desc) rn
from t_id
) t
where rn = 1
order by insert_dt;

How to combine Cross Join and String Agg in Bigquery with date time difference

I am trying to go from the following table
| user_id | touch | Date | Purchase Amount
| 1 | Impression| 2020-09-12 |0
| 1 | Impression| 2020-10-12 |0
| 1 | Purchase | 2020-10-13 |125$
| 1 | Email | 2020-10-14 |0
| 1 | Impression| 2020-10-15 |0
| 1 | Purchase | 2020-10-30 |122
| 2 | Impression| 2020-10-15 |0
| 2 | Impression| 2020-10-16 |0
| 2 | Email | 2020-10-17 |0
to
| user_id | path | Number of days between First Touch and Purchase | Purchase Amount
| 1 | Impression,Impression,Purchase | 2020-10-13(Purchase) - 2020-09-12 (Impression) |125$
| 1 | Email,Impression, Purchase | 2020-10-30(Purchase) - 2020-10-14(Email) | 122$
| 2 | Impression, Impression, Email | 2020-12-31 (Fixed date) - 2020-10-15(Impression) | 0$
In essence, I am trying to create a new row for each unique user in the table every time a 'Purchase' is encountered in a comma-separated string.
Also, take the difference between the first touch and first purchase for each unique user. When a new row is created we do the same for the same user as show in the example above.
From the little I have gathered I need to use a mixture of cross join and string agg but I tried using a case statement within string agg and was not able to get to the required result.
Is there a better way to do it in SQL (Bigquery).
Thank you
Below is for BigQuery Standard SQL
#standardSQL
select user_id,
string_agg(touch order by date) path,
date_diff(max(date), min(date), day) days,
sum(amount) amount
from (
select user_id, touch, date, amount,
countif(touch = 'Purchase') over win grp
from `project.dataset.table`
window win as (partition by user_id order by date rows between unbounded preceding and 1 preceding)
)
group by user_id, grp
if to apply to sample data from your question - output is
another change, in case there is no Purchase in the touch we calculate the number of days from a fixed window we have set. How can I add this to the query above?
select user_id,
string_agg(touch order by date) path,
date_diff(if(countif(touch = 'Purchase') = 0, '2020-12-31', max(date)), min(date), day) days,
sum(amount) amount
from (
select user_id, touch, date, amount,
countif(touch = 'Purchase') over win grp
from `project.dataset.table`
window win as (partition by user_id order by date rows between unbounded preceding and 1 preceding)
)
group by user_id, grp
with output
Means you need solution which divides row if there is purchase in touch.
Use following query:
Select user_id,
Aggregation function according to your requirement,
Sum(purchase_amount)
From
(Select t.*,
Sum(case when touch = 'Purchase' then 1 else 0 end) over (partition by user_id order by date) as sm
From t) t
Group by user_id, sm
We could approach this as a gaps-and-island problem, where every island ends with a purchase. How do we define the groups? By counting how many purchases we have ahead (current row included) - so with a descending sort in the query.
select user_id, string_agg(touch order by date),
min(date) as first_date, max(date) as max_date,
date_diff(max(date), min(date)) as cnt_days
from (
select t.*,
countif(touch = 'Purchase') over(partition by user_id order by date desc) as grp
from mytable t
) t
group by user_id, grp
You can create a value for each row that corresponds to the number of instances where table.touch = 'Purchase', which can then be used to group on:
with r as (select row_number() over(order by t1.user_id) rid, t1.* from table t1)
select t3.user_id, group_concat(t3.touch), sum(t3.amount), date_diff(max(t3.date), min(t3.date))
from (select
(select sum(r1.touch = 'Purchase' AND r1.rid < r2.rid) from r r1) c1, r2.* from r r2
) t3
group by t3.c1;

Database Design Historical Data Model

I am thinking a good design to capture history of product change. Suppose a user can have different products to trade for each day.
User Product Day
1 A 1
1 B 1
1 A 2
1 B 2
1 C 3
As we can see above on day 3, Product C is Added and Product A B are removed.
Thinking of below 2 design:
#1 Capture the product changes and store it as start and end date
User Product Start End
1 A 1 3
1 B 1 3
1 C 3 -
#2 Capture the product changes as 1 record
User Product Action Day
1 A Added 1
1 B Added 1
1 C Added 3
1 A Removed 3
1 B Removed 3
My following question is can these 2 models be converted to each other. For example, we can use Lead/Lag to convert #2 into #1.
Which design is better? Our system is using #2 to store the historical data.
Updated:
the intention to use the data is showing the product changes history.
For example, for a given date range, what's the product change for a particular user?
The second model seems better, at least if your main interest is in queries like "find all changes for all users and products, which occurred between DATE_1 and DATE_2".
With the second model, the query is trivial:
select * from (table) where (date) between DATE_1 and DATE_2;
How would you write the query for the first model?
Moreover, with the second model you could create an index on (user, date) - or even just on (date) - which will make quick work of the query. Even if you had indexes on the table in the first model, they wouldn't be used due to the complicated nature of the query.
While integrity constraints would be relatively difficult in both cases (as they are cross-rows), they would be much easier to implement - either with materialized views or with triggers - with the second model. In the first model you have to make sure there are no overlaps between the intervals. With the first model, if you partition by user and order by date, the condition is simply that the action alternates from row to row. Still not trivial to implement, but much simpler than the "non-overlapping intervals" condition for the first model.
To your other question: It is, indeed, trivial to go from either model to the other, using PIVOT and UNPIVOT. You do need an analytic function (ROW_NUMBER) before you PIVOT to go from model #2 to #1. You don't need any preparation to go from #1 to #2.
Personally, I think the first option is better. I'm assuming you have so many rows that the raw structure of a row per user, product and date is too heavy? Because for visualisations I think the raw table would work fine as is.
However, if you have to aggregate due to size, and do not need to know the amounts of the product nor how many users are selling them on any given day, then the first option would be easier to work with in my opinion just in terms of SQL. On the other hand, you will have a problem in case a product can have several start and end dates, since I am assuming a new entry would just overwrite the previous date stamp.
So, that in mind, I would personally create a table with a row per day(or monthly if you want to minimise the size of the table and monthly is granular enough for your use case). Then add a column for each product and whether or not they were sold that day. You could even do it with a count on the number of users selling that product, which would give you a little more detail. The only problem this model has, is that I would only use it in case it is truly static, historical data with no need to add new products.
You can convert from any one format to the other formats.
Data in the first format:
CREATE TABLE table1 (Usr, Product, Day) AS
SELECT 1, 'A', '1' FROM DUAL UNION ALL
SELECT 1, 'B', '1' FROM DUAL UNION ALL
SELECT 1, 'A', '2' FROM DUAL UNION ALL
SELECT 1, 'B', '2' FROM DUAL UNION ALL
SELECT 1, 'C', '3' FROM DUAL
Then:
SELECT usr,
product,
day + DECODE( action, 'Removed', 1, 0) AS day,
action
FROM (
SELECT Usr,
Product,
Day,
CASE
WHEN LAG( Day ) OVER ( PARTITION BY Usr, Product ORDER BY Day ) = Day - 1
THEN NULL
ELSE 'Added'
END AS Added,
CASE
WHEN LEAD( Day ) OVER ( PARTITION BY Usr, Product ORDER BY Day ) = Day + 1
THEN NULL
WHEN Day = MAX( Day ) OVER ()
THEN NULL
ELSE 'Removed'
END AS Removed
FROM table1
)
UNPIVOT ( action FOR value IN ( Added, Removed ) )
Outputs that data in the second form:
USR | PRODUCT | DAY | ACTION
--: | :------ | --: | :------
1 | A | 1 | Added
1 | A | 3 | Removed
1 | B | 1 | Added
1 | B | 3 | Removed
1 | C | 3 | Added
and:
SELECT Usr,
Product,
MIN( Day ) AS "Start",
CASE MAX( Day )
WHEN Last_Day
THEN NULL
ELSE MAX( Day ) + 1
END AS "End"
FROM (
SELECT Usr,
Product,
Day,
Day - ROW_NUMBER() OVER ( PARTITION BY Usr, Product ORDER BY Day ) AS grp,
MAX( Day ) OVER () AS last_day
FROM table1
)
GROUP BY Usr, Product, Grp, Last_Day
ORDER BY Usr, Product, "Start"
Outputs the data in the third format:
USR | PRODUCT | Start | End
--: | :------ | :---- | ---:
1 | A | 1 | 3
1 | B | 1 | 3
1 | C | 3 | null
Data in the second format:
CREATE TABLE table2 ( Usr, Product, Day, Action ) AS
SELECT 1, 'A', 1, 'Added' FROM DUAL UNION ALL
SELECT 1, 'A', 3, 'Removed' FROM DUAL UNION ALL
SELECT 1, 'B', 1, 'Added' FROM DUAL UNION ALL
SELECT 1, 'B', 3, 'Removed' FROM DUAL UNION ALL
SELECT 1, 'C', 3, 'Added' FROM DUAL;
Then you can convert it to the third format using:
SELECT Usr,
Product,
"Start",
"End"
FROM (
SELECT t.*,
ROW_NUMBER() OVER ( PARTITION BY Usr, Product, Action ORDER BY Day ) AS rn
FROM table2 t
)
PIVOT (
MAX( Day )
FOR Action IN (
'Added' AS "Start",
'Removed' AS "End"
)
)
Which outputs:
USR | PRODUCT | Start | End
--: | :------ | ----: | ---:
1 | A | 1 | 3
1 | B | 1 | 3
1 | C | 3 | null
Data in the third format:
CREATE TABLE table3 ( Usr, Product, "Start", "End" ) AS
SELECT 1, 'A', 1, 3 FROM DUAL UNION ALL
SELECT 1, 'B', 1, 3 FROM DUAL UNION ALL
SELECT 1, 'C', 3, NULL FROM DUAL;
Then to get the data in the first format you can use:
WITH unrolled_data ( Usr, Product, Day, "End" ) AS (
SELECT Usr, Product, "Start", "End"
FROM table3
UNION ALL
SELECT Usr, Product, Day + 1, "End"
FROM unrolled_data
WHERE Day + 1 < COALESCE( "End", 4 /* The last day + 1 */ )
)
SELECT Usr, Product, Day
FROM unrolled_data
ORDER BY Usr, Day, Product
Outputs:
USR | PRODUCT | DAY
--: | :------ | --:
1 | A | 1
1 | B | 1
1 | A | 2
1 | B | 2
1 | C | 3
And can convert to the second format using:
SELECT *
FROM table3
UNPIVOT ( Day FOR Action IN ( "Start" AS 'Added', "End" AS 'Removed' ) )
Which outputs:
USR | PRODUCT | ACTION | DAY
--: | :------ | :------ | --:
1 | A | Added | 1
1 | A | Removed | 3
1 | B | Added | 1
1 | B | Removed | 3
1 | C | Added | 3
(and you can combine queries to convert from 2-to-1.)
db<>fiddle here

Get Max And Min dates for consecutive values in T-SQL

I have a log table like below and want to simplfy it by getting min start date and max end date for consecutive Status values for each Id. I tried many window function combinations but no luck.
This is what I have:
This is what want to see:
This is a typical gaps-and-islands problem. You want to aggregate groups of consecutive records that have the same Id and Status.
No need for recursion, here is one way to solve it using window functions:
select
Id,
Status,
min(StartDate) StartDate,
max(EndDate) EndDate
from (
select
t.*,
row_number() over(partition by id order by StartDate) rn1,
row_number() over(partition by id, status order by StartDate) rn2
from mytable t
) t
group by
Id,
Status,
rn1 - rn2
order by Id, min(StartDate)
The query works by ranking records over two different partitions (by Id, and by Id and Status). The difference between the ranks gives you the group each record belongs to. You can run the subquery independently to see what it returns and understand the logic.
Demo on DB Fiddle:
Id | Status | StartDate | EndDate
-: | :----- | :------------------ | :------------------
1 | B | 07/02/2019 00:00:00 | 18/02/2019 00:00:00
1 | C | 18/02/2019 00:00:00 | 10/03/2019 00:00:00
1 | B | 10/03/2019 00:00:00 | 01/04/2019 00:00:00
2 | A | 05/02/2019 00:00:00 | 22/04/2019 00:00:00
2 | D | 22/04/2019 00:00:00 | 05/05/2019 00:00:00
2 | A | 05/05/2019 00:00:00 | 30/06/2019 00:00:00
Try the following query. First order the data by StartDate and generate a sequence (rid). Then you the recursive cte to get the first row (rid=1) for each group (id,status), and recursively get the next row and compare the start/end date.
;WITH cte_r(id,[Status],StartDate,EndDate,rid)
AS
(
SELECT id,[Status],StartDate,EndDate, ROW_NUMBER() OVER(PARTITION BY Id,[Status] ORDER BY StartDate) AS rid
FROM log_table
),
cte_range(id,[Status],StartDate,EndDate,rid)
AS
(
SELECT id,[Status],StartDate,EndDate,rid
FROM cte_r
WHERE rid=1
UNION ALL
SELECT p.id, p.[Status], CASE WHEN c.StartDate<p.EndDate THEN p.StartDate ELSE c.StartDate END AS StartDate, c.EndDate,c.rid
FROM cte_range p
INNER JOIN cte_r c
ON p.id=c.id
AND p.[Status]=c.[Status]
AND p.rid+1=c.rid
)
SELECT id,[Status],StartDate,MAX(EndDate) AS EndDate FROM cte_range GROUP BY id,StartDate ;

Select only 1 payment from a table with customers with multiple payments

I have a table called "payments" where I store all the payments of my costumers and I need to do a select to calculate the non-payment rate in a given month.
The costumers can have multiples payments in that month, but I should count him only once: 1 if any of the payments is done and 0 if any of the payment was made.
Example:
+----+------------+--------+
| ID | DATEDUE | AMOUNT |
+----+------------+--------+
| 1 | 2016-11-01 | 0 |
| 1 | 2016-11-15 | 20.00 |
| 2 | 2016-11-10 | 0 |
+----+------------+--------+
The result I expect is from the rate of november:
+----+------------+--------+
| ID | DATEDUE | AMOUNT |
+----+------------+--------+
| 1 | 2016-11-15 | 20.00 |
| 2 | 2016-11-10 | 0 |
+----+------------+--------+
So the rate will be 50%.
But if the select is:
SELECT * FROM payment WHERE DATEDUE BETWEEN '2016-11-01' AND '2016-11-30'
It will return me 3 rows and the rate will be 66%, witch is wrong. Ideas?
PS: This is a simpler example of the real table. The real query have a lot of columns, subselects, etc.
It sounds like you need to partition your results per customer.
SELECT TOP 1 WITH TIES
ID,
DATEDUE,
AMOUNT
ORDER BY ROW_NUMBER() OVER (PARTITION BY ID ORDER BY AMOUNT DESC)
WHERE DATEDUE BETWEEN '2016-11-01' AND '2016-11-30'
PS: The BETWEEN operator is frowned upon by some people. For clarity it might be better to avoid it:
What do BETWEEN and the devil have in common?
Try this
SELECT
id
, SUM(AMOUNT) AS AMOUNT
FROM
Payment
GROUP BY
id;
This might help if you want other columns.
WITH cte (
SELECT
id
, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY AMOUNT DESC ) AS RowNum
-- other row
)
SELECT *
FROM
cte
WHERE
RowNum = 1;
To calculate the rate, you can use explicit division:
select 1 - count(distinct case when amount > 0 then id end) / count(*)
from payment
where . . .;
Or, in a way that is perhaps easier to follow:
select avg(flag * 1.0)
from (select id, (case when max(amount) > 0 then 0 else 1 end) as flag
from payment
where . . .
group by id
) i