speeding up recursive query/ looping through values - sql

Suppose table with structure like this:
create table tab1
(
id int,
valid_from timestamp
)
I need to build query such that in case there is a duplicity over pair (id,valid_from), e.g.
id valid_from
1 2000-01-01 12:00:00
1 2000-01-01 12:00:00
then one second needs to be added to subsequent rows to valid_from column.
For example if there are three duplicate rows, the result should be as follows
id valid_from
1 2000-01-01 12:00:00
1 2000-01-01 12:00:01
1 2000-01-01 12:00:02
Tried running a recursive cte query but since for some cases there is a large number of duplicate values (for current data set about 160 for some cases of (id,valid_from)), it is really slow.
Thanks

If "next second" is not occupied, then:
WITH TAB (id, valid_from) AS
(
VALUES
(1, TIMESTAMP('2000-01-01 12:00:00'))
, (1, TIMESTAMP('2000-01-01 12:00:00'))
, (1, TIMESTAMP('2000-01-01 12:00:00'))
, (2, TIMESTAMP('2000-01-01 12:00:00'))
, (2, TIMESTAMP('2000-01-01 12:00:01'))
, (2, TIMESTAMP('2000-01-01 12:00:01'))
)
SELECT ID, VALID_FROM
, VALID_FROM + (ROWNUMBER() OVER (PARTITION BY ID, VALID_FROM) - 1) SECOND AS VALID_FROM2
FROM TAB
ORDER BY ID, VALID_FROM2;
The result is:
|ID |VALID_FROM |VALID_FROM2 |
|-----------|--------------------------|--------------------------|
|1 |2000-01-01-12.00.00.000000|2000-01-01-12.00.00.000000|
|1 |2000-01-01-12.00.00.000000|2000-01-01-12.00.01.000000|
|1 |2000-01-01-12.00.00.000000|2000-01-01-12.00.02.000000|
|2 |2000-01-01-12.00.00.000000|2000-01-01-12.00.00.000000|
|2 |2000-01-01-12.00.01.000000|2000-01-01-12.00.01.000000|
|2 |2000-01-01-12.00.01.000000|2000-01-01-12.00.02.000000|

You can use window functions:
select id,
valid_from + (row_number() over (partition by id order by valid_from) - 1) second
from t;

Related

Postgres - group rows by user, return one row per user in each group

I have a purchases table:
-----------------
user_id | amount
-----------------
1 | 12
1 | 4
1 | 8
2 | 23
2 | 45
2 | 7
I want a query that will return one row per user_id, but the row that I want for each user_id is where the amount is the smallest per user_id. So I should get as my result set:
-----------------
user_id | amount
-----------------
1 | 4
2 | 7
Using DISTINCT on the user_id column ensures I don't get duplicate user's, but I don't know how to make it so that returns the user row with the fewest amount.
You can use distinct on:
select distinct on (user) t.*
from t
order by user, amount;
Note: If you just want the smallest amount, then group by would be the typical solution:
select user, min(amount)
from t
group by user;
Distinct on is a convenient Postgres extension that makes it easy to get one row per group -- and it often performs better than other methods.
If your requirement requires ouput of a row that equates to the smallest amount, e.g. the table includes a transaction date and you need this in the output, then a convenient method is to use row_number() over() to select the wanted rows. e.g.
CREATE TABLE mytable(
user_id INTEGER NOT NULL
,amount INTEGER NOT NULL
,trandate DATE NOT NULL
);
INSERT INTO mytable(user_id,amount,trandate) VALUES (1,12,'2020-09-12');
INSERT INTO mytable(user_id,amount,trandate) VALUES (1,4,'2020-10-02');
INSERT INTO mytable(user_id,amount,trandate) VALUES (1,8,'2020-11-12');
INSERT INTO mytable(user_id,amount,trandate) VALUES (2,23,'2020-12-02');
INSERT INTO mytable(user_id,amount,trandate) VALUES (2,45,'2021-01-12');
INSERT INTO mytable(user_id,amount,trandate) VALUES (2,7,'2021-02-02');
select
user_id, amount, trandate
from (
select user_id, amount, trandate
, row_number() over(partition by user_id order by amount) as rn
from mytable
) t
where rn = 1
result:
+---------+--------+------------+
| user_id | amount | trandate |
+---------+--------+------------+
| 1 | 4 | 2020-10-02 |
| 2 | 7 | 2021-02-02 |
+---------+--------+------------+
demonstartion of this at db<>fiddle here

Redshift separate partitions with identical data by time

I have data in a Redshift table like product_id, price, and time_of_purchase. I want to create partitions for every time the price changed since the previous purchase. In this case the price of an item may go back to a previous price, but I need this to be a separate partition, e.g.:
Note the price was $2, then went up to $3, then went back to $2. If I do something like (partition by product_id, price order by time_of_purchase) then the last row gets partitioned with the top two, which I don't want. How can I do this correctly so I get three separate partitions?
Use lag() to get the previous value and then a cumulative sum:
select t.*,
sum(case when prev_price = price then 0 else 1 end) over
(partition by product_id order by time_of_purchase) as partition_id
from (select t.*,
lag(price) over (partition by product_id order by time_of_purchase) as prev_price
from t
) t
As opposed to #Gordon Linoff, I prefer to do it step by step, using WITH clauses ...
And, as I stated several times in other posts - please add your exemplary data in a copy-paste ready format, so we don't have to copy-paste your examples.
I like to add my examples in a self-contained micro demo format, with the input data already in my post, so everyone can play with it, that's why ..
WITH
-- your input, typed manually ....
indata(product_id,price,tm_of_p) AS (
SELECT 1,2.00,TIMESTAMP '2020-09-14 09:00'
UNION ALL SELECT 1,2.00,TIMESTAMP '2020-09-14 10:00'
UNION ALL SELECT 1,3.00,TIMESTAMP '2020-09-14 11:00'
UNION ALL SELECT 1,3.00,TIMESTAMP '2020-09-14 12:00'
UNION ALL SELECT 1,2.00,TIMESTAMP '2020-09-14 13:00'
)
,
with_change_counter AS (
SELECT
*
, CASE WHEN LAG(price) OVER(PARTITION BY product_id ORDER BY tm_of_p) <> price
THEN 1
ELSE 0
END AS chg_count
FROM indata
)
SELECT
product_id
, price
, tm_of_p
, SUM(chg_count) OVER(PARTITION BY product_id ORDER BY tm_of_p) AS session_id
FROM with_change_counter;
-- out product_id | price | tm_of_p | session_id
-- out ------------+-------+---------------------+------------
-- out 1 | 2.00 | 2020-09-14 09:00:00 | 0
-- out 1 | 2.00 | 2020-09-14 10:00:00 | 0
-- out 1 | 3.00 | 2020-09-14 11:00:00 | 1
-- out 1 | 3.00 | 2020-09-14 12:00:00 | 1
-- out 1 | 2.00 | 2020-09-14 13:00:00 | 2

How to find the earliest data that closest to the specified date?

Some business data has a create_on column to indicate the creation date, and I want to find the earliest data that closest to the specified date. How do I write the sql? I'm using postgres database.
drop table if exists t;
create table t (
id int primary key,
create_on date not null
-- ignore other columns
);
insert into t(id, create_on) values
(1, '2018-01-10'::date),
(2, '2018-01-20'::date);
-- maybe have many other data
| sn | specified-date | expected-result |
| 1 | 2018-01-09 | (1, '2018-01-10'::date) |
| 2 | 2018-01-10 | (1, '2018-01-10'::date) |
| 3 | 2018-01-11 | (1, '2018-01-10'::date) |
| 4 | 2018-01-19 | (1, '2018-01-10'::date) |
| 5 | 2018-01-20 | (2, '2018-01-20'::date) |
| 6 | 2018-01-21 | (2, '2018-01-20'::date) |
This is tricky, because you seem to want the most recent row one or before the date. But if no such row exists, you want the earliest date in the table:
with tt as (
select t.*
from t
where t.created_on <= :specified_date
order by t.created_on desc
fetch first 1 row only
)
select tt.* -- select most recent row before the date
from tt
union all
select t.* -- otherwise select most oldest row
from t
where not exists (select 1 from tt)
order by t.created_on
fetch first 1 row only;
EDIT:
You can also handle this with a single query:
select t.*
from t
order by (t.created_on <= :specified_date) desc, -- put older dates first
(case when t.created_on <= :specified_date then created_on end) desc,
created_on asc
fetch first 1 row only;
Although this looks simpler, it might actually be more expensive, because the query cannot make use of an index on (created_on). And, there is no where clause reducing the number of rows before sorting.

Combining column values from a historical ledger into a daily summary in SQL Server?

A little new to SQL and had a question. I am working with a view that contains a historical ledger/record of price changes for all products. Here is an example of what this view looks like:
+-----+-----------------+----------+----------+-----+
| SKU | PriceChangeDate | NewPrice | OldPrice | RN |
+-----+-----------------+----------+----------+-----+
| ABC | 1/1/2017 1:00 | $7.00 | $6.50 | 1 |
| ABC | 1/1/2017 1:30 | $6.75 | $7.00 | 2 |
| ABC | 1/1/2017 1:45 | $7.25 | $6.75 | 3 |
| DEF | 1/1/2017 1:05 | $8.75 | $8.00 | 1 |
| DEF | 1/1/2017 1:25 | $10.00 | $8.75 | 2 |
+-----+-----------------+----------+----------+-----+
The RN column was created by me, after having created a row number column by partitioning over SKU and ordering by PriceChangeDate.
What I'm trying to do is create a query that will return each distinct SKU, its most recent NewPrice, and its oldest OldPrice for a single day, to essentially show the starting price and ending price for the day. It would look something like this:
+-----+-----------------+----------+----------+-----+
| SKU | PriceChangeDate | NewPrice | OldPrice | RN |
+-----+-----------------+----------+----------+-----+
| ABC | 1/1/2017 1:45 | $7.25 | $6.50 | 3 |
| DEF | 1/1/2017 1:25 | $10.00 | $8.00 | 2 |
+-----+-----------------+----------+----------+-----+
I know that I have to group by SKU, but I'm not sure how I could make this happen. Any tips/ideas?
Thank you in advance!
Just take your query and add desc to the order by in the over clause.
Then, use a subquery or CTE, and add:
where rn = 1
to the outer query.
That would be something like:
select . . .
from (select t.*,
row_number() over (partition by sku order by PriceChangeDate desc) as rn
from . . .
) t
where rn = 1;
If you actually want the number of price changes (what you call rn), then add count(*) over (partition by sku).
I should note that this is often more efficiently accomplished using:
select t.*
from t
where t.PriceChangeDate = (select max(t2.PriceChangeDate) from t t2 where t2.sku = t.sku);
In particular, this can take advantage of an index on (sku, PriceChangeDate).
DECLARE #MyTable TABLE
(
SKU INT NOT NULL--use an appropriate data type here
,PriceChangeDate DATETIME NOT NULL
,NewPrice MONEY NOT NULL
,OldPrice MONEY NOT NULL
,RN INT
)
INSERT INTO #MyTable
(
SKU
,PriceChangeDate
,NewPrice
,OldPrice
,rn
)
VALUES
(1, '2017-01-01 01:00', 7.00, 6.50,1)
,(1, '2017-01-01 01:30', 6.75, 7.00,2)
,(1, '2017-01-01 01:45', 7.25, 6.75,3)
,(2, '2017-01-01 01:05', 8.75, 8.00,1)
,(2, '2017-01-01 01:25', 10.00, 8.75,2)
SELECT mx.SKU,
mx.PriceChangeDate,
mx.NewPrice AS NewPrice,
mn.oldPrice AS OldPrice,
mx.rn AS RN
FROM
(
SELECT *,
ROW_NUMBER() OVER( PARTITION BY SKU ORDER BY PriceChangeDate DESC ) AS maxval
FROM #MyTable
) mx
INNER JOIN
(
SELECT *,
ROW_NUMBER() OVER( PARTITION BY SKU ORDER BY PriceChangeDate ) AS minval
FROM #MyTable
) mn
ON mx.SKU = mn.SKU
AND mx.maxval = mn.minval
WHERE mx.maxval = 1
AND mn.minval = 1;
First, to not murder query performance, you will want to add a column to hold the date of the timestamps (in the proper time zone) and a table index (SKU, PriceChangeActualDate).
Then the solution will involve writing a window query:
DECLARE #MyTable TABLE
(
SKU INT NOT NULL--use an appropriate data type here
,[Timestamp] DATETIME NOT NULL
,[Date] AS CONVERT(DATE, [Timestamp]) PERSISTED
,NewPrice MONEY NOT NULL
,OldPrice MONEY NOT NULL
,PRIMARY KEY(SKU, [Timestamp])
)
--create an index on (SKU, [Date]) to help speed up query performance on large record sets
INSERT INTO #MyTable
(
SKU
,[Timestamp]
,NewPrice
,OldPrice
)
VALUES
(1, '2017-01-01 01:00', 7.00, 6.50)
,(1, '2017-01-01 01:30', 6.75, 7.00)
,(1, '2017-01-01 01:45', 7.25, 6.75)
,(2, '2017-01-01 01:05', 8.75, 8.00)
,(2, '2017-01-01 01:25', 10.00, 8.75)
SELECT DISTINCT
SKU
,[Date]
,[FirstUpdate] = MIN([Timestamp]) OVER(PARTITION BY SKU, [Date])
,[LastUpdate] = MAX([Timestamp]) OVER(PARTITION BY SKU, [Date])
,StartingPrice = FIRST_VALUE(OldPrice) OVER(PARTITION BY SKU, [Date] ORDER BY [Timestamp] ASC)
,EndingPrice = FIRST_VALUE(NewPrice) OVER(PARTITION BY SKU, [Date] ORDER BY [Timestamp] DESC)
FROM #MyTable
ORDER BY
SKU
,[Date]
I have added a column [Date] as a PERSISTED, computed column so it can be indexed (I didn't index it in the above code; see the comment), assuming the timestamps are in the proper time zone (another thing to avoid like the plague: time values without a clear time zone).
Note that FIRST_VALUE requires SQL Server 2012 or newer.
And a couple (personal) stylistic guidelines since you are new:
Put column aliases first, as this is much easier to read (everything is left-justified this way instead of randomly to the right somewhere, sometimes overflowing to a new line depending on your editor)
Be very conspicuous with time zones. You should also check out the DATETIMEOFFSET type
Avoid Hungarian naming if you can, or at least do it accurately Don't call timestamps dates
Until you are formatting time data for users, stick to ISO 8601

sequencing data using hive functions

I have a hive table
create table abc ( id int, channel string, time int );
insert into table abc values
(1,'a', 12),
(1,'c', 10),
(1,'b', 15),
(2,'a', 15),
(2,'c', 12),
(2,'c', 7);
I want resultant table to look something like this -
id , journey
1, c->a->b
2, c->c->a
journey column is arranged in ascending order by time per id
I have tried
select id , concat_ws(">", collect_list(channel)) as journey
from abc
group by id
but it does not preserve order.
Use subquery and order by time(to preserve order) then in the outer query use collect_list with group by clause.
hive> select id , concat_ws("->", collect_list(channel)) as journey from
(
select * from abc order by time
)t
group by id;
+-----+----------------+--+
| id | journey |
+-----+----------------+--+
| 1 | 'c'->'a'->'b' |
| 2 | 'c'->'c'->'a' |
+-----+----------------+--+