Redshift Alternative for Correlated Sub-Query - sql

I am using Redshift and need an alternative for a correlated subquery. I am getting the correlated subquery not supported error. However, for this particular exercise of trying to identify all sales transactions made by the same customer within a given hour from the originating transaction, I am not sure a traditional left join would work either. I.e., the query is dependent on the context or current value from the parent select. I have also tried something similar using row_number() window function but again, need a way to window / partition on a date range - not just customer_id.
The overall goal is to find the first sales transaction for a given customer id, then find all subsequent transactions made within 60 minutes of the first transaction. This logic will continue on for the remainder of the transactions for the same customer (and ultimately all customers in the database). That is, once the initial 60 minute window has been established from the time of the first transaction, a second 60 minute window would begin at the end of the first 60 minute window, and all transactions within the second window would also be identified and combined and then repeat for the remainder of transactions.
The output would list the first transaction id that started the 60 minute window, then the other subsequent transaction ids that were made within the 60 minute window. The 2nd row would display the first transaction id made by the same customer in the next 60 minute window (again, the first transaction post the first 60 minute window would be the start of the second 60 minute window) and then the subsequent transactions also made within the second 60 minute window.
The query example in its most basic form looks like the query below:
select
s1.customer_id,
s1.transaction_id,
s1.order_time,
(
select
s2.transaction_id
from
sales s2
where
s2.order_time > s1.order_time and
s2.order_time <= dateadd(m,60,s1.order_time) and
s2.customer_id = s1.customer_id
order by
s2.order_time asc
limit 1
) as sales_transaction_id_1,
(
select
s3.transaction_id
from
sales s3
where
s3.order_time > s1.order_time and
s3.order_time <= dateadd(m,60,s1.order_time) and
s3.customer_id = s1.customer_id
order by
s3.order_time asc
limit 1 offset 1
) as sales_transaction_id_2,
(
select
s3.transaction_id
from
sales s4
where
s4.order_time > s1.order_time and
s4.order_time <= dateadd(m,60,s1.order_time) and
s4.customer_id = s1.customer_id
order by
s4.order_time asc
limit 1 offset 1
) as sales_transaction_id_3
from
(
select
sales.customer_id,
sales.transaction_id,
sales.order_time
from
sales
order by
sales.order_time desc
) s1;
For example, if a customer made the following transactions:
customer_id transaction_id order_time
1234 33453 2017-06-05 13:30
1234 88472 2017-06-05 13:45
1234 88477 2017-06-05 14:10
1234 99321 2017-06-07 8:30
1234 99345 2017-06-07 8:45
The expected output would be as:
customer_id transaction_id sales_transaction_id_1 sales_transaction_id_2 sales_transaction_id_3
1234 33453 88472 88477 NULL
1234 99321 99345 NULL NULL
Also, it appears Redshift does not support lateral joins which seems to further restrict the options at my disposal. Any help would be greatly appreciated.

You can use window functions to get the subsequent transactions for every transaction. The window will be customer / hour and you can rank records to get the first "anchor" transaction and get all subsequent transactions that you need:
with
transaction_chains as (
select
customer_id
,transaction_id
,order_time
-- rank transactions within window to find the first "anchor" transaction
,row_number() over (partition by customer_id,date_trunc('minute',order_time) order by order_time)
-- 1st next order
,lead(transaction_id,1) over (partition by customer_id,date_trunc('minute',order_time) order by order_time) as transaction_id_1
,lead(order_time,1) over (partition by customer_id,date_trunc('minute',order_time) order by order_time) as order_time_1
-- 2nd next order
,lead(transaction_id,2) over (partition by customer_id,date_trunc('minute',order_time) order by order_time) as transaction_id_2
,lead(order_time,2) over (partition by customer_id,date_trunc('minute',order_time) order by order_time) as order_time_2
-- 2nd next order
,lead(transaction_id,3) over (partition by customer_id,date_trunc('minute',order_time) order by order_time) as transaction_id_3
,lead(order_time,3) over (partition by customer_id,date_trunc('minute',order_time) order by order_time) as order_time_3
from sales
)
select
customer_id
,transaction_id
,transaction_id_1
,transaction_id_2
,transaction_id_3
from transaction_chains
where row_number=1;

From your description, you just want group by and some sort of date difference. I'm not sure how you want to combine the rows, but here is the basic idea:
select s.customer_id,
min(order_time) as first_order_in_hour,
max(order_time) as last_order_in_hour,
count(*) as num_orders
from (select s.*,
min(order_time) over (partition by customer_id) as min_ot
from sales s
) s
group by customer_id, floor(datediff(second, min_ot, order_time) / (60 * 60));
This formulation (or something similar because Postgres does not have datediff()) would also be much faster in Postgres.

Related

How to return date of reaching a certain threshold

Using SQL Server Management Studio.
Let's say I have a table with transactions that contains User, Date, Transaction amount. I want a query that will return the date when a certain amount is reached - let's say 100.
For example the same user performs 10 transactions for 10 EUR. I want the query to select the date of the last transaction because that's when his volume reached 100. Of course, once 100 is reached, the query shouldn't change the date with the latest transaction anymore, but leave it at when 100 was reached.
Wrote this on pgadmin but I think syntax should be the same.
with cumulative as
(
select customer_id,
sum(amount) over (partition by customer_id order by payment_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) cum_amt,
payment_date
from payment
)
select customer_id
, min(payment_date) as threshold_reached
from cumulative
where cum_amt>=100
group by customer_id
case when sum(amt) over (partition by user order by date) - amt < 100
and sum(amt) over (partition by user order by date) >= 100
then 1 else 0 end

Past 7 days running amounts average as progress per each date

So, the query is simple but i am facing issues in implementing the Sql logic. Heres the query suppose i have records like
Phoneno Company Date Amount
83838 xyz 20210901 100
87337 abc 20210902 500
47473 cde 20210903 600
Output expected is past 7 days progress as running avg of amount for each date (current date n 6 days before)
Date amount avg
20210901 100 100
20210902 500 300
20210903 600 400
I tried
Select date, amount, select
avg(lg) from (
Select case when lag(amount)
Over (order by NULL) IS NULL
THEN AMOUNT
ELSE
lag(amount)
Over (order by NULL) END AS LG)
From table
WHERE DATE>=t.date-7) as avg
From table t;
But i am getting wrong avg values. Could anyone please help?
Note: Ive tried without lag too it results the wrong avgs too
You could use a self join to group the dates
select distinct
a.dt,
b.dt as preceding_dt, --just for QA purpose
a.amt,
b.amt as preceding_amt,--just for QA purpose
avg(b.amt) over (partition by a.dt) as avg_amt
from t a
join t b on a.dt-b.dt between 0 and 6
group by a.dt, b.dt, a.amt, b.amt; --to dedupe the data after the join
If you want to make your correlated subquery approach work, you don't really need the lag.
select dt,
amt,
(select avg(b.amt) from t b where a.dt-b.dt between 0 and 6) as avg_lg
from t a;
If you don't have multiple rows per date, this gets even simpler
select dt,
amt,
avg(amt) over (order by dt rows between 6 preceding and current row) as avg_lg
from t;
Also the condition DATE>=t.date-7 you used is left open on one side meaning it will qualify a lot of dates that shouldn't have been qualified.
DEMO
You can use analytical function with the windowing clause to get your results:
SELECT DISTINCT BillingDate,
AVG(amount) OVER (ORDER BY BillingDate
RANGE BETWEEN TO_DSINTERVAL('7 00:00:00') PRECEDING
AND TO_DSINTERVAL('0 00:00:00') FOLLOWING) AS RUNNING_AVG
FROM accounts
ORDER BY BillingDate;
Here is a DBFiddle showing the query in action (LINK)

Window functions and calculating averages with tricky data manipulation

I have a SQL Server programming challenge involving some manipulations of healthcare patient pulse readings.
The goal is to do an average of readings within a certain time period and to only include the latest pulse reading of the day.
As an example, times are appt_time:
PATIENT 1 PATIENT 2
‘1/1/2019 80 ‘1/3/2019 90
‘1/4/2019 85
‘1/2/2019 10 am 78
‘1/2/2019 1 pm 85
‘1/3/2019 90
A patient may or may not have a second reading in a day. Only the last 3 latest chronological readings are used for the average. If less than 3 readings are available, an average is computed for 2 readings, or 1 reading is chosen as average.
Can this be done with the SQL window functions? This is a little more efficient than using a subquery.
I have used first_VALUE desc statements successfully to pick the last pulse in a day. I then have tried various row_number commands to exclude the marked off row (first pulse of the day when 2 readings are present). I cannot seem to correctly calculate the average. I have used row_number in select and from clauses.
with CTEBPI3
AS (
SELECT pat_id
,appt_time
,bp_pulse
,first_VALUE (bp_pulse) over(partition by appt_time order by appt_time desc ) fv
,ROW_NUMBER() OVER (PARTITION BY appt_time ORDER BY APPT_time DESC)RN1
,,Round(Sum(bp_pulse) OVER (PARTITION BY Pat_id) / COUNT (appt_time) OVER (PARTITION BY Pat_id), 0) AS adJAVGSYS3
FROM
pat_enc
WHERE appt_time > '07/15/2018'
)
select *,
WHEN rn=1
Average for pat1 should be 85
Average for pat2 should be 87.5
You can do this with two window functions:
MAX(appt_time) OVER ... to get the latest time per day
DENSE_RANK() OVER ... to get the last three days
You get the date part from your datetime with CONVERT(DATE, appt_time). The average function AVGis already built in :-)
The complete query:
select pat_id, avg(bp_pulse) as average_pulse
from
(
select
pat_id, appt_time, bp_pulse,
max(appt_time) over (partition by pat_id, convert(date, appt_time)) as max_time,
dense_rank() over (partition by pat_id order by convert(date, appt_time) desc) as rn
from pat_enc
) evaluated
where appt_time = max_time -- last row per day
and rn <= 3 -- last three days
group by pat_id
order by pat_id;
If the column bp_pulse is defined as an integer, you must convert it to a decimal to avoid integer arithmetic:
select pat_id, avg(convert(decimal, bp_pulse)) as average_pulse
Demo: https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=3df744fcf2af89cdfd8b3cd8b6546d89
Actually, window functions are not necessarily more efficient. It is worth comparing:
select p.pat_id, avg(p.bp_pulse)
from pat_enc p
where -- appt_time > '2018-07-15' and -- don't know if this is necessary
p.appt_time >= (select distinct convert(date, appt_time)
from pat_enc p2
where p2.pat_id = p.pat_id
order by distinct convert(date, appt_time)
offset 2 row fetch first 1 row only
) and
p.appt_time = (select max(p2.appt_time)
from pat_enc p2
where p2.pat_id = p.pat_id and
convert(date, p2.appt_time) = convert(date, p.appt_time)
);
This wants an index on pat_enc(pat_id, appt_time).
In fact, there are a variety of ways to write this logic, with different mixes of subqueries and window functions (this is one extreme).
Which performs the best will depend on the nature of your data. In particular:
The number of appointments on the same day -- is this normally 1 or a large number?
The overall number of days with appointments -- is this right around three or are there hundreds?
You need to test on your data, but I think window function will work best when relatively few rows are filtered out (~1 appointment/day, ~3 days with appointments). Subqueries will be helpful when more rows are being filtered.

SQL Rolling Summary Statistics For Set Timeframe

I have a table that contains information about log-in events. Every time a user logs in, a record is added containing the user and the date. I want to calculate a new column in that table that holds the number of times that user has logged in in the past 31 days (including the current attempt). This is a simplified version of what my table looks like, including the column I want to add:
UserID Date LoginsInPast31Days
-------- ------------- --------------------
1 01-01-2012 1
2 02-01-2012 1
2 10-01-2012 2
1 25-01-2012 2
2 03-02-2012 2
2 22-03-2012 1
I know how to calculate a total amount of login attempts: I'd use COUNT(*) OVER (PARTITION BY UserId ORDER BY Date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW). However, I want to limit the timeframe to the last 31 days. My guess is that I have to change the UNBOUNDED PRECEDING, but how do I alter it in such a way that it select the right amount of rows?
One pretty efficient way is to add a record 30 days after each date. It looks like this:
select userid, dte,
sum(inc) over (partition by userid order by dte) as LoginsInPast31Days
from ((select distinct userid, logindate as dte, 1 as inc from logins) union all
(select distinct userid, dateadd(day, 31, dte, -1 as inc from logins)
) l;
You're almost there, 2 adjustments:
First make sure to group by user and date so you know how many rows to select
Secondly, you'll need to use 'ROWS BETWEEN CURRENT ROW AND 31 FOLLOWING' since you cannot limit the number of preceding records to use. By using descending sort order, you'll get the required result.
Combine these tips and you'll get:
SELECT SUM(COUNT(*)) OVER (
PARTITION BY t.userid_KEY
ORDER BY CAST(t.login_ts AS DATE) DESC
ROWS BETWEEN CURRENT ROW AND 31 FOLLOWING
)
FROM table AS t
GROUP BY t.userid, CAST(t.login_ts AS DATE)

query to display additional column based on aggregate value

I've been mulling on this problem for a couple of hours now with no luck, so I though people on SO might be able to help :)
I have a table with data regarding processing volumes at stores. The first three columns shown below can be queried from that table. What I'm trying to do is to add a 4th column that's basically a flag regarding if a store has processed >=$150, and if so, will display the corresponding date. The way this works is the first instance where the store has surpassed $150 is the date that gets displayed. Subsequent processing volumes don't count after the the first instance the activated date is hit. For example, for store 4, there's just one instance of the activated date.
store_id sales_volume date activated_date
----------------------------------------------------
2 5 03/14/2012
2 125 05/21/2012
2 30 11/01/2012 11/01/2012
3 100 02/06/2012
3 140 12/22/2012 12/22/2012
4 300 10/15/2012 10/15/2012
4 450 11/25/2012
5 100 12/03/2012
Any insights as to how to build out this fourth column? Thanks in advance!
The solution start by calculating the cumulative sales. Then, you want the activation date only when the cumulative sales first pass through the $150 level. This happens when adding the current sales amount pushes the cumulative amount over the threshold. The following case expression handles this.
select t.store_id, t.sales_volume, t.date,
(case when 150 > cumesales - t.sales_volume and 150 <= cumesales
then date
end) as ActivationDate
from (select t.*,
sum(sales_volume) over (partition by store_id order by date) as cumesales
from t
) t
If you have an older version of Postgres that does not support cumulative sum, you can get the cumulative sales with a subquery like:
(select sum(sales_volume) from t t2 where t2.store_id = t.store_id and t2.date <= t.date) as cumesales
Variant 1
You can LEFT JOIN to a table that calculates the first date surpassing the 150 $ limit per store:
SELECT t.*, b.activated_date
FROM tbl t
LEFT JOIN (
SELECT store_id, min(thedate) AS activated_date
FROM (
SELECT store_id, thedate
,sum(sales_volume) OVER (PARTITION BY store_id
ORDER BY thedate) AS running_sum
FROM tbl
) a
WHERE running_sum >= 150
GROUP BY 1
) b ON t.store_id = b.store_id AND t.thedate = b.activated_date
ORDER BY t.store_id, t.thedate;
The calculation of the the first day has to be done in two steps, since the window function accumulating the running sum has to be applied in a separate SELECT.
Variant 2
Another window function instead of the LEFT JOIN. May of may not be faster. Test with EXPLAIN ANALYZE.
SELECT *
,CASE WHEN running_sum >= 150 AND thedate = first_value(thedate)
OVER (PARTITION BY store_id, running_sum >= 150 ORDER BY thedate)
THEN thedate END AS activated_date
FROM (
SELECT *
,sum(sales_volume)
OVER (PARTITION BY store_id ORDER BY thedate) AS running_sum
FROM tbl
) b
ORDER BY store_id, thedate;
->sqlfiddle demonstrating both.