Using a where statement with rank and a subquery in SQL - sql

so I have a table that's sort of like this:
DELIVERY_AREA_ID
DELIVERY_RADIUS_METERS
EVENT_STARTED_TIMESTAMP
234sfd
4000
2020-01-01 12:19:29.719
234sfd
6500
2020-01-01 12:31:40.325
234sfd
3500
2020-01-01 12:53:10.538
234sfd
6500
2020-01-01 13:11:36.094
234sfd
3500
2020-01-01 13:32:26.754
234sfd
6500
2020-01-01 13:59:11.104
234sfd
6500
2020-01-02 07:44:16.792
234sfd
3500
2020-01-02 08:07:36.284
234sfd
6500
2020-01-02 08:54:08.014
234sfd
3500
2020-01-02 09:53:05.853
234sfd
6500
2020-01-02 10:04:39.443
234sfd
10000
2020-07-01 08:29:20.194
234sfd
3500
2020-07-03 07:50:41.782
234sfd
10000
2020-07-03 08:33:14.695
234sfd
3500
2020-07-05 07:47:05.539
234sfd
10000
2020-07-05 07:53:13.930
234sfd
3500
2020-07-05 09:18:57.688
234sfd
10000
2020-07-05 09:51:07.547
234sfd
3500
2020-07-19 18:02:14.099
the data is actually much more varied but yeah it follows that format.
I am trying to, in one query, in snowflake database, make a get the top ranked radius by duration. I currently have this:
SELECT DELIVERY_AREA_ID,
MAX(DELIVERY_RADIUS_METERS) AS default_delivery_radius,
MONTH_YEAR,
DELIVERY_RADIUS_METERS,
SUM(DURATION_SECONDS) AS total_duration,
MAX(EVENT_STARTED_TIMESTAMP) AS MAX_TIMESTAMP,
RANK() OVER (PARTITION BY DELIVERY_AREA_ID, MONTH_YEAR
ORDER BY SUM(DURATION_SECONDS) DESC) AS RADIUS_RANK
FROM (
-- Add the MONTH_YEAR column to the delivery_radius_log table
SELECT DELIVERY_AREA_ID,
DELIVERY_RADIUS_METERS,
EVENT_STARTED_TIMESTAMP,
CONCAT(MONTH(EVENT_STARTED_TIMESTAMP), '/',
YEAR(EVENT_STARTED_TIMESTAMP)) AS MONTH_YEAR,
DATEADD(second, DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)), EVENT_STARTED_TIMESTAMP) AS end_timestamp,
DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)) AS duration_seconds
FROM delivery_radius_log
) t -- added alias here
GROUP BY DELIVERY_AREA_ID, MONTH_YEAR, DELIVERY_RADIUS_METERS
I want to get the first rank for each month_year but when I use
where RADIUS_RANK = 1
I get an error: Syntax error: unexpected 'where'. (line 21)
Im not sure how to resolve this
I have tried this link which appears to have the same question but the solution is already what I am trying.

It is not possible to solve this scenario without querying the output of your query, in other words, using the output of that query as an input for another top-level query.
You can not use a field produced at the projection level in the WHERE clause
You can not use analytic functions in the WHERE clause
You can not use analytic functions in a HAVING clause
So the only solution is to query the output of that query and retrieve only the MIN rank.

To filter windowed function at the same query level you need to use QUALIFY clause:
SELECT DELIVERY_AREA_ID,
MAX(DELIVERY_RADIUS_METERS) AS default_delivery_radius,
MONTH_YEAR,
DELIVERY_RADIUS_METERS,
SUM(DURATION_SECONDS) AS total_duration,
MAX(EVENT_STARTED_TIMESTAMP) AS MAX_TIMESTAMP,
RANK() OVER (PARTITION BY DELIVERY_AREA_ID, MONTH_YEAR
ORDER BY SUM(DURATION_SECONDS) DESC) AS RADIUS_RANK
FROM (
-- Add the MONTH_YEAR column to the delivery_radius_log table
SELECT DELIVERY_AREA_ID,
DELIVERY_RADIUS_METERS,
EVENT_STARTED_TIMESTAMP,
CONCAT(MONTH(EVENT_STARTED_TIMESTAMP), '/',
YEAR(EVENT_STARTED_TIMESTAMP)) AS MONTH_YEAR,
DATEADD(second, DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)), EVENT_STARTED_TIMESTAMP) AS end_timestamp,
DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)) AS duration_seconds
FROM delivery_radius_log
) t -- added alias here
GROUP BY DELIVERY_AREA_ID, MONTH_YEAR, DELIVERY_RADIUS_METERS
QUALIFY RADIUS_RANK = 1;
If the rank column is not required then the entire expression could be moved:
SELECT DELIVERY_AREA_ID,
MAX(DELIVERY_RADIUS_METERS) AS default_delivery_radius,
MONTH_YEAR,
DELIVERY_RADIUS_METERS,
SUM(DURATION_SECONDS) AS total_duration,
MAX(EVENT_STARTED_TIMESTAMP) AS MAX_TIMESTAMP
FROM (
-- Add the MONTH_YEAR column to the delivery_radius_log table
SELECT DELIVERY_AREA_ID,
DELIVERY_RADIUS_METERS,
EVENT_STARTED_TIMESTAMP,
CONCAT(MONTH(EVENT_STARTED_TIMESTAMP), '/',
YEAR(EVENT_STARTED_TIMESTAMP)) AS MONTH_YEAR,
DATEADD(second, DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)), EVENT_STARTED_TIMESTAMP) AS end_timestamp,
DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)) AS duration_seconds
FROM delivery_radius_log
) t -- added alias here
GROUP BY DELIVERY_AREA_ID, MONTH_YEAR, DELIVERY_RADIUS_METERS
QUALIFY RANK() OVER (PARTITION BY DELIVERY_AREA_ID, MONTH_YEAR
ORDER BY SUM(DURATION_SECONDS) DESC) = 1;

Related

SQL for identifying % of orders placed within 20 minutes of each other

Have a dataset like below and would like to know various ways to solve the question of : what % of orders were within 20 minutes of each other?
CustomerId
Order_#
Order_Date
123
000112
12/25/2011 10:30
123
000113
12/25/2011 10:35
123
000114
12/25/2011 10:45
123
000115
12/25/2011 10:55
456
000113
12/25/2011 10:35
456
000113
1/25/2011 10:30
789
000117
9/25/2011 2:00
Result set should look like this:
3/7 = 0.42%
My approach was to first do a Self join with the table to get a count of rows which fall within the 20% but struggling to take out the duplicate rows.
Anyways, look forward to seeing some crafty answers.
Thank you.
You can use lead() and lag():
select avg( case when prev_order_date > order_date - interval '20 minute' or
next_order_date < order_date + interval '20 minute'
then 1.0 else 0
end) as ratio_within_20_minutes
from (select t.*,
lag(order_date) over (partition by customer_id order by order_date) as prev_order_date,
lead(order_date) over (partition by customer_id order by order_date) as next_order_date
from t
) t;
Note that date/time functions vary a lot among databases. This uses Standard SQL syntax for the comparisons. The exact syntax probably varies, depending on your database.
If you want this per customer then add group by customer_id to the query and customer_id to the select.
EDIT:
In SQL Server, this would be:
select avg( case when prev_order_date > dateadd(minute, -20, order_date) or
next_order_date < dateadd(minute, 20, order_date)
then 1.0 else 0
end) as ratio_within_20_minutes
from (select t.*,
lag(order_date) over (partition by customer_id order by order_date) as prev_order_date,
lead(order_date) over (partition by customer_id order by order_date) as next_order_date
from t
) t;

select first in and last out time - different date - from data finger

Here is my data finger table, [dbo].[tFPLog]
CardID Date Time TransactionCode
100 2020-09-01 08:00 IN
100 2020-09-01 17:00 OUT
100 2020-09-01 17:10 OUT
200 2020-09-01 16:00 IN
200 2020-09-02 02:00 OUT
200 2020-09-02 02:15 OUT
100 2020-09-02 07:00 IN
100 2020-09-02 16:00 OUT
200 2020-09-02 09:55 IN
200 2020-09-02 10:00 IN
200 2020-09-02 21:00 OUT
Conditions
Assume Employees will be IN and OUT in same day/next day.
Assume There will be multiple IN and OUT for same day/next day for employees. So need first IN and Last Out.
Duration = (FirstInTime - LastOutTime)
The current result i get using the query:
WITH CTE AS(
SELECT CardID,
[Date] AS DateIn,
MIN(CASE TransactionCode WHEN 'In' THEN [time] ELSE '23:59:59.999' END) AS TimeIn, --'23:59:59.999' as we are after the MIN, and NULL is the lowest value
[Date] AS DateOut,
MAX(CASE TransactionCode WHEN 'Out' THEN [time] END) AS TimeOut
FROM YourTable
GROUP BY CardID, [Date])
SELECT C.DateIn,
C.TimeIn,
C.DateOut,
C.TimeOut,
DATEADD(MINUTE,DATEDIFF(MINUTE,C.TimeIn,C.TimeOut),CONVERT(time(0),'00:00:00')) AS Duration
FROM CTE C;
=====The Current Result======
CardID DateIN TimeIN DateOUT TimeOUT Duration
100 2020-09-01 08:00 2020-09-01 17:10 09:10
200 2020-09-01 16:00 ? ? ?
100 2020-09-02 07:00 2020-09-02 16:00 09:00
200 2020-09-02 09:55 2020-09-02 21:00 11:05
=====The Result Needed=====
I want this result.
CardID DateIN TimeIN DateOUT TimeOUT Duration
100 2020-09-01 08:00 2020-09-01 17:10 09:10
200 2020-09-01 16:00 2020-09-02 02:15 10:15
100 2020-09-02 07:00 2020-09-02 16:00 09:00
200 2020-09-02 09:55 2020-09-02 21:00 11:05
How to get the DateOUT and TimeOUT in the nextday? with the condition FIRST IN AND LAST OUT. Please help, thank you in advance.
This seems like you were really overly complicating the problem. Just use some conditional aggregation, and then get the difference in minutes:
WITH CTE AS(
SELECT CardID,
[Date] AS DateIn,
MIN(CASE TransactionCode WHEN 'In' THEN [time] ELSE '23:59:59.999' END) AS TimeIn, --'23:59:59.999' as we are after the MIN, and NULL is the lowest value
[Date] AS DateOut,
MAX(CASE TransactionCode WHEN 'Out' THEN [time] END) AS TimeOut
FROM YourTable
GROUP BY CardID, [Date])
SELECT C.DateIn,
C.TimeIn,
C.DateOut,
C.TimeOut,
DATEADD(MINUTE,DATEDIFF(MINUTE,C.TimeIn,C.TimeOut),CONVERT(time(0),'00:00:00')) AS Duration
FROM CTE C;
This assumes that [date] is a date and [time] is a time (because, after all, that is what they are called...).
Side Note: it seems some what redundant have a DateIn and DateOut column when they will always have the same value. Might as well just have a [Date] Column.
Or perhaps, you are actually after this?
WITH CTE AS(
SELECT CardID,
[Date] AS DateIn,
[Time] AS TimeIn,
LEAD([Date]) OVER (PARTITION BY CardID ORDER BY [Date], [Time]) AS DateOut,
LEAD([Time]) OVER (PARTITION BY CardID ORDER BY [Date], [Time]) AS TimeOut,
TransactionCode
FROM dbo.YourTable)
SELECT C.DateIn,
C.TimeIn,
C.DateOut,
C.TimeOut
FROM CTE C
WHERE TransactionCode = 'IN';
Note that if that is the case, you would actually be better off storing the values [date] and [time] in a single column as a datetime/datetime2, not separate ones; as the values are clearly not distinct from each other.
Based on the (hopefully) final goal posts:
WITH VTE AS(
SELECT *
FROM (VALUES(100,CONVERT(date,'20200901'),CONVERT(time(0),'08:00:00'),'IN'),
(100,CONVERT(date,'20200901'),CONVERT(time(0),'17:00:00'),'OUT'),
(100,CONVERT(date,'20200901'),CONVERT(time(0),'17:10:00'),'OUT'),
(200,CONVERT(date,'20200901'),CONVERT(time(0),'16:00:00'),'IN'),
(200,CONVERT(date,'20200902'),CONVERT(time(0),'02:00:00'),'OUT'),
(200,CONVERT(date,'20200902'),CONVERT(time(0),'02:15:00'),'OUT'),
(100,CONVERT(date,'20200902'),CONVERT(time(0),'07:00:00'),'IN'),
(100,CONVERT(date,'20200902'),CONVERT(time(0),'16:00:00'),'OUT'),
(200,CONVERT(date,'20200902'),CONVERT(time(0),'09:55:00'),'IN'),
(200,CONVERT(date,'20200902'),CONVERT(time(0),'10:00:00'),'IN'),
(200,CONVERT(date,'20200902'),CONVERT(time(0),'21:00:00'),'OUT'))V(CardID,[Date],[Time],TransactionCode)),
Changes AS(
SELECT CardID,
DATEADD(MINUTE,DATEDIFF(MINUTE, '00:00:00',[time]),CONVERT(datetime2(0),[date])) AS Dt2, --Way easier to work with later
TransactionCode,
CASE TransactionCode WHEN LEAD(TransactionCode) OVER (PARTITION BY CardID ORDER BY [Date],[Time]) THEN 0 ELSE 1 END AS CodeChange
FROM VTE V),
Groups AS(
SELECT CardID,
dt2,
TransactionCode,
ISNULL(SUM(CodeChange) OVER (PARTITION BY CardID ORDER BY dt2 ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),0) AS Grp
FROM Changes),
MinMax AS(
SELECT CardID,
TransactionCode,
CASE TransactionCode WHEN 'IN' THEN MIN(dt2) WHEN 'Out' THEN MAX(dt2) END AS GrpDt2
FROM Groups
GROUP BY CardID,
TransactionCode,
Grp),
--And now original Logic
CTE AS(
SELECT CardID,
GrpDt2 AS DatetimeIn,
LEAD([GrpDt2]) OVER (PARTITION BY CardID ORDER BY GrpDt2) AS DateTimeOut,
TransactionCode
FROM MinMax)
SELECT C.CardID,
CONVERT(date,DatetimeIn) AS DateIn,
CONVERT(time(0),DatetimeIn) AS TimeIn,
CONVERT(date,DatetimeOut) AS DateOtt,
CONVERT(time(0),DatetimeOut) AS TimeOut,
DATEADD(MINUTE, DATEDIFF(MINUTE,DatetimeIn, DateTimeOut), CONVERT(time(0),'00:00:00')) AS Duration
FROM CTE C
WHERE TransactionCode = 'IN';

Find the start and end date of stock difference

Please Suggest good sql query to find the start and end date of stock difference
imagine i data in a table like below.
Sample_table
transaction_date stock
2018-12-01 10
2018-12-02 10
2018-12-03 20
2018-12-04 20
2018-12-05 20
2018-12-06 20
2018-12-07 20
2018-12-08 10
2018-12-09 10
2018-12-10 30
Expected result should be
Start_date end_date stock
2018-12-01 2018-12-02 10
2018-12-03 2018-12-07 20
2018-12-08 2018-12-09 10
2018-12-10 null 30
It is the gap and island problem. You may use row_numer and group by for this.
select t.stock, min(transaction_date), max(transaction_date)
from (
select row_number() over (order by transaction_date) -
row_number() over (partition by stock order by transaction_date) grp,
transaction_date,
stock
from data
) t
group by t.grp, t.stock
In the following DBFIDDLE DEMO I solve also the null value of the last group, but the main idea of finding consecutive rows is build on the above query.
You may check this for an explanation of this solution.
You can try below using row_number()
select stock,min(transaction_date) as start_date,
case when min(transaction_date)=max(transaction_date) then null else max(transaction_date) end as end_date
from
(
select *,row_number() over(order by transaction_date)-
row_number() over(partition by stock order by transaction_date) as rn
from t1
)A group by stock,rn
Try to use GROUP BY with MIN and MAX:
SELECT
stock,
MIN(transaction_date) Start_date,
CASE WHEN COUNT(*)>1 THEN MAX(transaction_date) END end_date
FROM Sample_table
GROUP BY stock
ORDER BY stock
You can try with LEAD, LAG functions as below:
select currentStockDate as startDate,
LEAD(currentStockDate,1) as EndDate,
currentStock
from
(select *
from
(select
LAG(transaction_date,1) over(order by transaction_date) as prevStockDate,
transaction_date as CurrentstockDate,
LAG(stock,1) over(order by transaction_date) as prevStock,
stock as currentStock
from sample_table) as t
where (prevStock <> currentStock) or (prevStock is null)
) as t2

Retrieve records that are in the date ranges in PostgreSQL

For each customer, I am trying to retrieve the records that are within 45 days of the most recent submit_date.
customer submit_date salary
A 2019-12-31 10000
B 2019-01-01 12000
A 2017-11-02 11000
A 2019-03-03 3000
B 2019-03-04 5500
C 2019-01-05 6750
D 2019-02-06 12256
E 2019-01-07 11345
F 2019-01-08 12345
Window functions come to the rescue:
SELECT customer, submit_date, salary
FROM (SELECT customer, submit_date, salary,
max(submit_date) OVER (PARTITION BY customer) AS latest_date
FROM thetable) AS q
WHERE submit_date >= latest_date - 45;
I am inclined to try:
select t.*
from t
where t.submit_date >= (select max(t2.submit_date) - interval '45 day'
from t t2
);
I think this can very much take advantage of an index on (submit_date).
If you want this relative to each customer, use a correlation clause:
select t.*
from t
where t.submit_date >= (select max(t2.submit_date) - interval '45 day'
from t t2
where t2.customer = t.customer
);
This wants an index on (customer, submit_date).

Adding additional group by in running average

I have the current code that is working
select format_date('%Y%m', date) as yyyymm,
(sum(sum(val)) over (order by min(date)) /
sum(count(*)) over (order by min(date))
) as running_avg
from t
group by yyyymm
order by yyyymm;
Returns
yyyymm Score
201712 25.57931742
201801 24.69794466
201802 24.23110781
201803 23.85651947
201804 23.66164799
201805 23.43029053
201806 23.17074628
201807 23.09766588
201808 23.08902284
I am now trying to add an additional group by clause, for department. The query runs however the results are inaccurate, can anyone recognize what i am doing incorrectly?
select format_date('%Y%m', date) as yyyymm, department
(sum(sum(val)) over (order by min(date)) /
sum(count(*)) over (order by min(date))
) as running_avg
from t
group by yyyymm, department
order by yyyymm;
Returns
yyyymm department Score
201712 HR 6.704365079
201712 F&B 8.550338502
201712 Marketing 8.550338502
201712 I.T. 9.857502908
201712 Security 9.551491994
201712 Contractors 9.411654456
201712 Executive Office 9.637075283
201712 Property Services 9.45905826
201712 Corporate 9.57458477
201712 Legal 9.700320268
You need to add department to the partition by:
select department, format_date('%Y%m', date) as yyyymm,
(sum(sum(val)) over (partition by department order by min(date)) /
sum(count(*)) over (partition by department order by min(date))
) as running_avg
from t
group by yyyymm, department
order by department, yyyymm;