Find most recent record in a subquery (SQL Server) - sql

I'm converting some code from Oracle to SQL Server (2012) and have run into an issue where this subquery is using a PARTITION/ORDER BY to retrieve the most recent record. The subquery runs fine on its own, but as it is a subquery, I'm getting the error:
SQL Server Database Error: The ORDER BY clause is invalid in views,
inline functions, derived tables, subqueries, and common table
expressions, unless TOP, OFFSET or FOR XML is also specified.
Here's the section of SQL:
FROM (
SELECT distinct enr.MemberNum,
(ISNULL(enr.MemberFirstName, '') + ' ' + ISNULL(enr.MemberLastName, '')) AS MEMBER_NAME,
enr.MemberBirthDate as DOB,
enr.MemberGender as Gender,
LAST_VALUE(enr.MemberCurrentAge) OVER (PARTITION BY MemberNum ORDER BY StaticDate ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS AGE,
LAST_VALUE(enr.EligStateAidCategory)OVER (PARTITION BY MemberNum ORDER BY StaticDate ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS EligStateAidCategory,
LAST_VALUE(enr.EligStateAidCategory)OVER (PARTITION BY MemberNum ORDER BY StaticDate ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS AID_CAT_ROLL_UP,
LAST_VALUE(enr.EligFinanceAidCategoryRollup)OVER (PARTITION BY MemberNum ORDER BY StaticDate ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS EligFinanceAidCategoryRollup,
SUM(enr.MemberMonth) OVER (PARTITION BY MemberNum) AS TOTAL_MEMBER_MONTHS
FROM dv_Enrollment enr
WHERE enr.StaticDate BETWEEN '01-JUN-2016' AND '30-JUN-2016'
)A
So, I've looked around and found that you can use the TOP (2147483647) hack, so I tried changing the first line to:
SELECT distinct TOP (2147483647) enr.MemberNum,
But I'm still getting the same error. All the other ways I've thought of also require an ORDER BY (using DENSE RANK, etc).

In both databases, I would write this like:
FROM (SELECT enr.MemberNum,
(ISNULL(enr.MemberFirstName, '') + ' ' + ISNULL(enr.MemberLastName, '')) AS MEMBER_NAME,
enr.MemberBirthDate as DOB,
enr.MemberGender as Gender,
MAX(CASE WHEN seqnum = 1 THEN enr.MemberCurrentAge END) AS AGE,
MAX(CASE WHEN seqnum = 1 THEN enr.EligStateAidCategory END) AS EligStateAidCategory,
MAX(CASE WHEN seqnum = 1 THEN enr.EligStateAidCategory END) AS AID_CAT_ROLL_UP,
MAX(CASE WHEN seqnum = 1 THEN enr.EligFinanceAidCategoryRollup END) AS EligFinanceAidCategoryRollup,
SUM(enr.MemberMonth) as TOTAL_MEMBER_MONTHS
FROM (SELECT enr.*,
ROW_NUMBER() OVER (PARTITION BY MemberNum ORDER BY StaticDate DESC) as seqnum
FROM dv_Enrollment enr
) enr
WHERE enr.StaticDate >= DATE '2016-06-01' AND -- DATE not needed in SQL Server
enr.StaticDate < DATE '2016-07-01' -- DATE not needed in SQL Server
GROUP BY enr.MemberNum, enr.MemberFirstName, enr.MemberLastName,
enr.MemberBirthDate, enr.MemberGender
) A
Why the changes?
The date changes are just to be careful about time components on the date. BETWEEN with date/times is a bad habit, because sometimes it can result in incorrect code and hard to debug errors.
I simply do not like using SELECT DISTINCT to mean GROUP BY. It is clever to use it with window functions (and necessary with LAST_VALUE()); but I think the code ends up being misleading.
I find the use of the subquery with seqnum to make it clear that the four "last value" variables are all pulling data from the last row.
In addition, it the sort is not stable (that is, the key is not unique), seqnum guarantees that the values are all from the same row. last_value() does not.

Switch this over to an aggregate subquery and cross apply() and see what happens.
select
e.MemberNum
, e.MemberName
, e.DOB
, e.Gender
, x.MemberCurrentAge
, x.EligStateAidCategory
, x.EligFinanceAidCategoryRollup
, x.MemberMonth
, e.Total_Member_Months
from (
select
enr.MemberNum
, MemberName = isnull(enr.MemberFirstName+' ', '') + isnull(enr.MemberLastName, '')
, DOB = enr.MemberBirthDate
, Gender = enr.MemberGender
/* This sounds like a weird thing to sum */
, Total_Member_Months = sum(enr.MemberMonth)
from dv_Enrollment enr
group by
enr.MemberNum
, isnull(enr.MemberFirstName+' ', '') + isnull(enr.MemberLastName, '')
, enr.MemberBirthDate
, enr.MemberGender
) as e
/* cross apply() is like an inner join
, use outer apply() for something like a left join */
cross apply (
select top 1
i.MemberCurrentAge
, i.EligStateAidCategory
, i.EligFinanceAidCategoryRollup
, i.MemberMonth
from dv_Enrollment as i
where i.MemberNum = e.MemberNum
and i.StaticDate >= '20160601'
and i.StatisDate <= '20160630'
order by i.StaticDate desc -- descending for most recent
) as x

Related

How to distribute ranks when prior rank is zero (part 2)

This is an extension to my prior question How to distribute values when prior rank is zero. The solution worked great for the postgres environment, but now I need to replicate to a databricks environment (spark sql).
The question is the same as before, but now trying to determine how to convert this postgres query to spark sql. Basically, it's summing up allocation amounts if there are gaps in the data (ie, no micro_geo's when grouping by location and geo3). The "imputed allocation" will equal 1 for all location & zip3 groups.
This is the postgres query, which works great:
select location_code, geo3, distance_group, has_micro_geo, imputed_allocation from
(
select ia.*,
(case when has_micro_geo > 0
then sum(allocation) over (partition by location_code, geo3, grp)
else 0
end) as imputed_allocation
from (select s.*,
count(*) filter (where has_micro_geo <> 0) over (partition by location_code, geo3 order by distance_group desc) as grp
from staging_groups s
) ia
)z
But it doesn't translate well and produces this error in databricks:
Error in SQL statement: ParseException:
mismatched input 'from' expecting <EOF>(line 1, pos 78)
== SQL ==
select location_code, geo3, distance_group, has_micro_geo, imputed_allocation from
------------------------------------------------------------------------------^^^
(
select ia.*,
(case when has_micro_geo > 0
then sum(allocation) over (partition by location_code, geo3, grp)
else 0
end) as imputed_allocation
from (select s.*,
count(*) filter (where has_micro_geo <> 0) over (partition by location_code, geo3 order by distance_group desc) as grp
from staging_groups s
) ia
)z
Or at a minimum, how to convert just part of this inner query which creates a "grp", and then perhaps the rest will work. I have been trying to replace this filter-where logic with something else, but attempts have not worked as desired.
select s.*,
count(*) filter (where has_micro_geo <> 0) over (partition by location_code, geo3 order by distance_group desc) as grp
from staging_groups s
Here's a db-fiddle with data https://www.db-fiddle.com/f/wisvDZJL9BkWxNFkfLXdEu/0 which is currently set to postgres, but again I need to run this in a spark sql environment. I've tried breaking this down and creating different tables, but my groups are not working as desired.
Here's an image to better visualize the output:
You need to rewrite this subquery:
select s.*,
count(*) filter (where has_micro_geo <> 0) over (partition by location_code, geo3 order by distance_group desc) as grp
from staging_groups s
Although the filter() clause to window and aggregate functions is standard SQL, few databases support it so far. Instead, consider a conditional window sum(), which produces the same result:
select s.*,
sum(case when has_micro_geo <> 0 then 1 else 0 end) over (partition by location_code, geo3 order by distance_group desc) as grp
from staging_groups s
I think that the rest of the query should run fine in Spark SQL.
As has_micro_geo is already a 0/1 flag you can rewite the count(filter) to
sum(has_micro_geo)
over (partition by location_code, geo3
order by distance_group desc
rows unbounded preceding) as grp
Adding rows unbounded preceding to avoid the default range unbounded preceding which might be less performant.
Btw, I wrote that already in my comment to Gordon's solution to your prior question :-)

Can Db2 LAG function refer to itself?

I'm trying to put information to identify GROUP ID by replicating this Excel formula:
IF(OR(A2<>A1,AND(B2<>"000",B1="000")),D1+1,D1)
This formula is written when my cursor is in "D2", meaning I've referred to the newly added column value in the previous row to generate the current value.
I'd like to this with Db2 SQL, but I'm not sure how to because I'll need to do LAG function on the column I'm going to add and referring their value.
Kindly advise if having better way to do.
Thanks.
You need nested OLAP-functions, assuming ORDER BY SERIAL_NUMBER, EVENT_TIMESTAMP returns the order shown in Excel:
with cte as
(
select ...
case --IF(OR(A2<>A1,AND(B2<>"000",B1="000"))
when (lag(OPERATION)
over (order by SERIAL_NUMBER, EVENT_TIMESTAMP) = '000'
and OPERATION <> '000')
or lag(SERIAL_NUMBER,1,'')
over (order by SERIAL_NUMBER, EVENT_TIMESTAMP) <> SERIAL_NUMBER
then 1
else 0
end as flag -- start of new group
from tab
)
select ...
sum(flag)
over (order by SERIAL_NUMBER, EVENT_TIMESTAMP
rows unbounded preceding) as GROUP_ID
from cte
Your code is counting the number of "breaks" in your data, where a "break" is defined as 000 or the value in the first column changing.
In SQL, you can do this as a cumulative sum:
select t.*,
sum(case when prev_serial_number = serial_number or operation <> '000'
then 0 else 1
end) over (order by event_timestamp rows between unbounded preceding and current row) as column_d
from (select t.*,
lag(serial_number) over (order by event_timestamp) as prev_serial_number
from t
) t

How to join multiple sub queries as a single one without 'with'?

I have a query consisting of multiple subqueries. I used 'join' as im not allowed to use 'with'. The subqueries have 'from' clause which is creating an issue.
I have to display two columns with each column consisting certain logic to be displayed. For printing the two columns, i need to use sub queries which requires 'from' clause. I'm not sure how to write the 'from' clause to fit the whole query and make it runnable. I have checked the individual queries and they all work fine.
select lead(dt) over
(partition by t1.id_user order by f.topup_date desc rows between 0
preceding and unbounded following )
from
(select *,
(max(case when f.topup_value >= 20 then f.topup_date end) over (partition
by f.id_user order by f.topup_date desc rows between 0 preceding and
unbounded following )) as dt
from topups f) as f, //(<-I think this is incorrect)
CAST(f.topup_value as float)/CAST(t1.topup_value as float) from
(SELECT t1.seq,t1.id_user,t1.topup_value,row_number()
over (partition by t1.id_user order by t1.topup_date )
as rowrank from topups t1) as t1
inner join topups f on f.id_user=t1.id_user
inner join topups t2 on t1.seq=t2.seq
You're getting a syntax error because a query can only have a single FROM clause. It's difficult to tell the outcome you're trying to achieve, but turning the first query into a non-correlated subquery and using it for column f might be what you're looking for:
select
(select lead(dt) over (partition by t1.id_user order by f.topup_date desc rows between 0 preceding and unbounded following )
from (
select *,
(max(case when f.topup_value >= 20 then f.topup_date end) over (partition by f.id_user order by f.topup_date desc rows between 0 preceding and unbounded following )) as dt
from topups f
) x) as f,
CAST(f.topup_value as float)/CAST(t1.topup_value as float)
from (
SELECT t1.seq, t1.id_user, t1.topup_value, row_number() over (partition by t1.id_user order by t1.topup_date ) as rowrank
from topups t1
) as t1
inner join topups f on f.id_user=t1.id_user
inner join topups t2 on t1.seq=t2.seq
Really hard to read that query. What you marked as possible incorrectness is wrong because you're trying to add what looks like another SELECT after your original FROM clause. That's incorrect syntax. Think of your FROM subquery as a temporary table. You couldn't say something like:
SELECT some_column
FROM a_table, some_other_column
That's cross-join syntax. some_other_column would need to be a table for that to even be valid.
Consider adding a CREATE TABLE and sample data so we can test.
You might be looking for something along the lines of this:
SELECT LEAD(temp.dt) OVER(PARTITION BY temp.id_user ORDER BY temp.topup_date DESC ROWS BETWEEN 0 PRECEDING AND UNBOUNDED FOLLOWING)
, temp.division
FROM
(
SELECT (max(CASE WHEN f.topup_value >= 20 THEN f.topup_date END) OVER(PARTITION BY f.id_user ORDER BY f.topup_date DESC ROWS BETWEEN 0 PRECEDING AND UNBOUNDED FOLLOWING )) AS dt
, f.topup_value::float / t1.topup_value::float AS division
, t1.id_user
, f.topup_date
FROM topups t1
JOIN topups f USING (id_user)
) temp
;
Just an opinion but its less noisy to use the :: operator to cast variables. Instead of CAST(f.topup_value as float) just use f.topup_value::float

Oracle LEAD - return next matching column value

I having below data in one table.
And I want to get NEXT out data from OUT column. So used LEAD function in below query.
SELECT ROW_NUMBER,TIMESTAMP,IN,OUT,LEAD(OUT) OVER (PARTITION BY NULL ORDER BY TIMESTAMP) AS NEXT_OUT
FROM MYTABLE;
It gives data as below NEXT_OUT column.
But I need to know the matching next column value in sequential way like DESIRED columns. Please let me know how can i achieve this in Oracle LEAD FUNCTION
THANKS
Assign row number to all INs and OUTs separately, sort the results by placing them in a single column and calculate LEADs:
WITH cte AS (
SELECT t.*
, CASE WHEN "IN" IS NOT NULL THEN COUNT("IN") OVER (ORDER BY "TIMESTAMP") END AS rn1
, CASE WHEN "OUT" IS NOT NULL THEN COUNT("OUT") OVER (ORDER BY "TIMESTAMP") END AS rn2
FROM t
)
SELECT cte.*
, LEAD("OUT") OVER (ORDER BY COALESCE(rn1, rn2), rn1 NULLS LAST) AS NEXT_OUT
FROM cte
ORDER BY COALESCE(rn1, rn2), rn1 NULLS LAST
Demo on db<>fiddle
Enumerate in the "in"s and the "out"s and use that information for matching.
select tin.*, tout.out as next_out
from (select t.*,
count(in) over (order by timestamp) as seqnum_in
from t
) tin left join
(select t.*,
count(out) over (order by timestamp) as seqnum_out
from t
) tout
on tin.in is not null and
tout.out is not null and
tin.seqnum_in = tout.seqnum_out;

SQL - Window function to get values from previous row where value is not null

I am using Exasol, in other DBMS it was possible to use analytical functions such LAST_VALUE() and specify some condition for the ORDER BY clause withing the OVER() function, like:
select ...
LAST_VALUE(customer)
OVER (PARTITION BY ID ORDER BY date_x DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING ) as the_last
Unfortunately I get the following error:
ERROR: [0A000] Feature not supported: windowing clause (Session:
1606983630649130920)
the same do not happen if instead of AND 1 PRECEDING I use: CURRENT ROW.
Basically what I wanted is to get the last value according the Order by that is NOT the current row. In this example it would be the $customer of the previous row.
I know that I could use the LAG(customer,1) OVER ( ...) but the problem is that I want the previous customer that is NOT null, so the offset is not always 1...
How can I do that?
Many thanks!
Does this work?
select lag(customer) over (partition by id
order by (case when customer is not null then 1 else 0 end),
date
)
You can do this with two steps:
select t.*,
max(customer) over (partition by id, max_date) as max_customer
from (select t.*,
max(case when customer is not null then date end) over (partition by id order by date) as max_date
from t
) t;