LAG within CASE giving false negative offset - sql

TL;DR: scroll down to TASK 2.
I am dealing with the following data set:
email,createdby,createdon
a#b.c,jsmith,2016-10-10
a#b.c,nsmythe,2016-09-09
a#b.c,vstark,2016-11-11
b#x.y,ajohnson,2015-02-03
b#x.y,elear,2015-01-01
...
and so on. Each email is guaranteed to have at least one duplicate in the data set.
Now, there are two tasks to resolve; I resolved one of them but am struggling with the other one. I will now present both tasks for completeness.
TASK 1 (resolved):
For each row, for each email, return an additional column with the name of the user that created the first record with this email.
Expected result for the above sample data set:
email,createdby,createdon,original_createdby
a#b.c,jsmith,2016-10-10,nsmythe
a#b.c,nsmythe,2016-09-09,nsmythe
a#b.c,vstark,2016-11-11,nsmythe
b#x.y,ajohnson,2015-02-03,elear
b#x.y,elear,2015-01-01,elear
Code to get the above:
;WITH q0 -- this is just a security measure in case there are unique emails in the data set
AS ( SELECT t.email
FROM t
GROUP BY t.email
HAVING COUNT(*) > 1) ,
q1
AS ( SELECT q0.email
, createdon
, createdby
, ROW_NUMBER() OVER ( PARTITION BY q0.email ORDER BY createdon ) rn
FROM t
JOIN q0
ON t.email = q0.email)
SELECT q1.email
, q1.createdon
, q1.createdby
, LAG(q1.createdby, q1.rn - 1) OVER ( ORDER BY q1.email, q1.createdon ) original_createdby
FROM q1
ORDER BY q1.email
, q1.rn
Brief explanation: I partition data set by email, then I number rows in each partition ordered by creation date, finally I return [createdby] value from (rn-1)th record. Works exactly as expected.
Now, similar to the above, there is TASK 2:
TASK 2:
For each row, for each email, return name of the user that created the first duplicate. I.e. name of a user where rn=2.
Expected result:
email,createdby,createdon,first_dupl_createdby
a#b.c,jsmith,2016-10-10,jsmith
a#b.c,nsmythe,2016-09-09,jsmith
a#b.c,vstark,2016-11-11,jsmith
b#x.y,ajohnson,2015-02-03,ajohnson
b#x.y,elear,2015-01-01,ajohnson
I want to keep things performant so trying to employ LEAD-LAG functions:
WITH q0
AS ( SELECT t.email
FROM t
GROUP BY t.email
HAVING COUNT(*) > 1) ,
q1
AS ( SELECT q0.email
, createdon
, createdby
, ROW_NUMBER() OVER ( PARTITION BY q0.email ORDER BY createdon ) rn
FROM t
JOIN q0
ON t.email = q0.email)
SELECT q1.email
, q1.createdon
, q1.createdby
, q1.rn
, CASE q1.rn
WHEN 1 THEN LEAD(q1.createdby, 1) OVER ( ORDER BY q1.email, q1.createdon )
ELSE LAG(q1.createdby, q1.rn - 2) OVER ( ORDER BY q1.email, q1.createdon )
END AS first_dupl_createdby
FROM q1
ORDER BY q1.email
, q1.rn
Explanation: for the first record in each partition, return [createdby] from the following record (i.e. from the record containing the first duplicate). For all other records in the same partition return [createdby] from (rn-2) records ago (i.e. for rn = 2 we're staying on the same record, for rn = 3 we're going 1 record back, for rn = 4 - 2 records back and so on).
An issue comes up on the
ELSE LAG(q1.createdby, q1.rn - 2)
operation. Apparently, against any logic, despite the existence of the preceding line (WHEN 1 THEN...), the ELSE block is also evaluated for rn = 1, resulting in a negative offset value passed to the LAG function:
Msg 8730, Level 16, State 2, Line 37
Offset parameter for Lag and Lead functions cannot be a negative value.
When I comment out that ELSE line, the whole thing works fine but obviously I am not getting any results in the first_dupl_createdby column for rn > 1.
QUESTION:
Is there any way of re-writing the above CASE statement (in TASK #2) so that it always returns the value from a record where rn = 2 within each partition but - and this is important bit - without doing a self-JOIN operation (I know I could prepare rows where rn = 2 in a separate sub-query but this would mean extra scans on the whole table and also running an unnecessary self-JOIN).

I think you can simply use the max window function as you are trying to get the value from rownumber = 2 for each partition.
SELECT q1.email
, q1.createdon
, q1.createdby
, q1.rn
, max(case when rn=2 then q1.createdby end) over(partition by q1.email) first_dup_created_by
FROM q1
ORDER BY q1.email, q1.rn
You can use a similar query to get the results for rownumber=1 for the 1st scenario as well.

You can get the information for each email using row_number() and conditional aggregation:
select email,
max(case when seqnum = 1 then createdby end) as createdby_first,
max(case when seqnum = 2 then createdby end) as createdby_second
from (select t.*,
row_number() over (partition by email order by createdon) as seqnum
from t
) t
group by email;
You can join this information back to the original data to get the information you want. I don't see how lag() naturally would be used to solve this problem.

/shrug
; WITH duplicate_email_addresses AS (
SELECT email
FROM t
GROUP
BY email
HAVING Count(*) > 1
)
, records_with_duplicate_email_addresses AS (
SELECT email
, createdon
, createdby
, Row_Number() OVER (PARTITION BY email ORDER BY createdon) AS sequencer
FROM t
WHERE EXISTS (
SELECT *
FROM duplicate_email_addresses
WHERE email = t.email
)
)
, second_duplicate_record AS ( -- Why do you need any more than this?
SELECT email
, createdon
, createdby
FROM records_with_duplicate_email_addresses
WHERE sequencer = 2
)
SELECT records_with_duplicate_email_addresses.email
, records_with_duplicate_email_addresses.createdon
, records_with_duplicate_email_addresses.createdby
, second_duplicate_record.createdby AS first_duplicate_createdby
FROM records_with_duplicate_email_addresses
INNER
JOIN second_duplicate_record
ON second_duplicate_record.email = records_with_duplicate_email_addresses.email
;

Related

Update failing while using dense_rank and row_number

Here is the sample data of the two tables , just put together for easy reference
I want the upper part of the table [Outbound].[dbo].[Encounter_Out_P] with column "277CA_FILENAME","277CA_FILENAME2","277CA_FILENAME3","277CA_FILENAME4" as NULLS which are sorted by File_Submitted_DT ascending order to be updated with "277FileId" values of the lower table [Outbound].[dbo].[Encounter_Out_277_P] which are sorted by EDIFECSProcessDate in ascending order. Thanks in advance
Here is my code
WITH
cte_2771 AS (
SELECT
"277CA_FILENAME",
File_Submitted_DT,
TRN02_PatientControlNumber
FROM (
SELECT
I."277CA_FILENAME",
I.File_Submitted_DT,
#cte_277.TRN02_PatientControlNumber
,dense_rank() OVER(PARTITION BY #cte_277.TRN02_PatientControlNumber ORDER BY ABS(DATEDIFF(MINUTE, i.File_Submitted_DT, #cte_277.EDIFECSProcessDate)) ASC) rw1
FROM
[Outbound].[dbo].[Encounter_Out_P] I
INNER JOIN #cte_277 ON I.EncounterID = #cte_277.TRN02_PatientControlNumber
--WHERE EncounterID = 'AP230120920712808806'
)t
WHERE
t.rw1 = 1
)
,
cte_2772 AS (
SELECT
"277FileId"
,1 + ((ROW_NUMBER() OVER(PARTITION BY TRN02_PatientControlNumber ORDER BY EDIFECSProcessDate,File_Submitted_DT ASC ) - 1) % 4)rw2
,TRN02_PatientControlNumber
FROM (
SELECT DISTINCT
p."277FileId",
p.EDIFECSProcessDate,
p.TRN02_PatientControlNumber
,p.ID
,cte_2771.File_Submitted_DT
FROM [Outbound].[dbo].[Encounter_Out_277_P] p
INNER JOIN cte_2771 ON cte_2771.File_Submitted_DT < p.EDIFECSProcessDate
WHERE
p.TRN02_PatientControlNumber = cte_2771.TRN02_PatientControlNumber
) t
)
UPDATE cte_2771
SET "277CA_FILENAME" =
COALESCE(cte_2771."277CA_FILENAME", cte_2772."277FileId" )
FROM cte_2771 INNER JOIN cte_2772
ON cte_2772.TRN02_PatientControlNumber = cte_2771.TRN02_PatientControlNumber
WHERE cte_2772.rw2 = 1
I want the output to be like below, (the upperpart) just put together for easy reference
Notes
I have posted the code for "277CA_FILENAME" only, since it is the same for the rest by changing the WHERE condition changes as "WHERE cte_2772.rw2 = 2,3,4"
if I Uncomment the --WHERE EncounterID = 'AP230120920712808806' in the cte_2771 , it is working perfectly, but if I comment it and run for the entire load, one row gets correct and the other one gets NULL

Combine multiple boolean columns into a single column

I am generation reports from an ERP system where users are provided with a check box which return a boolean value for each item selected. The database is hosted on SQL Server.
However, users can select Contracts with other values as well, as shown below.
I would like to capture the Categories as a single column and I don't mind having duplicate rows in the view. I would like the first row to return Contract and the second the other value selected, for the same Reference ID.
You can use apply :
select distinct t.*, tt.category
from t cross apply
( values ('Contracts', t.Contracts),
('Tender', t.Tender),
('Waiver', t.Waiver),
('Quotation', t.Quotation)
) tt(category, flag)
where flag = 1;
I guess a straightforward way is:
select *, 'Contract' as [Category] from [TableOne] where [Contract] = 1
union all select *, 'Tender' as [Category] from [TableOne] where [Tender] = 1
union all select *, 'Waiver' as [Category] from [TableOne] where [Waiver] = 1
union all select *, 'Quotation' as [Category] from [TableOne] where [Quotation] = 1
union all select *, '(none)' as [Category] from [TableOne] where [Contract]+[Tender]+[Waiver]+[Quotation] = 0
order by [Reference ID]
Note that the last line is put there just in case you need to handle the all-zero case.

Use of MAX function in SQL query to filter data

The code below joins two tables and I need to extract only the latest date per account, though it holds multiple accounts and history records. I wanted to use the MAX function, but not sure how to incorporate it for this case. I am using My SQL server.
Appreciate any help !
select
PROP.FileName,PROP.InsName, PROP.Status,
PROP.FileTime, PROP.SubmissionNo, PROP.PolNo,
PROP.EffDate,PROP.ExpDate, PROP.Region,
PROP.Underwriter, PROP_DATA.Data , PROP_DATA.Label
from
Property.dbo.PROP
inner join
Property.dbo.PROP_DATA on Property.dbo.PROP.FileID = Actuarial.dbo.PROP_DATA.FileID
where
(PROP_DATA.Label in ('Occupancy' , 'OccupancyTIV'))
and (PROP.EffDate >= '42278' and PROP.EffDate <= '42643')
and (PROP.Status = 'Bound')
and (Prop.FileTime = Max(Prop.FileTime))
order by
PROP.EffDate DESC
Assuming your DBMS supports windowing functions and the with clause, a max windowing function would work:
with all_data as (
select
PROP.FileName,PROP.InsName, PROP.Status,
PROP.FileTime, PROP.SubmissionNo, PROP.PolNo,
PROP.EffDate,PROP.ExpDate, PROP.Region,
PROP.Underwriter, PROP_DATA.Data , PROP_DATA.Label,
max (PROP.EffDate) over (partition by PROP.PolNo) as max_date
from Actuarial.dbo.PROP
inner join Actuarial.dbo.PROP_DATA
on Actuarial.dbo.PROP.FileID = Actuarial.dbo.PROP_DATA.FileID
where (PROP_DATA.Label in ('Occupancy' , 'OccupancyTIV'))
and (PROP.EffDate >= '42278' and PROP.EffDate <= '42643')
and (PROP.Status = 'Bound')
and (Prop.FileTime = Max(Prop.FileTime))
)
select
FileName, InsName, Status, FileTime, SubmissionNo,
PolNo, EffDate, ExpDate, Region, UnderWriter, Data, Label
from all_data
where EffDate = max_date
ORDER BY EffDate DESC
This also presupposes than any given account would not have two records on the same EffDate. If that's the case, and there is no other objective means to determine the latest account, you could also use row_numer to pick a somewhat arbitrary record in the case of a tie.
Using straight SQL, you can use a self-join in a subquery in your where clause to eliminate values smaller than the max, or smaller than the top n largest, and so on. Just set the number in <= 1 to the number of top values you want per group.
Something like the following might do the trick, for example:
select
p.FileName
, p.InsName
, p.Status
, p.FileTime
, p.SubmissionNo
, p.PolNo
, p.EffDate
, p.ExpDate
, p.Region
, p.Underwriter
, pd.Data
, pd.Label
from Actuarial.dbo.PROP p
inner join Actuarial.dbo.PROP_DATA pd
on p.FileID = pd.FileID
where (
select count(*)
from Actuarial.dbo.PROP p2
where p2.FileID = p.FileID
and p2.EffDate <= p.EffDate
) <= 1
and (
pd.Label in ('Occupancy' , 'OccupancyTIV')
and p.Status = 'Bound'
)
ORDER BY p.EffDate DESC
Have a look at this stackoverflow question for a full working example.
Not tested
with temp1 as
(
select foo
from bar
whre xy = MAX(xy)
)
select PROP.FileName,PROP.InsName, PROP.Status,
PROP.FileTime, PROP.SubmissionNo, PROP.PolNo,
PROP.EffDate,PROP.ExpDate, PROP.Region,
PROP.Underwriter, PROP_DATA.Data , PROP_DATA.Label
from Actuarial.dbo.PROP
inner join temp1 t
on Actuarial.dbo.PROP.FileID = t.dbo.PROP_DATA.FileID
ORDER BY PROP.EffDate DESC

Remove Duplicates while Merging values

How can I remove duplicates and merge Account Types?
I have a call log that reports duplicate phones based on Account Type.
For example:
Telephone | Account Type
304-555-6666 | R
304-555-6666 | C
I know how to remove duplicate Telephones using RANK\MAXCOUNT
But before removing duplicates I need to reset the Account Type to “B” is the duplicates have multiple account types.
In the example the surviving duplicate would be:
Telephone | Account Type
304-555-6666 | B
Warning, it is not guaranteed that duplicate phones have multiple Account Types.
Example:
Telephone | Account Type
999-888-6666 | R
999-888-6666 | R
Therefore the surviving duplicate should be:
Telephone | Account Type
999-888-6666 | R
How can I remove duplicates and reset the account type at the same time?
--
-- Remove Duplicate Recordings
--
SELECT * FROM (
SELECT i.dateofcall ,
i.recordingfile ,
i.telephone ,
s.accounttype ,
ROW_NUMBER() OVER (PARTITION BY i.telephone ORDER BY i.dateofcall DESC) AS 'RANK' ,
COUNT(i.telephone) OVER (PARTITION BY i.telephone) AS 'MAXCOUNT'
FROM #myactions i
LEFT JOIN #myphone s ON s.interactionID = i.Interactionid
) x
WHERE [RANK] = [MAXCOUNT]
SELECT * FROM (
SELECT i.dateofcall ,
i.recordingfile ,
i.telephone ,
s.accounttype ,
ROW_NUMBER() OVER (PARTITION BY i.telephone ORDER BY i.dateofcall DESC) AS 'RANK' ,
COUNT(i.telephone) OVER (PARTITION BY i.telephone) AS 'MAXCOUNT',
DENSE_RANK() OVER ( PARTITION BY i.telephone ORDER BY s.accounttype DESC ) AS 'ContPhone'
FROM #myactions i
LEFT JOIN #myphone s ON s.interactionID = i.Interactionid
) x
WHERE [RANK] = [MAXCOUNT]
Try this?
select
x.dateofcall
, x.recordingfile
, x.telephone
, case when count(*) > 2 then 'B' else max(x.accounttype) end accounttype
(
select
i.dateofcall
, i.recordingfile
, i.telephone
, s.accounttype
from
#myactions i
LEFT JOIN #myphone s ON s.interactionID = i.Interactionid
group by
i.dateofcall
, i.recordingfile
, i.telephone
, s.accounttype
) x
group by
x.dateofcall
, x.recordingfile
, x.telephone
Basically you need to put your business check in a case statement outside.
EDIT: I've also added the logic for B, R and C. Also done a sql fiddle- link to fiddle -http://sqlfiddle.com/#!6/b5ef5/7
SELECT
x.dateofcall,
x.recordingfile,
x.telephone,
COALESCE(
CASE WHEN x.maxcount>1 AND value>x.maxcount AND value<(2*x.maxcount) THEN 'B' ELSE NULL END,
CASE WHEN x.maxcount>1 AND value= (2*x.maxcount) THEN 'C' ELSE NULL END,
CASE WHEN x.maxcount>1 AND value= x.maxcount THEN 'R' ELSE NULL END,
x.accounttype ) as accounttype,
x.rank,
x.maxcount
FROM (
SELECT i.dateofcall ,
i.recordingfile ,
i.telephone ,
s.accounttype ,
ROW_NUMBER() OVER (PARTITION BY i.telephone ORDER BY i.dateofcall DESC) AS 'RANK' ,
COUNT(i.telephone) OVER (PARTITION BY i.telephone) AS 'MAXCOUNT',
SUM(CASE WHEN s.accounttype LIKE 'R' THEN 1 ELSE 2 END) OVER (PARTITION BY i.telephone) as Value
FROM
myactions i LEFT JOIN myphone s
ON s.interactionID = i.Interactionid
) x
WHERE [RANK] = [MAXCOUNT]

speed up SQL Query

I have a query which is taking some serious time to execute on anything older than the past, say, hours worth of data. This is going to create a view which will be used for datamining, so the expectations are that it would be able to search back weeks or months of data and return in a reasonable amount of time (even a couple minutes is fine... I ran for a date range of 10/3/2011 12:00pm to 10/3/2011 1:00pm and it took 44 minutes!)
The problem is with the two LEFT OUTER JOINs in the bottom. When I take those out, it can run in about 10 seconds. However, those are the bread and butter of this query.
This is all coming from one table. The ONLY thing this query returns differently than the original table is the column xweb_range. xweb_range is a calculated field column (range) which will only use the values from [LO,LC,RO,RC]_Avg where their corresponding [LO,LC,RO,RC]_Sensor_Alarm = 0 (do not include in range calculation if sensor alarm = 1)
WITH Alarm (sub_id,
LO_Avg, LO_Sensor_Alarm, LC_Avg, LC_Sensor_Alarm, RO_Avg, RO_Sensor_Alarm, RC_Avg, RC_Sensor_Alarm) AS (
SELECT sub_id, LO_Avg, LO_Sensor_Alarm, LC_Avg, LC_Sensor_Alarm, RO_Avg, RO_Sensor_Alarm, RC_Avg, RC_Sensor_Alarm
FROM dbo.some_table
where sub_id <> '0'
)
, AddRowNumbers AS (
SELECT rowNumber = ROW_NUMBER() OVER (ORDER BY LO_Avg)
, sub_id
, LO_Avg, LO_Sensor_Alarm
, LC_Avg, LC_Sensor_Alarm
, RO_Avg, RO_Sensor_Alarm
, RC_Avg, RC_Sensor_Alarm
FROM Alarm
)
, UnPivotColumns AS (
SELECT rowNumber, value = LO_Avg FROM AddRowNumbers WHERE LO_Sensor_Alarm = 0
UNION ALL SELECT rowNumber, LC_Avg FROM AddRowNumbers WHERE LC_Sensor_Alarm = 0
UNION ALL SELECT rowNumber, RO_Avg FROM AddRowNumbers WHERE RO_Sensor_Alarm = 0
UNION ALL SELECT rowNumber, RC_Avg FROM AddRowNumbers WHERE RC_Sensor_Alarm = 0
)
SELECT rowNumber.sub_id
, cds.equipment_id
, cds.read_time
, cds.LC_Avg
, cds.LC_Dev
, cds.LC_Ref_Gap
, cds.LC_Sensor_Alarm
, cds.LO_Avg
, cds.LO_Dev
, cds.LO_Ref_Gap
, cds.LO_Sensor_Alarm
, cds.RC_Avg
, cds.RC_Dev
, cds.RC_Ref_Gap
, cds.RC_Sensor_Alarm
, cds.RO_Avg
, cds.RO_Dev
, cds.RO_Ref_Gap
, cds.RO_Sensor_Alarm
, COALESCE(range1.range, range2.range) AS xweb_range
FROM AddRowNumbers rowNumber
LEFT OUTER JOIN (SELECT rowNumber, range = MAX(value) - MIN(value) FROM UnPivotColumns GROUP BY rowNumber HAVING COUNT(*) > 1) range1 ON range1.rowNumber = rowNumber.rowNumber
LEFT OUTER JOIN (SELECT rowNumber, range = AVG(value) FROM UnPivotColumns GROUP BY rowNumber HAVING COUNT(*) = 1) range2 ON range2.rowNumber = rowNumber.rowNumber
INNER JOIN dbo.some_table cds
ON rowNumber.sub_id = cds.sub_id
It's difficult to understand exactly what your query is trying to do without knowing the domain. However, it seems to me like your query is simply trying to find, for each row in dbo.some_table where sub_id is not 0, the range of the following columns in the record (or, if only one matches, that single value):
LO_AVG when LO_SENSOR_ALARM=0
LC_AVG when LC_SENSOR_ALARM=0
RO_AVG when RO_SENSOR_ALARM=0
RC_AVG when RC_SENSOR_ALARM=0
You constructed this query assigning each row a sequential row number, unpivoted the _AVG columns along with their row number, computed the range aggregate grouping by row number and then joining back to the original records by row number. CTEs don't materialize results (nor are they indexed, as discussed in the comments). So each reference to AddRowNumbers is expensive, because ROW_NUMBER() OVER (ORDER BY LO_Avg) is a sort.
Instead of cutting this table up just to join it back together by row number, why not do something like:
SELECT cds.sub_id
, cds.equipment_id
, cds.read_time
, cds.LC_Avg
, cds.LC_Dev
, cds.LC_Ref_Gap
, cds.LC_Sensor_Alarm
, cds.LO_Avg
, cds.LO_Dev
, cds.LO_Ref_Gap
, cds.LO_Sensor_Alarm
, cds.RC_Avg
, cds.RC_Dev
, cds.RC_Ref_Gap
, cds.RC_Sensor_Alarm
, cds.RO_Avg
, cds.RO_Dev
, cds.RO_Ref_Gap
, cds.RO_Sensor_Alarm
--if the COUNT is 0, xweb_range will be null (since MAX will be null), if it's 1, then use MAX, else use MAX - MIN (as per your example)
, (CASE WHEN stats.[Count] < 2 THEN stats.[MAX] ELSE stats.[MAX] - stats.[MIN] END) xweb_range
FROM dbo.some_table cds
--cross join on the following table derived from values in cds - it will always contain 1 record per row of cds
CROSS APPLY
(
SELECT COUNT(*), MIN(Value), MAX(Value)
FROM
(
--construct a table using the column values from cds we wish to aggregate
VALUES (LO_AVG, LO_SENSOR_ALARM),
(LC_AVG, LC_SENSOR_ALARM),
(RO_AVG, RO_SENSORALARM),
(RC_AVG, RC_SENSOR_ALARM)
) x (Value, Sensor_Alarm) --give a name to the columns for _AVG and _ALARM
WHERE Sensor_Alarm = 0 --filter our constructed table where _ALARM=0
) stats([Count], [Min], [Max]) --give our derived table and its columns some names
WHERE cds.sub_id <> '0' --this is a filter carried over from the first CTE in your example