speed up SQL Query - sql

I have a query which is taking some serious time to execute on anything older than the past, say, hours worth of data. This is going to create a view which will be used for datamining, so the expectations are that it would be able to search back weeks or months of data and return in a reasonable amount of time (even a couple minutes is fine... I ran for a date range of 10/3/2011 12:00pm to 10/3/2011 1:00pm and it took 44 minutes!)
The problem is with the two LEFT OUTER JOINs in the bottom. When I take those out, it can run in about 10 seconds. However, those are the bread and butter of this query.
This is all coming from one table. The ONLY thing this query returns differently than the original table is the column xweb_range. xweb_range is a calculated field column (range) which will only use the values from [LO,LC,RO,RC]_Avg where their corresponding [LO,LC,RO,RC]_Sensor_Alarm = 0 (do not include in range calculation if sensor alarm = 1)
WITH Alarm (sub_id,
LO_Avg, LO_Sensor_Alarm, LC_Avg, LC_Sensor_Alarm, RO_Avg, RO_Sensor_Alarm, RC_Avg, RC_Sensor_Alarm) AS (
SELECT sub_id, LO_Avg, LO_Sensor_Alarm, LC_Avg, LC_Sensor_Alarm, RO_Avg, RO_Sensor_Alarm, RC_Avg, RC_Sensor_Alarm
FROM dbo.some_table
where sub_id <> '0'
)
, AddRowNumbers AS (
SELECT rowNumber = ROW_NUMBER() OVER (ORDER BY LO_Avg)
, sub_id
, LO_Avg, LO_Sensor_Alarm
, LC_Avg, LC_Sensor_Alarm
, RO_Avg, RO_Sensor_Alarm
, RC_Avg, RC_Sensor_Alarm
FROM Alarm
)
, UnPivotColumns AS (
SELECT rowNumber, value = LO_Avg FROM AddRowNumbers WHERE LO_Sensor_Alarm = 0
UNION ALL SELECT rowNumber, LC_Avg FROM AddRowNumbers WHERE LC_Sensor_Alarm = 0
UNION ALL SELECT rowNumber, RO_Avg FROM AddRowNumbers WHERE RO_Sensor_Alarm = 0
UNION ALL SELECT rowNumber, RC_Avg FROM AddRowNumbers WHERE RC_Sensor_Alarm = 0
)
SELECT rowNumber.sub_id
, cds.equipment_id
, cds.read_time
, cds.LC_Avg
, cds.LC_Dev
, cds.LC_Ref_Gap
, cds.LC_Sensor_Alarm
, cds.LO_Avg
, cds.LO_Dev
, cds.LO_Ref_Gap
, cds.LO_Sensor_Alarm
, cds.RC_Avg
, cds.RC_Dev
, cds.RC_Ref_Gap
, cds.RC_Sensor_Alarm
, cds.RO_Avg
, cds.RO_Dev
, cds.RO_Ref_Gap
, cds.RO_Sensor_Alarm
, COALESCE(range1.range, range2.range) AS xweb_range
FROM AddRowNumbers rowNumber
LEFT OUTER JOIN (SELECT rowNumber, range = MAX(value) - MIN(value) FROM UnPivotColumns GROUP BY rowNumber HAVING COUNT(*) > 1) range1 ON range1.rowNumber = rowNumber.rowNumber
LEFT OUTER JOIN (SELECT rowNumber, range = AVG(value) FROM UnPivotColumns GROUP BY rowNumber HAVING COUNT(*) = 1) range2 ON range2.rowNumber = rowNumber.rowNumber
INNER JOIN dbo.some_table cds
ON rowNumber.sub_id = cds.sub_id

It's difficult to understand exactly what your query is trying to do without knowing the domain. However, it seems to me like your query is simply trying to find, for each row in dbo.some_table where sub_id is not 0, the range of the following columns in the record (or, if only one matches, that single value):
LO_AVG when LO_SENSOR_ALARM=0
LC_AVG when LC_SENSOR_ALARM=0
RO_AVG when RO_SENSOR_ALARM=0
RC_AVG when RC_SENSOR_ALARM=0
You constructed this query assigning each row a sequential row number, unpivoted the _AVG columns along with their row number, computed the range aggregate grouping by row number and then joining back to the original records by row number. CTEs don't materialize results (nor are they indexed, as discussed in the comments). So each reference to AddRowNumbers is expensive, because ROW_NUMBER() OVER (ORDER BY LO_Avg) is a sort.
Instead of cutting this table up just to join it back together by row number, why not do something like:
SELECT cds.sub_id
, cds.equipment_id
, cds.read_time
, cds.LC_Avg
, cds.LC_Dev
, cds.LC_Ref_Gap
, cds.LC_Sensor_Alarm
, cds.LO_Avg
, cds.LO_Dev
, cds.LO_Ref_Gap
, cds.LO_Sensor_Alarm
, cds.RC_Avg
, cds.RC_Dev
, cds.RC_Ref_Gap
, cds.RC_Sensor_Alarm
, cds.RO_Avg
, cds.RO_Dev
, cds.RO_Ref_Gap
, cds.RO_Sensor_Alarm
--if the COUNT is 0, xweb_range will be null (since MAX will be null), if it's 1, then use MAX, else use MAX - MIN (as per your example)
, (CASE WHEN stats.[Count] < 2 THEN stats.[MAX] ELSE stats.[MAX] - stats.[MIN] END) xweb_range
FROM dbo.some_table cds
--cross join on the following table derived from values in cds - it will always contain 1 record per row of cds
CROSS APPLY
(
SELECT COUNT(*), MIN(Value), MAX(Value)
FROM
(
--construct a table using the column values from cds we wish to aggregate
VALUES (LO_AVG, LO_SENSOR_ALARM),
(LC_AVG, LC_SENSOR_ALARM),
(RO_AVG, RO_SENSORALARM),
(RC_AVG, RC_SENSOR_ALARM)
) x (Value, Sensor_Alarm) --give a name to the columns for _AVG and _ALARM
WHERE Sensor_Alarm = 0 --filter our constructed table where _ALARM=0
) stats([Count], [Min], [Max]) --give our derived table and its columns some names
WHERE cds.sub_id <> '0' --this is a filter carried over from the first CTE in your example

Related

Update failing while using dense_rank and row_number

Here is the sample data of the two tables , just put together for easy reference
I want the upper part of the table [Outbound].[dbo].[Encounter_Out_P] with column "277CA_FILENAME","277CA_FILENAME2","277CA_FILENAME3","277CA_FILENAME4" as NULLS which are sorted by File_Submitted_DT ascending order to be updated with "277FileId" values of the lower table [Outbound].[dbo].[Encounter_Out_277_P] which are sorted by EDIFECSProcessDate in ascending order. Thanks in advance
Here is my code
WITH
cte_2771 AS (
SELECT
"277CA_FILENAME",
File_Submitted_DT,
TRN02_PatientControlNumber
FROM (
SELECT
I."277CA_FILENAME",
I.File_Submitted_DT,
#cte_277.TRN02_PatientControlNumber
,dense_rank() OVER(PARTITION BY #cte_277.TRN02_PatientControlNumber ORDER BY ABS(DATEDIFF(MINUTE, i.File_Submitted_DT, #cte_277.EDIFECSProcessDate)) ASC) rw1
FROM
[Outbound].[dbo].[Encounter_Out_P] I
INNER JOIN #cte_277 ON I.EncounterID = #cte_277.TRN02_PatientControlNumber
--WHERE EncounterID = 'AP230120920712808806'
)t
WHERE
t.rw1 = 1
)
,
cte_2772 AS (
SELECT
"277FileId"
,1 + ((ROW_NUMBER() OVER(PARTITION BY TRN02_PatientControlNumber ORDER BY EDIFECSProcessDate,File_Submitted_DT ASC ) - 1) % 4)rw2
,TRN02_PatientControlNumber
FROM (
SELECT DISTINCT
p."277FileId",
p.EDIFECSProcessDate,
p.TRN02_PatientControlNumber
,p.ID
,cte_2771.File_Submitted_DT
FROM [Outbound].[dbo].[Encounter_Out_277_P] p
INNER JOIN cte_2771 ON cte_2771.File_Submitted_DT < p.EDIFECSProcessDate
WHERE
p.TRN02_PatientControlNumber = cte_2771.TRN02_PatientControlNumber
) t
)
UPDATE cte_2771
SET "277CA_FILENAME" =
COALESCE(cte_2771."277CA_FILENAME", cte_2772."277FileId" )
FROM cte_2771 INNER JOIN cte_2772
ON cte_2772.TRN02_PatientControlNumber = cte_2771.TRN02_PatientControlNumber
WHERE cte_2772.rw2 = 1
I want the output to be like below, (the upperpart) just put together for easy reference
Notes
I have posted the code for "277CA_FILENAME" only, since it is the same for the rest by changing the WHERE condition changes as "WHERE cte_2772.rw2 = 2,3,4"
if I Uncomment the --WHERE EncounterID = 'AP230120920712808806' in the cte_2771 , it is working perfectly, but if I comment it and run for the entire load, one row gets correct and the other one gets NULL

LAG within CASE giving false negative offset

TL;DR: scroll down to TASK 2.
I am dealing with the following data set:
email,createdby,createdon
a#b.c,jsmith,2016-10-10
a#b.c,nsmythe,2016-09-09
a#b.c,vstark,2016-11-11
b#x.y,ajohnson,2015-02-03
b#x.y,elear,2015-01-01
...
and so on. Each email is guaranteed to have at least one duplicate in the data set.
Now, there are two tasks to resolve; I resolved one of them but am struggling with the other one. I will now present both tasks for completeness.
TASK 1 (resolved):
For each row, for each email, return an additional column with the name of the user that created the first record with this email.
Expected result for the above sample data set:
email,createdby,createdon,original_createdby
a#b.c,jsmith,2016-10-10,nsmythe
a#b.c,nsmythe,2016-09-09,nsmythe
a#b.c,vstark,2016-11-11,nsmythe
b#x.y,ajohnson,2015-02-03,elear
b#x.y,elear,2015-01-01,elear
Code to get the above:
;WITH q0 -- this is just a security measure in case there are unique emails in the data set
AS ( SELECT t.email
FROM t
GROUP BY t.email
HAVING COUNT(*) > 1) ,
q1
AS ( SELECT q0.email
, createdon
, createdby
, ROW_NUMBER() OVER ( PARTITION BY q0.email ORDER BY createdon ) rn
FROM t
JOIN q0
ON t.email = q0.email)
SELECT q1.email
, q1.createdon
, q1.createdby
, LAG(q1.createdby, q1.rn - 1) OVER ( ORDER BY q1.email, q1.createdon ) original_createdby
FROM q1
ORDER BY q1.email
, q1.rn
Brief explanation: I partition data set by email, then I number rows in each partition ordered by creation date, finally I return [createdby] value from (rn-1)th record. Works exactly as expected.
Now, similar to the above, there is TASK 2:
TASK 2:
For each row, for each email, return name of the user that created the first duplicate. I.e. name of a user where rn=2.
Expected result:
email,createdby,createdon,first_dupl_createdby
a#b.c,jsmith,2016-10-10,jsmith
a#b.c,nsmythe,2016-09-09,jsmith
a#b.c,vstark,2016-11-11,jsmith
b#x.y,ajohnson,2015-02-03,ajohnson
b#x.y,elear,2015-01-01,ajohnson
I want to keep things performant so trying to employ LEAD-LAG functions:
WITH q0
AS ( SELECT t.email
FROM t
GROUP BY t.email
HAVING COUNT(*) > 1) ,
q1
AS ( SELECT q0.email
, createdon
, createdby
, ROW_NUMBER() OVER ( PARTITION BY q0.email ORDER BY createdon ) rn
FROM t
JOIN q0
ON t.email = q0.email)
SELECT q1.email
, q1.createdon
, q1.createdby
, q1.rn
, CASE q1.rn
WHEN 1 THEN LEAD(q1.createdby, 1) OVER ( ORDER BY q1.email, q1.createdon )
ELSE LAG(q1.createdby, q1.rn - 2) OVER ( ORDER BY q1.email, q1.createdon )
END AS first_dupl_createdby
FROM q1
ORDER BY q1.email
, q1.rn
Explanation: for the first record in each partition, return [createdby] from the following record (i.e. from the record containing the first duplicate). For all other records in the same partition return [createdby] from (rn-2) records ago (i.e. for rn = 2 we're staying on the same record, for rn = 3 we're going 1 record back, for rn = 4 - 2 records back and so on).
An issue comes up on the
ELSE LAG(q1.createdby, q1.rn - 2)
operation. Apparently, against any logic, despite the existence of the preceding line (WHEN 1 THEN...), the ELSE block is also evaluated for rn = 1, resulting in a negative offset value passed to the LAG function:
Msg 8730, Level 16, State 2, Line 37
Offset parameter for Lag and Lead functions cannot be a negative value.
When I comment out that ELSE line, the whole thing works fine but obviously I am not getting any results in the first_dupl_createdby column for rn > 1.
QUESTION:
Is there any way of re-writing the above CASE statement (in TASK #2) so that it always returns the value from a record where rn = 2 within each partition but - and this is important bit - without doing a self-JOIN operation (I know I could prepare rows where rn = 2 in a separate sub-query but this would mean extra scans on the whole table and also running an unnecessary self-JOIN).
I think you can simply use the max window function as you are trying to get the value from rownumber = 2 for each partition.
SELECT q1.email
, q1.createdon
, q1.createdby
, q1.rn
, max(case when rn=2 then q1.createdby end) over(partition by q1.email) first_dup_created_by
FROM q1
ORDER BY q1.email, q1.rn
You can use a similar query to get the results for rownumber=1 for the 1st scenario as well.
You can get the information for each email using row_number() and conditional aggregation:
select email,
max(case when seqnum = 1 then createdby end) as createdby_first,
max(case when seqnum = 2 then createdby end) as createdby_second
from (select t.*,
row_number() over (partition by email order by createdon) as seqnum
from t
) t
group by email;
You can join this information back to the original data to get the information you want. I don't see how lag() naturally would be used to solve this problem.
/shrug
; WITH duplicate_email_addresses AS (
SELECT email
FROM t
GROUP
BY email
HAVING Count(*) > 1
)
, records_with_duplicate_email_addresses AS (
SELECT email
, createdon
, createdby
, Row_Number() OVER (PARTITION BY email ORDER BY createdon) AS sequencer
FROM t
WHERE EXISTS (
SELECT *
FROM duplicate_email_addresses
WHERE email = t.email
)
)
, second_duplicate_record AS ( -- Why do you need any more than this?
SELECT email
, createdon
, createdby
FROM records_with_duplicate_email_addresses
WHERE sequencer = 2
)
SELECT records_with_duplicate_email_addresses.email
, records_with_duplicate_email_addresses.createdon
, records_with_duplicate_email_addresses.createdby
, second_duplicate_record.createdby AS first_duplicate_createdby
FROM records_with_duplicate_email_addresses
INNER
JOIN second_duplicate_record
ON second_duplicate_record.email = records_with_duplicate_email_addresses.email
;

Use of MAX function in SQL query to filter data

The code below joins two tables and I need to extract only the latest date per account, though it holds multiple accounts and history records. I wanted to use the MAX function, but not sure how to incorporate it for this case. I am using My SQL server.
Appreciate any help !
select
PROP.FileName,PROP.InsName, PROP.Status,
PROP.FileTime, PROP.SubmissionNo, PROP.PolNo,
PROP.EffDate,PROP.ExpDate, PROP.Region,
PROP.Underwriter, PROP_DATA.Data , PROP_DATA.Label
from
Property.dbo.PROP
inner join
Property.dbo.PROP_DATA on Property.dbo.PROP.FileID = Actuarial.dbo.PROP_DATA.FileID
where
(PROP_DATA.Label in ('Occupancy' , 'OccupancyTIV'))
and (PROP.EffDate >= '42278' and PROP.EffDate <= '42643')
and (PROP.Status = 'Bound')
and (Prop.FileTime = Max(Prop.FileTime))
order by
PROP.EffDate DESC
Assuming your DBMS supports windowing functions and the with clause, a max windowing function would work:
with all_data as (
select
PROP.FileName,PROP.InsName, PROP.Status,
PROP.FileTime, PROP.SubmissionNo, PROP.PolNo,
PROP.EffDate,PROP.ExpDate, PROP.Region,
PROP.Underwriter, PROP_DATA.Data , PROP_DATA.Label,
max (PROP.EffDate) over (partition by PROP.PolNo) as max_date
from Actuarial.dbo.PROP
inner join Actuarial.dbo.PROP_DATA
on Actuarial.dbo.PROP.FileID = Actuarial.dbo.PROP_DATA.FileID
where (PROP_DATA.Label in ('Occupancy' , 'OccupancyTIV'))
and (PROP.EffDate >= '42278' and PROP.EffDate <= '42643')
and (PROP.Status = 'Bound')
and (Prop.FileTime = Max(Prop.FileTime))
)
select
FileName, InsName, Status, FileTime, SubmissionNo,
PolNo, EffDate, ExpDate, Region, UnderWriter, Data, Label
from all_data
where EffDate = max_date
ORDER BY EffDate DESC
This also presupposes than any given account would not have two records on the same EffDate. If that's the case, and there is no other objective means to determine the latest account, you could also use row_numer to pick a somewhat arbitrary record in the case of a tie.
Using straight SQL, you can use a self-join in a subquery in your where clause to eliminate values smaller than the max, or smaller than the top n largest, and so on. Just set the number in <= 1 to the number of top values you want per group.
Something like the following might do the trick, for example:
select
p.FileName
, p.InsName
, p.Status
, p.FileTime
, p.SubmissionNo
, p.PolNo
, p.EffDate
, p.ExpDate
, p.Region
, p.Underwriter
, pd.Data
, pd.Label
from Actuarial.dbo.PROP p
inner join Actuarial.dbo.PROP_DATA pd
on p.FileID = pd.FileID
where (
select count(*)
from Actuarial.dbo.PROP p2
where p2.FileID = p.FileID
and p2.EffDate <= p.EffDate
) <= 1
and (
pd.Label in ('Occupancy' , 'OccupancyTIV')
and p.Status = 'Bound'
)
ORDER BY p.EffDate DESC
Have a look at this stackoverflow question for a full working example.
Not tested
with temp1 as
(
select foo
from bar
whre xy = MAX(xy)
)
select PROP.FileName,PROP.InsName, PROP.Status,
PROP.FileTime, PROP.SubmissionNo, PROP.PolNo,
PROP.EffDate,PROP.ExpDate, PROP.Region,
PROP.Underwriter, PROP_DATA.Data , PROP_DATA.Label
from Actuarial.dbo.PROP
inner join temp1 t
on Actuarial.dbo.PROP.FileID = t.dbo.PROP_DATA.FileID
ORDER BY PROP.EffDate DESC

SQL Server UDF array inputs and outputs

I have a set of columns CODE_1-10, which contain diagnostic codes. I want to create a set of variables CODE_GROUP_1-17, which indicate whether or not one of some particular set of diagnostic codes matches any of the CODE_1-10 variables. For example, CODE_GROUP_1 = 1 if any of CODE_1-10 match either '123' or '456', and CODE_GROUP_2 = 1 if any of CODE_1-10 match '789','111','333','444' or 'foo'.
Here's an example of how you could do this using values constructors.
CASE WHEN (SELECT count(value.val)
FROM (VALUES (CODE_1)
, (CODE_2)
, (CODE_3)
, (CODE_4)
, (CODE_5)
, (CODE_6)
, (CODE_7)
, (CODE_8)
, (CODE_9)
, (CODE_10)
) AS value(val)
WHERE value.val in ('123', '456')
) > 0 THEN 1 ELSE 0 END AS CODE_GROUP_1,
CASE WHEN (SELECT count(value.val)
FROM (VALUES (CODE_1)
, (CODE_2)
, (CODE_3)
, (CODE_4)
, (CODE_5)
, (CODE_6)
, (CODE_7)
, (CODE_8)
, (CODE_9)
, (CODE_10)
) AS value(val)
WHERE value.val in ('789','111','333','444','foo')
) > 0 THEN 1 ELSE 0 END AS CODE_GROUP_2
I am wondering if there is another way to do this that is more efficient. Is there a way to make a CLR UDF that takes an array of CODE_1-10, and outputs a set of columns CODE_GROUP_1-17?
You could at least avoid the repetition of FROM (VALUES ...) like this:
SELECT
CODE_GROUP_1 = COUNT(DISTINCT CASE WHEN val IN ('123', '456') THEN 1 END),
CODE_GROUP_2 = COUNT(DISTINCT CASE WHEN val IN ('789','111','333','444','foo') THEN 1 END),
...
FROM
(
VALUES
(CODE_1),
(CODE_2),
(CODE_3),
(CODE_4),
(CODE_5),
(CODE_6),
(CODE_7),
(CODE_8),
(CODE_9),
(CODE_10)
) AS value(val)
If CODE_1, CODE_2 etc. are column names, you can use the above query as a derived table in CROSS APPLY:
SELECT
...
FROM
dbo.atable -- table containing CODE_1, CODE_2 etc.
CROSS APPLY
(
SELECT ... -- the above query
) AS x
;
Can you create 2 new tables with the columns appended as rows? So one table would be dxCode with a source column if you need to retain the 1-10 value and the dx code and whatever key field(s) you need, the other table would be dxGroup with your 17 groups, the source groupID if you need it, and your target dx values.
Then to determine which codes are in which groups, you can join on your dx fields.

Using Order By with Distinct on a Join (PLSQL)

I have written a join on some tables and I have ordered the data using two levels of ordering - one of which is the primary key of one table.
Now, with this data sorted I want to then exclude any duplicates from my data using an in-line view and the DISTINCT clause - and this is where I am coming unstuck.
I seem to be able to either sort the data OR distinct it, but never both at the same time. Is there a way around this or have I stumbled upon the SQL equivalent of the uncertainty principle?
This code returns the data sorted, but with duplicates
SELECT
ada.source_tab source_tab
, ada.source_col source_col
, ada.source_value source_value
, ada.ada_id ada_id
FROM
are_aud_data ada
, are_aud_exec_checks aec
, are_audit_elements ael
WHERE
aec.aec_id = ada.aec_id
AND ael.ano_id = aec.ano_id
AND aec.acn_id = 123456
AND ael.ael_type = 1
ORDER BY
CASE
WHEN source_tab = 'Tab type 1' THEN 1
WHEN source_tab = 'Tab type 2' THEN 2
ELSE 3
END
,ada.ada_id ASC;
This code removes the duplicates, but I lose the order...
SELECT DISTINCT source_tab, source_col, source_value FROM (
SELECT
ada.source_tab
, ada.source_col source_col
, ada.source_value source_value
, ada.ada_id ada_id
FROM
are_aud_data ada
, are_aud_exec_checks aec
, are_audit_elements ael
WHERE
aec.aec_id = ada.aec_id
AND ael.ano_id = aec.ano_id
AND aec.acn_id = 123456
AND ael.ael_type = 1
ORDER BY
CASE
WHEN source_tab = 'Tab type 1' THEN 1
WHEN source_tab = 'Tab type 2' THEN 2
ELSE 3
END
,ada.ada_id ASC
)
;
If I try and include 'ORDER BY ada_id' at the end of the outer select, I get the error message 'ORA-01791: not a SELECTed expression' which is infuriating me!!
Why don't you include ada_id at the selected fields of the outer query?
;WITH CTE AS
(
SELECT
ada.source_tab source_tab
, ada.source_col source_col
, ada.source_value source_value
, ada.ada_id ada_id
, ROW_NUMBER() OVER (PARTITION BY [COLUMNS_YOU_WANT TO BE DISTINCT]
ORDER BY [your_columns]) rn
FROM
are_aud_data ada
, are_aud_exec_checks aec
, are_audit_elements ael
WHERE
aec.aec_id = ada.aec_id
AND ael.ano_id = aec.ano_id
AND aec.acn_id = 356441
AND ael.ael_type = 1
ORDER BY
CASE
WHEN source_tab = 'Licensed Inventory' THEN 1
WHEN source_tab = 'CMDB' THEN 2
ELSE 3
END
,ada.ada_id ASC
)
select * from CTE WHERE rn<2
it seems that the ada_id is meaningless in the outer query.
you have removed all those values to boil it down to the distinct source_tab and source_col...
what would you expect the order to be?
you want maybe the minimum ada_id for each table and column set to be the driver for the order - (although the table name seems appropriate to me)
include the minimum ada_id in the inner query (you'll need a group by clause)
then reference that in the outer query and sort on it.