Multiple joins to get the same lookup column for different values - sql

We have a rather large SQL query, which is rather poorly performing. One of the problems (from analysing query plan) is the number of joins we have.
Essentially we have values in our data that we need to do a look up on another table.to get the value to display to the user. The problem is that we have do a join on the same table 4 times because there are 4 different columns that all need the same look up.
Hopefully this diagram might make it clearer
Raw_Event_data
event_id, datetime_id, lookup_1, lookup_2, lookup_3, lookup_4
1, 2013-01-01_12:00, 1, 5, 3, 9
2, 2013-01-01_12:00, 121, 5, 8, 19
3, 2013-01-01_12:00, 11, 2, 3, 32
4, 2013-01-01_12:00, 15, 2, 1, 0
Lookup_table
lookup_id, lookup_desc
1, desc1
2, desc2
3, desc3
...
Our query then looks something like this
Select
raw.event_id,
raw.datetime_id,
lookup1.lookup_desc,
lookup2.lookup_desc,
lookup3.lookup_desc,
lookup4.lookup_desc,
FROM
Raw_Event_data raw, Lookup_table lookup1,Lookup_table lookup2,Lookup_table lookup3,Lookup_table lookup4
WHERE raw.event_id = 1 AND
raw.lookup_1 *= lookup1 AND
raw.lookup_2 *= lookup2.lookup_id AND
raw.lookup_3 *= lookup3.lookup_id AND
raw.lookup_4 *= lookup4.lookup_id
So I get as an output
1, 2013-01-01_12:00, desc1, desc5, desc3, desc9
As I said the query works, but the joins are killing the performance.
That is a simplistic example I give there, in reality there will be 12 joins like above and we won't be selecting a specific event, but rather a range of events.
The question is, is there a better way of doing those joins.

correlated subqueries might be the way to go:
SELECT r.event_id
, r.datetime_id
, (select lookup1.lookup_desc from lookup_table lookup1 where lookup1.lookup_id = r.lookup_1) as desc_1
, (select lookup2.lookup_desc from lookup_table lookup2 where lookup2.lookup_id = r.lookup_2) as desc_2
, (select lookup3.lookup_desc from lookup_table lookup3 where lookup3.lookup_id = r.lookup_3) as desc_3
, (select lookup4.lookup_desc from lookup_table lookup4 where lookup4.lookup_id = r.lookup_4) as desc_4
FROM Raw_Event_data r
WHERE r.event_id = 1
;

My first attempt would be to handle the indexing myself, if I was refused by the DBA's.
declare #start_range bigint, #end_range bigint
select
#start_range = 5
,#end_range = 500
create local temporary table raw_event_subset
( --going to assume some schema based on your comments...obviously you will change these to whatever the base schema is.
event_id bigint
,datetime_id timestamp
,lookup_1 smallint
,lookup_2 smallint
--etc
) on commit preserve rows
create HG index HG_temp_raw_event_subset_event_id on raw_event_subset (event_id)
create LF index LF_temp_raw_event_subset_lookup_1 on raw_event_subset (lookup_1)
create LF index LF_temp_raw_event_subset_lookup_2 on raw_event_subset (lookup_2)
--etc
insert into raw_event_subset
select
event_id
,datetime_id
,lookup_1
,lookup_2
--,etc
from raw_event_data
where event_id >= #start_range --event_id *must* have an HG index on it for this to be worthwhile.
and event_id <= #end_range
--then run your normal query, except replace raw_event_data with raw_event_subset
select
event_id
,datetime_id
,l1.lookup_desc
,l2.lookup_desc
--etc
from raw_event_subset r
left join lookup_table l1
on l1.lookup_id = r.lookup_1
left join lookup_table l2
on l2.lookup_id = r.lookup_2
--etc
drop table raw_event_subset
hope this helps...

Related

SQL Query taking WAY too long

I have a query that's taking way too long.
There's not an index on any column and I'm pretty sure the way the OR are acting in this are making this too hard on the server.
This is a view I have and I'm making a SELECT * on this view that is taking 4 minutes to complete.
After revision, the query that I'm doing on this view is taking the most time.
SELECT * FROM Penny_Assoc_PCB WHERE PRODUCT_ID=68 ORDER BY RECORD_DT, ASSOCIATION_TYPE
/***** Here is the execution plan *******/
https://www.brentozar.com/pastetheplan/?id=Bki03eIHK
SELECT dbo.synfact_record.RECORD_ID
,dbo.synfact_record.PART_ID
,dbo.synfact_record.RECORD_DT
,dbo.synfact_association.ASSOCIATION_PART_A
,dbo.synfact_association.ASSOCIATION_PART_B
,dbo.synfact_association.ASSOCIATION_TYPE
,dbo.synfact_association.ASSOCIATION_ID
,dbo.synfact_record.PRODUCT_ID
FROM dbo.synfact_association
INNER JOIN dbo.synfact_record ON dbo.synfact_association.RECORD_ID = dbo.synfact_record.RECORD_ID
WHERE (
dbo.synfact_record.PART_ID IN (
SELECT PART_ID
FROM dbo.synfact_record AS synfact_record_1
WHERE (RECORD_STATUS = 1)
AND (RECORD_TYPE = 0)
)
)
AND dbo.synfact_record.PRODUCT_ID IN(
8,
9,
10,
15,
27,
31,
34,
56,
60,
61,
62,
66,
67,
68)
AND (dbo.synfact_record.RECORD_ID > 499)
AND (dbo.synfact_record.RECORD_STATUS = 1)
GROUP BY dbo.synfact_record.RECORD_ID
,dbo.synfact_record.PART_ID
,dbo.synfact_record.RECORD_DT
,dbo.synfact_association.ASSOCIATION_PART_A
,dbo.synfact_association.ASSOCIATION_PART_B
,dbo.synfact_association.ASSOCIATION_TYPE
,dbo.synfact_association.ASSOCIATION_ID
,dbo.synfact_record.PRODUCT_ID
,dbo.synfact_record.RECORD_STATUS
You can substantially simplify your query.
I have removed the GROUP BY, which was acting as a giant DISTINCT with no aggregation. If you get duplicates, I suggest you put more thought into your join. Perhaps you need a better join condition, or a top-1-per-group.
SELECT r.RECORD_ID,
r.PART_ID,
r.RECORD_DT,
a.ASSOCIATION_PART_A,
a.ASSOCIATION_PART_B,
a.ASSOCIATION_TYPE,
r.ASSOCIATION_ID,
r.PRODUCT_ID
FROM
dbo.synfact_association AS a
INNER JOIN
dbo.synfact_record AS r ON a.RECORD_ID = r.RECORD_ID
WHERE
(r.PART_ID IN (
SELECT PART_ID
FROM dbo.synfact_record AS r1
WHERE (r1.RECORD_STATUS = 1)
AND (r1.RECORD_TYPE = 0)
)
)
AND r.PRODUCT_ID IN
(8,9,10,15,27,31,34,56,60,61,62,67,68)
AND (r.RECORD_ID > 499)
AND (r.RECORD_STATUS = 1);
Based on this query alone, I would recommend the following indexes:
CREATE CLUSTERED INDEX IX_synfact_association_RECORD_ID
ON synfact_association (RECORD_ID)
-- for non clustered add: INCLUDE (ASSOCIATION_PART_A, ASSOCIATION_PART_B, ASSOCIATION_TYPE)
CREATE CLUSTERED INDEX IX_synfact_record_RECORD_ID
ON synfact_record (RECORD_STATUS, RECORD_ID)
-- for non clustered add: INCLUDE (PART_ID, RECORD_DT, ASSOCIATION_ID, PRODUCT_ID)
In this second index it maybe worth swapping RECORD_ID and PART_ID
CREATE NONCLUSTERED INDEX IX_synfact_record_RECORD_TYPE
ON synfact_record (RECORD_STATUS, RECORD_TYPE, PART_ID)
This last index is necessary for the IN clause

How to join a subquery to itself?

How do can you join a subquery onto itself? I'd like to do something like the following.
SELECT
four.src AS start, four.dest AS layover, f.dest AS destination
FROM
( SELECT 1 AS src, 2 as dest union all select 2, 3 ) AS four
JOIN
four AS f
ON f.src = four.dest
However the query above gives me the error
Msg 208, Level 16, State 1, Line 1
Invalid object name 'four'.
I'd rather not have to store it as a variable or view etc first since this is part of a monolithic query (this is itself a subquery and its part of a series of UNIONS) and I do not want to make sure that there are no impacting joins elsewhere that relate.
The force behind this change is that fourused to be a simple lookup but now for this query the values have to be calculated.
PS - this is a simplified example, in my case the subquery for four is a hundred lines long
You can make use of CTE (Common Table Expression in this scenario. Here, you need not to store this result in any temporary objects.
;WITH four AS (
SELECT 1 AS src, 2 as dest
union all
select 2, 3
)
SELECT F1.src AS start, F1.dest AS layover, f2.dest AS destination
FROM four F1
INNER JOIN four F2 ON F1.src = F2.dest
Use a temp table.
Declare #Temp(src int, desc int);
INSERT INTO #Temp(src,desc)
VALUES
(SELECT 1 AS src, 2 as dest union all select 2, 3)
SELECT * FROM #Temp t1
INNER JOIN #Temp t2 ON t1.src = t2.dest
You need to write it again. You alias 'four' can only be called in 'Select','Where', 'have', 'On'etc. conditions only and NOT as table in joins until and unless it's a table name in itself
SELECT
four.src AS start, four.dest AS layover, f.dest AS destination
FROM
(SELECT 1 AS src, 2 as dest union all select 2, 3 ) AS four
JOIN
(SELECT 1 AS src, 2 as dest union all select 2, 3 ) AS f
ON f.src = four.dest

MSSQL ORDER BY Passed List

I am using Lucene to perform queries on a subset of SQL data which returns me a scored list of RecordIDs, e.g. 11,4,5,25,30 .
I want to use this list to retrieve a set of results from the full SQL Table by RecordIDs.
So SELECT * FROM MyFullRecord
where RecordID in (11,5,3,25,30)
I would like the retrieved list to maintain the scored order.
I can do it by using an Order by like so;
ORDER BY (CASE WHEN RecordID = 11 THEN 0
WHEN RecordID = 5 THEN 1
WHEN RecordID = 3 THEN 2
WHEN RecordID = 25 THEN 3
WHEN RecordID = 30 THEN 4
END)
I am concerned with the loading of the server loading especially if I am passing long lists of RecordIDs. Does anyone have experience of this or how can I determine an optimum list length.
Are there any other ways to achieve this functionality in MSSQL?
Roger
You can record your list into a table or table variable with sorting priorities.
And then join your table with this sorting one.
DECLARE TABLE #tSortOrder (RecordID INT, SortOrder INT)
INSERT INTO #tSortOrder (RecordID, SortOrder)
SELECT 11, 1 UNION ALL
SELECT 5, 2 UNION ALL
SELECT 3, 3 UNION ALL
SELECT 25, 4 UNION ALL
SELECT 30, 5
SELECT *
FROM yourTable T
LEFT JOIN #tSortOrder S ON T.RecordID = S.RecordID
ORDER BY S.SortOrder
Instead of creating a searched order by statement, you could create an in memory table to join. It's easier on the eyes and definitely scales better.
SQL Statement
SELECT mfr.*
FROM MyFullRecord mfr
INNER JOIN (
SELECT *
FROM (VALUES (1, 11),
(2, 5),
(3, 3),
(4, 25),
(5, 30)
) q(ID, RecordID)
) q ON q.RecordID = mfr.RecordID
ORDER BY
q.ID
Look here for a fiddle
Something like:
SELECT * FROM MyFullRecord where RecordID in (11,5,3,25,30)
ORDER BY
CHARINDEX(','+CAST(RecordID AS varchar)+',',
','+'11,5,3,25,30'+',')
SQLFiddle demo

How to improve performance of this query?

With reference to SQL Query how to summarize students record by date? I was able to get the report I wanted.
I was told in real world the students table will have 30 Millions of records. I do have index on (StudentID, Date). Any suggestions to improve the performance or is there a better way to build the report ?
Right now I have the following query
;with cte as
(
select id,
studentid,
date,
'#'+subject+';'+grade+';'+convert(varchar(10), date, 101) report
from student
)
-- insert into studentreport
select distinct
studentid,
STUFF(
(SELECT cast(t2.report as varchar(50))
FROM cte t2
where c.StudentId = t2.StudentId
order by t2.date desc
FOR XML PATH (''))
, 1, 0, '') AS report
from cte c;
Without seeing the execution plan, it's not really possible to write an optimized SQL statement so I'll make suggestions instead.
Don't use a cte as they often don't handle queries with large memory requires well (at least, in my experience). Instead, stage the cte data in a real table, either with a materialized/indexed view or with a working table (maybe a large temp table). Then execute the second select (after the cte) to combine your data in an ordered list.
The number of comments to your question indicates that you have a large problem (or problems). You're converting tall and skinny data (think integers, datetime2 types) into ordered lists within a strings. Try to think instead in terms of storing in the smallest data formats available and manipulating into strings until afterward (or never). Alternatively, give serious thought into creating an XML data field to replace the 'report' field.
If you can make it work, this is what I would do (including a test case without indexes). Your mileage may vary, but give it a try:
create table #student (id int not null, studentid int not null, date datetime not null, subject varchar(40), grade varchar(40))
insert into #student (id,studentid,date,subject,grade)
select 1, 1, getdate(), 'history', 'A-' union all
select 2, 1, dateadd(d,1,getdate()), 'computer science', 'b' union all
select 3, 1, dateadd(d,2,getdate()), 'art', 'q' union all
--
select 1, 2, getdate() , 'something', 'F' union all
select 2, 2, dateadd(d,1,getdate()), 'genetics', 'e' union all
select 3, 2, dateadd(d,2,getdate()), 'art', 'D+' union all
--
select 1, 3, getdate() , 'memory loss', 'A-' union all
select 2, 3, dateadd(d,1,getdate()), 'creative writing', 'A-' union all
select 3, 3, dateadd(d,2,getdate()), 'history of asia 101', 'A-'
go
select studentid as studentid
,(select s2.date as '#date', s2.subject as '#subject', s2.grade as '#grade'
from #student s2 where s1.studentid = s2.studentid for xml path('report'), type) as 'reports'
from (select distinct studentid from #student) s1;
I don't know how to make the output legible on here, but the resultset is 2 fields. Field 1 is an integer, field 2 is XML with one node per report. This still isn't as ideal as just sending the resultset, but it is at least one result per studentid.

SQL Query to eliminate similar entries

I am working on a problem in SQL Server 2008
I have a table with six columns:
PK INT
dOne SmallINT
dTwo SmallINT
dThree SmallINT
dFour SmallINT
dFiveSmallINT
dSix SmallINT
The table contains around a million recrods. It's probably worth noting that value in column n+1 > value in column n i.e. 97, 98, 99, 120, 135. I am trying to eliminate all rows which have 5 DIGITS in common (ignoring the PK) i.e.:
76, 89, 99, 102, 155, 122
11, 89, 99, 102, 155, 122
89, 99, 102, 155, 122, 130
In this case the algorithm should start at the first row and delete the second and third rows because they contain 5 matching digits. The first row persists.
I have tried to brute force the solution but finding all the duplicates for only the first record takes upwards of 25 seconds meaning processing the whole table would take... way too long (this should be a repeatable process).
I am fairly new to SQL but this is what I have come up with (I have come up with a few solutions but none were adequate... this is the latest attempt):
(I won't include all the code but I will explain the method, I can paste more if it helps)
Save the digits of record n into variables. SELECT all records which have one digit in common with record n FROM largeTable.
Insert all selected digits into #oneMatch and include [matchingOne] with the digit that matched.
Select all records which have one digit in common with record n FROM the temp table WHERE 'digit in common' != [matching]. INSERT all selected digits into #twoMatch and include [matchingOne] AND [matchingTwo]...
Repeat until inserting into #fiveMatch. Delete #fiveMatch from largeTable and move to record n+1
I am having a problem implementing this solution. How can I assign the matching variable depending on the WHERE clause?
-- SELECT all records with ONE matching field:
INSERT INTO #oneMatch (ID_pk, dOne, dTwo, dThree, dFour, dFive, dSix, mOne)
SELECT ID_pk, dOne, dTwo, dThree, dFour, dFive, dSix
FROM dbo.BaseCombinationsExtended
WHERE ( [dOne] IN (#dOne, #dTwo, #dThree, #dFour, #dFive, #dSix) **mOne = dOne?
OR [dTwo] IN (#dOne, #dTwo, #dThree, #dFour, #dFive, #dSix) **mOne = dTwo?
OR [dTwo] IN (#dOne, #dTwo, #dThree, #dFour, #dFive, #dSix) **mOne = dThree?
...
OR [dSix] IN (#dOne, #dTwo, #dThree, #dFour, #dFive, #dSix) **mOne = dSix?
)
I am able to 'fake' the above using six queries but that is too inefficient...
Sorry for the long description. Any help would be greatly appreciated (new solution or implementation of my attempt above) as this problem has been nagging at me for a while...
Unless I miss something this should produce the correct result.
declare #T table
(
PK INT identity primary key,
dOne SmallINT,
dTwo SmallINT,
dThree SmallINT,
dFour SmallINT,
dFive SmallINT,
dSix SmallINT
)
insert into #T values
(76, 89, 99, 102, 155, 122),
(11, 89, 99, 102, 155, 122),
(89, 99, 102, 155, 122, 130)
;with q1(PK, d1, d2, d3, d4, d5) as
(
select PK, dTwo, dThree, dFour, dFive, dSix
from #T
union all
select PK, dOne, dThree, dFour, dFive, dSix
from #T
union all
select PK, dOne, dTwo, dFour, dFive, dSix
from #T
union all
select PK, dOne, dTwo, dThree, dFive, dSix
from #T
union all
select PK, dOne, dTwo, dThree, dFour, dSix
from #T
union all
select PK, dOne, dTwo, dThree, dFour, dFive
from #T
),
q2 as
(
select PK,
row_number() over(partition by d1, d2, d3, d4, d5 order by PK) as rn
from q1
),
q3 as
(
select PK
from q2
where rn = 1
group by PK
having count(*) = 6
)
select T.*
from #T as T
inner join q3 as Q
on T.PK = Q.PK
I can't make any promises on performance, but you can try this. The first thing that I do is put the data into a more normalized structure.
CREATE TABLE dbo.Test_Sets_Normalized (my_id INT NOT NULL, c SMALLINT NOT NULL)
GO
INSERT INTO dbo.Test_Sets_Normalized (my_id, c)
SELECT my_id, c1 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c2 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c3 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c4 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c5 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c6 FROM dbo.Test_Sets
GO
SELECT DISTINCT
T2.my_id
FROM
(SELECT DISTINCT my_id FROM dbo.Test_Sets_Normalized) T1
INNER JOIN (SELECT DISTINCT my_id FROM dbo.Test_Sets_Normalized) T2 ON T2.my_id > T1.my_id
WHERE
(
SELECT
COUNT(*)
FROM
dbo.Test_Sets_Normalized T3
INNER JOIN dbo.Test_Sets_Normalized T4 ON
T4.my_id = T2.my_id AND
T4.c = T3.c
WHERE
T3.my_id = T1.my_id) >= 5
That should get you the IDs that you need. Once you've confirmed that it does what you want, you can JOIN back to the original table and delete by IDs.
There's probably an improvement possible somewhere that doesn't require the DISTINCT. I'll give it a little more thought.
Edit - the following approach might be better than N squared performance, depending on the optimizer. If all 5 columns are indexed it should only need 6 index seeks per row, which is still N * logN. It does seem a little dopey though.
You could code generate the where condition based on all the permutations of 5 matches: so the records to delete would be given by:
SELECT * FROM SillyTable ToDelete WHERE EXISTS
(
SELECT PK From SillyTable Duplicate
WHERE ( (
(Duplicate.dOne=ToDelete.dOne)
AND (Duplicate.dTwo=ToDelete.dTwo)
AND (Duplicate.dThree=ToDelete.dThree)
AND (Duplicate.dFour=ToDelete.dFour)
AND (Duplicate.dFive=ToDelete.dFive)
) OR (
(Duplicate.dOne=ToDelete.dTwo)
AND (Duplicate.dTwo=ToDelete.dThree)
AND (Duplicate.dThree=ToDelete.dFour)
AND (Duplicate.dFour=ToDelete.dFive)
AND (Duplicate.dFive=ToDelete.dSix)
) OR (
(Duplicate.dTwo=ToDelete.dOne)
AND (Duplicate.dThree=ToDelete.dTwo)
AND (Duplicate.dFour=ToDelete.dThree)
AND (Duplicate.dFive=ToDelete.dFour)
AND (Duplicate.dSix=ToDelete.dFive)
) OR (
(Duplicate.dTwo=ToDelete.dTwo)
AND (Duplicate.dThree=ToDelete.dThree)
AND (Duplicate.dFour=ToDelete.dFour)
AND (Duplicate.dFive=ToDelete.dFive)
AND (Duplicate.dSix=ToDelete.dSix)
) ...
This goes on to cover all 36 combinations (there is one non-match on each side of the join, out of 6 possible columns, so 6*6 gives you all the possibilites). I would code generate this because it's a lot of typing, and what if you want 4 out of 6 matches tomorrow, but you could hand code it I guess.