Find records with exact matches on a many to many relationship - sql

I have three tables that look like these:
PROD
Prod_ID|Desc
------------
P1|Foo1
P2|Foo2
P3|Foo3
P4|Foo4
...
RAM
Ram_ID|Desc
------------
R1|Bar1
R2|Bar2
R3|Bar3
R4|Bar4
...
PROD_RAM
Prod_ID|Ram_ID
------------
P1|R1
P2|R2
P3|R1
P3|R2
P3|R3
P4|R3
P5|R1
P5|R2
...
Between PROD and RAM there's a Many-To-Many relationship described by the PROD_RAM table.
Given a Ram_ID set like (R1,R3) I would like to find all the PROD that has exactly ONE or ALL of the RAM of the given set.
Given (R1,R3) should return for example P1,P4 and P5; P3 should not be returned because has R1 and R3 but also R2.
What's the fastest query to get all the PROD that has exactly ONE or ALL of the Ram_ID of a given RAM set?
EDIT:
The PROD_RAM table could contain relationship bigger than 1->3 so, "hardcoded" checks for count = 1 OR = 2 are not a viable solution.

Another solution you could try for speed would be like this
;WITH CANDIDATES AS (
SELECT pr1.Prod_ID
, pr2.Ram_ID
FROM PROD_RAM pr1
INNER JOIN PROD_RAM pr2 ON pr2.Prod_ID = pr1.Prod_ID
WHERE pr1.Ram_ID IN ('R1', 'R3')
)
SELECT *
FROM CANDIDATES
WHERE CANDIDATES.Prod_ID NOT IN (
SELECT Prod_ID
FROM CANDIDATES
WHERE Ram_ID NOT IN ('R1', 'R3')
)
or if you don't like repeating the set conditions
;WITH SUBSET (Ram_ID) AS (
SELECT 'R1'
UNION ALL SELECT 'R3'
)
, CANDIDATES AS (
SELECT pr1.Prod_ID
, pr2.Ram_ID
FROM PROD_RAM pr1
INNER JOIN PROD_RAM pr2 ON pr2.Prod_ID = pr1.Prod_ID
INNER JOIN SUBSET s ON s.Ram_ID = pr1.Ram_ID
)
, EXCLUDES AS (
SELECT Prod_ID
FROM CANDIDATES
LEFT OUTER JOIN SUBSET s ON s.Ram_ID = CANDIDATES.Ram_ID
WHERE s.Ram_ID IS NULL
)
SELECT *
FROM CANDIDATES
LEFT OUTER JOIN EXCLUDES ON EXCLUDES.Prod_ID = CANDIDATES.Prod_ID
WHERE EXCLUDES.Prod_ID IS NULL

One way to do this would be something like the following:
SELECT PROD.Prod_ID FROM PROD WHERE
(SELECT COUNT(*) FROM PROD_RAM WHERE PROD_RAM.Prod_ID = PROD.Prod_ID) > 0 AND
(SELECT COUNT(*) FROM PROD_RAM WHERE PROD_RAM.Prod_ID = PROD.Prod_ID AND PROD.Ram_ID <>
IFNULL((SELECT TOP 1 Ram_ID FROM PROD_RAM WHERE PROD_RAM.Prod_ID = PROD.Prod_ID),0)) = 0

SELECT Prod_ID
FROM
( SELECT Prod_ID
, COUNT(*) AS cntAll
, COUNT( CASE WHEN Ram_ID IN (1,3)
THEN 1
ELSE NULL
END
) AS cntGood
FROM PROD_RAM
GROUP BY Prod_ID
) AS grp
WHERE cntAll = cntGood
AND ( cntGood = 1
OR cntGood = 2 --- number of items in list (1,3)
)
Not at all sure if it's the fastest way. You'll have to try different ways to write this query (using JOINs and NOT EXISTS ) and test for speed.

Related

Problem optimizing sql query with cross apply sub query

So I have three tables:
MakerParts, that holds the primary information of a Vehicle Part:
Id
MakerId
PartNumber
Description
1
1
ABC1234
Tire
2
1
XYZ1234
Door
MakerPrices, that holds the price history variation for the parts (references MakerParts.Id on MakerPartNumberId, and the table MakerPriceUpdates on UpdateId):
Id
MakerPartNumberId
UpdateId
Price
1
1
1
9.83
2
1
2
11.23
MakerPriceUpdates, that holds the date of prices updates. This update is basically a CSV file that is uploaded to our system. One file, one line on this table, multiple prices changes on the table MakerPrices.
Id
Date
FileName
1
2019-01-09 00:00:00.000
temp.csv
2
2019-01-11 00:00:00.000
temp2.csv
This means that one part (MakerParts) may have multiple prices (MakerPrices). The date of the price change is on the table MakerPricesUpdates.
I want to select all MakerParts where the most recent price is zero, filtering by the MakerId on table MakerParts.
What I've tried:
select mp.* from MakerParts mp cross apply
(select top 1 Price from MakerPrices inner join
MakerPricesUpdates on MakerPricesUpdates.Id = MakerPrices.UpdateId where
MakerPrices.MakerPartNumberId = mp.Id order by Date desc) as p
where mp.MakerId = 1 and p.Price = 0
But that is absurdly slow (we have about 100 million lines on the MakerPrices table). I'm having a hard time optimizing this query. (the result is only two rows for the MakerId 1, and it took 2 mins to run). I also tried:
select * from (
select
mp.*,
(select top 1 Price from MakerPrices inner join
MakerPricesUpdates on MakerPricesUpdates.Id = MakerPrices.UpdateId
where MakerPrices.MakerPartNumberId = mp.Id order by Date desc) as Price
from MakerParts mp) as temp
where temp.Price = 0 and MakerId = 1
Same result, and same time. My query plan (for the first query) (no new indexes suggested by Management Studio):
I think you can avoid joining MakerPriceUpdates with makerprices since with the highest
UpdateId you can find the latest price updates. It will save you some time.
select mp.* from MakerParts mp cross apply
(select top 1 Price from MakerPrices where
MakerPrices.MakerPartNumberId = mp.Id order by MakerPrices.UpdateId desc) as p
where mp.MakerId = 1 and p.Price = 0
You can further reduced some times by avoiding sort and order by with cte and row_number() as below:
;with LatestMakerPrices as
(
select *,row_number()over(partition by MakerPartNumberId order by updateid desc)rn from MakerPrices
)
select mp.* from MakerParts mp cross apply
(select price from LatestMakerPrices lmp where lmp.MakerPartNumberId=mp.Id) as p
where mp.MakerId = 1 and p.Price = 0
Execution plan difference between query in question and my answer:
try:
WITH tab AS (
SELECT *, NULL as Price FROM MakerParts
WHERE not exists (
SELECT Id
FROM MakerPrices
WHERE MakerPrices.MakerPartNumberId = MakerParts.Id
)
)
SELECT * from tab WHERE MakerId = 2
UNION ALL
SELECT a.* , Price
FROM [dbo].[MakerParts] a
LEFT JOIN [dbo].[MakerPrices] b
ON b.MakerPartNumberId = a.Id
WHERE MakerId = 2 AND Price = 0
Try your query:
select mp.* from MakerParts mp cross apply
(select top 1 Price from MakerPrices inner join
MakerPricesUpdates on MakerPricesUpdates.Id = MakerPrices.UpdateId where
MakerPrices.MakerPartNumberId = mp.Id order by Date desc) as p
where mp.MakerId = 1 and p.Price = 0
After creating below index:
CREATE NONCLUSTERED INDEX [NCIdx_MakerPrices_MakerPartNumberId_UpdateId] ON [dbo].[MakerPrices]
(
[MakerPartNumberId] ASC,
[UpdateId] ASC
)
INCLUDE([Price])
And making ID column of MakerPricesUpdates table primary key.

Rewrite a query with GROUP BY ALL

Microsoft has deprecated GROUP BY ALL and while the query might work now, I'd like to future-proof this query for future SQL upgrades.
Currently, my query is:
SELECT qt.QueueName AS [Queue] ,
COUNT ( qt.QueueName ) AS [#ofUnprocessedEnvelopes] ,
COUNT ( CASE WHEN dq.AssignedToUserID = 0 THEN 1
ELSE NULL
END
) AS [#ofUnassignedEnvelopes] ,
MIN ( dq.DocumentDate ) AS [OldestEnvelope]
FROM dbo.VehicleReg_Documents_QueueTypes AS [qt]
LEFT OUTER JOIN dbo.VehicleReg_Documents_Queue AS [dq] ON dq.QueueID = qt.QueueTypeID
WHERE dq.IsProcessed = 0
AND dq.PageNumber = 1
GROUP BY ALL qt.QueueName
ORDER BY qt.QueueName ASC;
And the resulting data set:
<table><tbody><tr><td>Queue</td><td>#ofUnprocessedEnvelopes</td><td>#ofUnassignedEnvelopes</td><td>OldestEnvelope</td></tr><tr><td>Cancellations</td><td>0</td><td>0</td><td>NULL</td></tr><tr><td>Dealer</td><td>26</td><td>17</td><td>2018-04-06</td></tr><tr><td>Matched to Registration</td><td>93</td><td>82</td><td>2018-04-04</td></tr><tr><td>New Registration</td><td>166</td><td>140</td><td>2018-03-21</td></tr><tr><td>Remaining Documents</td><td>2</td><td>2</td><td>2018-04-04</td></tr><tr><td>Renewals</td><td>217</td><td>0</td><td>2018-04-03</td></tr><tr><td>Transfers</td><td>296</td><td>245</td><td>2018-03-30</td></tr><tr><td>Writebacks</td><td>53</td><td>46</td><td>2018-04-09</td></tr></tbody></table>
I've tried various versions using CTE's and UNION's but I cannot get result set to generate correctly - the records that have no counts will not display or I will have duplicate records displayed.
Any suggestions on how to make this work without the GROUP BY ALL?
Below is one attempt where I tried a CTE with a UNION:
;WITH QueueTypes ( QueueTypeID, QueueName )
AS ( SELECT QueueTypeID ,
QueueName
FROM dbo.VehicleReg_Documents_QueueTypes )
SELECT qt.QueueName AS [Queue] ,
COUNT ( qt.QueueName ) AS [#ofUnprocessedEnvelopes] ,
COUNT ( CASE WHEN dq.AssignedToUserID = 0 THEN 1
ELSE NULL
END
) AS [#ofUnassignedEnvelopes] ,
CONVERT ( VARCHAR (8), MIN ( dq.DocumentDate ), 1 ) AS [OldestEnvelope]
FROM QueueTypes AS qt
LEFT OUTER JOIN dbo.VehicleReg_Documents_Queue AS dq ON dq.QueueID = qt.QueueTypeID
WHERE dq.IsProcessed = 0
AND dq.PageNumber = 1
GROUP BY qt.QueueName
UNION ALL
SELECT qt.QueueName AS [Queue] ,
COUNT ( qt.QueueName ) AS [#ofUnprocessedEnvelopes] ,
COUNT ( CASE WHEN dq.AssignedToUserID = 0 THEN 1
ELSE NULL
END
) AS [#ofUnassignedEnvelopes] ,
CONVERT ( VARCHAR (8), MIN ( dq.DocumentDate ), 1 ) AS [OldestEnvelope]
FROM QueueTypes AS qt
LEFT OUTER JOIN dbo.VehicleReg_Documents_Queue AS dq ON dq.QueueID = qt.QueueTypeID
GROUP BY qt.QueueName
But the results are not close to being correct:
Your current query doesn't work as it seems to work, because while you outer join table VehicleReg_Documents_Queue you dismiss all outer joined rows in the WHERE clause, so you are where you would have been with a mere inner join. You may want to consider either moving your criteria to the ON clause or make this an inner join right away.
It is also weird that you join queue type and queue not on the queue ID or the queue type ID, but on dq.QueueID = qt.QueueTypeID. That's like joining employees and addresses on employee number matching the house number. At least that's what it looks like.
(Then why does your queue type table have a queue name? Shouldn't the queue table contain the queue name instead? But this is not about your query, but about your data model.)
GROUP BY ALL means: "Please give us all QueueNames, even when the WHERE clause dismisses them. I see two possibilities for your query:
You do want an outer join actually. Then there is no WHERE clause and you can simply make this GROUP BY qt.QueueName.
You don't want an outer join. Then we want a row per QueueName in the table, which we might not get with simply changing GROUP BY ALL qt.QueueName to GROUP BY qt.QueueName.
In that second case we want all QueueNames first and outer join your query:
select
qn.QueueName AS [Queue],
q.[#ofUnassignedEnvelopes],
q.[OldestEnvelope]
FROM (select distinct QueueName from VehicleReg_Documents_QueueTypes) qn
LEFT JOIN
(
SELECT qt.QueueName,
COUNT ( qt.QueueName ) AS [#ofUnprocessedEnvelopes] ,
COUNT ( CASE WHEN dq.AssignedToUserID = 0 THEN 1
ELSE NULL
END
) AS [#ofUnassignedEnvelopes] ,
MIN ( dq.DocumentDate ) AS [OldestEnvelope]
FROM dbo.VehicleReg_Documents_QueueTypes AS [qt]
JOIN dbo.VehicleReg_Documents_Queue AS [dq] ON dq.QueueID = qt.QueueTypeID
WHERE dq.IsProcessed = 0
AND dq.PageNumber = 1
) q ON q.QueueName = qn.QueueName
GROUP BY ALL qn.QueueName
ORDER BY qn.QueueName ASC;
I think the best corollary here for a 'GROUP BY ALL' into something more ANSI compliant would be a CASE statement. Without knowing your data, it's hard to say for sure if this is 1:1, but I'm betting it's in the ballpark.
SELECT qt.QueueName AS [Queue]
,COUNT(CASE
WHEN dq.IsProcessed = 0
AND dq.PageNumber = 1
THEN qt.QueueName
END) AS [#ofUnprocessedEnvelopes]
,COUNT(CASE
WHEN dq.AssignedToUserID = 0
AND dq.IsProcessed = 0
AND dq.PageNumber = 1
THEN 1
ELSE NULL
END) AS [#ofUnassignedEnvelopes]
,MIN(CASE
WHEN dq.IsProcessed = 0
AND dq.PageNumber = 1
THEN dq.DocumentDate
END) AS [OldestEnvelope]
FROM dbo.VehicleReg_Documents_QueueTypes AS [qt]
LEFT OUTER JOIN dbo.VehicleReg_Documents_Queue AS [dq] ON dq.QueueID = qt.QueueTypeID
GROUP BY qt.QueueName
ORDER BY qt.QueueName ASC;
That's a bit uglier because every aggregate has to have the WHERE conditions inside a case statement, but at least you are future proof.

How to join three tables with distinct

I'm trying to join three tables to pull back a list of distinct blog posts with associated assets (images etc) but I keep coming up a cropper. The three tablets are tblBlog, tblAssetLink and tblAssets. The Blog tablet hold the blog, the asset table holds the assets and the Assetlink table links the two together.
tblBlog.BID is the PK in blog, tblAssets.AID is the PK in Assets.
This query works but pulls back multiple posts for the same record. I've tried to use select distinct and group by and even union but as my knowledge is pretty poor with SQL - they all error.
I'd like to also discount any assets that are marked as deleted (tblAssets.Deleted = true) but not hide the associated Blog post (if that's not marked as deleted). If anyone can help - it would be much appreciated! Thanks.
Here's my query so far....
SELECT dbo.tblBlog.BID,
dbo.tblBlog.DateAdded,
dbo.tblBlog.PMonthName,
dbo.tblBlog.PDay,
dbo.tblBlog.Header,
dbo.tblBlog.AddedBy,
dbo.tblBlog.PContent,
dbo.tblBlog.Category,
dbo.tblBlog.Deleted,
dbo.tblBlog.Intro,
dbo.tblBlog.Tags,
dbo.tblAssets.Name,
dbo.tblAssets.Description,
dbo.tblAssets.Location,
dbo.tblAssets.Deleted AS Expr1,
dbo.tblAssetLink.Priority
FROM dbo.tblBlog
LEFT OUTER JOIN dbo.tblAssetLink
ON dbo.tblBlog.BID = dbo.tblAssetLink.BID
LEFT OUTER JOIN dbo.tblAssets
ON dbo.tblAssetLink.AID = dbo.tblAssets.AID
WHERE ( dbo.tblBlog.Deleted = 'False' )
ORDER BY dbo.tblAssetLink.Priority, tblBlog.DateAdded DESC
EDIT
Changed the Where and the order by....
Expected output:
tblBlog.BID = 123
tblBlog.DateAdded = 12/04/2015
tblBlog.Header = This is a header
tblBlog.AddedBy = Persons name
tblBlog.PContent = *text*
tblBlog.Category = Category name
tblBlog.Deleted = False
tblBlog.Intro = *text*
tblBlog.Tags = Tag, Tag, Tag
tblAssets.Name = some.jpg
tblAssets.Description = Asset desc
tblAssets.Location = Location name
tblAssets.Priority = True
Use OUTER APPLY:
DECLARE #b TABLE ( BID INT )
DECLARE #a TABLE ( AID INT )
DECLARE #ba TABLE
(
BID INT ,
AID INT ,
Priority INT
)
INSERT INTO #b
VALUES ( 1 ),
( 2 )
INSERT INTO #a
VALUES ( 1 ),
( 2 ),
( 3 ),
( 4 )
INSERT INTO #ba
VALUES ( 1, 1, 1 ),
( 1, 2, 2 ),
( 2, 1, 1 ),
( 2, 2, 2 )
SELECT *
FROM #b b
OUTER APPLY ( SELECT TOP 1
a.*
FROM #ba ba
JOIN #a a ON a.AID = ba.AID
WHERE ba.BID = b.BID
ORDER BY Priority
) o
Output:
BID AID
1 1
2 1
Something like:
SELECT b.BID ,
b.DateAdded ,
b.PMonthName ,
b.PDay ,
b.Header ,
b.AddedBy ,
b.PContent ,
b.Category ,
b.Deleted ,
b.Intro ,
b.Tags ,
o.Name ,
o.Description ,
o.Location ,
o.Deleted AS Expr1 ,
o.Priority
FROM dbo.tblBlog b
OUTER APPLY ( SELECT TOP 1
a.* ,
al.Priority
FROM dbo.tblAssetLink al
JOIN dbo.tblAssets a ON al.AID = a.AID
WHERE b.BID = al.BID
ORDER BY al.Priority
) o
WHERE b.Deleted = 'False'
You cannot join three tables unless they all have the same attribute. It would work if all tables had BID, but the second join is trying to join AID. Which wont work. They all have to have BID.
Based on your comments
i would like to get is just one asset per blog post (top one ordered
by Priority)
You can change your query as following. I suggest changing the join with dbo.tblAssetLink to filtered one, which contains only one (highest priority) link for every blog.
SELECT dbo.tblBlog.BID,
dbo.tblBlog.DateAdded,
dbo.tblBlog.PMonthName,
dbo.tblBlog.PDay,
dbo.tblBlog.Header,
dbo.tblBlog.AddedBy,
dbo.tblBlog.PContent,
dbo.tblBlog.Category,
dbo.tblBlog.Deleted,
dbo.tblBlog.Intro,
dbo.tblBlog.Tags,
dbo.tblAssets.Name,
dbo.tblAssets.Description,
dbo.tblAssets.Location,
dbo.tblAssets.Deleted AS Expr1,
dbo.tblAssetLink.Priority
FROM dbo.tblBlog
LEFT OUTER JOIN
(SELECT BID, AID,
ROW_NUMBER() OVER (PARTITION BY BID ORDER BY [Priority] DESC) as N
FROM dbo.tblAssetLink) AS filteredAssetLink
ON dbo.tblBlog.BID = filteredAssetLink.BID
LEFT OUTER JOIN dbo.tblAssets
ON filteredAssetLink.AID = dbo.tblAssets.AID
WHERE dbo.tblBlog.Deleted = 'False' AND filteredAssetLink.N = 1
ORDER BY tblBlog.DateAdded DESC

How to improve sql script performance

The following script is very slow when its run.
I have no idea how to improve the performance of the script.
Even with a view takes more than quite a lot minutes.
Any idea please share to me.
SELECT DISTINCT
( id )
FROM ( SELECT DISTINCT
ct.id AS id
FROM [Customer].[dbo].[Contact] ct
LEFT JOIN [Customer].[dbo].[Customer_ids] hnci ON ct.id = hnci.contact_id
WHERE hnci.customer_id IN (
SELECT DISTINCT
( [Customer_ID] )
FROM [Transactions].[dbo].[Transaction_Header]
WHERE actual_transaction_date > '20120218' )
UNION
SELECT DISTINCT
contact_id AS id
FROM [Customer].[dbo].[Restaurant_Attendance]
WHERE ( created > '2012-02-18 00:00:00.000'
OR modified > '2012-02-18 00:00:00.000'
)
AND ( [Fifth_Floor_London] = 1
OR [Fourth_Floor_Leeds] = 1
OR [Second_Floor_Bristol] = 1
)
UNION
SELECT DISTINCT
( ct.id )
FROM [Customer].[dbo].[Contact] ct
INNER JOIN [Customer].[dbo].[Wifinity_Devices] wfd ON ct.wifinity_uniqueID = wfd.[CustomerUniqueID]
AND startconnection > '2012-02-17'
UNION
SELECT DISTINCT
comdt.id AS id
FROM [Customer].[dbo].[Complete_dataset] comdt
LEFT JOIN [Customer].[dbo].[Aggregate_Spend_Counts] agsc ON comdt.id = agsc.contact_id
WHERE agsc.contact_id IS NULL
AND ( opt_out_Mail <> 1
OR opt_out_email <> 1
OR opt_out_SMS <> 1
OR opt_out_Mail IS NULL
OR opt_out_email IS NULL
OR opt_out_SMS IS NULL
)
AND ( address_1 IS NOT NULL
OR email IS NOT NULL
OR mobile IS NOT NULL
)
UNION
SELECT DISTINCT
( contact_id ) AS id
FROM [Customer].[dbo].[VIP_Card_Holders]
WHERE VIP_Card_number IS NOT NULL
) AS tbl
Wow, where to start...
--this distinct does nothing. Union is already distinct
--SELECT DISTINCT
-- ( id )
--FROM (
SELECT DISTINCT [Customer_ID] as ID
FROM [Transactions].[dbo].[Transaction_Header]
where actual_transaction_date > '20120218' )
UNION
SELECT
contact_id AS id
FROM [Customer].[dbo].[Restaurant_Attendance]
-- not sure that you are getting the date range you want. Should these be >=
-- if you want everything that occurred on the 18th or after you want >= '2012-02-18 00:00:00.000'
-- if you want everything that occurred on the 19th or after you want >= '2012-02-19 00:00:00.000'
-- the way you have it now, you will get everything on the 18th unless it happened exactly at midnight
WHERE ( created > '2012-02-18 00:00:00.000'
OR modified > '2012-02-18 00:00:00.000'
)
AND ( [Fifth_Floor_London] = 1
OR [Fourth_Floor_Leeds] = 1
OR [Second_Floor_Bristol] = 1
)
-- all of this does nothing because we already have every id in the contact table from the first query
-- UNION
-- SELECT
-- ( ct.id )
-- FROM [Customer].[dbo].[Contact] ct
-- INNER JOIN [Customer].[dbo].[Wifinity_Devices] wfd ON ct.wifinity_uniqueID = wfd.[CustomerUniqueID]
-- AND startconnection > '2012-02-17'
UNION
-- cleaned this up with isnull function and coalesce
SELECT
comdt.id AS id
FROM [Customer].[dbo].[Complete_dataset] comdt
LEFT JOIN [Customer].[dbo].[Aggregate_Spend_Counts] agsc ON comdt.id = agsc.contact_id
WHERE agsc.contact_id IS NULL
AND ( isnull(opt_out_Mail,0) <> 1
OR isnull(opt_out_email,0) <> 1
OR isnull(opt_out_SMS,0) <> 1
)
AND coalesce(address_1 , email, mobile) IS NOT NULL
UNION
SELECT
( contact_id ) AS id
FROM [Customer].[dbo].[VIP_Card_Holders]
WHERE VIP_Card_number IS NOT NULL
-- ) AS tbl
Where exists is generally faster than in as well.
Or conditions are generally slower as well, use more union statements instead.
And learn to use left joins correctly. If you have a where condition (other than where id is null) on the table on teh right side of a left join, it will convert to an inner join. If this is not what you want, then your code is currently giving you an incorrect result set.
See http://wiki.lessthandot.com/index.php/WHERE_conditions_on_a_LEFT_JOIN for an explanation of how to fix.
As stated in a comment optimize one at a time. See which one takes the longest and focus on that one.
union will remove duplicates so you don't need the distinct on the individual queries
On you first I would try this:
The left join is killed by the WHERE hnci.customer_id IN so you might as well have a join.
The sub-query is not efficient as cannot use an index on the IN.
The query optimizer does not know what in ( select .. ) will return so it cannot optimize use of indexes.
SELECT ct.id AS id
FROM [Customer].[dbo].[Contact] ct
JOIN [Customer].[dbo].[Customer_ids] hnci
ON ct.id = hnci.contact_id
JOIN [Transactions].[dbo].[Transaction_Header] th
on hnci.customer_id = th.[Customer_ID]
and th.actual_transaction_date > '20120218'
On that second join the query optimizer has the opportunity of which condition to apply first. Let say [Customer].[dbo].[Customer_ids].[customer_id] and [Transactions].[dbo].[Transaction_Header] each have indexes. The query optimizer has the option to apply that before [Transactions].[dbo].[Transaction_Header].[actual_transaction_date].
If [actual_transaction_date] is not indexed then for sure it would do the other ID join first.
With your in ( select ... ) the query optimizer has no option but to apply the actual_transaction_date > '20120218' first. OK some times query optimizer is smart enough to use an index inside the in outside the in but why make it hard for the query optimizer. I have found the query optimizer make better decisions if you make the decisions easier.
A join on a sub-query has the same problem. You take options away from the query optimizer. Give the query optimizer room to breathe.
try this, temptable should help you:
IF OBJECT_ID('Tempdb..#Temp1') IS NOT NULL
DROP TABLE #Temp1
--Low perfomance because of using "WHERE hnci.customer_id IN ( .... ) " - loop join must be
--and this "where" condition will apply to two tables after left join,
--so result will be same as with two inner joints but with bad perfomance
--SELECT DISTINCT
-- ct.id AS id
--INTO #temp1
--FROM [Customer].[dbo].[Contact] ct
-- LEFT JOIN [Customer].[dbo].[Customer_ids] hnci ON ct.id = hnci.contact_id
--WHERE hnci.customer_id IN (
-- SELECT DISTINCT
-- ( [Customer_ID] )
-- FROM [Transactions].[dbo].[Transaction_Header]
-- WHERE actual_transaction_date > '20120218' )
--------------------------------------------------------------------------------
--this will give the same result but with better perfomance then previouse one
--------------------------------------------------------------------------------
SELECT DISTINCT
ct.id AS id
INTO #temp1
FROM [Customer].[dbo].[Contact] ct
JOIN [Customer].[dbo].[Customer_ids] hnci ON ct.id = hnci.contact_id
JOIN ( SELECT DISTINCT
( [Customer_ID] )
FROM [Transactions].[dbo].[Transaction_Header]
WHERE actual_transaction_date > '20120218'
) T ON hnci.customer_id = T.[Customer_ID]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
INSERT INTO #temp1
( id
)
SELECT DISTINCT
contact_id AS id
FROM [Customer].[dbo].[Restaurant_Attendance]
WHERE ( created > '2012-02-18 00:00:00.000'
OR modified > '2012-02-18 00:00:00.000'
)
AND ( [Fifth_Floor_London] = 1
OR [Fourth_Floor_Leeds] = 1
OR [Second_Floor_Bristol] = 1
)
INSERT INTO #temp1
( id
)
SELECT DISTINCT
( ct.id )
FROM [Customer].[dbo].[Contact] ct
INNER JOIN [Customer].[dbo].[Wifinity_Devices] wfd ON ct.wifinity_uniqueID = wfd.[CustomerUniqueID]
AND startconnection > '2012-02-17'
INSERT INTO #temp1
( id
)
SELECT DISTINCT
comdt.id AS id
FROM [Customer].[dbo].[Complete_dataset] comdt
LEFT JOIN [Customer].[dbo].[Aggregate_Spend_Counts] agsc ON comdt.id = agsc.contact_id
WHERE agsc.contact_id IS NULL
AND ( opt_out_Mail <> 1
OR opt_out_email <> 1
OR opt_out_SMS <> 1
OR opt_out_Mail IS NULL
OR opt_out_email IS NULL
OR opt_out_SMS IS NULL
)
AND ( address_1 IS NOT NULL
OR email IS NOT NULL
OR mobile IS NOT NULL
)
INSERT INTO #temp1
( id
)
SELECT DISTINCT
( contact_id ) AS id
FROM [Customer].[dbo].[VIP_Card_Holders]
WHERE VIP_Card_number IS NOT NULL
SELECT DISTINCT
id
FROM #temp1 AS T

Is there a way to make this query more efficient performance wise?

This query takes a long time to run on MS Sql 2008 DB with 70GB of data.
If i run the 2 where clauses seperately it takes a lot less time.
EDIT - I need to change the 'select *' to 'delete' afterwards, please keep it in mind when answering. thanks :)
select *
From computers
Where Name in
(
select T2.Name
from
(
select Name
from computers
group by Name
having COUNT(*) > 1
) T3
join computers T2 on T3.Name = T2.Name
left join policyassociations PA on T2.PK = PA.EntityId
where (T2.EncryptionStatus = 0 or T2.EncryptionStatus is NULL) and
(PA.EntityType <> 1 or PA.EntityType is NULL)
)
OR
ClientId in
(
select substring(ClientID,11,100)
from computers
)
Swapping IN for EXISTS will help.
Also, as per Gordon's answer: UNION can out-perform OR.
SELECT computers.*
FROM computers
LEFT
JOIN policyassociations
ON policyassociations.entityid = computers.pk
WHERE (
computers.encryptionstatus = 0
OR computers.encryptionstatus IS NULL
)
AND (
policyassociations.entitytype <> 1
OR policyassociations.entitytype IS NULL
)
AND EXISTS (
SELECT name
FROM (
SELECT name
FROM computers
GROUP
BY name
HAVING Count(*) > 1
) As duplicate_computers
WHERE name = computers.name
)
UNION
SELECT *
FROM computers As c
WHERE EXISTS (
SELECT SubString(clientid, 11, 100)
FROM computers
WHERE SubString(clientid, 11, 100) = c.clientid
)
You've now updated your question asking to make this a delete.
Well the good news is that instead of the "OR" you just make two DELETE statements:
DELETE
FROM computers
LEFT
JOIN policyassociations
ON policyassociations.entityid = computers.pk
WHERE (
computers.encryptionstatus = 0
OR computers.encryptionstatus IS NULL
)
AND (
policyassociations.entitytype <> 1
OR policyassociations.entitytype IS NULL
)
AND EXISTS (
SELECT name
FROM (
SELECT name
FROM computers
GROUP
BY name
HAVING Count(*) > 1
) As duplicate_computers
WHERE name = computers.name
)
;
DELETE
FROM computers As c
WHERE EXISTS (
SELECT SubString(clientid, 11, 100)
FROM computers
WHERE SubString(clientid, 11, 100) = c.clientid
)
;
Some things I would look at are
1. are indexes in place?
2. 'IN' will slow your query, try replacing it with joins,
3. you should use column name, I guess 'Name' in this case, while using count(*),
4. try selecting required data only, by selecting particular columns.
Hope this helps!
or can be poorly optimized sometimes. In this case, you can just split the query into two subqueries, and combine them using union:
select *
From computers
Where Name in
(
select T2.Name
from
(
select Name
from computers
group by Name
having COUNT(*) > 1
) T3
join computers T2 on T3.Name = T2.Name
left join policyassociations PA on T2.PK = PA.EntityId
where (T2.EncryptionStatus = 0 or T2.EncryptionStatus is NULL) and
(PA.EntityType <> 1 or PA.EntityType is NULL)
)
UNION
select *
From computers
WHERE ClientId in
(
select substring(ClientID,11,100)
from computers
);
You might also be able to improve performance by replacing the subqueries with explicit joins. However, this seems like the shortest route to better performance.
EDIT:
I think the version with join's is:
select c.*
From computers c left outer join
(select c.Name
from (select c.*, count(*) over (partition by Name) as cnt
from computers c
) c left join
policyassociations PA
on T2.PK = PA.EntityId and PA.EntityType <> 1
where (c.EncryptionStatus = 0 or c.EncryptionStatus is NULL) and
c.cnt > 1
) cpa
on c.Name = cpa.Name left outer join
(select substring(ClientID, 11, 100) as name
from computers
) csub
on c.Name = csub.name
Where cpa.Name is not null or csub.Name is not null;