Performance Issue in Left outer join Sql server - sql

In my project I need find difference task based on old and new revision in the same table.
id | task | latest_Rev
1 A N
1 B N
2 C Y
2 A Y
2 B Y
Expected Result:
id | task | latest_Rev
2 C Y
So I tried following query
Select new.*
from Rev_tmp nw with (nolock)
left outer
join rev_tmp old with (nolock)
on nw.id -1 = old.id
and nw.task = old.task
and nw.latest_rev = 'y'
where old.task is null
when my table have more than 20k records this query takes more time?
How to reduce the time?
In my company don't allow to use subquery

Use LAG function to remove the self join
SELECT *
FROM (SELECT *,
CASE WHEN latest_Rev = 'y' THEN Lag(latest_Rev) OVER(partition BY task ORDER BY id) ELSE NULL END AS prev_rev
FROM Rev_tmp) a
WHERE prev_rev IS NULL

My answer assumes
You can't change the indexes
You can't use subqueries
All fields are indexed separately
If you look at the query, the only value that really reduces the resultset is latest_rev='Y'. If you were to eliminate that condition, you'd definitely get a table scan. So we want that condition to be evaluated using an index. Unfortunately a field that just values 'Y' and 'N' is likely to be ignored because it will have terrible selectivity. You might get better performance if you coax SQL Server into using it anyway. If the index on latest_rev is called idx_latest_rev then try this:
Set transaction isolated level read uncommitted
Select new.*
from Rev_tmp nw with (index(idx_latest_rev))
left outer
join rev_tmp old
on nw.id -1 = old.id
and nw.task = old.task
where old.task is null
and nw.latest_rev = 'y'

latest_Rev should be a Bit type (boolean equivalent), i better for performance (Detail here)
May be can you add index on id, task
, latest_Rev columns
You can try this query (replace left outer by not exists)
Select *
from Rev_tmp nw
where nw.latest_rev = 'y' and not exists
(
select * from rev_tmp old
where nw.id -1 = old.id and nw.task = old.task
)

Related

Best way to compare two sets of data w/ SQL

What I have is a query that grabs a set of data. This query is ran at a certain time. Then, 30 minutes later, I have another query (same syntax) that runs and grabs that same set of data. Finally, I have a third query (which is the query in question) that compares both sets of data. The records it pulls out are ones that agree with: if "FEDVIP_Active" was FALSE in the first data set and TRUE in the second data set, OR "UniqueID" didn't exist in the first data set and does in the second data set AND FEDVIP_Active is TRUE. I'm questioning the performance of the query below that does the comparison. It times out after 30 minutes. Is there anything you can see that I shouldn't be doing in order to be the most efficient to run? The two identical-ish data sets I'm comparing have around a million records each.
First query that grabs the initial set of data:
select Unique_ID, First_Name, FEDVIP_Active, Email_Primary
from Master_Subscribers_Prospects
Second query is exactly the same as the first.
Then, the third query below compares the data:
select
a.FEDVIP_Active,
a.Unique_ID,
a.First_Name,
a.Email_Primary
from
Master_Subscribers_Prospects_1 a
inner join
Master_Subscribers_Prospects_2 b
on 1 = 1
where a.FEDVIP_Active = 1 and b.FEDVIP_Active = 0 or
(b.Unique_ID not in (select Unique_ID from Master_Subscribers_Prospects_1) and b.FEDVIP_Active = 1)
If I understand correctly, you want all records from the second data set where the corresponding unique id in the first data set is not active (either by not existing or by having the flag set to not active).
I would suggest exists:
select a.*
from Master_Subscribers_Prospects_1 a
where a.FEDVIP_Active = 1 and
not exists (select 1
from Master_Subscribers_Prospects_2 b
where b.Unique_ID = a.Unique_ID and
b.FEDVIP_Active = 1
);
For performance, you want an index on Master_Subscribers_Prospects_2(Unique_ID, FEDVIP_Active).
An inner join on 1 = 1 is a disguised cross join and the number of rows a cross join produces can grow rapidly. It's the product of the number of rows in both relations involved. For performance you want to keep intermediate results as small as possible.
Then instead of IN EXISTS is often performing better, when the number of rows of the subquery is large.
But I think you don't need IN or EXITS at all.
Assuming unique_id identifies a record and is not null, you could left join the first table to the second one on common unique_ids. Then if and only if no record for an unique_id in the second table exits the unique_id of the first table in the result of the join is null, so you can check for that.
SELECT b.fedvip_active,
b.unique_id,
b.first_name,
b.email_primary
FROM master_subscribers_prospects_2 b
LEFT JOIN master_subscribers_prospects_1 a
ON b.unique_id = a.unique_id
WHERE a.fedvip_active = 1
AND b.fedvip_active = 0
OR a.unique_id IS NULL
AND b.fedvip_active = 1;
For that query indexes on master_subscribers_prospects_1 (unique_id, fedvip_active) and master_subscribers_prospects_2 (unique_id, fedvip_active) might also help to speed things up.
Doing an inner select in where sats is always bad.
Here is a same version with a left join, that might work for you.
select
a.FEDVIP_Active,
a.Unique_ID,
a.First_Name,
a.Email_Primary
from
Master_Subscribers_Prospects_1 a
inner join
Master_Subscribers_Prospects_2 b on 1 = 1
left join Master_Subscribers_Prospects_1 sa on sa.Unique_ID = b.Unique_ID
where (a.FEDVIP_Active = 1 and b.FEDVIP_Active = 0) or
(sa.Unique_ID is null and b.FEDVIP_Active = 1)

Unknown processing time of SQL statement

I have a query like this
SELECT
a.LeaseNo, RIGHT(a.LeaseNo, 1) AS idx
FROM
Leases a
LEFT OUTER JOIN
DebitNoteItems b ON a.leaseno = b.LeaseNo
WHERE
status = 'A'
AND PortfolioType = 'R'
AND b.NoteItemID IS NULL
Result set is found in less than 1 second.
Then I tried to do find the records with idx = 0, thus I wrote
SELECT LeaseNo
FROM
(SELECT
a.LeaseNo, RIGHT(a.LeaseNo, 1) AS idx
FROM
Leases a
LEFT OUTER JOIN
DebitNoteItems b ON a.leaseno = b.LeaseNo
WHERE
status = 'A'
AND PortfolioType = 'R'
AND b.NoteItemID IS NULL) tmp
WHERE
tmp.idx = '0'
However, this query is very slow.
Then I tried
SELECT LeaseNo
FROM
(SELECT
a.LeaseNo, RIGHT(a.LeaseNo, 1) AS idx
FROM
Leases a
LEFT OUTER JOIN
DebitNoteItems b ON a.leaseno = b.LeaseNo
WHERE
status = 'A'
AND PortfolioType = 'R'
AND b.NoteItemID IS NULL) tmp
WHERE
tmp.idx LIKE '%0%'
This one is executed in less than 1 second.
I want to know why could the second one much faster than the first one when there are only 1 simple condition different (= '0' and LIKE '%0%')? What am I talking about is not few seconds difference, by querying it and the differences are less than 1 second for the second query (applied LIKE), and more than a minute (applied =, in fact it is still querying, it is terminated manually, it doesn't look like it is comparing the idx in the queried tmp table)
Is there something wrong or inappropriate in the query?

Why is Selecting From Table Variable Far Slower than List of Integers

I have a pretty big MSSQL stored procedure that I need to conditionally check for certain IDs:
Select SomeColumns
From BigTable b
Join LotsOfTables l on b.LongStringField = l.LongStringField
Where b.SomeID in (1,2,3,4,5)
I wanted to conditionally check the SomeID field, so I did the following:
if #enteredText = 'This'
INSERT INTO #AwesomeIDs
VALUES(1),(2),(3)
if #enteredText = 'That'
INSERT INTO #AwesomeIDs
VALUES(4),(5)
Select SomeColumns
From BigTable b
Join LotsOfTables l on b.LongStringField = l.LongStringField
Where b.SomeID in (Select ID from #AwesomeIDs)
Nothing else has changed, yet I can't even get the latter query to grab 5 records. The top query returns 5000 records in less than 3 seconds. Why is selecting from a table variable so much drastically slower?
Two other possible options you can consider
Option 1
Select SomeColumns
From BigTable b
Join LotsOfTables l on b.LongStringField = l.LongStringField
Where
( b.SomeID IN (1,2,3) AND #enteredText = 'This')
OR
( b.SomeID IN (4,5) AND #enteredText = 'That')
Option 2
Select SomeColumns
From BigTable b
Join LotsOfTables l on b.LongStringField = l.LongStringField
Where EXISTS (Select 1
from #AwesomeIDs
WHERE b.SomeID = ID)
Mind you for Table variables , SQL Server always assumes there is only ONE row in the table (except sql 2014 , assumption is 100 rows) and it can affect the estimated and actual plans. But 1 row against 3 not really a deal breaker.

COUNT (DISTINCT column_name) Discrepancy vs. COUNT (column_name) in SQL Server 2008?

I'm running into a problem that's driving me nuts.
When running the query below, I get a count of 233,769
SELECT COUNT(distinct Member_List_Link.UserID)
FROM Member_List_Link with (nolock)
INNER JOIN MasterMembers with (nolock)
ON Member_List_Link.UserID = MasterMembers.UserID
WHERE MasterMembers.Active = 1 And
Member_List_Link.GroupID = 5 AND
MasterMembers.ValidUsers = 1 AND
Member_List_Link.Status = 1
But if I run the same query without the distinct keyword, I get a count of 233,748
SELECT COUNT(Member_List_Link.UserID)
FROM Member_List_Link with (nolock)
INNER JOIN MasterMembers with (nolock)
ON Member_List_Link.UserID = MasterMembers.UserID
WHERE MasterMembers.Active = 1 And Member_List_Link.GroupID = 5
AND MasterMembers.ValidUsers = 1 AND Member_List_Link.Status = 1
To test, I recreated all the tables and place them into temp tables and ran the queries again:
SELECT COUNT(distinct #Temp_Member_List_Link.UserID)
FROM #Temp_Member_List_Link with (nolock)
INNER JOIN #Temp_MasterMembers with (nolock)
ON #Temp_Member_List_Link.UserID = #Temp_MasterMembers.UserID
WHERE #Temp_MasterMembers.Active = 1 And
#Temp_Member_List_Link.GroupID = 5 AND
#Temp_MasterMembers.ValidUsers = 1 AND
#Temp_Member_List_Link.Status = 1
And without the distinct keyword
SELECT COUNT(#Temp_Member_List_Link.UserID)
FROM #Temp_Member_List_Link with (nolock)
INNER JOIN #Temp_MasterMembers with (nolock)
ON #Temp_Member_List_Link.UserID = #Temp_MasterMembers.UserID
WHERE #Temp_MasterMembers.Active = 1 And
#Temp_Member_List_Link.GroupID = 5 AND
#Temp_MasterMembers.ValidUsers = 1 AND
#Temp_Member_List_Link.Status = 1
On a side note, I recreated the temp tables by simply running (select * from Member_List_Link into #temp...)
And now when I check to see the difference between COUNT(column) vs. COUNT(distinct column) with these temp tables, I don't see any!
So why is there a discrepancy with the original tables?
I'm running SQL Server 2008 (Dev Edition).
UPDATE - Including statistics profile
PhysicalOp column only for the first query (without distinct)
NULL
Compute Scalar
Stream Aggregate
Clustered Index Seek
PhysicalOp column only for the first query (with distinct)
NULL
Compute Scalar
Stream Aggregate
Parallelism
Stream Aggregate
Hash Match
Hash Match
Bitmap
Parallelism
Index Seek
Parallelism
Clustered Index Scan
Rows and Executes for the 1st query (without distinct)
1 1
0 0
1 1
1 1
Rows and Executes for the 2nd query (with distinct)
Rows Executes
1 1
0 0
1 1
16 1
16 16
233767 16
233767 16
281901 16
281901 16
281901 16
234787 16
234787 16
Adding OPTION(MAXDOP 1) to the 2nd query (with distinct)
Rows Executes
1 1
0 0
1 1
233767 1
233767 1
281901 1
548396 1
And the resulting PhysicalOp
NULL
Compute Scalar
Stream Aggregate
Hash Match
Hash Match
Index Seek
Clustered Index Scan
FROM http://msdn.microsoft.com/en-us/library/ms187373.aspx
NOLOCK Is equivalent to READUNCOMMITTED. For more information, see READUNCOMMITTED later in this topic.
READUNCOMMITED will read rows twice if they are the subject of a transation- since both the roll foward and roll back rows exist within the database when the transaction is IN process.
By default all queries are read committed which excludes uncommitted rows
When you insert into a temp table the select will give you only committed rows - I believe this covers all the symptoms you are trying to explain
I think i have got the answer to your question but tell me first is userid a primary key in your original table ?
if yes,then CTAS query to create temp table would not copy any primary key of original table ,it only copy NOT NULL constraint that is not a part of primary key..fine?
now what happened your original table had a primary key so count(distinct column_name) doesnt include tuples with null records and while you created temp tables , primary key doesnt get copied and hence the NOT NULL constraint doesnt get to the temp table!!
is that clear to you?
It's hard to reproduce this behaviour, so I'm punching in the dark here:
The WITH (NOLOCK) statement enables reading of uncommitted data. I'm guessing you've added that to not lock anything for your users? If you remove those and issue a
SET TRANSACTION ISOLATION LEVEL READ COMMITTED
Prior to executing the query, you should get more reliable results. But then, the tables may receive locks while executing the query.
If that doesn't work, my guess is that DISTINCT use an index to optimize. Check the queryplan, and rebuild indexes as necessary. Could be the source of your problem.
What result do you get with
SELECT count(*) FROM (
SELECT distinct Member_List_Link.UserID
FROM Member_List_Link with (nolock)
INNER JOIN MasterMembers with (nolock)
ON Member_List_Link.UserID = MasterMembers.UserID
WHERE MasterMembers.Active = 1 And
Member_List_Link.GroupID = 5 AND
MasterMembers.ValidUsers = 1 AND
Member_List_Link.Status = 1
) as m
AND WITH:
SELECT count(*) FROM (
SELECT distinct Member_List_Link.UserID
FROM Member_List_Link
INNER JOIN MasterMembers
ON Member_List_Link.UserID = MasterMembers.UserID
WHERE MasterMembers.Active = 1 And
Member_List_Link.GroupID = 5 AND
MasterMembers.ValidUsers = 1 AND
Member_List_Link.Status = 1
) as m
Ray, please try the following
SELECT COUNT(*)
FROM
(
SELECT Member_List_Link.UserID, ROW_NUMBER() OVER (PARTITION BY Member_List_Link.UserID ORDER BY (SELECT NULL)) N
FROM Member_List_Link with (nolock)
INNER JOIN MasterMembers with (nolock)
ON Member_List_Link.UserID = MasterMembers.UserID
WHERE MasterMembers.Active = 1 And
Member_List_Link.GroupID = 5 AND
MasterMembers.ValidUsers = 1 AND
Member_List_Link.Status = 1
) A
WHERE N = 1
when you use count with distinct column it doesn't count columns having values null.
create table #tmp(name char(4) null)
insert into #tmp values(null)
insert into #tmp values(null)
insert into #tmp values("AAA")
Query:-
1> select count(*) from #tmp
2> go
3
1> select count(distinct name) from #tmp
2> go
1
1> select distinct name from #tmp
2> go
name
NULL
AAA
but it works in derived table
1> select count(*) from ( select distinct name from #tmp) a
2> go
2
Note:- I tested it in Sybase

Doing an Update Ignore in SQL Server 2005

I have a table where I wish to update some of the rows. All the fields are not null. I'm doing a sub-query, and I wish to update the table with the non-Null results.
See Below for my final answer:
In MySQL, I solve this problem by doing an UPDATE IGNORE. How do I make this work in SQL Server 2005? The sub-query uses a four-table Join to find the data to insert if it exists. The Update is being run against a table that could have 90,000+ records, so I need a solution that uses SQL, rather than having the Java program that's querying the database retrieve the results and then update those fields where we've got non-Null values.
Update: My query:
UPDATE #SearchResults SET geneSymbol = (
SELECT TOP 1 symbol.name FROM
GeneSymbol AS symbol JOIN GeneConnector AS geneJoin
ON symbol.id = geneJoin.geneSymbolID
JOIN Result AS sSeq ON geneJoin.sSeqID = sSeq.id
JOIN IndelConnector AS joiner ON joiner.sSeqID = sSeq.id
WHERE joiner.indelID = #SearchResults.id ORDER BY symbol.id ASC)
WHERE isSNV = 0
If I add "AND symbol.name IS NOT NULL" to either WHERE I get a SQL error. If I run it as is I get "adding null to a non-null column" errors. :-(
Thank you all, I ended up finding this:
UPDATE #SearchResults SET geneSymbol =
ISNULL ((SELECT TOP 1 symbol.name FROM
GeneSymbol AS symbol JOIN GeneConnector AS geneJoin
ON symbol.id = geneJoin.geneSymbolID
JOIN Result AS sSeq ON geneJoin.sSeqID = sSeq.id
JOIN IndelConnector AS joiner ON joiner.sSeqID = sSeq.id
WHERE joiner.indelID = #SearchResults.id ORDER BY symbol.id ASC), ' ')
WHERE isSNV = 0
While it would be better not to do anything in the null case (so I'm going to try to understand the other answers, and see if they're faster) setting the null cases to a blank answer also works, and that's what this does.
Note: Wrapping the ISNULL (...) with () leads to really obscure (and wrong) errors.
with UpdatedGenesDS (
select joiner.indelID, name, row_number() over (order by symbol.id asc) seq
from
GeneSymbol AS symbol JOIN GeneConnector AS geneJoin
ON symbol.id = geneJoin.geneSymbolID
JOIN Result AS sSeq ON geneJoin.sSeqID = sSeq.id
JOIN IndelConnector AS joiner ON joiner.sSeqID = sSeq.id
WHERE name is not null ORDER BY symbol.id ASC
)
update Genes
set geneSymbol = upd.name
from #SearchResults a
inner join UpdateGenesDs upd on a.id = b.intelID
where upd.seq =1 and isSNV = 0
this handles the null completely as all are filtered out by the where predicate (can also be filtered by join predicate if You wish. Is it what You are looking for?
Here's another option, where only those rows in #SearchResults that are succesfully joined will be udpated. If there are no null values in the underlying data, then the inner joins will pull in no null values, and you won't have to worry about filtering them out.
UPDATE #SearchResults
set geneSymbol = symbol.name
from #SearchResults sr
inner join IndelConnector AS joiner
on joiner.indelID = sr.id
inner join Result AS sSeq
on sSeq.id = joiner.sSeqID
inner join GeneConnector AS geneJoin
on geneJoin.sSeqID = sSeq.id
-- Get "lowest" (i.e. first if listed alphabetically) value of name for each id
inner join (select id, min(name) name
from GeneSymbol
group by id) symbol
on symbol.id = geneJoin.geneSymbolID
where isSNV = 0 -- Which table is this value from?
(There might be some syntax problems, without tables I can't debug it)