Best way to compare two sets of data w/ SQL

Best way to compare two sets of data w/ SQL - sql

What I have is a query that grabs a set of data. This query is ran at a certain time. Then, 30 minutes later, I have another query (same syntax) that runs and grabs that same set of data. Finally, I have a third query (which is the query in question) that compares both sets of data. The records it pulls out are ones that agree with: if "FEDVIP_Active" was FALSE in the first data set and TRUE in the second data set, OR "UniqueID" didn't exist in the first data set and does in the second data set AND FEDVIP_Active is TRUE. I'm questioning the performance of the query below that does the comparison. It times out after 30 minutes. Is there anything you can see that I shouldn't be doing in order to be the most efficient to run? The two identical-ish data sets I'm comparing have around a million records each.
First query that grabs the initial set of data:
select Unique_ID, First_Name, FEDVIP_Active, Email_Primary
from Master_Subscribers_Prospects
Second query is exactly the same as the first.
Then, the third query below compares the data:
select
a.FEDVIP_Active,
a.Unique_ID,
a.First_Name,
a.Email_Primary
from
Master_Subscribers_Prospects_1 a
inner join
Master_Subscribers_Prospects_2 b
on 1 = 1
where a.FEDVIP_Active = 1 and b.FEDVIP_Active = 0 or
(b.Unique_ID not in (select Unique_ID from Master_Subscribers_Prospects_1) and b.FEDVIP_Active = 1)

If I understand correctly, you want all records from the second data set where the corresponding unique id in the first data set is not active (either by not existing or by having the flag set to not active).
I would suggest exists:
select a.*
from Master_Subscribers_Prospects_1 a
where a.FEDVIP_Active = 1 and
not exists (select 1
from Master_Subscribers_Prospects_2 b
where b.Unique_ID = a.Unique_ID and
b.FEDVIP_Active = 1
);
For performance, you want an index on Master_Subscribers_Prospects_2(Unique_ID, FEDVIP_Active).

An inner join on 1 = 1 is a disguised cross join and the number of rows a cross join produces can grow rapidly. It's the product of the number of rows in both relations involved. For performance you want to keep intermediate results as small as possible.
Then instead of IN EXISTS is often performing better, when the number of rows of the subquery is large.
But I think you don't need IN or EXITS at all.
Assuming unique_id identifies a record and is not null, you could left join the first table to the second one on common unique_ids. Then if and only if no record for an unique_id in the second table exits the unique_id of the first table in the result of the join is null, so you can check for that.
SELECT b.fedvip_active,
b.unique_id,
b.first_name,
b.email_primary
FROM master_subscribers_prospects_2 b
LEFT JOIN master_subscribers_prospects_1 a
ON b.unique_id = a.unique_id
WHERE a.fedvip_active = 1
AND b.fedvip_active = 0
OR a.unique_id IS NULL
AND b.fedvip_active = 1;
For that query indexes on master_subscribers_prospects_1 (unique_id, fedvip_active) and master_subscribers_prospects_2 (unique_id, fedvip_active) might also help to speed things up.

Doing an inner select in where sats is always bad.
Here is a same version with a left join, that might work for you.
select
a.FEDVIP_Active,
a.Unique_ID,
a.First_Name,
a.Email_Primary
from
Master_Subscribers_Prospects_1 a
inner join
Master_Subscribers_Prospects_2 b on 1 = 1
left join Master_Subscribers_Prospects_1 sa on sa.Unique_ID = b.Unique_ID
where (a.FEDVIP_Active = 1 and b.FEDVIP_Active = 0) or
(sa.Unique_ID is null and b.FEDVIP_Active = 1)

Related

SQL Server : stored procedure is slow when running two left join from one table

I have a stored procedure that runs a query to get some data coupe of rows not that big of tables that has two left joins from the same table but is acting slow and taking up to 300 ms with 6 to 20 rows in each table.
How can I optimize this stored procedure?
SELECT
m.MobileNotificationID,
m.[Message] AS text,
m.TypeId AS typeId ,
m.MobileNotificationID AS recordId ,
0 badge ,
m.DeviceID,
ISNULL(users.DeviceToken, subscribers.DeviceToken) DeviceToken,
ISNULL(users.DeviceTypeID, subscribers.DeviceTypeID) DeviceTypeID,
m.Notes,
isSent = 0
--, m.SubscriberID, m.UserID
FROM
MobileNotification m
LEFT JOIN
Device users ON m.userId = users.UserID
AND users.DeviceID = m.DeviceID
LEFT JOIN
Device subscribers ON m.SubscriberID = subscribers.SubscriberId
AND subscribers.DeviceID = m.DeviceID
WHERE
IsSent = 0
AND m.DateCreated <= (SELECT GETDATE())
AND (0 = 0 OR ISNULL(users.DeviceTypeID, subscribers.DeviceTypeID) = 0)
AND (ISNULL(users.DeviceToken, '') <> '' OR
ISNULL(subscribers.DeviceToken, '') <> '')
ORDER BY
m.DateCreated DESC

Few advices:
ISNULL check makes queries much slower, try to avoid
To significantly improve speed, create an index on columns that you filter like "IsSent" & "DateCreated", as well as columns that you group by.
Also index every table with clusterd index on its id column.
https://learn.microsoft.com/en-us/sql/relational-databases/indexes/clustered-and-nonclustered-indexes-described?view=sql-server-ver15
try to avoid twice left join on the same table if its possible. in you case i think you can merge the terms into one line
and finally- from my experience: sometimes its a lot faster to perform 2 queries:
supose you select fields only from 1 big table: first just select the IDs in the first query. and then in the second query select all string fields and other calculations filtering previous IDs.
good luck

Performance Issue in Left outer join Sql server

In my project I need find difference task based on old and new revision in the same table.
id | task | latest_Rev
1 A N
1 B N
2 C Y
2 A Y
2 B Y
Expected Result:
id | task | latest_Rev
2 C Y
So I tried following query
Select new.*
from Rev_tmp nw with (nolock)
left outer
join rev_tmp old with (nolock)
on nw.id -1 = old.id
and nw.task = old.task
and nw.latest_rev = 'y'
where old.task is null
when my table have more than 20k records this query takes more time?
How to reduce the time?
In my company don't allow to use subquery

Use LAG function to remove the self join
SELECT *
FROM (SELECT *,
CASE WHEN latest_Rev = 'y' THEN Lag(latest_Rev) OVER(partition BY task ORDER BY id) ELSE NULL END AS prev_rev
FROM Rev_tmp) a
WHERE prev_rev IS NULL

My answer assumes
You can't change the indexes
You can't use subqueries
All fields are indexed separately
If you look at the query, the only value that really reduces the resultset is latest_rev='Y'. If you were to eliminate that condition, you'd definitely get a table scan. So we want that condition to be evaluated using an index. Unfortunately a field that just values 'Y' and 'N' is likely to be ignored because it will have terrible selectivity. You might get better performance if you coax SQL Server into using it anyway. If the index on latest_rev is called idx_latest_rev then try this:
Set transaction isolated level read uncommitted
Select new.*
from Rev_tmp nw with (index(idx_latest_rev))
left outer
join rev_tmp old
on nw.id -1 = old.id
and nw.task = old.task
where old.task is null
and nw.latest_rev = 'y'

latest_Rev should be a Bit type (boolean equivalent), i better for performance (Detail here)
May be can you add index on id, task
, latest_Rev columns
You can try this query (replace left outer by not exists)
Select *
from Rev_tmp nw
where nw.latest_rev = 'y' and not exists
(
select * from rev_tmp old
where nw.id -1 = old.id and nw.task = old.task
)

SQL query: Iterate over values in table and use them in subquery

I have a simple SQL table containing some values, for example:
id | value (table 'values')
----------
0 | 4
1 | 7
2 | 9
I want to iterate over these values, and use them in a query like so:
SELECT value[0], x1
FROM (some subquery where value[0] is used)
UNION
SELECT value[1], x2
FROM (some subquery where value[1] is used)
...
etc
In order to get a result set like this:
4 | x1
7 | x2
9 | x3
It has to be in SQL as it will actually represent a database view. Of course the real query is a lot more complicated, but I tried to simplify the question while keeping the essence as much as possible.
I think I have to select from values and join the subquery, but as the value should be used in the subquery I'm lost on how to accomplish this.
Edit: I oversimplified my question; in reality I want to have 2 rows from the subquery and not only one.
Edit 2: As suggested I'm posting the real query. I simplified it a bit to make it clearer, but it's a working query and the problem is there. Note that I have hardcoded the value '2' in this query two times. I want to replace that with values from a different table, in the example table above I would want a result set of the combined results of this query with 4, 7 and 9 as values instead of the currently hardcoded 2.
SELECT x.fantasycoach_id, SUM(round_points)
FROM (
SELECT DISTINCT fc.id AS fantasycoach_id,
ffv.formation_id AS formation_id,
fpc.round_sequence AS round_sequence,
round_points,
fpc.fantasyplayer_id
FROM fantasyworld_FantasyCoach AS fc
LEFT JOIN fantasyworld_fantasyformation AS ff ON ff.id = (
SELECT MAX(fantasyworld_fantasyformationvalidity.formation_id)
FROM fantasyworld_fantasyformationvalidity
LEFT JOIN realworld_round AS _rr ON _rr.id = round_id
LEFT JOIN fantasyworld_fantasyformation AS _ff ON _ff.id = formation_id
WHERE is_valid = TRUE
AND _ff.coach_id = fc.id
AND _rr.sequence <= 2 /* HARDCODED USE OF VALUE */
)
LEFT JOIN fantasyworld_FantasyFormationPlayer AS ffp
ON ffp.formation_id = ff.id
LEFT JOIN dbcache_fantasyplayercache AS fpc
ON ffp.player_id = fpc.fantasyplayer_id
AND fpc.round_sequence = 2 /* HARDCODED USE OF VALUE */
LEFT JOIN fantasyworld_fantasyformationvalidity AS ffv
ON ffv.formation_id = ff.id
) x
GROUP BY fantasycoach_id
Edit 3: I'm using PostgreSQL.

SQL works with tables as a whole, which basically involves set operations. There is no explicit iteration, and generally no need for any. In particular, the most straightforward implementation of what you described would be this:
SELECT value, (some subquery where value is used) AS x
FROM values
Do note, however, that a correlated subquery such as that is very hard on query performance. Depending on the details of what you're trying to do, it may well be possible to structure it around a simple join, an uncorrelated subquery, or a similar, better-performing alternative.
Update:
In view of the update to the question indicating that the subquery is expected to yield multiple rows for each value in table values, contrary to the example results, it seems a better approach would be to just rewrite the subquery as the main query. If it does not already do so (and maybe even if it does) then it would join table values as another base table.
Update 2:
Given the real query now presented, this is how the values from table values could be incorporated into it:
SELECT x.fantasycoach_id, SUM(round_points) FROM
(
SELECT DISTINCT
fc.id AS fantasycoach_id,
ffv.formation_id AS formation_id,
fpc.round_sequence AS round_sequence,
round_points,
fpc.fantasyplayer_id
FROM fantasyworld_FantasyCoach AS fc
-- one row for each combination of coach and value:
CROSS JOIN values
LEFT JOIN fantasyworld_fantasyformation AS ff
ON ff.id = (
SELECT MAX(fantasyworld_fantasyformationvalidity.formation_id)
FROM fantasyworld_fantasyformationvalidity
LEFT JOIN realworld_round AS _rr
ON _rr.id = round_id
LEFT JOIN fantasyworld_fantasyformation AS _ff
ON _ff.id = formation_id
WHERE is_valid = TRUE
AND _ff.coach_id = fc.id
-- use the value obtained from values:
AND _rr.sequence <= values.value
)
LEFT JOIN fantasyworld_FantasyFormationPlayer AS ffp
ON ffp.formation_id = ff.id
LEFT JOIN dbcache_fantasyplayercache AS fpc
ON ffp.player_id = fpc.fantasyplayer_id
-- use the value obtained from values again:
AND fpc.round_sequence = values.value
LEFT JOIN fantasyworld_fantasyformationvalidity AS ffv
ON ffv.formation_id = ff.id
) x
GROUP BY fantasycoach_id
Note in particular the CROSS JOIN which forms the cross product of two tables; this is the same thing as an INNER JOIN without any join predicate, and it can be written that way if desired.
The overall query could be at least a bit simplified, but I do not do so because it is a working example rather than an actual production query, so it is unclear what other changes would translate to the actual application.

In the example I create two tables. See how outer table have an alias you use in the inner select?
SQL Fiddle Demo
SELECT T.[value], (SELECT [property] FROM Table2 P WHERE P.[value] = T.[value])
FROM Table1 T
This is a better way for performance
SELECT T.[value], P.[property]
FROM Table1 T
INNER JOIN Table2 p
on P.[value] = T.[value];
Table 2 can be a QUERY instead of a real table
Third Option
Using a cte to calculate your values and then join back to the main table. This way you have the subquery logic separated from your final query.
WITH cte AS (
SELECT
T.[value],
T.[value] * T.[value] as property
FROM Table1 T
)
SELECT T.[value], C.[property]
FROM Table1 T
INNER JOIN cte C
on T.[value] = C.[value];

It might be helpful to extract the computation to a function that is called in the SELECT clause and is executed for each row of the result set
Here's the documentation for CREATE FUNCTION for SQL Server. It's probably similar to whatever database system you're using, and if not you can easily Google for it.
Here's an example of creating a function and using it in a query:
CREATE FUNCTION DoComputation(#parameter1 int)
RETURNS int
AS
BEGIN
-- Do some calculations here and return the function result.
-- This example returns the value of #parameter1 squared.
-- You can add additional parameters to the function definition if needed
DECLARE #Result int
SET #Result = #parameter1 * #parameter1
RETURN #Result
END
Here is an example of using the example function above in a query.
SELECT v.value, DoComputation(v.value) as ComputedValue
FROM [Values] v
ORDER BY value

Why is Selecting From Table Variable Far Slower than List of Integers

I have a pretty big MSSQL stored procedure that I need to conditionally check for certain IDs:
Select SomeColumns
From BigTable b
Join LotsOfTables l on b.LongStringField = l.LongStringField
Where b.SomeID in (1,2,3,4,5)
I wanted to conditionally check the SomeID field, so I did the following:
if #enteredText = 'This'
INSERT INTO #AwesomeIDs
VALUES(1),(2),(3)
if #enteredText = 'That'
INSERT INTO #AwesomeIDs
VALUES(4),(5)
Select SomeColumns
From BigTable b
Join LotsOfTables l on b.LongStringField = l.LongStringField
Where b.SomeID in (Select ID from #AwesomeIDs)
Nothing else has changed, yet I can't even get the latter query to grab 5 records. The top query returns 5000 records in less than 3 seconds. Why is selecting from a table variable so much drastically slower?

Two other possible options you can consider
Option 1
Select SomeColumns
From BigTable b
Join LotsOfTables l on b.LongStringField = l.LongStringField
Where
( b.SomeID IN (1,2,3) AND #enteredText = 'This')
OR
( b.SomeID IN (4,5) AND #enteredText = 'That')
Option 2
Select SomeColumns
From BigTable b
Join LotsOfTables l on b.LongStringField = l.LongStringField
Where EXISTS (Select 1
from #AwesomeIDs
WHERE b.SomeID = ID)
Mind you for Table variables , SQL Server always assumes there is only ONE row in the table (except sql 2014 , assumption is 100 rows) and it can affect the estimated and actual plans. But 1 row against 3 not really a deal breaker.

Doing an Update Ignore in SQL Server 2005

I have a table where I wish to update some of the rows. All the fields are not null. I'm doing a sub-query, and I wish to update the table with the non-Null results.
See Below for my final answer:
In MySQL, I solve this problem by doing an UPDATE IGNORE. How do I make this work in SQL Server 2005? The sub-query uses a four-table Join to find the data to insert if it exists. The Update is being run against a table that could have 90,000+ records, so I need a solution that uses SQL, rather than having the Java program that's querying the database retrieve the results and then update those fields where we've got non-Null values.
Update: My query:
UPDATE #SearchResults SET geneSymbol = (
SELECT TOP 1 symbol.name FROM
GeneSymbol AS symbol JOIN GeneConnector AS geneJoin
ON symbol.id = geneJoin.geneSymbolID
JOIN Result AS sSeq ON geneJoin.sSeqID = sSeq.id
JOIN IndelConnector AS joiner ON joiner.sSeqID = sSeq.id
WHERE joiner.indelID = #SearchResults.id ORDER BY symbol.id ASC)
WHERE isSNV = 0
If I add "AND symbol.name IS NOT NULL" to either WHERE I get a SQL error. If I run it as is I get "adding null to a non-null column" errors. :-(
Thank you all, I ended up finding this:
UPDATE #SearchResults SET geneSymbol =
ISNULL ((SELECT TOP 1 symbol.name FROM
GeneSymbol AS symbol JOIN GeneConnector AS geneJoin
ON symbol.id = geneJoin.geneSymbolID
JOIN Result AS sSeq ON geneJoin.sSeqID = sSeq.id
JOIN IndelConnector AS joiner ON joiner.sSeqID = sSeq.id
WHERE joiner.indelID = #SearchResults.id ORDER BY symbol.id ASC), ' ')
WHERE isSNV = 0
While it would be better not to do anything in the null case (so I'm going to try to understand the other answers, and see if they're faster) setting the null cases to a blank answer also works, and that's what this does.
Note: Wrapping the ISNULL (...) with () leads to really obscure (and wrong) errors.

with UpdatedGenesDS (
select joiner.indelID, name, row_number() over (order by symbol.id asc) seq
from
GeneSymbol AS symbol JOIN GeneConnector AS geneJoin
ON symbol.id = geneJoin.geneSymbolID
JOIN Result AS sSeq ON geneJoin.sSeqID = sSeq.id
JOIN IndelConnector AS joiner ON joiner.sSeqID = sSeq.id
WHERE name is not null ORDER BY symbol.id ASC
)
update Genes
set geneSymbol = upd.name
from #SearchResults a
inner join UpdateGenesDs upd on a.id = b.intelID
where upd.seq =1 and isSNV = 0
this handles the null completely as all are filtered out by the where predicate (can also be filtered by join predicate if You wish. Is it what You are looking for?

Here's another option, where only those rows in #SearchResults that are succesfully joined will be udpated. If there are no null values in the underlying data, then the inner joins will pull in no null values, and you won't have to worry about filtering them out.
UPDATE #SearchResults
set geneSymbol = symbol.name
from #SearchResults sr
inner join IndelConnector AS joiner
on joiner.indelID = sr.id
inner join Result AS sSeq
on sSeq.id = joiner.sSeqID
inner join GeneConnector AS geneJoin
on geneJoin.sSeqID = sSeq.id
-- Get "lowest" (i.e. first if listed alphabetically) value of name for each id
inner join (select id, min(name) name
from GeneSymbol
group by id) symbol
on symbol.id = geneJoin.geneSymbolID
where isSNV = 0 -- Which table is this value from?
(There might be some syntax problems, without tables I can't debug it)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Best way to compare two sets of data w/ SQL - sql

Related

SQL Server : stored procedure is slow when running two left join from one table

Performance Issue in Left outer join Sql server

SQL query: Iterate over values in table and use them in subquery

Why is Selecting From Table Variable Far Slower than List of Integers

Doing an Update Ignore in SQL Server 2005

Categories

Resources