Getting record differences between 2 nearly identical tables - sql

I have a process that consolidates 40+ identically structured databases down to one consolidated database, the only difference being that the consolidated database adds a project_id field to each table.
In order to be as efficient as possible, I'm try to only copy/update a record from the source databases to the consolidated database if it's been added/changed. I delete outdated records from the consolidated database, and then copy in any non-existing records. To delete outdated/changed records I'm using a query similar to this:
DELETE FROM <table>
WHERE NOT EXISTS (SELECT <primary keys>
FROM <source> b
WHERE ((<b.fields = a.fields>) or
(b.fields is null and a.fields is null)))
AND PROJECT_ID = <project_id>
This works for the most part, but one of the tables in the source database has over 700,000 records, and this query takes over an hour to complete.
How can make this query more efficient?

Use timestamps or better yet audit tables to identify the records that changed since time "X" and then save time "X" when last sync started. We use that for interface feeds.

You might want to try LEFT JOIN with NULL filter:
DELETE <table>
FROM <table> t
LEFT JOIN <source> b
ON (t.Field1 = b.Field1 OR (t.Field1 IS NULL AND b.Field1 IS NULL))
AND(t.Field2 = b.Field2 OR (t.Field2 IS NULL AND b.Field2 IS NULL))
--//...
WHERE t.PROJECT_ID = <project_id>
AND b.PrimaryKey IS NULL --// any of the PK fields will do, but I really hope you do not use composite PKs
But if you are comparing all non-PK columns, then your query is going to suffer.
In this case it is better to add a UpdatedAt TIMESTAMP field (as DVK suggests) on both databases which you could update with the AFTER UPDATE trigger, then your sync procedure would be much faster, given that you create an index including PKs and UpdatedAt column.

You can reorder the WHERE statement; it has four comparisons, put the one most likely to fail first.
If you can alter the databases/application slightly, and you'll need to do this again, a bit field that says "updated" might not be a bad addition.

I usually rewrite queries like this to avoid the not...
Not In is horrible for performance, although Not Exists improves on this.
Check out this article, http://www.sql-server-pro.com/sql-where-clause-optimization.html
My suggestion...
Select out your pkey column into a working/temp table, add a column (flag) int default 0 not null, and index the pkey column. Mark flag =1 if record exists in your subquery (much quicker!).
Replace your sub select in your main query with an exists where (select pkey from temptable where flag=0)
What this works out to is being able to create a list of 'not exists' values that can be used inclusively from an all inclusive set.
Here's our total set.
{1,2,3,4,5}
Here's the existing set
{1,3,4}
We create our working table from these two sets (technically a left outer join)
(record:exists)
{1:1, 2:0, 3:1, 4:1, 5:0}
Our set of 'not existing records'
{2,5} (Select * from where flag=0)
Our product... and much quicker (indexes!)
{1,2,3,4,5} in {2,5} = {2,5}
{1,2,3,4,5} not in {1,3,4} = {2,5}
This can be done without a working table, but its use makes visualizing what's happening easier.
Kris

Related

Fastest options for merging two tables in SQL Server

Consider two very large tables, Table A with 20 million rows in, and Table B which has a large overlap with TableA with 10 million rows. Both have an identifier column and a bunch of other data. I need to move all items from Table B into Table A updating where they already exist.
Both table structures
- Identifier int
- Date DateTime,
- Identifier A
- Identifier B
- General decimal data.. (maybe 10 columns)
I can get the items in Table B that are new, and get the items in Table B that need to be updated in Table A very quickly, but I can't get an update or a delete insert to work quickly. What options are available to merge the contents of TableB into TableA (i.e. updating existing records instead of inserting) in the shortest time?
I've tried pulling out existing records in TableB and running a large update on table A to update just those rows (i.e. an update statement per row), and performance is pretty bad, even with a good index on it.
I've also tried doing a one shot delete of the different values out of TableA that exist in TableB and performance of the delete is also poor, even with the indexes dropped.
I appreciate that this may be difficult to perform quickly, but I'm looking for other options that are available to achieve this.
Since you deal with two large tables, in-place updates/inserts/merge can be time consuming operations. I would recommend to have some bulk logging technique just to load a desired content to a new table and the perform a table swap:
Example using SELECT INTO:
SELECT *
INTO NewTableA
FROM (
SELECT * FROM dbo.TableB b WHERE NOT EXISTS (SELECT * FROM dbo.TableA a WHERE a.id = b.id)
UNION ALL
SELECT * FROM dbo.TableA a
) d
exec sp_rename 'TableA', 'BackupTableA'
exec sp_rename 'NewTableA', 'TableA'
Simple or at least Bulk-Logged recovery is highly recommended for such approach. Also, I assume that it has to be done out of business time since plenty of missing objects to be recreated on a new tables: indexes, default constraints, primary key etc.
A Merge is probably your best bet, if you want to both inserts and updates.
MERGE #TableB AS Tgt
USING (SELECT * FROM #TableA) Src
ON (Tgt.Identifier = SRc.Identifier)
WHEN MATCHED THEN
UPDATE SET Date = Src.Date, ...
WHEN NOT MATCHED THEN
INSERT (Identifier, Date, ...)
VALUES (Src.Identifier, Src.Date, ...);
Note that the merge statement must be terminated with a ;

Count rows with column varbinary NOT NULL tooks a lot of time

This query
SELECT COUNT(*)
FROM Table
WHERE [Column] IS NOT NULL
takes a lot of time. The table has 5000 rows, and the column is of type VARBINARY(MAX).
What can I do?
Your query needs to do a table scan on a column that can potentially be very large without any way to index it. There isn't much you can do to fix this without changing your approach.
One option is to split the table into two tables. The first table could have all the details you have now in it and the second table would have just the file. You can make this a 1-1 table to ensure data is not duplicated.
You would only add the binary data as needed into the second table. If it is not needed anymore, you simply delete the record. This will allow you to simply write a JOIN query to get the information you are looking for.
SELECT
COUNT(*)
FROM dbo.Table1
INNER JOIN dbo.Table2
ON Table1.Id = Table2.Id

DB Design: Looking for Performance improvement when a BIT Column from every table used in every SQL Queries

I recently got added to a new ASP .NET Project(A web application) .There were recent performance issues with the application, and I am in a team with their current task to Optimize some slow running stored procedures.
The database designed is highly normalized. In all the tables we have a BIT column as [Status_ID]. In every Stored procedures, For every tsql query, this column is involved in WHERE condition for all tables.
Example:
Select A.Col1,
C.Info
From dbo.table1 A
Join dbo.table2 B On A.id = B.id
Left Join dbo.table21 C On C.map = B.Map
Where A.[Status_ID] = 1
And B.[Status_ID] = 1
And C.[Status_ID] = 1
And A.link > 50
In the above sql, 3 tables are involved, [Status_ID] column from all 3 tables are involved in the WHERE condition. This is just an example. Like this [Status_ID] is involved in almost all the queries.
When I see the execution plan of most of the SPs, there are lot of Key lookup (Clustered) task involved and most of them are looking for [Status_ID] in the respective table.
In the Application, I found that, it is not possible to avoid these column checking from queries. So
Will it be a good idea to
Alter all [Status_ID] columns to NOT NULL, and then adding them to PRIMARY KEY of that table.Columns 12,13.. will be (12,1) and (13,1)
Adding [Status_ID] column to all the NON Clustered indexes in the INCLUDE PART for that table.
Please share you suggestions over the above two points as well as any other.
Thanks for reading.
If you add the Status_ID to the PK you change the definition of the PK
If you add Status_ID to the PK then you could have duplicate ID
And changing the Status_ID would fragment the index
Don't do that
The PK should be what should make the row unique
Add a separate nonclustered index for the Status_ID
And if it is not null then change it to not null
This will only cut the workload in 1/2
Another option is to add [Status_ID] to every other non clustered.
But if it is first it only cuts the workload in 1/2.
And if is second it is only effective if the other component of the index is in the query
Try Status_ID as a separate index
I suspect the query optimizer will be smart enough to evaluate it last since it will be the least specific index
If you don't have an index on link then do so
And try changing the query
Some times this helps the query optimizer
Select A.Col1, C.Info
From dbo.table1 A
Join dbo.table2 B
On A.id = B.id
AND A.[Status_ID] = 1
And A.link > 50
And B.[Status_ID] = 1
Left Join dbo.table21 C
On C.map = B.Map
And C.[Status_ID] = 1
Check the fragmentation of the indexes
Check the type of join
If it is using a loop join then try join hints
This query should not be performing poorly
If might be lock contention
Try with (nolock)
That might not be an acceptable long term solution but it would tell you is locks are the problem

Slow self-join delete query

Does it get any simpler than this query?
delete a.* from matches a
inner join matches b ON (a.uid = b.matcheduid)
Yes, apparently it does... because the performance on the above query is really bad when the matches table is very large.
matches is about 220 million records. I am hoping that this DELETE query takes the size down to about 15,000 records. How can I improve the performance of the query? I have indexes on both columns. UID and MatchedUID are the only two columns in this InnoDB table, both are of type INT(10) unsigned. The query has been running for over 14 hours on my laptop (i7 processor).
Deleting so many records can take a while, I think this is as fast as it can get if you're doing it this way. If you don't want to invest into faster hardware, I suggest another approach:
If you really want to delete 220 million records, so that the table then has only 15.000 records left, thats about 99,999% of all entries. Why not
Create a new table,
just insert all the records you want to survive,
and replace your old one with the new one?
Something like this might work a little bit faster:
/* creating the new table */
CREATE TABLE matches_new
SELECT a.* FROM matches a
LEFT JOIN matches b ON (a.uid = b.matcheduid)
WHERE ISNULL (b.matcheduid)
/* renaming tables */
RENAME TABLE matches TO matches_old;
RENAME TABLE matches_new TO matches;
After this you just have to check and create your desired indexes, which should be rather fast if only dealing with 15.000 records.
running explain select a.* from matches a inner join matches b ON (a.uid = b. matcheduid) would explain how your indexes are present and being used
I might be setting myself up to be roasted here, but in performing a delete operation like this in the midst of a self-join, isn;t the query having to recompute the join index after each deletion?
While it is clunky and brute force, you might consider either:
A. Create a temp table to store the uid's resulting from the inner join, then join to THAT, THEN perfoorm the delete.
OR
B. Add a boolean (bit) typed column, use the join to flag each match (this operation should be FAST), and THEN use:
DELETE * FROM matches WHERE YourBitFlagColumn = True
Then delete the boolean column.
You probably need to batch your delete. You can do this with a recursive delete using a common table expression or just iterate it on some batch size.

How do I select differing rows in two MySQL tables with the same structure?

I have two tables, A and B, that have the same structure (about 30+ fields). Is there a short, elegant way to join these tables and only select rows where one or more columns differ? I could certainly write some script that creates the query with all the column names but maybe there is an SQL-only solution.
To put it another way: Is there a short substitute to this:
SELECT *
FROM table_a a
JOIN table_b b ON a.pkey=b.pkey
WHERE a.col1 != b.col2
OR a.col2 != b.col2
OR a.col3 != b.col3 # .. repeat for 30 columns
Taking on data into account, there is no short way. Actually this is the only solid way to do it. One thing you might need to be careful with is proper comparison of NULL values in NULL-able columns. The query with OR tends to be slow, not mentioning if it is on 30 columns.
Also your query will not include records in table_b that do not have corresponding one in table_a. So ideally you would use a FULL JOIN.
If you need to perform this operation often, then you could introduce some kind of additional data column that gets updated always when anything in the row changes. This could be as simple as the TIMESTAMP column which gets updated with the help of UPDATE/INSERT triggers. Then when you compare, you even have a knowledge of which record is more recent. But again, this is not a bullet proof solution.
There is a standard SQL way to do this (a MINUS SELECT), but MySQL (along with many other DBMSes) doesn't support it.
Failing that, you could try this:
SELECT a.* FROM a NATURAL LEFT JOIN b
WHERE b.pkcol IS NULL
According to the MySQL documentation, a NATURAL JOIN will join the two tables on all identically named columns. By filtering out the a records where the b primary key column comes back NULL, you are effectively getting only the a records with no matching b table record.
FYI: This is based on the MySQL documentation, not personal experience.
The best way I can think of is to create a temporary table with the same structure also, but with a unique restriction across the 30 fields you want to check for. Then insert all rows from table A into the temp table, then all rows from table B into the temp table... As the rows from B go in, (use insert ignore) the ones that are not unique on at least 1 column will be dropped. The result will be that you have only rows where at least 1 column difffers in your temp table.. You can then select everything from that.