How can I efficiently compare my data with a remote database? - sql

I need to update my contacts database in SQL Server with changes made in a remote database (also SQL Server, on a different server on the same local network). I can't make any changes to the remote database, which is a commercial product. I'm connected to the remote database using a linked server. Both tables contain around 200K rows.
My logic at this point is very simple: [simplified pseudo-SQL follows]
/* Get IDs of new contacts into local temp table */
Select remote.ID into #NewContactIDs
From Remote.Contacts remote
Left Join Local.Contacts local on remote.ID=local.ID
Where local.ID is null
/* Get IDs of changed contacts */
Select remote.ID into #ChangedContactIDs
From Remote.Contacts remote
Join Local.Contacts local on remote.ID=local.ID
Where local.ModifyDate < remote.ModifyDate
/* Pull down all new or changed contacts */
Select ID, FirstName, LastName, Email, ...
Into #NewOrChangedContacts
From Remote.Contacts remote
Where remote.ID in (
Select ID from #NewContactIDs
union
Select ID from #ChangedContactIDs
)
Of course, doing those joins and comparisons over the wire is killing me. I'm sure there's a better way - advice?

Consider maintaining a lastCompareTimestamp (the last time you did the compare) in your local system. Grab all the remote records with ModifyDates > lastCmpareTimestamp and throw them in a local temp table. Work with them locally from there.

The last compare date is a great idea
One other method I have had great success with is SSIS (though it has a learning curve, and might be overkill unless you do this type of thing a lot):
Make a package
Set a data source for each of the two tables. If you expect a lot of change pull the whole tables, if you expect only incremental changes then filter by mod date. Make sure the results are ordered
Funnel both sets into a Full Outer Join
Split the results of the join into three buckets: unchanged, changed, new
Discard the unchanged records, send the new records to an insert destination, and send the changed records to either a staging table for a SQL-based update, or - for few rows - an OLEDB command with a parameterized update statement.
OR, if on SQL Server 2008, use Merge

Related

SQL Table Comparison Taking Extended Periods of Time

I am working on building an application to assist with taking data from different sources dynamically Files and Emails (usually CSV and Excel), APIs, and other SQL Databases and processing them and moving them to a central SQL server. All the tables are being uploaded to the main SQL server and processed to insert new rows into the destination table and update rows with changed data if some is available. The main SQL server is a Microsoft SQL server.
When the data is uploaded to the main server for comparison it is being stored in a temporary table which is being dropped after the comparison is done. The statement that I am using created dynamically by the program in order to allow it to be dynamic to different datasets. What I have been using is a NOT EXISTS which when I run it on a table that is 380k+ rows of data it has been taking 2+ hours to process that data. I have also tried EXCEPT, however I am unable to use that as some of the tables contain text fields which can't be used for the EXCEPT statement. The datasets that are being uploaded to the server are written to and read from at different intervals based on the schedules built into the program.
I was looking to find a more efficient way or improvements that I might be able to make use of in order to bring down the run times for this table. The program that manages the server is running on a remote server than the SQL instance which runs on part of the organizations SQL farm. I am not very experienced with SQL so I appreciate all the help I might be able to get. Below I added links to the code and an example statement produced by the system when going to run the comparison.
C# Code: https://pastebin.com/8PeUvekG
SQL Statement: https://pastebin.com/zc9kshJw
INSERT INTO vewCovid19_SP
(Street_Number,Street_Dir,Street_Name,Street_Type,Apt,Municipality,County,x_cord,y_cord,address,Event_Number,latitude,longitude,Test_Type,Created_On_Date,msg)
SELECT A.Street_Number,A.Street_Dir,A.Street_Name,A.Street_Type,A.Apt,A.Municipality,A.County,A.x_cord,A.y_cord,A.address,A.Event_Number,A.latitude,A.longitude,A.Test_Type,A.Created_On_Date,A.msg
FROM #TEMP_UPLOAD A
WHERE NOT EXISTS
SELECT * FROM vewCovid19_SP B
WHERE ISNULL(CONVERT(VARCHAR,A.Street_Number), 'NULL') = ISNULL(CONVERT(VARCHAR,B.Street_Number), 'NULL')
AND ISNULL(CONVERT(VARCHAR,A.Street_Dir), 'NULL') = ISNULL(CONVERT(VARCHAR,B.Street_Dir), 'NULL')
AND ISNULL(CONVERT(VARCHAR,A.Apt), 'NULL') = ISNULL(CONVERT(VARCHAR,B.Apt), 'NULL')
AND ISNULL(CONVERT(VARCHAR,A.Street_Name), 'NULL') = ISNULL(CONVERT(VARCHAR,B.Street_Name), 'NULL')
AND ISNULL(CONVERT(VARCHAR,A.Street_Type), 'NULL') = ISNULL(CONVERT(VARCHAR,B.Street_Type), 'NULL'));
DROP TABLE #TEMP_UPLOAD"
One simple query form would be to load a new table from a UNION (which includes de-duplication).
eg
insert into viewCovid19_SP_new
select *
from vewCovid19_SP
union
select *
from #temp_upload
then swap the tables with ALTER TABLE ... SWITCH or drop and sp_rename.

SQL Server - Syntax around UNION and USE functions

have a series of databases on the same server which i am wishing to query. I am using the same code to query the database and would like the results to appear in a single list.
I am using 'USE' to specify which database to query, followed by creating some temporary tables to group my data, before using a final SELECT statement to bring together all the data from the database.
I am then using UNION, followed by a second USE command for the next database and so on.
SQL Server is showing a syntax error on the word 'UNION' but does not give any assistance as to the source of the problem.
Is it possible that I am missing a character. At present I am not using ( or ) anywhere.
The USE statement just redirects your session to connect to a different database on the same instance, you don't actually need to switch from database to database in this matter (there are a few rare exceptions tho).
Use the 3 part notation to join your result sets. You can do this while being connected to any database.
SELECT
SomeColumn = T.SomeColumn
FROM
FirstDatabase.Schema.TableName AS T
UNION ALL
SELECT
SomeColumn = T.SomeColumn
FROM
SecondDatabase.Schema.YetAnotherTable AS T
The engine will automatically check for your login's users on each database and validate your permissions on the underlying tables or views.
UNION adds result sets together, you can't issue another operation (like USE) other than SELECT between UNION.
You should use the database names before the table name:
SELECT valueFromBase1
FROM `database1`.`table1`
WHERE ...
UNION
SELECT valueFromBase2
FROM `database2`.`table2`
WHERE ...

Reduce Load on SQL Server DB

I have a third party application from which queries will hit the SQL Server 2008 database to fetch the data ASAP (near to real time). The same query can be called by multiple users at different times.
Is there a way to store the latest result and serve the results for subsequent queries without actually hitting the database again and again for the same piece of data?
Get the results from a procedure that stores data in a global temporary table, or change to a permanent table if you regularly drop connections: change tempdb..##Results to Results. param = 1 refreshes the data:
Create procedure [getresults] (#refresh int = 0)
as
begin
IF #refresh = 1 and OBJECT_ID('tempdb..##Results') IS NOT NULL
drop table ##Results
IF OBJECT_ID('tempdb..##Results') IS NULL
select * into ##Results from [INSERT SQL HERE]
SELECT * FROM ##RESULTS
END
Can you create an indexed view for the data?
When the data is updated the view will be updated when the 3rd party makes a call the view contents will be returned without needing to hit the base tables
Unfortunately the SQL server you are using doesn't have cache system like for example, MySQL query cache. But as per the documentation I just saw here: Buffer Management
Data pages which are read during a SELECT are first brought into the buffer cache. Subsequent requests reading the same data can thus be served quicker than the initial request without needing to access the disc.

Copy lots of table records from SQL Server Dev to Prod

I have a table with about 5 million records and I only need to move the last 1 million to production (as the other 4 million are there). What is the best way to do this so I don't have to recopy the entire table each time?
A little faster will probably be:
Insert into prod.dbo.table (column1, column2....)
Select column1, column2.... from dev.dbo.table d
where not exists (
select 1 from prod.dbo.table pc where pc.pkey = d.pkey
)
But you need to tell us if these tables are on the same server or not
Also how frequently is this run and how robust does it need to be? There are alternative solutions depending on your requirements.
Given this late arriving gem from the OP: no need to compare as I know the IDs > X , then you do not have to do an expensive comparison. You can just use
Insert into prod.dbo.table (column1, column2....)
Select column1, column2.... from dev.dbo.table d
where ID > x
This will be far more efficient as you are only transferring the rows you need.
Edit: (Sorry for revising so much. I'm understanding your question better now)
Insert into TblProd
Select * from TblDev where
pkey not in (select pkey from tblprod)
This should only copy over records that aren't already in your target table.
Since they are on a separate server that changes everything. In short: in order to know what isn't in dev you need to compare everything in DEV to everything in PROD anyway, so there is no simple way to avoid comparing huge datasets.
Some different strategies used for replication between PROD and DEV systems:
A. Backup and restore the whole database across and apply scripts afterwards to clean it up
B. Implement triggers in the PROD database that record the changes then copy only changed records accross
C. Identify some kind of partition or set of records that you know don't change (i.e. 12 months ago), and only refresh those that aren't in that dataset.
D. Copy ALL of prod into a staging table on the DEV server using SSIS. Use a very similar query to above to only insert new records across the database. Delete the staging table.
E. You might be able to find a third party SSIS component that does this efficiently. Out of the box, SSIS is inefficient at comparative updates.
Do you actually have an idea of what those last million records are? i.e. are the for a location or a date or something? Can you write a select to identify them?
Based on this comment:
no need to compare as I know the IDs > X will work
You can run this on the DEV server, assuming you have created a linked server called PRODSERVER on the DEV server
INSERT INTO DB.dbo.YOURTABLE (COL1, COL2, COL3...)
SELECT COL1, COL2, COL3...
FROM PRODSERVER.DB.dbo.YOURTABLE
WHERE ID > X
Look up 'SQL Server Linked Servers' for more information on how to create one.
This is fine for a one off but if you do this regularly you might want to make something more robust.
For example you could create a script that exports the data using BCP.EXE to a file, copies it across to DEV and imports it again. This is more reliable as it does it in one batch rather than requiring a network connection the whole time.
If the tables are on the same server, you can do something like this
I am using MySQL, so may be the syntax will be a little bit different, but in my opinion everything should be the same.
INSERT INTO newTable (columnsYouWantToCopy)
SELECT columnsYouWantToCopy
FROM oldTable WHERE clauseWhichGivesYouOnlyRecodsYouNeed
If on another server, you can do something like this:
http://dev.mysql.com/doc/refman/5.0/en/select-into.html

Problems merging local and remote tables using SAS, using the 'dbkey=' optionnot working when

I have some code to merge a local table of keys in SAS with a remote table (from a MS-SQL database).
Example code:
LIBNAME RemoteDB ODBC user=xxx password=yyy datasrc='RemoteDB' READBUFF=1500;
proc sql;
create table merged_result as
select t1.ID,
t1.OriginalInfo,
t2.RemoteInfo
from input_keys as t1
Left join RemoteDB.remoteTable (dbkey=ID) as t2
on (t1.ID = t2.ID)
order by ID;
quit;
This used to work fine (at least for 150000 rows), but doesn't now, possibly due to SAS updates. At the moment, the same code leads to SAS trying to download the entire remote table (hundreds of GB) to merge locally, which clearly isn't an option. It is obviously the dbkey= option that has stopped working. For the record, the key used to join (ID in example) is indexed in the remote table.
Instead using the dbmaster= option together with the multi_datasrc_opt=in_clause option work (in the LIBNAME statement), but only for 4500 keys and less. Trying to merge larger datasets again leads to SAS trying to download the entire remote table.
Suggestions on how to proceed?
Underwater's question indicates the implicit pass-through feature had worked previously in a manner consistent with optimized processing. After an update the implicit pass-through continues to work for his queries, albeit in a non-optimal way.
To ensure a known (explicit) equivalent near optimal processing methodology I would upload input_keys to a temp table in RemoteDB and join that remotely in pass through. This code is an example of a workable fallback whenever you are dissatisfied with the implicit decisions made by the Executor, SQL planner, and library engine.
LIBNAME tempdata oledb ... dbmstemp=yes ; * libname for remote temp tables;
* store only ids remotely;
data tempdata.id_list;
set input_keys(keep=id);
run;
* use uploaded ids in pass-through join, capture resultset and rejoin for OriginalInfo in sas;
proc sql;
connect to ... as REMOTE ...connection options...;
create table results_matched as
select
RMTJOIN.*
, LOCAL.OriginalInfo
from
(
select * from connection to remote
(
select *
from mySchema.myBigTable BIG
join tempdb.##id_list LIST
on BIG.id = LIST.id
)
) as RMTJOIN
JOIN input_keys as LOCAL
on RMTJOIN.id = LOCAL.id
;
quit;
The dbmstemp option for SQL Server connections causes new remote tables to reside in tempdb schema and be named with leading ##.
When using SQL Server use the BULKLOAD= libname option for highest performance. You may require a special GRANT from the data base administrator in order to bulk load.