I have a table with about 5 million records and I only need to move the last 1 million to production (as the other 4 million are there). What is the best way to do this so I don't have to recopy the entire table each time?

A little faster will probably be:
Insert into prod.dbo.table (column1, column2....)
Select column1, column2.... from dev.dbo.table d
where not exists (
select 1 from prod.dbo.table pc where pc.pkey = d.pkey
But you need to tell us if these tables are on the same server or not
Also how frequently is this run and how robust does it need to be? There are alternative solutions depending on your requirements.
Given this late arriving gem from the OP: no need to compare as I know the IDs > X , then you do not have to do an expensive comparison. You can just use
Insert into prod.dbo.table (column1, column2....)
Select column1, column2.... from dev.dbo.table d
where ID > x
This will be far more efficient as you are only transferring the rows you need.

Edit: (Sorry for revising so much. I'm understanding your question better now)
Insert into TblProd
Select * from TblDev where
pkey not in (select pkey from tblprod)
This should only copy over records that aren't already in your target table.

Since they are on a separate server that changes everything. In short: in order to know what isn't in dev you need to compare everything in DEV to everything in PROD anyway, so there is no simple way to avoid comparing huge datasets.
Some different strategies used for replication between PROD and DEV systems:
A. Backup and restore the whole database across and apply scripts afterwards to clean it up
B. Implement triggers in the PROD database that record the changes then copy only changed records accross
C. Identify some kind of partition or set of records that you know don't change (i.e. 12 months ago), and only refresh those that aren't in that dataset.
D. Copy ALL of prod into a staging table on the DEV server using SSIS. Use a very similar query to above to only insert new records across the database. Delete the staging table.
E. You might be able to find a third party SSIS component that does this efficiently. Out of the box, SSIS is inefficient at comparative updates.
Do you actually have an idea of what those last million records are? i.e. are the for a location or a date or something? Can you write a select to identify them?
Based on this comment:
no need to compare as I know the IDs > X will work
You can run this on the DEV server, assuming you have created a linked server called PRODSERVER on the DEV server
Look up 'SQL Server Linked Servers' for more information on how to create one.
This is fine for a one off but if you do this regularly you might want to make something more robust.
For example you could create a script that exports the data using BCP.EXE to a file, copies it across to DEV and imports it again. This is more reliable as it does it in one batch rather than requiring a network connection the whole time.

If the tables are on the same server, you can do something like this
I am using MySQL, so may be the syntax will be a little bit different, but in my opinion everything should be the same.
INSERT INTO newTable (columnsYouWantToCopy)
SELECT columnsYouWantToCopy
FROM oldTable WHERE clauseWhichGivesYouOnlyRecodsYouNeed
If on another server, you can do something like this:


SQL multiple tables - very slow

I am trying to fasten up a SQL Server report regarding the IBM OS/400 operating system for my sales department.
A colleague of mine (which left the company) did this report and used a ton of sub selects.
The report usually takes about 30 min to process and often just fails to be displayed. I already tried to cut out some tables/rows in hopes of fastening up the process without success (all is needed by the sales department).
It works over all relevant data (orders, customers, articles, our order at the manufacturer, the manufacturer and so on). Any ideas?
I can't index it, due to the OS/400 system; guess it would be a new programming task for our contractor which leads to costs.
Can I use some clever joins? or somehow reduce the amount of subselects?
Are you using 4 part names in your query? That's probably your problem...
From SQL server...
-- Pull all rows from the table(s) back to MS SQL server and do the where locally on the MS SQL server
select * from LINKEDSVR.MYIBMI.MYLIB.MYTBL where locnbr = '00335';
-- Sends the statement to IBM i server for processing, only results are returned..
select * from openquery(LINKEDSVR, 'select * from MYTBL where locnbr = ''00335''');
Try running the subselects first, sending the output of each to its own table.
Update statistics on the tables. Then run the rest of the SQL, replacing what were originally subselects with the tables created in the first step.
Treat multiple layers of nesting the same way: each layer is its own insert into another table.
I've found that query optimizers have a hard time with complex SQL. Breaking-out the subqueries into separate steps often resolves this.
Between runs my preference is to leave the data intact as a reference in case debugging is needed, then truncate the tables as the first step of a run.
Responding to eraser's comments
Assuming your original query takes this general form:
select [columns] from
(-- subquery
select [columns] from TableA
) as Subquery
from TableB
where mainquery_where_clause
-- Create a table to handle results for your subquery:
Create Table A ;
-- Update the data distribution statistics:
update stats (TableA) ;
-- Now run the subquery:
insert into SubQTable select [columns] from TableA
-- Now run the re-written main query:
Select [columns]
from TableA, TableB
where TableA.joincol = TableB.joincol
and mainquery_where_clause ;
I noticed some syntax issues with the SQL you posted. Looks like something got left out. But the principle of my answer remains the same. Please note that applying my suggestion may not help, as there are potentially many variables to your scenario; you mentioned subqueries, so I chose to address that.
Halfer's suggestion is a great one: edit your original question, adding the SQL code, and putting it in the "{}" supplied by the text editing tool.
I strongly suggest that you obtain the SQL execution plan and post the results.

Is there a "Code Coverage" equivalent for SQL databases?

I have a database with many tables that get used, and many tables that are no longer used. While I could sort through each table manually to see if they are still in use, that would be a cumbersome task. Is there any software/hidden feature that can be used on a SQL Server/Oracle database that would return information like "Tables x,y,z have not been used in the past month" "Tables a,b,c have been used 17 times today"? Or possibly a way to sort tables by "Date Last Modified/Selected From"?
Or is there a better way to go about doing this? Thanks
edit: I found a "modify_date" column when executing "SELECT * FROM sys.tables ORDER BY modify_date desc", but this seems to only keep track of modifications to the table's structure, not its contents.
replace spt_values with the tablename you are interested in, the query will give the the last time it was used and what it was used by
From here: Finding Out How Many Times A Table Is Being Used In Ad Hoc Or Procedure Calls In SQL Server 2005 And 2008
SELECT * FROM(SELECT COALESCE(OBJECT_NAME(s2.objectid),'Ad-Hoc') AS ProcName,execution_count,
(SELECT TOP 1 SUBSTRING(s2.TEXT,statement_start_offset / 2+1 ,
( (CASE WHEN statement_end_offset = -1
ELSE statement_end_offset END) - statement_start_offset) / 2+1)) AS sql_statement,
FROM sys.dm_exec_query_stats AS s1
CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS s2 ) x
WHERE sql_statement like '%spt_values%' -- replace here
AND sql_statement NOT like 'SELECT * FROM(SELECT coalesce(object_name(s2.objectid)%'
ORDER BY execution_count DESC
Keep in mind that if you restart the box, this will be cleared out
In Oracle you can use the ASH (Active Session History) to find info about SQL that was used. You can also perform code coverage tests with the Hierarchical profiler, where you can find which parts of the stored procedures is used or not used.
If you wonder about the updates on table data, you can also use DBA_TAB_MODIFICATIONS. This shows how many inserts, updates, deletes are done on a table or table partition. As soon as new object statistics are generated, the row for the specified table is removed from DBA_TAB_MODIFICATIONS. You still have help here, since you could also have a peek in the table statistics history. This does not show anything about tables that are queried only. If you really need to know about this, you are to use the ASH.
Note, for both ASH and statistics history access, you do need the diagnostics or tuning pack license. (normally you would want this anyway).
If you use trigger you can detect update insert or delete on table.
Access is problably more difficult.
I use a combination of static analysis in the metadata to determine tables/columns which have no dependencies and runtime traces in SQL Server to see what activity is happening.
Some more queries that might be useful for you.
select * from sys.dm_db_index_usage_stats
select * from sys.dm_db_index_operational_stats(db_id(),NULL,NULL,NULL)
select * from sys.sql_expression_dependencies /*SQL Server 2008 only*/
The difference betweeen what the first 2 DMVs report is explained well in this blog post.
Ed Elliott's open source tool, SQL Cover, is a good bet and has built-in support for the popular unit testing tool, tSQLt.

Empty XML Columns during SQL Server replication

We have a merge replication setup on SQL Server that goes like this: 1 SQL server at the office, another SQL server traveling around the world. The publisher is the SQL server at the office.
In about 1% of the cases, two of our tables with a column of XML Data type (not bound to a schema) are replicated with rows containing empty XML columns. ( This only happened when data is sent from the "traveling server" back home, but then again, data seems to be changed more often there ). We only have this in prod. environment ( WAN replication ).
Things i have verified:
The row is replicated, as the last modification date on the row is refreshed but the xml column is empty. Of course it is not empty on the other SQL Server.
No conflicts are displayed in the replication conflicts UI.
It is not caused by the size of the data inside the XML Column as some are very small.
Usually, the problem occurs in batch. ( The xml column of 8-9 consecutive rows will be empty )
The problem occurs if a row was inserted OR updated. No pattern there.
The problem seems to occur, but this is pure speculation on my part when the connection is weaker. ( We've seen this problem happen more often when the server was far away as compared to when it was close by. )
Sorry if i have confused some things, I am not really a DBA, more of a DEV with knowledge of SQL but since the application using the database keeps getting blamed for the problems ( the XML column must not be empty!! ) I have taken it at heart to try and find the problem instead of just manually patching the data each time ( Whats the use of replication if you have to do that? )
If anyone could help out with this problem, or at least suggest some ways of being able to debug / investigate this it would be greatly appreciated.
I did search alot on google and I did find this: Hot Fix . But we do have the latest service pack and the problem seems a bit different.
fyi: We have a replication setup locally here but the problem never occurs. We will be trying a WAN simulator on it as well to see if that can help.
Edit: hot fix is now available for my issue:
After logging this issue with Microsoft, we were able to reproduce the problem without a slow link ( Big thanks to the competent escalation engineer at Microsoft ). The repro is a bit different from our scenario, but highlights the timing issue we were getting perfectly.
Create 2 tables – One parent one child (have a PK-FK relationship)
Insert 2 rows in the parent table
Set up replication – configure merge agent to run ON DEMAND
Once all is replicated:
On the PUBLISHER: delete one row from the parent table
On the SUBSCRIBER: Insert 2 rows of data that references the parentid you deleted above
Insert 5 rows of data that references the parentid that will stay in the table
Sync, Merge agent will fail, Sync again, Merge agent will succeed
Missing XML data on the publisher on the 5 rows.
Seems it is a bug that is in SQL Server 2005/2008 and 2008R2.
It will be addressed in a hot fix in 2008 and up. ( As SQL Server 2005 is no longer being altered )
You may want to start out by slapping a bandaid on this perplexing situation to buy some time to fully investigate and fix (or more likely get MS to fix it). SQL Data Compare is an excellent tool that might help.
Figured i'd put an update here as this issue got me a few gray hairs and I am somewhat closer to a solution now.
I finally had some time to work on this and managed to reproduce this issue in our test environment, using a WAN simulator and slowing down the link and injecting some random packet loss. ( to best simulate the production environment where the server is overseas on a really bad line ).
After doing some SQL tracing, and some verbose logging here are my conclusions:
When replicating a row with an XML column, the process is done in 2 steps. First an insert is done of the full row but with an empty string for the XML column. Right after, an update is done this time with the XML column having data. Since the link is slow, in some situations a foreign key violation occured.
In this scenario, Table2 depends on Table1. After finishing replicating table1, and starting to replicate table2 (Enumration of insert/updates which takes time on a slow link), some entries were added to table1 and table2. Therefore some inserts on Table2 failed because Table1 entries were not in the database and were only going to be replicated next batch. The next time the replication occured, no more foreign key violations occured, however when it tried to insert the row that had previously failed in Table2 ( XML column row ), the update part of it was missing ( I could see that in the SQL profiler ) and that is why the row ended up after all was done with an empty XML.
Setting "Enforce for replication" to false on the foreign keys seems to address the problem, however I do still think that this whole process should work with the option set to true.
I logged a support call with Microsoft for this. I have sent the traces and logs to Microsoft and will see what they have to say.
I've read this article: But for me, setting this option to false is kind of a work around, no?
What do you guys think?
ps: Hope this is clear, tried to explain it the best I could. English is not my first language.

How can I efficiently compare my data with a remote database?

I need to update my contacts database in SQL Server with changes made in a remote database (also SQL Server, on a different server on the same local network). I can't make any changes to the remote database, which is a commercial product. I'm connected to the remote database using a linked server. Both tables contain around 200K rows.
My logic at this point is very simple: [simplified pseudo-SQL follows]
/* Get IDs of new contacts into local temp table */
Select remote.ID into #NewContactIDs
From Remote.Contacts remote
Left Join Local.Contacts local on remote.ID=local.ID
Where local.ID is null
/* Get IDs of changed contacts */
Select remote.ID into #ChangedContactIDs
From Remote.Contacts remote
Join Local.Contacts local on remote.ID=local.ID
Where local.ModifyDate < remote.ModifyDate
/* Pull down all new or changed contacts */
Select ID, FirstName, LastName, Email, ...
Into #NewOrChangedContacts
From Remote.Contacts remote
Where remote.ID in (
Select ID from #NewContactIDs
Select ID from #ChangedContactIDs
Of course, doing those joins and comparisons over the wire is killing me. I'm sure there's a better way - advice?
Consider maintaining a lastCompareTimestamp (the last time you did the compare) in your local system. Grab all the remote records with ModifyDates > lastCmpareTimestamp and throw them in a local temp table. Work with them locally from there.
The last compare date is a great idea
One other method I have had great success with is SSIS (though it has a learning curve, and might be overkill unless you do this type of thing a lot):
Make a package
Set a data source for each of the two tables. If you expect a lot of change pull the whole tables, if you expect only incremental changes then filter by mod date. Make sure the results are ordered
Funnel both sets into a Full Outer Join
Split the results of the join into three buckets: unchanged, changed, new
Discard the unchanged records, send the new records to an insert destination, and send the changed records to either a staging table for a SQL-based update, or - for few rows - an OLEDB command with a parameterized update statement.
OR, if on SQL Server 2008, use Merge

SQL query giving wrong result on linked server

I'm trying to pull user data from 2 tables, one locally and one on a linked server, but I get the wrong results when querying the remote server.
I've cut my query down to
select * from SQL2.USER.dbo.people where persId = 475785
for testing and found that when I run it I get no results even though I know the person exists.
(persId is an integer, db is SQL Server 2000 and dbo.people is a table by the way)
If I copy/ paste the query and run it on the same server as the database then it works.
It only seems to affect certain user ids as running for example
select * from SQL2.USER.dbo.people where persId = 475784
works fine for the user before the one I want.
Strangely I've found that
select * from SQL2.USER.dbo.people where persId like '475785'
also works but
select * from SQL2.USER.dbo.people where persId > 475784
brings back records with persIds starting at 22519 not 475785 as I'd expect.
Hope that made sense to somebody
Any ideas ?
Due to internal concerns about doing any changes to the live people table, I've temporarily moved my database so they're both on the same server and so the linked server issue doesn't apply. Once the whole lot is migrated to a separate cluster I'll be able to investigate properly. I'll update the update once this happens and I can work my way through all the suggestions. Thanks for your help.
The fact that LIKE operates is not a major clue: LIKE forces integers to string (so you can say WHERE field LIKE '2%' and you will get all records that start with a 2, even when field is of integer type). Your incorrect comparisons would lead me to think your indexes are corrupt, but you say they work when not used via the link... however, the selected index might be different depending on the use? (I seem to recall an instance when I had duplicate indexes and only one was stale, although that was too long ago to recall the exact cause).
Nevertheless, I would try rebuilding your index using the DBCC DBREINDEX (tablenname) command. If it turns out that doing so fixes your query, you may want to rebuild them all: here is a script for rebuilding them all easily.
Is dbo.people a table or a view? I've seen something similar where the underlying table schema had been changed and dropping and recreating the view fixed the problem, although the fact that the query works if run directly on the linked server does indicate something index based..
Is the linked server using the same collation? Depending on the index used, I could see something like this perhaps happening if the servers were not collation compatible, but the linked server was set up with collation compatible (which tells Sql Server it can run the query on the remote server).
I would check the following:
Check your definition on the linked server, and confirm that SQL2 is the
server you expect it to be
Check and compare the execution plans both from the remote and local servers
Try linking by IP address rather than name, to ensure you have the proper machine
Put the code into a stored procedure on the remote machine, and try calling that instead
Sounds like a bug to me - I;ve read of some issues along these lines, btu can't remember specifically what. What version of SQL Server are you running?
select * from SQL2.USER.dbo.people where persId = 475785
for a PersID which fails how does:
FROM OpenQuery(SQL2, 'SELECT * FROM USER.dbo.people WHERE persId = 475785')