SQL server compare tables and update if changed

SQL server compare tables and update if changed - sql

Hi I am looking for A solution
to update values in table1 only if the values change. I mean compare to tableb and update only the changed values

An alternative to neils solution is to use binary checksum and store that in a field in your table then compare against that
Not saying its a better solution, just giving you some options.

for multiple rows at a time try:
UPDATE a
SET IntCol=b.IntCol
,varcharCol=b.varcharCol
,DatetimeCol=b.DatetimeCol
FROM TableA a
INNER JOIN (SELECT pk,IntCol,varcharCol,DatetimeCol FROM TableA
EXCEPT
SELECT pk,IntCol,varcharCol,DatetimeCol FROM TableB
) dt ON a.pk=dt.pk

You could use a trigger on the source table that updates the target table.
However if there's large volume that could slow inserts/updates on the source quite badly. In which case, I'd make the trigger insert into a 3rd table. A scheduled job could then process that table and delete records (with appropriate use of transactions, of course).
An entirely different approach would be to move the triggering up one layer into your application, and use a message-based approach. In this situation you get the best of both worlds, because the process listening for messages will process them in order as fast as it can, leading to almost real-time updates of the target table.
So you can have your cake, and eat it.

Related

Elegant way to replace a single record on a very small table?

I have a small table with only a few records (~20) that I'm using as cache for a really expensive query that returns an array of strings. The table contains a single text column and it will be updated every few minutes. The array is updated in memory and most of the time only one of the records is changed.
I'm thinking of using a transaction with something like:
INSERT INTO cache(id)
SELECT unnest($1::text[])
ON CONFLICT DO NOTHING
DELETE FROM cache
WHERE id NOT IN (SELECT unnest($1::text[]))
but I have a feeling that I might as well just delete everything and then insert it again since it's such a small table. Another option would be to try and combine the queries with a CTE or something. What's the best practice?
Thanks!

How to eliminate multiple server calls | MS SQL Server

There is a stored procedure that needs to be modified to eliminate a call to another server.
What is the easiest and feasible way to do this so that the final SP's execution time is faster and also preference to solutions which do not involve much change to the application?
Eg:
select *
from dbo.table1 a
inner join server2.dbo.table2 b on a.id = b.id

Cross server JOINs can be problematic as the optimiser doesn't always pick the most effective solution, which may even result in the entire remote table being dragged over your network to be queried for a single row.
Replication is by far the best option, if you can justify it. This will mean you need to have a primary key on the table you want to replicate, which seems a reasonable constraint (ha!), but might become an issue with a third-party system.
if the remote table is small then it might be better to take a temporary local copy, e.g. SELECT * INTO #temp FROM server2.<database>.dbo.table2;. Then you can change your query to something like this: select * from dbo.table1 a inner join #temp b on a.id = b.id;. The temporary table will be marked for garbage collection when your session ends, so no need to tidy up after yourself.
If the table is larger then you might want to do the above, but also add an index to your temporary table, e.g. CREATE INDEX ix$temp ON #temp (id);. Note that if you use a named index then you will have issues if you run the same procedure twice simultaneously, as the index name won't be unique. This isn't a problem if the execution is always in series.
If you have a small number of ids that you want to include then OPENQUERY might be the way to go, e.g. SELECT * FROM OPENQUERY('server2', 'SELECT * FROM table2 WHERE id IN (''1'', ''2'')');. The advantage here is that you are now running the query on the remote server, so it's more likely to use a more efficient query plan.
The bottom line is that if you expect to be able to JOIN a remote and local table then you will always have some level of uncertainty; even if the query runs well one day, it might suddenly decide to run a LOT slower the following day. Small things, like adding a single row of data to the remote table, can completely change the way the query is executed.

What is the less expensive method of inserting records into Oracle?

I need to be able to repeatedly process an XML file and insert large amounts of data into an Oracle database. The procedure needs to be able to create new records, or update existing ones if data already exists.
I can think of two ways to process inserting/updating 100,000 records into an Oracle database. But which is the better method? Or is there another way?
Attempt the INSERT. If no exception, the insert works and all is good. If there is an exception, catch it and do an UPDATE instead.
Look up the record first (SELECT). If not found do an INSERT. If found, do an UPDATE.
Obviously if the Oracle table is empty then the first method saves time by foregoing lookups. But if the file was previously imported, and then someone changes a few lines and re-imports, then the amount of exceptions generated becomes huge.
The 2nd method takes longer on an empty database due to lookups but does not generate expensive exceptions during subsequent imports.
Is there a "normal" pattern for working with data like this?
Thanks!

I don't know what 'the' pattern is, but if you are generating a statement, then maybe you can generate a union of select from dual queries that contains all the data from the XML file. Then, you can wrap this select in a MERGE INTO statement, so your SQL looks something like:
MERGE INTO YourTable t
USING (
SELECT 'Val1FromXML' as SomeKey, 'Val2FromXML' as Extrafield, 'Val3FromXML' as OtherField FROM DUAL
UNION ALL
SELECT 'Val1FromRow2' as SomeKey, 'Val2FromXML' as Extrafield, 'Val3FromXML' as OtherField FROM DUAL
...) x ON x.SomeKey = t.SomeKey
WHEN MATCHED THEN
UPDATE SET
t.ExtraField = x.ExtraField,
t.OtherField = x.OtherField
WHEN NOT MATCHED THEN
INSERT (ExtraField, OtherField) VALUES (x.ExtraField, x.OtherField)
The advantage of this statement, is that it's only one statement, so it saves the overhead of initializing a statement for each row. Also, as a single statement it will completely fail or completely succeed, what you would otherwise accomplish with a transaction.
And that's a pitfall as well. For an import like this, you may want to do only a limited number of rows at a time and then commit. That way you don't lock a large part of the table for too long, and you can break the import and continue later. But fortunately, it should be pretty easy to generate a MERGE INTO statement for a limited number of rows too, by simply putting no more than, say, 500 rows in the unioned select-from-duals.

The "normal" pattern would be to wrap your file with an external table and then perform an upsert via the merge keyword.
Depending on your hardware loading the file into a staging table via SQL*Loader can be much faster than using an external table.
edit - just realized you're processing a file and not trying to load it directly. GolezTrol's answer is a good way to deal with the rows you're generating. If there's a huge amount though I would still recommend populating a staging table and consider loading it separately via loader instead a massive SQL statement.

Counting updates/inserts in SSIS using PGNP

I'm updating a table from another table in the same database - what would be the best way to count the updates/inserts?
I can think of a couple of methods:
Join the tables and count (i.e. inner join for update, left join where null for inserts) then perform the update/insert
Use the modification date in the target table (this is maintained correctly) and do a count where the mod date has change, this would have to be done after the update, and before and after the insert... sure you get the idea.
Currently I use method two as I thought it may be faster not having to join the tables, and the modification time stamp data is there anyway.
What are peoples thoughts on this? (I wanted to tag this best-practice, but that tag seems to have disappeared).
EDITED: Sorry, I should have been more specific to the scenario - assume only one concurrent update (this is to update an archive/warehouse overnight) and the provider for SSIS were using won't return the number of rows updated.

I'd probably keep using the second option. If you know you're running daily (or any other regular interval), you could just test for all modifications (updates and inserts) based on the datepart / day (depending on interval) value in your timestamp column.
This way you'll not have to rewrite your update-testing procedure when your inserts/joins/other requirements change.
(Of course, you're vulnerable to changes by other agents).
You could also introduce a 'helper column' where you set a unique update value, but that smells fishy.

MERGE INTO insertion order

I have a statement that looks something like this:
MERGE INTO someTable st
USING
(
SELECT id,field1,field2,etc FROM otherTable
) ot on st.field1=ot.field1
WHEN NOT MATCHED THEN
INSERT (field1,field2,etc)
VALUES (ot.field1,ot.field2,ot.etc)
where otherTable has an autoincrementing id field.
I would like the insertion into someTable to be in the same order as the id field of otherTable, such that the order of ids is preserved when the non-matching fields are inserted.
A quick look at the docs would appear to suggest that there is no feature to support this.
Is this possible, or is there another way to do the insertion that would fulfil my requirements?
EDIT: One approach to this would be to add an additional field to someTable that captures the ordering. I'd rather not do this if possible.
... upon reflection the approach above seems like the way to go.

I cannot speak to what the Questioner is asking for here because it doesn't make any sense.
So let's assume a different problem:
Let's say, instead, that I have a Heap-Table with no Identity-Field, but it does have a "Visited" Date field.
The Heap-Table logs Person WebPage Visits and I'm loading it into my Data Warehouse.
In this Data Warehouse I'd like to use the Surrogate-Key "WebHitID" to reference these relationships.
Let's use Merge to do the initial load of the table, then continue calling it to keep the tables in sync.
I know that if I'm inserting records into an table, then I'd prefer the ID's (that are being generated by an Identify-Field) to be sequential based on whatever Order-By I choose (let's say the "Visited" Date).
It is not uncommon to expect an Integer-ID to correlate to when it was created relative to the rest of the records in the table.
I know this is not always 100% the case, but humor me for a moment.
This is possible with Merge.
Using (what feels like a hack) TOP will allow for Sorting in our Insert:
MERGE DW.dbo.WebHit AS Target --This table as an Identity Field called WebHitID.
USING
(
SELECT TOP 9223372036854775807 --Biggest BigInt (to be safe).
PWV.PersonID, PWV.WebPageID, PWV.Visited
FROM ProdDB.dbo.Person_WebPage_Visit AS PWV
ORDER BY PWV.Visited --Works only with TOP when inside a MERGE statement.
) AS Source
ON Source.PersonID = Target.PersonID
AND Source.WebPageID = Target.WebPageID
AND Source.Visited = Target.Visited
WHEN NOT MATCHED BY Target THEN --Not in Target-Table, but in Source-Table.
INSERT (PersonID, WebPageID, Visited) --This Insert populates our WebHitID.
VALUES (Source.PersonID, Source.WebPageID, Source.Visited)
WHEN NOT MATCHED BY Source THEN --In Target-Table, but not in Source-Table.
DELETE --In case our WebHit log in Prod is archived/trimmed to save space.
;
You can see I opted to use TOP 9223372036854775807 (the biggest Integer there is) to pull everything.
If you have the resources to merge more than that, then you should be chunking it out.
While this screams "hacky workaround" to me, it should get you where you need to go.
I have tested this on a small sample set and verified it works.
I have not studied the performance impact of it on larger complex sets of data though, so YMMV with and without the TOP.

Following up on MikeTeeVee's answer.
Using TOP will allow you to Order By within a sub-query, however instead of TOP 9223372036854775807, I would go with
SELECT TOP 100 PERCENT
Unlikely to reach that number, but this way just makes more sense and looks cleaner.

Why would you care about the order of the ids matching? What difference would that make to how you query the data? Related tables should be connected through primary and foreign keys, not order records were inserted. Tables are not inherently ordered a particular way in databases. Order should come from the order by clause.
More explanation as to why you want to do this might help us steer you to an appropriate solution.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas