How to check if a set of rows already exist in the database and skip migrate them?

How to check if a set of rows already exist in the database and skip migrate them? - sql

I need to create a package to migrate a large amount of data from a database table into a different database table. The source table will continuously have new data in like 4,5 days so I will run my package again and again.
I need to migrate all data from this table to another table but I don't want to migrate those data that I already migrated. What kind of transformation I need to use or what SQL command I need to write to do this?

The usual way this is done is by having "audit" timestamps on the source table and migating only records updated or inserted after the last migration.
for example:
Table Sales
sale_id
sale_date
sale_amount
...............
dw_create_date
dw_update_date
Your source extraction could be something along the lines of..
select sales.sale_id,
sales.sale_date,
....
from sales
where dw_updated_date > {last_migration_date}
last_migration_date is usually read from a config file or table.
Other approaches
There are a few other approaches that you could use, but all of these have bigger performance problems as your data size grows.
1) Do a (target-source) data, to get changed rows in the souurce.
select *
from source
minus
select * from target
You could do the same using a join between source and target.
select source.*
from src
left join tgt on (src.id=tgt.id)
where (src.column1 <> tgt.column1 or
src.column2 <> tgt.column2
............
)
Note that either one of these approaches does not take care of deletes in the source. If you want the tables to be in sync, the only way to do that would be do a (source-target) to get insert/update changes and (target-source) to get deleted rows and do the same in the target.
2. Insert and ignore the primary constraint error:
This has serious issues if the data can change in the source and you want the updates propagated to the target. You'd also be querying the entire source each time. It is usually better to use Merge/Upsert along with filtered source data, instead.

I would assume both tables have some unique identifier, no?
Table A has:
1
2
3
4
You're moving that to Table B, but keeping the data in Table A at the same time, yes?
So you've run your job once. Now Table B has:
1
2
3
4
Table A gets updated. It now has:
1
2
3
4
5
6
7
You run your job again, but you only want to send over 5,6,7.
SELECT *
FROM TableA
LEFT OUTER JOIN TableB ON TableA.ID = TableB.ID
WHERE TableB.ID = NULL.
If you have some sample data it would help. Does this give you a good idea?
See joins: http://i.stack.imgur.com/1UKp7.png

Related

How to get the differences between two - kind of - duplicated tables (sql)

Prolog:
I have two tables in two different databases, one is an updated version of the other. For example we could imagine that one year ago I duplicated table 1 in the new db (say, table 2), and from then I started working on table 2 never updating table 1.
I would like to compare the two tables, to get the differences that have grown in this period of time (the tables has preserved the structure, so that comparison has meaning)
My way of proceeding was to create a third table, in which I would like to copy both table 1 and table 2, and then count the number of repetitions of every entry.
In my opinion, this, added to a new attribute that specifies for every entry the table where he cames from would do the job.
Problem:
Copying the two tables into the third table I get the (obvious) error to have two duplicate key values in a unique or primary key costraint.
How could I bypass the error or how could do the same job better? Any idea is appreciated

Something like this should do what you want if A and B have the same structure, otherwise just select and rename the columns you want to confront....
SELECT
*
FROM
B
WHERE NOT EXISTS (SELECT * FROM A)
if NOT EXISTS doesn't work in your DBMS you could also use a left outer join comparing the rows columns values.
SELECT
A.*
from
A left outer join B
on A.col = B.col and ....

Limit Rows in ETL Without Date Column for Cue

We have two large tables (Clients and Contacts) which undergo an ETL process every night, being inserted into a single "People" table in the data warehouse. This table is used in many places and cannot be significantly altered without a lot of work.
The source tables are populated by third party software; we used to assume that we could identify the rows that had been updated since last night by using the "UpdateDate" column in each, but more recently identified some rows that were not touched by the ETL, as the "UpdateDate" column was not behaving as we had thought; the software company do not see this as a bug, so we have to live with this fact.
As a result, we now take all source rows, transformed into a temp staging table and then Merge that into the data warehouse, using the Merge to identify any changed values. We have noticed that this process is taking too long on some days and would like to limit the number of rows that the ETL process looks at, as we believe that the reason for the hold-up is the principally the sheer volume of data that is examined and stored on the temp database. We can see no way to look purely at the source data and identify when each row last changed.
Here is a simplified pseudocode of the ETL stored procedure, although what the procedure actually does is not really relevant to the question (included just in case you disagree with me!)
CREATE #TempTable (ClientOrContact BIT NOT NULL, Id INT NOT NULL, [Some_Other_Columns])
INSERT #TempTable
SELECT 1 AS ClientOrContact, C.Id, [SomeColumns] FROM
(SELECT [SomeColumns]
FROM Source_ClientsTable C
JOIN FieldsTable F JOIN [SomeOtherTables])
PIVOT (MAX(F.FieldValue) FOR F.FieldName IN ([SomeFieldNames]));
INSERT #TempTable
SELECT 0 AS ClientOrContact, C.Id, [SomeColumns] FROM
(SELECT [SomeColumns]
FROM Source_ContactsTable C
JOIN FieldsTable F JOIN [SomeOtherTables])
PIVOT (MAX(F.FieldValue) FOR F.FieldName IN ([SomeFieldNames]));
ALTER #TempTable ADD PRIMARY KEY (ClientOrContact, Id);
MERGE Target_PeopleTable AS Tgt
USING (SELECT [SomeColumns] FROM #TempTable JOIN [SomeOtherTables]) AS Src
ON Tgt.ClientOrContact = Src.ClientOrContact AND Tgt.Id = Src.Id
WHEN MATCHED AND NOT EXISTS (SELECT Tgt.* INTERSECT SELECT Src.*)
THEN UPDATE SET ([All_NonKeyTargetColumns] = [All_NonKeySourceColumns])
WHEN NOT MATCHED BY Target THEN INSERT [All_TargetColumns] VALUES [All_SourceColumns]
OUTPUT $Action INTO #Changes;
RETURN COUNT(*) FROM #Changes;
GO
The source tables have about 1.5M rows each, but each day only a relatively small number of rows are inserted or updated (never deleted). There are about 50 columns in each table, of those, about 40 columns can have changed values each night. Most columns are VARCHAR and each table contains an independent incremental primary key column. We can add indexes to the source tables, but not alter them in any other way (They have already been indexed by a predecessor) The source tables and target table are on the same server, but different databases. Edit: The Target Table has a composite primary key on the ClientOrContact and Id columns, matching that shown on the temp table in the script above.
So, my question is this - please could you suggest any general possible strategies that might be useful to limit the number of rows we look at or copy across each night? If we only touched the rows that we needed to each night, we would be touching less than 1% of the data we do at the moment...

Before you try the following suggestion, just one thing to check is that the Target_PeopleTable has an index or primary key on the id column. It probably does but without schema information to verify I am making no assumptions and this might speed up the merge stage.
As you've identified if you could somehow limit the records in TempTable to just the changed rows then this could offer a performance win for the actual MERGE statement (depending on how expensive determining just the changed rows is).
As a general strategy I would consider some kind of checksum to try and identify the changed records only. The T-SQL Checksum function could be used to calculate a check sum across the required columns by specifying the columns as a comma separated list to that function or there are actual column types available for this such as Binary_Checksum.
Since you cannot change the source schema you would have to maintain a list of record ids and associated checksums in your target database so that you can readily compare the source checksums to the target checksums from the last run in order to identify a difference.
You can then only insert into the Temp table where there is a checksum difference between the target and source or the id does not exist in the target db.
This might just be moving the performance problem to the temp insert part but I think it's worth a try.

Have you considered triggers? I avoid them like the plague, but they really are the solution to some problems.
Put an INSERT/UPDATE [/DELETE?] trigger on your two source tables. Program it such that when rows are added or updated, the trigger will log the IDs of these rows in a (you'll have to create this) audit table, where that table would contain the ID, the type of change (update or insert – and delete, if you have to worry about those) and when the change was made. When you run ETL, join this list of “to be merged” items with the source tables. When you’re done, delete the table and it’s reset for the next run. (Use the “added on” datetime column to make sure you don’t delete rows that may have been added while you were running ETL.)
There’s lots of details behind proper use and implementation, but overall this idea should do what you need.

Merge Statement VS Lookup Transformation

I am stuck with a problem with different views.
Present Scenario:
I am using SSIS packages to get data from Server A to Server B every 15 minutes.Created 10 packages for 10 different tables and also created 10 staging table for the same. In the DataFlow Task it is selecting data from server A with ID greater last imported ID and dumping them onto a Staging table.(Each table has its own stagin table).After the DataFlow task I am using a MERGE statement to merge records from Staging table to Destination table where ID is NO Matched.
Problem:
This will take care all new records inserted but if once a record is picked by SSIS job and is update at the source I am not able to pick it up again and not able to grab the updated data.
Questions:
How will I be able to achieve the Update with impacting the source database server too much.
Do I use MERGE statement and select 10,000 records every single run?(every 15 minutes)
Do I use LookUp transformation to do the updates
Some tables have more than 2 million records and growing, so what is the best approach for them.
NOTE:
I can truncate tables in destination and reinsert complete data for the first run.
Edit:
The Source has a column 'LAST_UPDATE_DATE' which I can Use in my query.

If I'm understanding your statements correctly it sounds like you're pretty close to your solution. If you currently have a merge statement that includes the insert (where source does not match destination) you should be able to easily include the update statement for the (where source matches destination).
example:
MERGE target_table as destination_table_alias
USING (
SELECT <column_name(s)>
FROM source_table
) AS source_alias
ON
[source_table].[table_identifier] = [destination_table_alias].[table_identifier]
WHEN MATCHED THEN UPDATE
SET [destination_table_alias.column_name1] = mySource.column_name1,
[destination_table_alias.column_name2] = mySource.column_name2
WHEN NOT MATCHED THEN
INSERT
([column_name1],[column_name2])
VALUES([source_alias].[column_name1],mySource.[column_name2])
So, to your points:
Update can be achieved via the 'WHEN MATCHED' logic within the merge statement
If you have the last ID of the table that you're loading, you can include this as a filter on your select statement so that the dataset is incremental.
No lookup is needed with the 'WHEN MATCHED' is utilized.
utilizing a select filter in the select portion of the merge statement.
Hope this helps

Creating history of flows_030100.wwv_flow_activity_log

Quick Version: I have 4 tables (TableA, TableB, TableC, TableD) identical in design. TableC is a complete History of TableA & B. I want to periodically update TableC with new data from TableA & B. TableD contains a copy of the row most recently transferred from A/B to C. I need to select all records from TablesA/B that are more recent than the record in TableD. Any advice?
Long Version: I'm trying trying to ETL (Extract, Transform, Load) some information from a few different tables into some other tables for quicker, easier reporting... kind of like a data warehouse but within the same database (don't ask).
Basically we want to record and report on system performance. ORACLE have logs for this in tables flows_030100.wwv_flow_activity_log1$ and flows_030100.wwv_flow_activity_log2$ - I believe these tables are filled and cleared every two weeks or something...
I have created a table:
CREATE TABLE dw_log_hist AS
SELECT * FROM flows_030100.wwv_flow_activity_log WHERE 1=0
and filled it with the current information:
INSERT INTO dw_log_hist
SELECT *
FROM flows_030100.wwv_flow_activity_log1$
INSERT INTO dw_log_hist
SELECT *
FROM flows_030100.wwv_flow_activity_log2$
HOWEVER, these log files record EVERY click in the APEX screens. As such, they are continually growing.
I want to periodically update my DW_Log_Hist table with only new information (I am fully aware my history table will grow to be ridiculously sized but I'll deal with that later).
Unfortunately, these tables have no primary key, so I've had to create another table to store marker records that will tell me the latest logs I copied over -_-
CREATE TABLE dw_log_temp AS
SELECT * FROM flows_030100.wwv_flow_activity_log
WHERE time_stamp = (SELECT MAX (time_stamp)
FROM flows_030100.wwv_flow_activity_log2$)
NOW THEN after all that waffle... this is what I need your help with:
Does anyone know whether one of the log tables (wwv_flow_activity_log1$ or wwv_flow_activity_log2$) always has the latest logs? Is it a case of log1$ filling up, log2$ filling then log1$ being overwritten with log2$ so that log2$ always has the latest data? Or do they both fill up and then get filled up again?
Can anyone advise how I would go about populating the DW_Log_Hist table using the DW_Log_Temp marker records?
Conceptually it would be something like:
insert everything into dw_log_hist from activity_log1$ and activity_log2$ where the time_stamp is > (time_stamp of the record in dw_log_temp)
Super sorry for such a long post.

Got the answer :-)
A chap on Reddit helped me realise my over complication...
insert into dw_log_hist
select *
from flows_030100.wwv_flow_activity_log1$
where time_stamp > (select max(time_stamp)
from dw_log_hist)
union
select *
from flows_030100.wwv_flow_activity_log2$
where time_stamp > (select max(time_stamp)
from dw_log_hist)
Hurrah! Always feel like such an idiot when you see the simple answer...

update data from one table to another (in a database)

DB gurus,
I am hoping someone can set set me on the right direction.
I have two tables. Table A and Table B. When the system comes up, all entries from Table A are massaged and copied over to Table B (according to Table B's schema). Table A can have tens of thousands of rows.
While the system is up, Table B is kept in sync with Table A via DB change notifications.
If the system is rebooted, or my service restarted, I want to re-initialize Table B. However, I want to do this with the least possible DB updates. Specifically, I want to:
add any rows that are in Table A, but not in Table B, and
delete any rows that are not in Table A, but are in Table B
any rows that are common to Table A and Table B should be left untouched
Now, I am not a "DB guy", so I am wondering what is conventional way of doing this.

Use exists to keep processing to a minimum.
Something along these lines, modified so the joins are correct (also verify that I didn't do something stupid and get TableA and TableB backwards from your description):
insert into TableB
select
*
from
TableA a
where
not exists (select 1 from TableB b where b.ID = a.ID)
delete from
TableB b
where
not exists (select 1 from TableA a where a.ID = b.ID)

Informix's Enterprise Replication features would do all this for you. ER works by shipping the logical logs from one server to another, and rolling them forward on the secondary.
You can configure it to be as finely-grained as you need (ie just a handful of tables).
You use the term "DB change notifications" - are you already using ER or is this some trigger-based arrangement?
If for some reason ER can't work for your configuration, I would suggest rewriting the notifications model to behave asynchronously, ie:
write notifications to a table in server 'A' that contains a timestamp or serial field
create a table on server 'B' that stores the timestamp/serial value of the last processed record
run a daemon process on server 'B' that:
compares 'A' and 'B' timestamps/serials
selects 'A' records between 'A' and 'B' timestamps
processes those records into 'B'
update 'B' timestamp/serial
sleep for appropriate time-period, and loop
So Server 'B' is responsible for ensuring its copy is in sync with 'A'. 'A' is not inconvenienced by 'B' being unavailable.

A simple way would be to use a historic table where you would put the changes from A that happened since the last update, and use that table to sync the table B instead of a direct copy from A to B. Once the sync is done, you delete the whole historic table and start anew.
What I don't understand is how table A can be update and not B if your service or computer is not running. Are they found on 2 different database or server?

Join data from both tables according to comon columns and this gives you the rows that have a match in both tables, i.e. data in A and in B. Then use this values (lets call this set M) with set operations, i.e. set minus operations to get the differences.
first requirement: A minus M
second requrement: B minus A
third requirement: M
Do you get the idea?

I am a Sql Server guy but since Sql Server 2008, for this kind of operation , a feature call MERGE is available.
By using MERGE statement we can perform insert, update and delete operations in a single statement.
So I googled and found that Informix also supports the same MERGE statement but I am not sure whether it takes care of delete too or not though insert and update is being taken care off. Moreover, this statement takes care of transaction by itself

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas