DB gurus,
I am hoping someone can set set me on the right direction.
I have two tables. Table A and Table B. When the system comes up, all entries from Table A are massaged and copied over to Table B (according to Table B's schema). Table A can have tens of thousands of rows.
While the system is up, Table B is kept in sync with Table A via DB change notifications.
If the system is rebooted, or my service restarted, I want to re-initialize Table B. However, I want to do this with the least possible DB updates. Specifically, I want to:
add any rows that are in Table A, but not in Table B, and
delete any rows that are not in Table A, but are in Table B
any rows that are common to Table A and Table B should be left untouched
Now, I am not a "DB guy", so I am wondering what is conventional way of doing this.
Use exists to keep processing to a minimum.
Something along these lines, modified so the joins are correct (also verify that I didn't do something stupid and get TableA and TableB backwards from your description):
insert into TableB
select
*
from
TableA a
where
not exists (select 1 from TableB b where b.ID = a.ID)
delete from
TableB b
where
not exists (select 1 from TableA a where a.ID = b.ID)
Informix's Enterprise Replication features would do all this for you. ER works by shipping the logical logs from one server to another, and rolling them forward on the secondary.
You can configure it to be as finely-grained as you need (ie just a handful of tables).
You use the term "DB change notifications" - are you already using ER or is this some trigger-based arrangement?
If for some reason ER can't work for your configuration, I would suggest rewriting the notifications model to behave asynchronously, ie:
write notifications to a table in server 'A' that contains a timestamp or serial field
create a table on server 'B' that stores the timestamp/serial value of the last processed record
run a daemon process on server 'B' that:
compares 'A' and 'B' timestamps/serials
selects 'A' records between 'A' and 'B' timestamps
processes those records into 'B'
update 'B' timestamp/serial
sleep for appropriate time-period, and loop
So Server 'B' is responsible for ensuring its copy is in sync with 'A'. 'A' is not inconvenienced by 'B' being unavailable.
A simple way would be to use a historic table where you would put the changes from A that happened since the last update, and use that table to sync the table B instead of a direct copy from A to B. Once the sync is done, you delete the whole historic table and start anew.
What I don't understand is how table A can be update and not B if your service or computer is not running. Are they found on 2 different database or server?
Join data from both tables according to comon columns and this gives you the rows that have a match in both tables, i.e. data in A and in B. Then use this values (lets call this set M) with set operations, i.e. set minus operations to get the differences.
first requirement: A minus M
second requrement: B minus A
third requirement: M
Do you get the idea?
I am a Sql Server guy but since Sql Server 2008, for this kind of operation , a feature call MERGE is available.
By using MERGE statement we can perform insert, update and delete operations in a single statement.
So I googled and found that Informix also supports the same MERGE statement but I am not sure whether it takes care of delete too or not though insert and update is being taken care off. Moreover, this statement takes care of transaction by itself
Related
I need to create a package to migrate a large amount of data from a database table into a different database table. The source table will continuously have new data in like 4,5 days so I will run my package again and again.
I need to migrate all data from this table to another table but I don't want to migrate those data that I already migrated. What kind of transformation I need to use or what SQL command I need to write to do this?
The usual way this is done is by having "audit" timestamps on the source table and migating only records updated or inserted after the last migration.
for example:
Table Sales
sale_id
sale_date
sale_amount
...............
dw_create_date
dw_update_date
Your source extraction could be something along the lines of..
select sales.sale_id,
sales.sale_date,
....
from sales
where dw_updated_date > {last_migration_date}
last_migration_date is usually read from a config file or table.
Other approaches
There are a few other approaches that you could use, but all of these have bigger performance problems as your data size grows.
1) Do a (target-source) data, to get changed rows in the souurce.
select *
from source
minus
select * from target
You could do the same using a join between source and target.
select source.*
from src
left join tgt on (src.id=tgt.id)
where (src.column1 <> tgt.column1 or
src.column2 <> tgt.column2
............
)
Note that either one of these approaches does not take care of deletes in the source. If you want the tables to be in sync, the only way to do that would be do a (source-target) to get insert/update changes and (target-source) to get deleted rows and do the same in the target.
2. Insert and ignore the primary constraint error:
This has serious issues if the data can change in the source and you want the updates propagated to the target. You'd also be querying the entire source each time. It is usually better to use Merge/Upsert along with filtered source data, instead.
I would assume both tables have some unique identifier, no?
Table A has:
1
2
3
4
You're moving that to Table B, but keeping the data in Table A at the same time, yes?
So you've run your job once. Now Table B has:
1
2
3
4
Table A gets updated. It now has:
1
2
3
4
5
6
7
You run your job again, but you only want to send over 5,6,7.
SELECT *
FROM TableA
LEFT OUTER JOIN TableB ON TableA.ID = TableB.ID
WHERE TableB.ID = NULL.
If you have some sample data it would help. Does this give you a good idea?
See joins: http://i.stack.imgur.com/1UKp7.png
So there's this table of just about 40,000 rows I am looking to update. Colleague said it's best to incrementally update the table instead of complete delete and load.
So I've tried hashing out the design and logic of a script to do this, but my inexperience is getting to me. I just don't know what's efficient and unneeded to incrementally update a table.
Currently, the warehouse looks like this: data comes from source into a table (let's call this T1) in Teradata. Then it's sent into another table (let's call this T2) in Teradata with some added fields such as timestamp. Lastly, a view is built on that last table for security reasons.
So with that laid out, I was thinking of creating a temp/volatile table with data from T1. This would have all the data up to the time the script is run with new records. Then, go through the entire table seeing if the ID (primary index) already exists in T2, and if not, add it to another temp table. Then somehow combine the second temp table with T2 and override T2 and build a view on top of that.
Does this make any sense?
There's also the possibility of records being updated. So they would already exist in T2, but have updated data in a new version of T1. I think comparing the values of all the columns from T1 to T2 would be highly inefficient, but can't think of another way to do this
A 40,000 row delete and insert should be pretty painless for any modern database. Ditto for updates.
The real reason for doing and incremental delete/update/insert is so you can log the changes and timestamp rows in the permanent table with the date/time of nsertion and/or last update. The usual technique goes something like this:
remove rows from the permanent table that don't exist in the temp table
update rows that exist in both tables
insert rows that exist in the temp table, but don't exist in the permanent table.
Looking at the Teradata docs, that would be something like this (no warranties about this being syntactically correct, since I don't have a Teradata instance to play with):
delete permanent p
where not exists ( select *
from temp t
where t.id = p.id
)
update p
from permanent p ,
temp t
set ...
where t.id = p.id
insert permanent
select ...
from temp t
where not exists ( select *
from permanent p
where p.id = t.id
)
One might note that the deletes might get a little hairy if there are dependent foreign key constraints involved.
One might also note that on the update, the where clause might get a tad...complicated if you want to check for actual changes to column values: not much point in updating a row if nothing has changed.
There's a Teradata MERGE command that you might find useful, check this post:
https://forums.teradata.com/forum/database/merge-syntax-simple-version
merge into merge_tmp as t using (select 1 as a,'stf' as b,'uuj' as c) as s
on t.a = s.a
when matched then update set c = s.c
when not matched then insert values (s.a,s.b,s.c);
If you need to match on more columns simple put an and in the on statement.
Edit: If you want to use MERGE you might also need to use a delete statement like the one in nicholas' post.
Quick Version: I have 4 tables (TableA, TableB, TableC, TableD) identical in design. TableC is a complete History of TableA & B. I want to periodically update TableC with new data from TableA & B. TableD contains a copy of the row most recently transferred from A/B to C. I need to select all records from TablesA/B that are more recent than the record in TableD. Any advice?
Long Version: I'm trying trying to ETL (Extract, Transform, Load) some information from a few different tables into some other tables for quicker, easier reporting... kind of like a data warehouse but within the same database (don't ask).
Basically we want to record and report on system performance. ORACLE have logs for this in tables flows_030100.wwv_flow_activity_log1$ and flows_030100.wwv_flow_activity_log2$ - I believe these tables are filled and cleared every two weeks or something...
I have created a table:
CREATE TABLE dw_log_hist AS
SELECT * FROM flows_030100.wwv_flow_activity_log WHERE 1=0
and filled it with the current information:
INSERT INTO dw_log_hist
SELECT *
FROM flows_030100.wwv_flow_activity_log1$
INSERT INTO dw_log_hist
SELECT *
FROM flows_030100.wwv_flow_activity_log2$
HOWEVER, these log files record EVERY click in the APEX screens. As such, they are continually growing.
I want to periodically update my DW_Log_Hist table with only new information (I am fully aware my history table will grow to be ridiculously sized but I'll deal with that later).
Unfortunately, these tables have no primary key, so I've had to create another table to store marker records that will tell me the latest logs I copied over -_-
CREATE TABLE dw_log_temp AS
SELECT * FROM flows_030100.wwv_flow_activity_log
WHERE time_stamp = (SELECT MAX (time_stamp)
FROM flows_030100.wwv_flow_activity_log2$)
NOW THEN after all that waffle... this is what I need your help with:
Does anyone know whether one of the log tables (wwv_flow_activity_log1$ or wwv_flow_activity_log2$) always has the latest logs? Is it a case of log1$ filling up, log2$ filling then log1$ being overwritten with log2$ so that log2$ always has the latest data? Or do they both fill up and then get filled up again?
Can anyone advise how I would go about populating the DW_Log_Hist table using the DW_Log_Temp marker records?
Conceptually it would be something like:
insert everything into dw_log_hist from activity_log1$ and activity_log2$ where the time_stamp is > (time_stamp of the record in dw_log_temp)
Super sorry for such a long post.
Got the answer :-)
A chap on Reddit helped me realise my over complication...
insert into dw_log_hist
select *
from flows_030100.wwv_flow_activity_log1$
where time_stamp > (select max(time_stamp)
from dw_log_hist)
union
select *
from flows_030100.wwv_flow_activity_log2$
where time_stamp > (select max(time_stamp)
from dw_log_hist)
Hurrah! Always feel like such an idiot when you see the simple answer...
I have 2 tables of identical schemea. I need to move rows older than 90 days (based on a dataTime column present in the table) from table A to table B. Here is the pseudo code for what I want to do
SET #Criteria = getdate()-90
Select * from table A
Where column X<#Criteria
Into table B
--now clean up the records we just moved to table B, in Table A
delete from table A Where column X<#Criteria
My questions are:
What is the most efficient way to do this (will select-in perform well under high volumes)? Table A will have ~180,000,000 rows in it, and will need to move ~4,000,000 rows at a time to table B.
How do I encapsulate this under one transaction so that I will not delete rows from Table A if there was an error inserting them to Table B. I just want to make sure that I don't accidentally delete a row from table A unless I have successfully written it to table B.
Are there any good SQL Server 2005 books that you recommend?
Thanks,
Chris
I think that SSIS is probably the best solution for your needs.
I think you can just use the SSIS tasks like Data Flow task to achieve your needs. There doesnt seem to be any need to create a procedure separately for the logic.
Transactions can be set for any Data Flow task using TransactionOption property. Check out this article as to how to use Transactions in SSIS
Some basic tutorials on SSIS packages and how to create them can be referred to here and here
regarding
How do I encapsulate this under one transaction so that I will not delete rows from Table A if there was an error inserting them to Table B.
you can delete all rows from A that are in B using a join. Then, if the copy to B failed, nothing will be deleted from A.
There are two Databases, Database A has a table A with columns of id, group and flag. Database B has a table B with columns of ID and flag. Table B is essentially a subset of table A where the group == 'B'.
They are updated/created in odd ways that are outside my understanding at this time, and are beyond the scope of this question (this is not the time to fix the basic setup and practices of this client).
The problem is that when the flag in Table A is updated, it is not reflected in table B, but should be. This is not a time-critical problem, so it was suggested I create a job to handle this. Maybe because it's the end of the week, or maybe because I've never written more than the most basic stored procedure (I'm a programmer, not a DBA), but I'm not sure how to go about this.
At a simplistic level, the stored procedure would be something along of the lines of
Select * in table A where group == B
Then, loop through the resultset, and for each id, update the flag.
But I'm not even sure how to loop in a stored procedure like this. Suggestions? Example code would be preferred.
Complication: Alright, this gets a little harder too. For every group, Table B is in a separate database, and we need to update this flag for all groups. So, we would have to set up a separate trigger for each group to handle each DB name.
And yes, inserts to Table B are already handled - this is just to update flag status.
Assuming that ID is a unique key, and that you can use linked servers or some such to run a query across servers, this SQL statement should work (it works for two tables on the same server).
UPDATE Table_B
SET Table_B.Flag = Table_A.Flag
FROM Table_A inner join Table_B on Table_A.id = Table_B.id
(since Table_B already contains the subset of rows from Table_A where group = B, we don't have to include this condition in our query)
If you can't use linked servers, then I might try to do it with some sort of SSIS package. Or I'd use the method described in the linked question (comments, above) to get the relevant data from Database A into a temp table etc. in Database B, and then run this query using the temp table.
UPDATE
DatabaseB.dbo.Table_B
SET
DatabaseB.dbo.Table_B.[Flag] = DatabaseA.dbo.Table_A.Flag
FROM
DatabaseA.dbo.Table_A inner join DatabaseB.dbo.Table_B B
on DatabaseA.dbo.id = DatabaseB.dbo.B.id
Complication:
For sevaral groups run one such update SQL per group.
Note you can use Flag without []. I'm using the brackets only because of syntax coloring on stackoverflow.
Create an update trigger on table A that pushes the necessary changes to B as A is modified.
Basically (syntax may not be correct, I can't check it right now). I seem to recall that the inserted table contains all of the updated rows on an update, but you may want to check this to make sure. I think the trigger is the way to go, though.
create trigger update_b_trigger
on Table_A
for update
as
begin
update Table_B
set Table_B.flag = inserted.flag
from inserted
inner join Table_B
on inserted.id = Table_B.id
and inserted.group = 'B'
and inserted.flag <> Table_B.flag
end
[EDIT] I'm assuming that inserts/deletes to Table B are already handled and it's just flag updates to Table B that need to be addressed.