I'm trying to write code for a batch import of lots of rows into the database.
Currently I bulk copy the raw data (from a .csv file) into a staging table, so that it's all at the database side. That leaves me with a staging table full of rows that identify 'contacts'. These now need to be moved into other tables of the database.
Next I copy over the rows from the staging table that I don't already have in the contacts table, and for the ones I do already have, I need to update the column named "GroupToBeAssignedTo", indicating a later operation I will perform.
I have a feeling I'm going about this wrong. The query isn't efficient and I'm looking for advice of how I could do this better.
update [t1]
set [t1].GroupToBeAssignedTo = [t2].GroupToBeAssignedTo from Contacts [t1]
inner join ContactImportStaging [t2] on [t1].UserID = [t2].UserID AND [t1].EmailAddress = [t2].EmailAddress AND [t2].GUID = #GUID
where not exists
(
select GroupID, ContactID from ContactGroupMapping
where GroupID = [t2].GroupToBeAssignedTo AND ContactID = [t1].ID
)
Might it be better to just import all the rows without checking for duplicates first and then 'clean' the data afterwards? Looking for suggestions of where I'm going wrong. Thanks.
EDIT: To clarify, the question is regarding MS SQL.
This answer is slightly "I wouldn't start from here", but it's the way I'd do it ;)
If you've got the Standard or Enterprise editions of MS SQL Server 2005, and you have access to SQL Server Integration Services, this kind of thing is a doddle to do with a Data Flow.
Create a data source linked to the CSV file (it's faster if it's sorted by some field)
...and another to your existing contacts table (using ORDER BY to sort it by the same field)
Do a Merge Join on their common field -- you'll need to use a Sort transformation if either the two sources aren't already sorted
Do a Conditional split to focus only on rows that aren't already in your table (i.e. a table-unique field is "null", i.e. the merge join didn't actually merge for that row)
Use an OLEDB destination to input to the table.
Probably more individual steps than a single insert-with-select statement, but it'll save your staging, and it's pretty intuitive to follow. Plus, you're probably already licenced to use it, and it's pretty easy :)
Next I copy over the rows from the staging table that I don't already have in the contacts table
It seems that implies that ContactGroupMapping does not have records matching Contacts.id, in which case you can just omit the EXISTS:
UPDATE [t1]
SET [t1].GroupToBeAssignedTo = [t2].GroupToBeAssignedTo
FROM Contacts [t1]
INNER JOIN
ContactImportStaging [t2]
ON [t1].UserID = [t2].UserID
AND [t1].EmailAddress = [t2].EmailAddress
AND [t2].GUID = #GUID
Or I am missing something?
Related
What is needed: I'm needing 25 million records from oracle incrementally loaded to SQL Server 2012. It will need to have an UPDATE, DELETE, NEW RECORDS feature in the package. The oracle data source is always changing.
What I have: I've done this many times before but not anything past 10 million records.First I have an [Execute SQL Task] that is set to grab the result set of the [Max Modified Date]. I then have a query that only pulls data from the [ORACLE SOURCE] > [Max Modified Date] and have that lookup against my destination table.
I have the the [ORACLE Source] connecting to the [Lookup-Destination table], the lookup is set to NO CACHE mode, I get errors if I use partial or full cache mode because I assume the [ORACLE Source] is always changing. The [Lookup] then connects to a [Conditional Split] where I would input an expression like the one below.
(REPLACENULL(ORACLE.ID,"") != REPLACENULL(Lookup.ID,""))
|| (REPLACENULL(ORACLE.CASE_NUMBER,"")
!= REPLACENULL(ORACLE.CASE_NUMBER,""))
I would then have the rows that the [Conditional Split] outputs into a staging table. I then add a [Execute SQL Task] and perform an UPDATE to the DESTINATION-TABLE with the query below:
UPDATE Destination
SET SD.CASE_NUMBER =UP.CASE_NUMBER,
SD.ID = UP.ID,
From Destination SD
JOIN STAGING.TABLE UP
ON UP.ID = SD.ID
Problem: This becomes very slow and takes a very long time and it just keeps running. How can I improve the time and get it to work? Should I use a cache transformation? Should I use a merge statement instead?
How would I use the expression REPLACENULL in the conditional split when it is a data column? would I use something like :
(REPLACENULL(ORACLE.LAST_MODIFIED_DATE,"01-01-1900 00:00:00.000")
!= REPLACENULL(Lookup.LAST_MODIFIED_DATE," 01-01-1900 00:00:00.000"))
PICTURES BELOW:
A pattern that is usually faster for larger datasets is to load the source data into a local staging table then use a query like below to identify the new records:
SELECT column1,column 2
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM TargetTable TGT
WHERE TGT.MatchKey = SRC.MatchKey
)
Then you just feed that dataset into an insert:
INSERT INTO TargetTable (column1,column 2)
SELECT column1,column 2
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM TargetTable TGT
WHERE TGT.MatchKey = SRC.MatchKey
)
Updates look like this:
UPDATE TGT
SET
column1 = SRC.column1,
column2 = SRC.column2,
DTUpdated=GETDATE()
FROM TargetTable TGT
WHERE EXISTS (
SELECT * FROM TargetTable SRC
WHERE TGT.MatchKey = SRC.MatchKey
)
Note the additional column DTUpdated. You should always have a 'last updated' column in your table to help with auditing and debugging.
This is an INSERT/UPDATE approach. There are other data load approaches such as windowing (pick a trailing window of data to be fully deleted and reloaded) but the approach depends on how your system works and whether you can make assumptions about data (i.e. posted data in the source will never be changed)
You can squash the seperate INSERT and UPDATE statements into a single MERGE statement, although it gets pretty huge, and I've had performance issues with it and there are other documented issues with MERGE
Unfortunately, there's not a good way to do what you're trying to do. SSIS has some controls and documented ways to do this, but as you have found they don't work as well when you start dealing with large amounts of data.
At a previous job, we had something similar that we needed to do. We needed to update medical claims from a source system to another system, similar to your setup. For a very long time, we just truncated everything in the destination and rebuilt every night. I think we were doing this daily with more than 25M rows. If you're able to transfer all the rows from Oracle to SQL in a decent amount of time, then truncating and reloading may be an option.
We eventually had to get away from this as our volumes grew, however. We tried to do something along the lines of what you're attempting, but never got anything we were satisfied with. We ended up with a sort of non-conventional process. First, each medical claim had a unique numeric identifier. Second, whenever the medical claim was updated in the source system, there was an incremental ID on the individual claim that was also incremented.
Step one of our process was to bring over any new medical claims, or claims that had changed. We could determine this quite easily, since the unique ID and the "change ID" column were both indexed in source and destination. These records would be inserted directly into the destination table. The second step was our "deletes", which we handled with a logical flag on the records. For actual deletes, where records existed in destination but were no longer in source, I believe it was actually fastest to do this by selecting the DISTINCT claim numbers from the source system and placing them in a temporary table on the SQL side. Then, we simply did a LEFT JOIN update to set the missing claims to logically deleted. We did something similar with our updates: if a newer version of the claim was brought over by our original Lookup, we would logically delete the old one. Every so often we would clean up the logical deletes and actually delete them, but since the logical delete indicator was indexed, this didn't need to be done too frequently. We never saw much of a performance hit, even when the logically deleted records numbered in the tens of millions.
This process was always evolving as our server loads and data source volumes changed, and I suspect the same may be true for your process. Because every system and setup is different, some of the things that worked well for us may not work for you, and vice versa. I know our data center was relatively good and we were on some stupid fast flash storage, so truncating and reloading worked for us for a very, very long time. This may not be true on conventional storage, where your data interconnects are not as fast, or where your servers are not colocated.
When designing your process, keep in mind that deletes are one of the more expensive operations you can perform, followed by updates and by non-bulk inserts, respectively.
Incremental Approach using SSIS
Get Max(ID) and Max(ModifiedDate) from Destination Table and Store them in a Variables
Create a temporary staging table using EXECUTE SQL TASK and store that temporary staging table name into the variable
Take a Data Flow Task and Use OLEDB Source and OLEDB Destination to pull the data from the Source System and load the
data into the variable of temporary tables
Take Two Execute SQL Task one for Insert Process and other for Update
Drop the Temporary Table
INSERT INTO sales.salesorderdetails
(
salesorderid,
salesorderdetailid,
carriertrackingnumber ,
orderqty,
productid,
specialofferid,
unitprice,
unitpricediscount,
linetotal ,
rowguid,
modifieddate
)
SELECT sd.salesorderid,
sd.salesorderdetailid,
sd.carriertrackingnumber,
sd.orderqty,
sd.productid ,
sd.specialofferid ,
sd.unitprice,
sd.unitpricediscount,
sd.linetotal,
sd.rowguid,
sd.modifieddate
FROM ##salesdetails AS sd WITH (nolock)
LEFT JOIN sales.salesorderdetails AS sa WITH (nolock)
ON sa.salesorderdetailid = sd.salesorderdetailid
WHERE NOT EXISTS
(
SELECT *
FROM sales.salesorderdetails sa
WHERE sa.salesorderdetailid = sd.salesorderdetailid)
AND sa.salesorderdetailid > ?
UPDATE sa
SET SalesOrderID = sd.salesorderid,
CarrierTrackingNumber = sd.carriertrackingnumber,
OrderQty = sd.orderqty,
ProductID = sd.productid,
SpecialOfferID = sd.specialofferid,
UnitPrice = sd.unitprice,
UnitPriceDiscount = sd.unitpricediscount,
LineTotal = sd.linetotal,
rowguid = sd.rowguid,
ModifiedDate = sd.modifieddate
FROM sales.salesorderdetails sa
LEFT JOIN ##salesdetails sd
ON sd.salesorderdetailid = sa.salesorderdetailid
WHERE sa.modifieddate > sd.modifieddate
AND sa.salesorderdetailid < ?
Entire Process took 2 Minutes to Complete
Incremental Process Screenshot
I am assuming you have some identity like (pk)column in your oracle table.
1 Get max identity (Business key) from Destination database (SQL server one)
2 Create two data flow
a) Pull only data >max identity from oracle and put them Destination directly .( As these are new record).
b) Get all record < max identity and update date > last load put them into temp (staging ) table (as this is updated data)
3 Update Destination table with record from temp table record (created at step b)
I'm currently working with distributed SQLite Databases.
Each client has it's own database which he can sync with the server database.
The server receives a database from a client and updates and inserts it's own rows according to the database the client sends.
While inserting new records is as easy as:
ATTACH DATABASE 'clients_uploaded_db' AS toMerge;
INSERT INTO `presentation` SELECT * FROM toMerge.presentation WHERE id NOT IN (
SELECT id FROM `presentation`
)
updating is not. I need to check if a client record has changed (presentation.changedate has a smaller value than presentation.changedate in the client db) and update if necessary.
In DBMS such as MySQL the following would be possible. But using joins on UPDATEs in SQLite is not possible.
ATTACH DATABASE 'clients_uploaded_db' AS toMerge;
UPDATE presentation
INNER JOIN toMerge.presentation AS tp ON
id = tp.id
SET
label = tp.label
WHERE
tp.changedate > changedate
I've read through several SO questions but I could only find solutions where only one row needs to be updated or where the id's are known.
DB-Structure
server_db:
presentation (id:pk, label, changedate)
clients_uploaded_db:
presentation (id:pk, label, changedate)
TL;DR
I can't join tables on UPDATE but I need to make rows of a table exactly the same as the rows on a table on another database that is made available in my query, but only when the changedate col of the second database is higher than that of the first one.
What I have done so far
Tried to join the regarding tables
Iterated programatically through the records and updated if necessary (works fine but is a performance killer)
SQLite doesn't support a from clause. So you have to do this with correlated subqueries. It is not clear what your update is supposed to be doing. But here is an example
UPDATE presentation
SET label = (select tp.label
from toMerge.presentation tp
where presentation.id = tp.id and
tp.changedate > presentation.changedate
order by tp.change_date
limit 1);
I'm working in with relatively large data sets; ~200GB. The data is coming from text files that are being imported to SQL via a script. They are being bulkcopy'd into a temp table with the normalized tables waiting to recieve the data.
My question comes from the fact that I'm mostly a scripter so my logic would be to loop through each row and do individual checks per row to put the data where it needs to go but I read a different post on SO saying that's really wrong for SQL.
So my question is, if I have one temp table (31 columns) that is to be normalized between 5 others, what's the best way to go about this?
Table relationship is as follows:
System - Table that contains machine information (e.g. name, domain, etc.)
File - File information (e.g. name, size, directory, etc.)
SystemFile - The many-to-many system<->file relationship table.
Metadata - File metadata (language, etc.) - has foreign key relationship to file primary key
DigitalSignature - File digital signature status - has foreign key relationship to file primary key
Thanks
Dont have any links, don't have enough experience with things like ssis etc to give a balanced view. but when doing the task you are talking about my normal process would be (generic, simple version):
1.look at normalised data set and consider the least dependant components in the data being imported (e.g. order headers created before order items)
2.create queries the select out the data i will have.. these often have this form:
select
t.x,t.y,t.z
from
temp_table as t
left outer join normalise_table as n
on t.x=n.x
and t.y=n.y
and t.z=n.z
where
n.x is null
where temp_table may have lots of columns but these three represent whatever normalised nugget i want to add first, the left outer join and where null make sure i only get the new values - if merging is the same
verify that i am getting good information and that i am only getting the new rows i want. often you have to use group bys or distincts on the temp data to get accurate data for inserting.. something like:
select
t.x,t.y,t.z
from
(select
distinct x,y,z
from
temp_table ) as t
left outer join normalise_table as n
on t.x=n.x
and t.y=n.y
and t.z=n.z
where
n.x is null
3.wrap that select in an insert:
insert into
normalise_table (x,y,z)
select
t.x,t.y,t.z
from
(select
distinct x,y,z
from
temp_table ) as t
left outer join normalise_table as n
on t.x=n.x
and t.y=n.y
and t.z=n.z
where
n.x is null
in this way you are inserting sets of data.. the procedural part is doing this for each set to be inserted, but in general you are not iterating over rows.
BTW T-SQL has a merge command for when you may or may not have the data in the target table (and if you want to remove keys missing from the temp tables)
http://msdn.microsoft.com/en-us/library/bb510625.aspx
Some comments on foreign keys - these tend to be more specific to the situation:
Can you identify the relationship without the primary key? This is the easiest situation to deal with..
Imagine I have inserted my xyz object into a normalised table but it has 100 child rows (abc's) in another table (each child may have 100 children too.. this would mean 10000 rows in the de-normalised data for one xyz)
you would have to go through the validation before but your final query may look something like:
insert into
normalise_table_2 (parentID,a,b,c)
select
n.id,t.a,t.b,t.c
from
(select
distinct x,y,z,a,b,c
from
temp_table ) as t
inner join join normalise_table as n
on t.x=n.x
and t.y=n.y
and t.z=n.z
left outer join normalise_table_2 as n2
on n.id = n2.parentID
and t.a = n2.a
and t.b = n2.b
and t.c = n2.c
where
n2.a is null
or maybe a more readable way:
insert into normalise_table_2 (parentID,a,b,c)
select
*
from (
select distinct
n.id,t.a,t.b,t.c
from
normalise_table as n
inner join temp_table as t
on t.x = n.x
and t.y = n.y
and t.z = n.z
left outer join normalise_table_2 as n2
on t.a = n2.a
and t.b = n2.b
and t.c = n2.c
and n2.parentID = n.id
where
n2.id is null
) as x
If you are having trouble identifying the row without the id here are some points to consider
I often give a unique id to every row in the de-normalised/import data this makes it easier to track what has and has not been done. not to mention paying off in other ways (e.g. when source data has blanks if its they are to be the same as the row above)
I have created temp tables to track relationships like this as I go along.
sometimes (especially for less consistent data) these are not temp tables as they can be used after the fact for analysis what did and didn't import (and where it went), sometimes i have a comments column that the update queries populate with any details about exceptions relating to the import of that row.
sometimes you are lucky and there is some kind of source or oldId field in the target that can be used to link the de-normalised data and normalised version (this is particularly true of system migration type tasks as people often want to be able to look up items in the old system). sometimes this can be weird and wonderful - e.g. using the updated by or created by field looking for a special account that executes this particular process (though i would not particulary recommend that)
Sometimes it makes sense to update the source tables in some way.. e.g. replacing identifiers there
Sometimes you come up with ID ranges or similar that are used for import and you break normal rules about where ids are generated and your import process creates the ID.
this often means shutting down all other access to the target system while the import is executed. may sound mad but sometimes this is the best way for very complex uploads that require a lot of preparation
But often when you think about it there is a particular order you can add your data in and avoid this issue as you will always be able to identify the correct data. I have used the above techniques to make my life easier but I am not sure I have ever HAD to use them..
The only exception I can think of is generating IDs outside of the system which i have had to use, but this was so that IDs would be consistent across multiple trial loads and the final production load. Also data was coming from many sources with many people working on it, it made life easier that they could be in control of their own IDs - but it did bring other issues ;).
Generally I would try and leave the source data alone and ensure that if you re-run any of your scripts then they wont have any effect. this makes the whole system much more robust and gives everyone more confidence as you can re-import the same data or a file that has some of the same data and run everything again and nothing breaks.
note i have not tested any of these queries and just written them off the top of my head so sorry if they are not totally accurate.
I've researched and realize I have a unique situation.
First off, I am not allowed to post images yet to the board since I'm a new user, so see appropriate links below
I have multiple tables where a column (not always the identifier column) is sequentially numbered and shouldn't have any breaks in the numbering. My goal is to make sure this stays true.
Down and Dirty
We have an 'Event' table where we randomly select a percentage of the rows and insert the rows into table 'Results'. The "ID" column from the 'Results' is passed to a bunch of delete queries.
This more or less ensures that there are missing rows in several tables.
My problem:
Figuring out an sql query that will renumber the column I specify. I prefer to not drop the column.
Example delete query:
delete ItemVoid
from ItemTicket
join ItemVoid
on ItemTicket.item_ticket_id = itemvoid.item_ticket_id
where itemticket.ID in (select ID
from results)
Example Tables Before:
Example Tables After:
As you can see 2 rows were delete from both tables based on the ID column. So now I gotta figure out how to renumber the item_ticket_id and the item_void_id columns where the the higher number decreases to the missing value, and the next highest one decreases, etc. Problem #2, if the item_ticket_id changes in order to be sequential in ItemTickets, then
it has to update that change in ItemVoid's item_ticket_id.
I appreciate any advice you can give on this.
(answering an old question as it's the first search result when I was looking this up)
(MS T-SQL)
To resequence an ID column (not an Identity one) that has gaps,
can be performed using only a simple CTE with a row_number() to generate a new sequence.
The UPDATE works via the CTE 'virtual table' without any extra problems, actually updating the underlying original table.
Don't worry about the ID fields clashing during the update, if you wonder what happens when ID's are set that already exist, it
doesn't suffer that problem - the original sequence is changed to the new sequence in one go.
WITH NewSequence AS
(
SELECT
ID,
ROW_NUMBER() OVER (ORDER BY ID) as ID_New
FROM YourTable
)
UPDATE NewSequence SET ID = ID_New;
Since you are looking for advice on this, my advice is you need to redesign this as I see a big flaw in your design.
Instead of deleting the records and then going through the hassle of renumbering the remaining records, use a bit flag that will mark the records as Inactive. Then when you are querying the records, just include a WHERE clause to only include the records are that active:
SELECT *
FROM yourTable
WHERE Inactive = 0
Then you never have to worry about re-numbering the records. This also gives you the ability to go back and see the records that would have been deleted and you do not lose the history.
If you really want to delete the records and renumber them then you can perform this task the following way:
create a new table
Insert your original data into your new table using the new numbers
drop your old table
rename your new table with the corrected numbers
As you can see there would be a lot of steps involved in re-numbering the records. You are creating much more work this way when you could just perform an UPDATE of the bit flag.
You would change your DELETE query to something similar to this:
UPDATE ItemVoid
SET InActive = 1
FROM ItemVoid
JOIN ItemTicket
on ItemVoid.item_ticket_id = ItemTicket.item_ticket_id
WHERE ItemTicket.ID IN (select ID from results)
The bit flag is much easier and that would be the method that I would recommend.
The function that you are looking for is a window function. In standard SQL (SQL Server, MySQL), the function is row_number(). You use it as follows:
select row_number() over (partition by <col>)
from <table>
In order to use this in your case, you would delete the rows from the table, then use a with statement to recalculate the row numbers, and then assign them using an update. For transactional integrity, you might wrap the delete and update into a single transaction.
Oracle supports similar functionality, but the syntax is a bit different. Oracle calls these functions analytic functions and they support a richer set of operations on them.
I would strongly caution you from using cursors, since these have lousy performance. Of course, this will not work on an identity column, since such a column cannot be modified.
I have a VB app that accesses a sql database. I think it’s running slow, and I thought maybe I didn’t have the tables propery indexed. I was wondering how you would create the indexes? Here’s the situation.
My main loop is
Select * from Docrec
Order by YearFiled,DocNumb
Inside this loop I have two others databases hits.
Select * from Names
Where YearFiled = DocRec.YearFiled
and Volume = DocRec.Volume and Page = DocRec.Page
Order by SeqNumb
Select * from MapRec
Where FiledYear = DocRec.YearFiled
and Volume = DocRec.Volume and Page = DocRec.Page
Order by SeqNumb
Hopefully I made sense.
Try in one query using INNER JOIN:
SELECT * FROM Doctec d
INNER JOIN Names n ON d.YearField = n.YearField AND d.Volume = n.Volume AND d.Page = n.Page
INNER JOIN MapRec m ON m.FiledYear = n.YearFiled AND m.Volume = n.Volumen and m.Page = n.Page
ORDER BY YearFiled, DocNumb
You will have only one query to database. The problem can be that you hit database many times and get only one (or few) row(s) per time.
Off the top, one thing that would help would be determining if you really need all columns.
If you don't, instead of SELECT *, select just the columns you need - that way you're not pulling as much data.
If you do, then from SQL Server Management Studio (or whatever you use to manage the SQL Server) you'll need to look at what is indexed and what isn't. The columns you tend to search on the most would be your first candidates for an index.
Addendum
Now that I've seen your edit, it may help to look at why you're doing the queries the way you are, and see if there isn't a way to consolidate it down to one query. Without more context I'd just be guessing at more optimal queries.
In general looping through records is a poor idea. can you not do a set-based query that gives you everything you need in one pass?
As far as indexing consider any fields that you use in the ordering or where clauses and any fileds that arein joins. Primary keys are indexed as part of the setup of a primary ley but foreign keys are not. Often people forget that they need to index them as well.
Never use select * in a production environment. It is a poor practice. Do not ever return more data than you need.
I don't know if you need the loop. If all you are doing is grabbing the records in maprec that match for docrec and then the same for the second table then you can do this without a loop using inner join syntax.
select columnlist from maprec m inner join docrec d on (m.filedyear = d.yearfield and m.volume = d.volume and m.page=d.page)
and then again for the second table...
You could also trim up your queries to return only the columns needed instead of returning all if possible. This should help performance.
To create an index by yourself in SQL Server 2005, go to the design of the table and select the Manage Indexes & Keys toolbar item.
You can use the Database Engine Tuning Advisor. You can create a trace (using sql server profiler) of your queries and then the Advisor will tell you and create the indexes needed to optimize for your query executions.
UPDATE SINCE YOUR FIRST COMMENT TO ME:
You can still do this by running the first query then the second and third without a loop as I have shown above. Here's the trick. I am thinking you need to tie the first to the second and third one hence why you did a loop.
It's been a while since I have done VB6 recordsets BUT I do recall the ability to filter the recordset once returned from the DB. So, in this case, you could keep your loop but instead of calling SQL every time in the loop you would simply filter the resulting recordset data based on the first record. You would initialize / load the second & third query before this loop to load the data. Using the syntax above that I gave will load in each of those tables the matching to the parent table (docrec).
With this, you will still only hit the DB three times but still retain the loop you need to have the parent docrec table traversed so you can do work on it AND the child tables when you do have a match.
Here's a few links on ado recordset filtering....
http://www.devguru.com/technologies/ado/QuickRef/recordset_filter.html
http://msdn.microsoft.com/en-us/library/ee275540(BTS.10).aspx
http://www.w3schools.com/ado/prop_rs_filter.asp
With all this said.... I have this strange feeling that perhaps it could be solved with just a left join on your tables?
select * from docrec d
left join maprec m on (d.YearFiled= m.FiledYear and d.Volume = m.Volume and d.Page = m.Page)
left join names n on (d.YearFiled = n.YearFiled and d.Volume = n.Volume and d.Page = n.Page)
this will return all DocRec records AND add all the maprec values and name values where it matches OR NULL if not.
If this fits your need it will only hit the DB once.