Using JOINS in UPDATE statement when merging in SQLite - sql

I'm currently working with distributed SQLite Databases.
Each client has it's own database which he can sync with the server database.
The server receives a database from a client and updates and inserts it's own rows according to the database the client sends.
While inserting new records is as easy as:
ATTACH DATABASE 'clients_uploaded_db' AS toMerge;
INSERT INTO `presentation` SELECT * FROM toMerge.presentation WHERE id NOT IN (
SELECT id FROM `presentation`
)
updating is not. I need to check if a client record has changed (presentation.changedate has a smaller value than presentation.changedate in the client db) and update if necessary.
In DBMS such as MySQL the following would be possible. But using joins on UPDATEs in SQLite is not possible.
ATTACH DATABASE 'clients_uploaded_db' AS toMerge;
UPDATE presentation
INNER JOIN toMerge.presentation AS tp ON
id = tp.id
SET
label = tp.label
WHERE
tp.changedate > changedate
I've read through several SO questions but I could only find solutions where only one row needs to be updated or where the id's are known.
DB-Structure
server_db:
presentation (id:pk, label, changedate)
clients_uploaded_db:
presentation (id:pk, label, changedate)
TL;DR
I can't join tables on UPDATE but I need to make rows of a table exactly the same as the rows on a table on another database that is made available in my query, but only when the changedate col of the second database is higher than that of the first one.
What I have done so far
Tried to join the regarding tables
Iterated programatically through the records and updated if necessary (works fine but is a performance killer)

SQLite doesn't support a from clause. So you have to do this with correlated subqueries. It is not clear what your update is supposed to be doing. But here is an example
UPDATE presentation
SET label = (select tp.label
from toMerge.presentation tp
where presentation.id = tp.id and
tp.changedate > presentation.changedate
order by tp.change_date
limit 1);

Related

Best incremental load method using SSIS with over 20 million records

What is needed: I'm needing 25 million records from oracle incrementally loaded to SQL Server 2012. It will need to have an UPDATE, DELETE, NEW RECORDS feature in the package. The oracle data source is always changing.
What I have: I've done this many times before but not anything past 10 million records.First I have an [Execute SQL Task] that is set to grab the result set of the [Max Modified Date]. I then have a query that only pulls data from the [ORACLE SOURCE] > [Max Modified Date] and have that lookup against my destination table.
I have the the [ORACLE Source] connecting to the [Lookup-Destination table], the lookup is set to NO CACHE mode, I get errors if I use partial or full cache mode because I assume the [ORACLE Source] is always changing. The [Lookup] then connects to a [Conditional Split] where I would input an expression like the one below.
(REPLACENULL(ORACLE.ID,"") != REPLACENULL(Lookup.ID,""))
|| (REPLACENULL(ORACLE.CASE_NUMBER,"")
!= REPLACENULL(ORACLE.CASE_NUMBER,""))
I would then have the rows that the [Conditional Split] outputs into a staging table. I then add a [Execute SQL Task] and perform an UPDATE to the DESTINATION-TABLE with the query below:
UPDATE Destination
SET SD.CASE_NUMBER =UP.CASE_NUMBER,
SD.ID = UP.ID,
From Destination SD
JOIN STAGING.TABLE UP
ON UP.ID = SD.ID
Problem: This becomes very slow and takes a very long time and it just keeps running. How can I improve the time and get it to work? Should I use a cache transformation? Should I use a merge statement instead?
How would I use the expression REPLACENULL in the conditional split when it is a data column? would I use something like :
(REPLACENULL(ORACLE.LAST_MODIFIED_DATE,"01-01-1900 00:00:00.000")
!= REPLACENULL(Lookup.LAST_MODIFIED_DATE," 01-01-1900 00:00:00.000"))
PICTURES BELOW:
A pattern that is usually faster for larger datasets is to load the source data into a local staging table then use a query like below to identify the new records:
SELECT column1,column 2
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM TargetTable TGT
WHERE TGT.MatchKey = SRC.MatchKey
)
Then you just feed that dataset into an insert:
INSERT INTO TargetTable (column1,column 2)
SELECT column1,column 2
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM TargetTable TGT
WHERE TGT.MatchKey = SRC.MatchKey
)
Updates look like this:
UPDATE TGT
SET
column1 = SRC.column1,
column2 = SRC.column2,
DTUpdated=GETDATE()
FROM TargetTable TGT
WHERE EXISTS (
SELECT * FROM TargetTable SRC
WHERE TGT.MatchKey = SRC.MatchKey
)
Note the additional column DTUpdated. You should always have a 'last updated' column in your table to help with auditing and debugging.
This is an INSERT/UPDATE approach. There are other data load approaches such as windowing (pick a trailing window of data to be fully deleted and reloaded) but the approach depends on how your system works and whether you can make assumptions about data (i.e. posted data in the source will never be changed)
You can squash the seperate INSERT and UPDATE statements into a single MERGE statement, although it gets pretty huge, and I've had performance issues with it and there are other documented issues with MERGE
Unfortunately, there's not a good way to do what you're trying to do. SSIS has some controls and documented ways to do this, but as you have found they don't work as well when you start dealing with large amounts of data.
At a previous job, we had something similar that we needed to do. We needed to update medical claims from a source system to another system, similar to your setup. For a very long time, we just truncated everything in the destination and rebuilt every night. I think we were doing this daily with more than 25M rows. If you're able to transfer all the rows from Oracle to SQL in a decent amount of time, then truncating and reloading may be an option.
We eventually had to get away from this as our volumes grew, however. We tried to do something along the lines of what you're attempting, but never got anything we were satisfied with. We ended up with a sort of non-conventional process. First, each medical claim had a unique numeric identifier. Second, whenever the medical claim was updated in the source system, there was an incremental ID on the individual claim that was also incremented.
Step one of our process was to bring over any new medical claims, or claims that had changed. We could determine this quite easily, since the unique ID and the "change ID" column were both indexed in source and destination. These records would be inserted directly into the destination table. The second step was our "deletes", which we handled with a logical flag on the records. For actual deletes, where records existed in destination but were no longer in source, I believe it was actually fastest to do this by selecting the DISTINCT claim numbers from the source system and placing them in a temporary table on the SQL side. Then, we simply did a LEFT JOIN update to set the missing claims to logically deleted. We did something similar with our updates: if a newer version of the claim was brought over by our original Lookup, we would logically delete the old one. Every so often we would clean up the logical deletes and actually delete them, but since the logical delete indicator was indexed, this didn't need to be done too frequently. We never saw much of a performance hit, even when the logically deleted records numbered in the tens of millions.
This process was always evolving as our server loads and data source volumes changed, and I suspect the same may be true for your process. Because every system and setup is different, some of the things that worked well for us may not work for you, and vice versa. I know our data center was relatively good and we were on some stupid fast flash storage, so truncating and reloading worked for us for a very, very long time. This may not be true on conventional storage, where your data interconnects are not as fast, or where your servers are not colocated.
When designing your process, keep in mind that deletes are one of the more expensive operations you can perform, followed by updates and by non-bulk inserts, respectively.
Incremental Approach using SSIS
Get Max(ID) and Max(ModifiedDate) from Destination Table and Store them in a Variables
Create a temporary staging table using EXECUTE SQL TASK and store that temporary staging table name into the variable
Take a Data Flow Task and Use OLEDB Source and OLEDB Destination to pull the data from the Source System and load the
data into the variable of temporary tables
Take Two Execute SQL Task one for Insert Process and other for Update
Drop the Temporary Table
INSERT INTO sales.salesorderdetails
(
salesorderid,
salesorderdetailid,
carriertrackingnumber ,
orderqty,
productid,
specialofferid,
unitprice,
unitpricediscount,
linetotal ,
rowguid,
modifieddate
)
SELECT sd.salesorderid,
sd.salesorderdetailid,
sd.carriertrackingnumber,
sd.orderqty,
sd.productid ,
sd.specialofferid ,
sd.unitprice,
sd.unitpricediscount,
sd.linetotal,
sd.rowguid,
sd.modifieddate
FROM ##salesdetails AS sd WITH (nolock)
LEFT JOIN sales.salesorderdetails AS sa WITH (nolock)
ON sa.salesorderdetailid = sd.salesorderdetailid
WHERE NOT EXISTS
(
SELECT *
FROM sales.salesorderdetails sa
WHERE sa.salesorderdetailid = sd.salesorderdetailid)
AND sa.salesorderdetailid > ?
UPDATE sa
SET SalesOrderID = sd.salesorderid,
CarrierTrackingNumber = sd.carriertrackingnumber,
OrderQty = sd.orderqty,
ProductID = sd.productid,
SpecialOfferID = sd.specialofferid,
UnitPrice = sd.unitprice,
UnitPriceDiscount = sd.unitpricediscount,
LineTotal = sd.linetotal,
rowguid = sd.rowguid,
ModifiedDate = sd.modifieddate
FROM sales.salesorderdetails sa
LEFT JOIN ##salesdetails sd
ON sd.salesorderdetailid = sa.salesorderdetailid
WHERE sa.modifieddate > sd.modifieddate
AND sa.salesorderdetailid < ?
Entire Process took 2 Minutes to Complete
Incremental Process Screenshot
I am assuming you have some identity like (pk)column in your oracle table.
1 Get max identity (Business key) from Destination database (SQL server one)
2 Create two data flow
a) Pull only data >max identity from oracle and put them Destination directly .( As these are new record).
b) Get all record < max identity and update date > last load put them into temp (staging ) table (as this is updated data)
3 Update Destination table with record from temp table record (created at step b)

Updating Table Records in a Batch and Auditing it

Consider this Table:
Table: ORDER
Columns: id, order_num, order_date, order_status
This table has 1 million records. I want to update the order_status to value of '5', for a bunch (about 10,000) of order_num's that i will be reading from a input text file.
My SQL could be:
(A) update ORDER set order_status=5 where order_num in ('34343', '34454', '454545',...)
OR
(B) update ORDER set order_status=5 where order_num='34343'
I can loop over this update several times until I have covered my 10,000 order updates.
(Also note that i have few Child Tables of ORDER like ORDER_ITEMS, where similar status must be updated and information audited)
My problem is here is:
How can i Audit this update in a separate ORDER_AUDIT Table:
Order_Num: 34343 - Updated Successfully
Order_Num: 34454 - Order Not Found
Order_Num: 454545 - Updated Successfully
Order_Num: 45457 - Order Not Found
If i go for batch update as in (A), I cannot Audit at Order Level.
If i go for Single Order at at time update as in (B), I will have to loop 10,000 times - that may be quite slow - but I can Audit at Order level in this case.
Is there any other way?
First of all, build an external table over your "input text file". That way you can run a simple single UPDATE statement:
update ORDER
set order_status=5
where order_num in ( select col1 from ext_table order by col1)
Neat and efficient. (Sorting the sub-query is optional: it may improve the performance of the update but the key point is, we can treat external tables like regular tables and use the full panoply of the SELECT syntax on them.) Find out more.
Secondly use the RETURNING clause to capture the hits.
update ORDER
set order_status=5
where order_num in ( select col1 from ext_table order by col1)
returning order_num bulk collect into l_nums;
l_nums in this context is a PL/SQL collection of type number. The RETURNING clause will give you all the ORDER_NUM values for updated rows only. Find out more.
If you declare the type for l_nums as a SQL nested table object you can use it in further SQL statements for your auditing:
insert into order_audit
select 'Order_Num: '||to_char(t.column_value)||' - Updated Succesfully'
from table ( l_nums ) t
/
insert into order_audit
select 'Order_Num: '||to_char(col1)||' - Order Not Found'
from ext_table
minus
select * from table ( l_nums )
/
Notes on performance:
You don't say how many of the rows you have in the input text file will match. Perhaps you don't know (actually on re-reading it's not clear whether 10,000 is the number of rows in the file or the number of matching rows). Pl/SQL collections use private session memory, so very large collections can blow the PGA. However, you should be able to cope with ten thousand NUMBER instances without blinching.
My solution does require you to read the external table twice. This shouldn't be a problem. And it will certainly be way faster than dynamically assembling one hundred IN clauses of a thousand numbers and looping over each.
Note that update is often the slowest bulk operation known to man. There are ways of speeding them up, but those methods can get quite involved. However, if this is something you'll want to do often and performance becomes a sticking point you should read this OraFAQ article.
Use MERGE. Firstly load data into a temporary table called ORDER_UPD_TMP with only one column id. You can do it using SQLDeveloper import feature. Then use MERGE in order to udpate your base table:
MERGE INTO ORDER b
USING (
SELECT order_id
FROM ORDER_UPD_TMP
) e
ON (b.id = e.id)
WHEN MATCHED THEN
UPDATE SET b.status = 5
You can also update with a different status when records don't match. Check the documentation for more details:
http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_9016.htm
I think the best way will be:
to import your file to the database first
then do few SQL UPDATE/INSERT queries in one transaction to update status for all orders and create audit records.

SQL - renumbering a sequential column to be sequential again after deletion

I've researched and realize I have a unique situation.
First off, I am not allowed to post images yet to the board since I'm a new user, so see appropriate links below
I have multiple tables where a column (not always the identifier column) is sequentially numbered and shouldn't have any breaks in the numbering. My goal is to make sure this stays true.
Down and Dirty
We have an 'Event' table where we randomly select a percentage of the rows and insert the rows into table 'Results'. The "ID" column from the 'Results' is passed to a bunch of delete queries.
This more or less ensures that there are missing rows in several tables.
My problem:
Figuring out an sql query that will renumber the column I specify. I prefer to not drop the column.
Example delete query:
delete ItemVoid
from ItemTicket
join ItemVoid
on ItemTicket.item_ticket_id = itemvoid.item_ticket_id
where itemticket.ID in (select ID
from results)
Example Tables Before:
Example Tables After:
As you can see 2 rows were delete from both tables based on the ID column. So now I gotta figure out how to renumber the item_ticket_id and the item_void_id columns where the the higher number decreases to the missing value, and the next highest one decreases, etc. Problem #2, if the item_ticket_id changes in order to be sequential in ItemTickets, then
it has to update that change in ItemVoid's item_ticket_id.
I appreciate any advice you can give on this.
(answering an old question as it's the first search result when I was looking this up)
(MS T-SQL)
To resequence an ID column (not an Identity one) that has gaps,
can be performed using only a simple CTE with a row_number() to generate a new sequence.
The UPDATE works via the CTE 'virtual table' without any extra problems, actually updating the underlying original table.
Don't worry about the ID fields clashing during the update, if you wonder what happens when ID's are set that already exist, it
doesn't suffer that problem - the original sequence is changed to the new sequence in one go.
WITH NewSequence AS
(
SELECT
ID,
ROW_NUMBER() OVER (ORDER BY ID) as ID_New
FROM YourTable
)
UPDATE NewSequence SET ID = ID_New;
Since you are looking for advice on this, my advice is you need to redesign this as I see a big flaw in your design.
Instead of deleting the records and then going through the hassle of renumbering the remaining records, use a bit flag that will mark the records as Inactive. Then when you are querying the records, just include a WHERE clause to only include the records are that active:
SELECT *
FROM yourTable
WHERE Inactive = 0
Then you never have to worry about re-numbering the records. This also gives you the ability to go back and see the records that would have been deleted and you do not lose the history.
If you really want to delete the records and renumber them then you can perform this task the following way:
create a new table
Insert your original data into your new table using the new numbers
drop your old table
rename your new table with the corrected numbers
As you can see there would be a lot of steps involved in re-numbering the records. You are creating much more work this way when you could just perform an UPDATE of the bit flag.
You would change your DELETE query to something similar to this:
UPDATE ItemVoid
SET InActive = 1
FROM ItemVoid
JOIN ItemTicket
on ItemVoid.item_ticket_id = ItemTicket.item_ticket_id
WHERE ItemTicket.ID IN (select ID from results)
The bit flag is much easier and that would be the method that I would recommend.
The function that you are looking for is a window function. In standard SQL (SQL Server, MySQL), the function is row_number(). You use it as follows:
select row_number() over (partition by <col>)
from <table>
In order to use this in your case, you would delete the rows from the table, then use a with statement to recalculate the row numbers, and then assign them using an update. For transactional integrity, you might wrap the delete and update into a single transaction.
Oracle supports similar functionality, but the syntax is a bit different. Oracle calls these functions analytic functions and they support a richer set of operations on them.
I would strongly caution you from using cursors, since these have lousy performance. Of course, this will not work on an identity column, since such a column cannot be modified.

SQL Find Holes in Linear Time-Based Table

I have a linear datetime based table in MySQL and SQL server for a time-based data collection system. There is required to be a new entry in this table that will be timestamped exactly every 10 seconds. Sometimes, the data collection source goes down for a while and comes back up, at which point the newest data is inserted into the table, and a new thread is spawned to start back-filling in the "holes".
Is there a SQL statement I could use to find the last "hole" in the ten second values so that I would know where to start asking for backup data?
Thanks!
for SQL Server, try something like this:
SELECT TOP 1
DATEADD(second,-10,t1.YourDateColumn)
FROM YourTable t1
WHERE NOT EXISTS (SELECT 1
FROM YourTable t2
WHERE DATEADD(second,-10,t1.YourDateColumn) = t2.YourDateColumn)
ORDER BY t1.YourDateColumn DESC

My SQL insert/update statement is too inefficient

I'm trying to write code for a batch import of lots of rows into the database.
Currently I bulk copy the raw data (from a .csv file) into a staging table, so that it's all at the database side. That leaves me with a staging table full of rows that identify 'contacts'. These now need to be moved into other tables of the database.
Next I copy over the rows from the staging table that I don't already have in the contacts table, and for the ones I do already have, I need to update the column named "GroupToBeAssignedTo", indicating a later operation I will perform.
I have a feeling I'm going about this wrong. The query isn't efficient and I'm looking for advice of how I could do this better.
update [t1]
set [t1].GroupToBeAssignedTo = [t2].GroupToBeAssignedTo from Contacts [t1]
inner join ContactImportStaging [t2] on [t1].UserID = [t2].UserID AND [t1].EmailAddress = [t2].EmailAddress AND [t2].GUID = #GUID
where not exists
(
select GroupID, ContactID from ContactGroupMapping
where GroupID = [t2].GroupToBeAssignedTo AND ContactID = [t1].ID
)
Might it be better to just import all the rows without checking for duplicates first and then 'clean' the data afterwards? Looking for suggestions of where I'm going wrong. Thanks.
EDIT: To clarify, the question is regarding MS SQL.
This answer is slightly "I wouldn't start from here", but it's the way I'd do it ;)
If you've got the Standard or Enterprise editions of MS SQL Server 2005, and you have access to SQL Server Integration Services, this kind of thing is a doddle to do with a Data Flow.
Create a data source linked to the CSV file (it's faster if it's sorted by some field)
...and another to your existing contacts table (using ORDER BY to sort it by the same field)
Do a Merge Join on their common field -- you'll need to use a Sort transformation if either the two sources aren't already sorted
Do a Conditional split to focus only on rows that aren't already in your table (i.e. a table-unique field is "null", i.e. the merge join didn't actually merge for that row)
Use an OLEDB destination to input to the table.
Probably more individual steps than a single insert-with-select statement, but it'll save your staging, and it's pretty intuitive to follow. Plus, you're probably already licenced to use it, and it's pretty easy :)
Next I copy over the rows from the staging table that I don't already have in the contacts table
It seems that implies that ContactGroupMapping does not have records matching Contacts.id, in which case you can just omit the EXISTS:
UPDATE [t1]
SET [t1].GroupToBeAssignedTo = [t2].GroupToBeAssignedTo
FROM Contacts [t1]
INNER JOIN
ContactImportStaging [t2]
ON [t1].UserID = [t2].UserID
AND [t1].EmailAddress = [t2].EmailAddress
AND [t2].GUID = #GUID
Or I am missing something?