Update a table while reading it - sql

I have a database table with a Y/N flag in one column. I want to read all records where the flag is 'N' and after processing a record, set the flag to 'Y' in that record. Is it correct, and reasonable, to do this at the same time, using two separate connections? Or should I read the entire table first and update only after I'm done with the reading? What's the correct approach to this?
The database involved is Netezza, in case it matters.

You should read first then update. Not asynchronsly. If the "select" part takes a long time, you should consider doing it batches. You can use a separate connections but should be confident you've completed your read.

Depends mostly on your design and needs.
How important is the flag? What if something goes wrong when you have set all flags before you have processed them... and so on.
Why you need two connections is out of my understanding, usually you have one connection you keep open. I don't know the blocks of Netezza but some system can also be made to do select and update at the same time.
You could do:
Load a bunch, process them and then update all flags. (fastest, one fail = all fail)
Load a bunch, process one, update one flag, process next.. (quite fast and one fail don't hit em all)
Fetch and update them one by one. (will be slowest but most secure)

Related

Performance considration between Cursor with 30000 record or create 30000 update statement

Which is better?
1)A cursor that loop 30000 record and perform update one by one
2)Create a script that has 30000 update command
thanks
Both should take about the same time, mainly subject to how the CURSOR is declared.
Reason? You have 30,000 individual updates which is usually the main factor
Note that 30,000 individual UPDATES in one batch will probably fail because of batch size and compile time anyway...
SQL is a set based language and you can most likely do a single UPDATE to update all rows in one go. If you can't, it is because of 2 reasons
You need "per row" logic: this can usually be achieved by CASE expressions, UDFs etc
You don't understand sets and SQL
With more information (the SQL and logic) we could help you more...
There is a very easy way to tell: Do it and measure the time.
Other than that, having 30000 lines does not make a lot of sense when you can have just 10.
Making updates this way for reasons other than data migration or maintenance doesn't sound like wise either, and in those cases performance is not an issue - but maintenance and legibility always is.
You know, that depends on context.
It helps, though, to learn. SQL for example. You are on a low level not to see the real optimizations possible here. SQL is a lot more than just Update, Insert and simple Select statements.
1)A cursor that loop 30000 record and perform update one by one
Linear step by step processing. No way to paralellize as SQL itself has no threading mechanisms available to the user; Optimizations are one by one - i.e. the query optimizer looks at items one statement at a time.
2)Create a script that has 30000 update command
Assuming the script is external, it could split the work and run it concurrent on multiple connections, i.e. run more than one parallel.
But there is more:
Make a script that calculates the new values.
Bulk import them into a temporary table using the buld copy API
Issue ONE update statment that takes the updated values from the temporary table to the final one.
Maybe have a script that issues a merge statement for multi update? There are tons of variations there if you know the SQL api more than "update, open cursor, simple select".
I do that - though a lot more data (batches of 50.000, sometimes 4-6 at the same time). The problem being that sql bulk copy has some overhead. But I manage 75.000 inserts per second that way.
A lot depends on the business questions and the complexity of the logic - if it is simple updates then the question is: Calculated or externally driven? Multiple values by 2 = calculated, updating addresses = data driven (i.e. you need the new data from somewhere).

SQL Server - On-Insert Trigger vs. Per-Minute Job

My question is mainly concerned with "what's best for performance", but kinda "philosophically" speaking as well (if it makes a difference)... so let's jump right in.
[TableA].[ColumnB] stores a value that needs to exist in [TableC].[ColumnD]. Right off the bat, no answers involving Foreign-keys - just assume that they're "not allowed" in this environment for whatever reason.
But due to "circumstances x,y,z", [TableA].[ColumnB] sometimes gets values that do not exist in [TableC].[ColumnD], because, let's say, [TableA] gets populated from an object that exists in running code as a "serialized blob", an in-memory representation of the data, and the [ColumnB] values got populated before those values were deleted from [TableC].[ColumnD] by some other process. ANYWAY, this is for example's sake, so don't get bogged down in the "why does this condition happen", just accept that it does.
To "fix" the problem, which method is best of these two: 1. make a Trigger that fires on-INSERT on [TableA], to Update [ColumnB] to the value that it should be (and assume I have a "mapping" of bad-to-good values). Or, 2. run a scheduled-Job every hour/minute/whatever that runs Update queries to change all possible "bad" values to their corresponding "good" values.
To put it more generally, what's better for performance and/or what is best practice: a Trigger, or a periodic Scheduled-Job? In context, let's say [TableA] is typically on the order of hundreds of thousands of rows, with Inserts happening 10-100 records at-a-time, as frequently as every few minutes to as rarely as a few times per day.
On-insert.
Doing triggers is like callbacks- They're more logically sound, and they spread any lag into every query. Doing continual checks (called polling or cron-jobs), you end up with more severe moments of lag every now and then. In almost all cases, using triggers/callbacks are the better way to go as having 1ms of lag added to every query is better than 100ms of lag at seemingly random intervals.
Use of triggers is generally discouraged, but your load is light and your case seems to be a natural trigger case. Consider using instead-of trigger to avoid two operations on the same row (one insert instead of insert and update). It may be the simplest and most reliable solution (as long as you have written reliable code in the trigger that won't cause the whole operation to crash).
Since you are considering a batch job, you are not concerned with timing issues. I.e it's OK with your application that tables may be out of sync for 1 minute or even 1 hour. That's the major difference with the trigger approach, which will guarantee that tables are in sync all the time. Potential timing issues would make me uncomfortable. On the plus side, you won't be at risk of crashing the original insert operation with your trigger.
If you go this route, please consider Change Tracking feature. Change tracking will indicate which rows have been inserted since the last time you checked, so you won't have to scan the whole table for new records. Alternatively, if your TableA has an INDENITY primary or unique key, you can implement similar design without change tracking functionality.
Triggers are both best performance and practice, as they maintain referential integrity as well as allowing the server to optimise for performance.
You didn't say what version of SQL Server you were using, but if it's 2008+, you can use Change Data Capture to keep track of data changes to your "primary" table. Then, periodically, you can run a batch over the change table and do whatever processing is required over that small set.

SQL transaction affecting a big amount of rows

The situation is as follows:
A big production client/server system where one central database table has a certain column that has had NULL as default value but now has 0 as default value. But all the rows created before that change of course still have value as null and that generates a lot of unnecessary error messages in this system.
Solution is of course simple as that:
update theTable set theColumn = 0 where theColumn is null
But I guess it's gonna take a lot of time to complete this transaction? Apart from that, will there be any other issues I should think of before I do this? Will this big transaction block the whole database, or that particular table during the whole update process?
This particular table has about 550k rows and 500k of them has null value and will be affected by the above sql statement.
The impact on the performance of other connected clients depends on:
How fast the servers hardware is
How many indexes containing the column your update statement has to update
Which transaction isolation settings the other clients connect to the database
The db engine will acquire write locks, so when your clients only need read access to the table, it should not be a big problem.
500.000 records sounds not too much for me, but as i said, the time and resources the update takes depends on many factors.
Do you have a similar test system, where you can try out the update?
Another solution is to split the one big update into many small ones and call them in a loop.
When you have clients writing frequently to that table, your update statement might get blocked "forever". I have seen databases where performing the update row by row was the only way of getting the update through. But that was a table with about 200.000.000 records and about 500 very active clients!
it's gonna take a lot of time to complete this transaction
there's no definite way to say this. Depends a lot on the hardware, number of concurrent sessions, whether the table has got locks, the number of interdependent triggers et al.
Will this big transaction block the whole database, or that particular table during the whole update process
If the "whole database" is dependent on this table then it might.
will there be any other issues I should think of before I do this
If the table has been locked by other transaction - you might run into a row-lock situation. In rare cases, perhaps a dead lock situation. Best would be to ensure that no one is utilizing the table, check for any pre-exising locks and then run the statement.
Locking issues are vendor specific.
Asuming no triggers on the table, half a million rows is not much for a dediated database server even with many indexes on the table.

Postgres: How to fire multiple queries in same time?

I have one procedure which updates record values, and i want to fire it up against all records in table (over 30k records), procedure execution time is from 2 up to 10 seconds, because it depends on network load.
Now i'm doing UPDATE table SET field = procedure_name(paramns); but with that amount of records it takes up to 40 min to process all table.
Now im using 4 different connections witch fork to background and fires query with WHERE clause set to iterate over modulo of row id's to speed this up, ( WHERE id_field % 4 = ) and this works well and cuts down table populate to ~10 mins.
But i want to avoid using cron, shell jobs and multiple connections for this, i know that it can be done with libpq, but is there a way to fire up a query (4 different non-blocking queries) and do not wait till it ends execution, within single connection?
Or if anyone can point me out to some clues on how to write that function, using postgres internals, or simply in C and bound it as a stored procedure?
Cheers Darius
I've got a sure answer for this question - IF you will share with us what your ab workout is!!! I'm getting fat by the minute and I need answers myself...
OK I'll answer anyway.
If you are updating one table, on one database server, in 40 minutes 'single threaded' and in 10 minutes with 4 threads, the bottleneck is not the database server; otherwise, it would get bogged down in I/O. If you are executing a bunch of UPDATES, one call per record, the network round-trip time is killing you.
I'm pretty sure this is the case and not that it's either an I/O bottleneck on the DB or the possibility that procedure_name(paramns); is taking a long time. (If that were the procedure taking 2-10 seconds it would take like 2500 min to do 30K records). The reason I am sure is that starting 4 concurrent processed cuts the time in 1/4. So especially it is not an i/o issue on the DB server.
This might be the one excuse for putting business logic in an SP on the server. Optimization unfortunately means breaking the rules. The consequence is difficult maintenance. but, duh!!
However, the best solution would be to get this set up to use 'bulk update' queries. That might mean you have to take several strange and unintuitive steps such as this:
This will require a lot of modfication if multiple users can run it concurrently.
refactor the system so procedure_name(paramns) can get all the data it needs to process all records via a select statement. May need to use creative joins. If it's an SP of course now you are moving the logic to the client.
Use that have the program create an XML or other importable flat file format with the PK of the record to update, and the new field value or values. Write all the updates to this file instead of executing them on the DB.
have a temp table on the database that matches the layout of this flat file
run an import on the database - clear the temp table and import the file
do an update of a join of the temp table and the table to be updated, e.g., UPDATE mytbl, mytemp WHERE myPK=mytempPK SET myval=mytempnewval (use the right join syntax of course).
You can try some of these things 'by hand' first before you bother coding, to see if it's worth the speed increase.
If possible, you can still put this all in an SP!
I'm not making any guarantees, especially as I look down at my ever-fattening belly, but, this has the potential to melt your update job down to under a minute.
It is possible to update multiple rows at once. Below an example in postgres:
UPDATE
table_name
SET
column_name = temp.column_name
FROM
(VALUES
(<id1>, <value1>),
(<id2>, <value2>),
(<id3>, <value3>)
) AS temp("id", "column_name")
WHERE
table_name.id = temp.id
PHP has some functions for asynchrone queries:
pg_ send_ execute()
pg_ send_ prepare()
pg_send_query()
pg_ send_ query_ params()
No idea about other programming languages, you have to dig into the manuals.
I think you can't. Single connection can handle single query at once. It's described in libpq documentation chapter "Asynchronous Command Processing":
"After successfully calling PQsendQuery, call PQgetResult one or more times to obtain the results. PQsendQuery cannot be called again (on the same connection) until PQgetResult has returned a null pointer, indicating that the command is done."

SQL, selecting and updating

I am trying to select 100s of rows at a DB that contains 100000s of row and update those rows afters.
the problem is I don't want to go to DB twice for this purpose since update only marks those rows as "read".
is there any way I can do this in java using simple jdbc libraries? (hopefully without using stored procedures)
update: ok here is some clarification.
there are a few instance of same application running on different servers, they all need to select 100s of "UNREAD" rows sorted according to creation_date column, read blob data within it, write it to file and ftp that file to some server. (I know prehistoric but requirements are requirements)
The read and update part is for to ensure each instance getting diffent set of data. (in order, tricks like odds and evens wont work :/)
We select data for update. the data transfers through the wire (we wait and wait) and then we update them as "READ". then release lock for reading. this entire thing takes too long. By reading and updating at the same time, I would like to reduce lock time (from time we use select for update to actual update) so that using multiple instances would increase read rows per second.
Still have ideas?
It seems to me there might be more than one way to interpret the question here.
You are selecting the rows for the
sole purpose of updating them and
not reading them.
You are selecting the rows to show
to somebody, and marking them as
read either one at a time or all as a group.
You want to select the rows and mark
them as read at the time you select
them.
Let's take Option 1 first, as that seems to be the easiest. You don't need to select the rows in order to update them, just issue an update with a WHERE clause:
update table_x
set read = 'T'
where date > sysdate-1;
Looking at option 2, you want to mark them as read when a user has read them (or a down stream system has received it, or whatever). For this, you'll probably have to do another update. If you query for the primary key, in addition to the other columns you'll need in the first select, you will probably have an easier time of updating, as the DB won't have to do table or index scans to find the rows.
In JDBC (Java) there is a facility to do a batch update, where you execute a set of updates all at once. That's worked out well when I need to perform a lot of updates that are of the exact same form.
Option 3, where you want to select and update all in one shot. I don't find much use for this, personally, but that doesn't mean others don't. I suppose some kind of stored procedure would reduce the round trips. I'm not sure what db you are working with here and can't really offer specifics.
Going to the DB isn't so bad. If you aren't returning anything 'across the wire' then an update shouldn't do you too much damage and its only a few hundred thousand rows. What is your worry?
If you're doing a SELECT in JDBC and iterating over the ResultSet to UPDATE each row, you're doing it wrong. That's an (n+1) query problem that will never perform well.
Just do an UPDATE with a WHERE clause that determines which of those rows needs to be updated. It's a single network round trip that way.
Don't be too code-centric. Let the database do the job it was designed for.
Can't you just use the same connection without closing it?