SQL, selecting and updating - sql

I am trying to select 100s of rows at a DB that contains 100000s of row and update those rows afters.
the problem is I don't want to go to DB twice for this purpose since update only marks those rows as "read".
is there any way I can do this in java using simple jdbc libraries? (hopefully without using stored procedures)
update: ok here is some clarification.
there are a few instance of same application running on different servers, they all need to select 100s of "UNREAD" rows sorted according to creation_date column, read blob data within it, write it to file and ftp that file to some server. (I know prehistoric but requirements are requirements)
The read and update part is for to ensure each instance getting diffent set of data. (in order, tricks like odds and evens wont work :/)
We select data for update. the data transfers through the wire (we wait and wait) and then we update them as "READ". then release lock for reading. this entire thing takes too long. By reading and updating at the same time, I would like to reduce lock time (from time we use select for update to actual update) so that using multiple instances would increase read rows per second.
Still have ideas?

It seems to me there might be more than one way to interpret the question here.
You are selecting the rows for the
sole purpose of updating them and
not reading them.
You are selecting the rows to show
to somebody, and marking them as
read either one at a time or all as a group.
You want to select the rows and mark
them as read at the time you select
them.
Let's take Option 1 first, as that seems to be the easiest. You don't need to select the rows in order to update them, just issue an update with a WHERE clause:
update table_x
set read = 'T'
where date > sysdate-1;
Looking at option 2, you want to mark them as read when a user has read them (or a down stream system has received it, or whatever). For this, you'll probably have to do another update. If you query for the primary key, in addition to the other columns you'll need in the first select, you will probably have an easier time of updating, as the DB won't have to do table or index scans to find the rows.
In JDBC (Java) there is a facility to do a batch update, where you execute a set of updates all at once. That's worked out well when I need to perform a lot of updates that are of the exact same form.
Option 3, where you want to select and update all in one shot. I don't find much use for this, personally, but that doesn't mean others don't. I suppose some kind of stored procedure would reduce the round trips. I'm not sure what db you are working with here and can't really offer specifics.

Going to the DB isn't so bad. If you aren't returning anything 'across the wire' then an update shouldn't do you too much damage and its only a few hundred thousand rows. What is your worry?

If you're doing a SELECT in JDBC and iterating over the ResultSet to UPDATE each row, you're doing it wrong. That's an (n+1) query problem that will never perform well.
Just do an UPDATE with a WHERE clause that determines which of those rows needs to be updated. It's a single network round trip that way.
Don't be too code-centric. Let the database do the job it was designed for.

Can't you just use the same connection without closing it?

Related

How to Detect database data changes

In Sql Server database, is there a way to detect data changes, i tried to do that using checksum & checksum_agg but on a single table, all i want is to have the entire checksum (hash) for all tables
the query i used for a signle table is
select isnull(CHECKSUM_AGG(CHECKSUM(*)),0) from mytable
It sounds like you need INSERT/UPDATE/DELETE triggers.
I'm not sure that this will be efficient. As the table grows the computation will become more costly. Are you sure you need to do this every time data changes? How often are changes made? You're throwing a lot of work away with every operation. Maybe you can do it on request when needed rather than with every write operation.
You can use the 'SP_MSFOREACHTABLE' command to run a loop through all of your tables and aggregate the sums
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) as test FROM yor_table_name WITH (NOLOCK);
You can detect the change in sql server with this code, there is a number pool here and a number is selected for each change, a number does not come twice in a row.

Update a table while reading it

I have a database table with a Y/N flag in one column. I want to read all records where the flag is 'N' and after processing a record, set the flag to 'Y' in that record. Is it correct, and reasonable, to do this at the same time, using two separate connections? Or should I read the entire table first and update only after I'm done with the reading? What's the correct approach to this?
The database involved is Netezza, in case it matters.
You should read first then update. Not asynchronsly. If the "select" part takes a long time, you should consider doing it batches. You can use a separate connections but should be confident you've completed your read.
Depends mostly on your design and needs.
How important is the flag? What if something goes wrong when you have set all flags before you have processed them... and so on.
Why you need two connections is out of my understanding, usually you have one connection you keep open. I don't know the blocks of Netezza but some system can also be made to do select and update at the same time.
You could do:
Load a bunch, process them and then update all flags. (fastest, one fail = all fail)
Load a bunch, process one, update one flag, process next.. (quite fast and one fail don't hit em all)
Fetch and update them one by one. (will be slowest but most secure)

Postgres: How to fire multiple queries in same time?

I have one procedure which updates record values, and i want to fire it up against all records in table (over 30k records), procedure execution time is from 2 up to 10 seconds, because it depends on network load.
Now i'm doing UPDATE table SET field = procedure_name(paramns); but with that amount of records it takes up to 40 min to process all table.
Now im using 4 different connections witch fork to background and fires query with WHERE clause set to iterate over modulo of row id's to speed this up, ( WHERE id_field % 4 = ) and this works well and cuts down table populate to ~10 mins.
But i want to avoid using cron, shell jobs and multiple connections for this, i know that it can be done with libpq, but is there a way to fire up a query (4 different non-blocking queries) and do not wait till it ends execution, within single connection?
Or if anyone can point me out to some clues on how to write that function, using postgres internals, or simply in C and bound it as a stored procedure?
Cheers Darius
I've got a sure answer for this question - IF you will share with us what your ab workout is!!! I'm getting fat by the minute and I need answers myself...
OK I'll answer anyway.
If you are updating one table, on one database server, in 40 minutes 'single threaded' and in 10 minutes with 4 threads, the bottleneck is not the database server; otherwise, it would get bogged down in I/O. If you are executing a bunch of UPDATES, one call per record, the network round-trip time is killing you.
I'm pretty sure this is the case and not that it's either an I/O bottleneck on the DB or the possibility that procedure_name(paramns); is taking a long time. (If that were the procedure taking 2-10 seconds it would take like 2500 min to do 30K records). The reason I am sure is that starting 4 concurrent processed cuts the time in 1/4. So especially it is not an i/o issue on the DB server.
This might be the one excuse for putting business logic in an SP on the server. Optimization unfortunately means breaking the rules. The consequence is difficult maintenance. but, duh!!
However, the best solution would be to get this set up to use 'bulk update' queries. That might mean you have to take several strange and unintuitive steps such as this:
This will require a lot of modfication if multiple users can run it concurrently.
refactor the system so procedure_name(paramns) can get all the data it needs to process all records via a select statement. May need to use creative joins. If it's an SP of course now you are moving the logic to the client.
Use that have the program create an XML or other importable flat file format with the PK of the record to update, and the new field value or values. Write all the updates to this file instead of executing them on the DB.
have a temp table on the database that matches the layout of this flat file
run an import on the database - clear the temp table and import the file
do an update of a join of the temp table and the table to be updated, e.g., UPDATE mytbl, mytemp WHERE myPK=mytempPK SET myval=mytempnewval (use the right join syntax of course).
You can try some of these things 'by hand' first before you bother coding, to see if it's worth the speed increase.
If possible, you can still put this all in an SP!
I'm not making any guarantees, especially as I look down at my ever-fattening belly, but, this has the potential to melt your update job down to under a minute.
It is possible to update multiple rows at once. Below an example in postgres:
UPDATE
table_name
SET
column_name = temp.column_name
FROM
(VALUES
(<id1>, <value1>),
(<id2>, <value2>),
(<id3>, <value3>)
) AS temp("id", "column_name")
WHERE
table_name.id = temp.id
PHP has some functions for asynchrone queries:
pg_ send_ execute()
pg_ send_ prepare()
pg_send_query()
pg_ send_ query_ params()
No idea about other programming languages, you have to dig into the manuals.
I think you can't. Single connection can handle single query at once. It's described in libpq documentation chapter "Asynchronous Command Processing":
"After successfully calling PQsendQuery, call PQgetResult one or more times to obtain the results. PQsendQuery cannot be called again (on the same connection) until PQgetResult has returned a null pointer, indicating that the command is done."

collecting mysql statistics

What would be the easiest way to count the new records that are inserted into a database? Is it possible to include a count query in with the load query?
Or is something more complex needed, such as recording the existing last record and counting everything added after it?
edit:
I have a cron job, that uses LOAD DATA INFILE in a script that is passed directly to mysql. This data is used with a php web application. As part of the php web application, I need to generate weekly reports, including how many records were inserted in the last week.
I am unable to patch mysql, or drastically change the database schema/structure, but I am able to add in new tables or fields. I would prefer not to count records from the csv file and store this result in a textfile or something. INstead, I would prefer to do everything from within PHP with queries.
Assuming your using Mysql 5 or greater, you could create a trigger which would fire upon inserting into a specific table. Note that an "insert" trigger also fires with the "LOAD" command.
Using a trigger would require you to persist the count information in a separate table. Basically you'd need to create a new table with 1 row/column to hold the count. The trigger would then update that value with the amount of data loaded.
Here's the MySQL manual page on triggers, the syntax is fairly straight forward. http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
edit
Alternatively, if you don't want to persist the data within the database you could perform your "Load" operations within a stored procedure. This would allow you to perform a select count() on the table before you begin the Load and after the Load is complete. You would just need to subtract the resulting values to determine how many rows were inserted during the Load.
Here's the MySQL manual page on procedures.
http://dev.mysql.com/doc/refman/5.0/en/create-procedure.html
That would probably depend on what is determined as being new. Is it entries entered into the database in the last five minutes or 10 minutes etc? Or is it any record past a certain Auto ID?
If you are looking at time based method of determining what's new, you can have a field (probably of type datetime) that records the time when the record was inserted and to get the number, you simply do a...
select count(*) from table where currentTime > 'time-you-consider-to-be-new'
If you don't want to go by recording the time, you can use an auto increment key and simply keep track of the last inserted ID and count the ones that come after that at any given time window. so if one hour ago the ID was 10000 then a number of records have been inserted since then. You will need to count all records greater than 10000 and keep track of the last insert ID and repeat whenever needed.
If you are not looking at a specific table, you can use the following:
show global status like "Com_%";
This will show you statistics for every type of query. These numbers just keep on counting, so if you want to use them, record the initial number when starting to track the queries, and subtract this from your final number (but yea, that's a given).
If you are looking for pure statistics, I can recommend using Munin with the MySQL plugins.
From where do you load the data? You might consider to count them befor you insert them into the database. If it's a sqlscript you might write a quick and dirty bash script (with grep or something similar) to count the fields.
You say you can't change the structure. Does that mean you can't change the table you are inserting into, or you can't change the database at all? If you can add a table, then just create a table with 2 columns - a timestamp and the key of the table you are loading. Before you load your csv file, create another csv file with just those two columns, and load that csv after your main one.
This might be simpler than you want, but what about a Nagios monitor to track the row count? (Also consider asking around on serferfault.com; this stuff is totally up their alley.)
Perhaps you could write a small shell script that queries the database for the number of rows. You could then have a Cron job that runs every minute/hour/day etc and outputs the COUNT to a log file. Over time, you could review the log file and see the rate at which the database is growing. If you also put a date in the log file, you could review it easier over longer periods.
See if this is the kind of MySQL data collection you're interested in: http://code.google.com/p/google-mysql-tools/wiki/UserTableMonitoring.
If that is the case, Google offers a MySQL patch (to apply to a clean mysql directory source) at http://google-mysql-tools.googlecode.com/svn/trunk/mysql-patches/all.v4-mysql-5.0.37.patch.gz. You can read more about the patch at http://code.google.com/p/google-mysql-tools/wiki/Mysql5Patches.
If this is not what you're looking for, I suggest you explain yourself a little more in order for us to help you better.
Could you use a trigger on the table which will insert into a table you created, which in the structure has a timestamp?
You could then use a date calculation on a period range to find the information needed.
I dont know what version of mysql you are using, but here is link to the syntax for trigger creation in version 5.0: http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
Good luck,
Matt
Well, if you need exhaustive information: which rows were inserted, updated or deleted, it might make sense to create an additional audit table to store those things with a timestamp. You could do this with triggers. I would also write a stored procedure which would execute as event and erase old entries (whatever you consider old).
Refer to the link posted by Lima on how to create triggers in MySQL.
Refer to page 655 of "MySQL Cookbook" by Paul Dubois (2nd Edition) or page 158 of "SQL for smarties" by Joe Celko.
so the 'load' will only insert new data in the table ? or rewrite the whole table ?
If it will load new data, then you can do a select count(*) from yourtable
once before the loading and once after the loading ... the difference will show you how many new records where inserted..
If on the other hand you rewrite the whole table and want to find the different records from the previous version .. then you would need a completely different approach..
Which one is it ?
Your question is a bit ambiguous but they mysql c APIs provide a function "mysql_affected_rows" that you can call after each query to get the number of affected rows. For an insert it returns the number of rows inserted. Be aware that for updates it returns the number of rows changed not the number of rows that matched the where clause.
If you are performing a number of queries and need to know how many were inserted the most reliable way would probably be doing a count before and after the queries.
As noted in sobbayi's answer adding a "created at" timestamp to your tables would allow you to query for records created after (or before) a given time.
UPDATE:
OK here is what you need to do to get a count before and after:
create a table for the counts:
create table row_counts (ts timestamp not null, row_count integer not null);
in your script add the following before and after your load file inline query:
insert into row_counts (ts,row_count) select now(),count(0) from YOUR_TABLE;
load file inline......
insert into row_counts (ts,row_count) select now(),count(0) from YOUR_TABLE;
the row_counts table will now have the count before and after your load.
show global status like 'Com_insert';
flush status and show session status... will work for just the current connection.
see http://dev.mysql.com/doc/refman/5.1/en/server-status-variables.html#statvar_Com_xxx
Since you asked for the easiest way, I would suggest you to use a trigger on insert. You could use a single column, single row table as a counter and update it with the trigger.

How to find number of rows inserted/deleted in MySQL

Is there a way to find out the number of rows inserted/deleted in a table in MySQL? Is this kind of statistics kept somewhere in the database? If not, what would be the best way to implement something to keep track of these statistics?
When I say how many, I mean within a certain period (last 24 hours, or since server was up, or last week etc)
When I need to keep track of deleted things, I just don't delete.
I change a column value that excludes it from normal user results.
If space is an issue, you can set it's contents you no longer care about to empty.
Inserted you can user COUNT()
The Binary Log contains records of all queries that update or insert data. I don't know if it stores the number of affected rows, however.
There is also a General Query Log, which tracks all queries that were run.
(Information current for MySQL 5.0. If you're using an older version ymmv)
If I want to handle logging my SQL queries, I have 2 possibilities:
Turning the MySQL Log function on
Writting my own 'trace' class
I prefer doing number 2.
Why?
Because it is more controllable. You can easily differ from INSERT DELETE UPDATE and so on queries.
But that is not the only advantage of your own trace class, because creating trace files (so called "logs") makes administrative tasks much more easier.
You can structure the trace output, put it into a separate database, store it into some XML or JSON file.
You can order things as you want them to be.