Need to query a database for 12 million rows, process this data and then insert the filtered data into another database.
I can't just do a SELECT * from the database for obvious reasons - far too much data would be returned for my program to handle, and also this is a live database (customer order details) and I can't have the database crawl to a halt for 10 minutes while it runs my query.
I'm looking for inspiration on how to write this program. I have to process each row. I was thinking it might be best to get a count on the rows. Then grab X at a time, wait for Y seconds, and repeat, until the dataset is complete. This way I'm not overloading the database, and since X will be sufficiently small, will run nicely in memmory.
Other suggestions or feedback ?
I'd recommend you read the doc about SELECT...INTO OUTFILE and LOAD DATA FROM INFILE.
These are very fast ways of dumping data to a flat file and then importing it to another database.
You could dump into the flat file, and then run an offline script to process your rows, and then once that's done import the result to the new database.
See also:
http://dev.mysql.com/doc/refman/5.1/en/select.html (search for "INTO OUTFILE")
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
Spreading the load over time seems the only practicable solution. Exactly how to do it depends to some extent on your schema, how records change over time in the "live database", and what consistency semantics your processing must have.
In the worst case -- any record can be changed at any time, there is nothing in the schema that lets you easily and speedily check for "recently modified, inserted, or deleted records", and you nevertheless need to be consistent in what you process -- the task is simply unfeasible, unless you can count on some special support from your relational engine and/or OS (such as volume or filesystem "snapshots", like in Linux's LVM, that let you cheaply and speedily "freeze in time" a copy of the volumes on which the DB resides, for later leisurely fetching with another, read-only, database configured to read from the snapshot volume).
But presumably you do have some constraints, something in the schema that helps with the issue, or else, one can hope, you can afford some inconsistency generated by changes in the DB happening at the same time as your processing -- some lines processed twice, some not processed, some processed in older versions and others in newer versions... unfortunately, you have told us next to nothing about any of these issues, making it essentially unfeasible to offer much more help. If you edit your question to provide a LOT more information on platform, schema, and DB usage patterns, maybe more help can be offered.
A flat file or a snapshot are both ideal.
If a flat file does not suit or you do not have access to snapshots theny you could use a sequential id field or create a sequential id in a temp table and then iterate using that.
Something like
#max_id = 0
while exists (select * from table where seq_id > #max_id)
select top n * from table where seq_id > #max_id order by seq_id
... process...
set #max_id = #max seq_id from the last lot
end
If there is no sequential id then you can create a temp table that holds the order like
insert into some_temp_table
select unique_id from table order by your_ordering_scheme
then process like this
... do something with top n from table join some_temp_table on unique_id ...
delete top n from some_temp_table
this way temp_table holds the record identifiers that still need to be processed.
You don't mention which db you are using, but I doubt any db that can hold 12 million rows would actually try to return all the data to your program at once. Your program essentially streams the data in small blocks (say 1000 rows) something that is usually handled by the database driver.
RDBMSs have different transaction levels which can be used to reduce the effort the database spends maintaining consistency guarantees, which will avoid locking up the table.
Databases can also create snapshots of tables to a file for later analysis.
In your position, I would try the simplest thing first, and see how that scales (on a development copy of the db with simulated user access.)
Related
Let's say I have a database with lots of tables, but there's one big table that's being updated regularly. At any given point in time, this table contains billions of rows, and let's say that the table is updated so regularly that we can expect a 100% refresh of the table by the end of each quarter. So the volume of data being moved around is in the order tens of billions. Because this table is changing so constantly, I want to implement a PITR, but only for this one table. I have two options:
Hack PostgreSQL's in-house PITR to apply only for one table.
Build it myself by creating a base backup, set up continuous archiving, and using a python script to execute the log of SQL statements up to a point in time (or use PostgreSQL's EXECUTE statement to loop through the archive). The big con with this is that it won't have the timeline functionality.
My problem is, I don't know if option 1 is even possible, and I don't know if option 2 even makes sense (looping through billions of rows sounds like it defeats the purpose of PITR, which is speed and convenience.) What other options do I have?
I have a tableMyTable with 29,000 rows.
MyTable structure {
StudentId bigint,
....
}
Number of columns > 10 columns. The database in the hosting server.
From SSMS i execute the query:
SELECT *
FROM MyTable
Is it normal that the execution lasts more than 5 min?
First of all, retrieving all the data from a remote database is never a good idea. You are using an important share of bandwidth. Hopefully, the query you are using is only used for debugging purpose and should never hit production.
You did not mention if it took 5 minutes before you started receiving something or if you are receiving your data over the course of that 5 minutes, at a constant rate.
In the first situation, not receiving rows at all might indicating a that a lock is effective on your table, due to another operation.
In the latter situation, you are constantly receiving rows, but at a slow rate. Bandwidth and server load play a big part in that. To get you a rough idea of the amount of data that you are downloading, run this stored procedure:
EXEC sp_spaceused 'YourTableName';
Consider that the server has to upload that data and that you have to download the data.
Binary and xml fields (also called BLOB field) usually take a lot of data and you may not be able to control the amount of data stored by the user in those field.
Try checking the size of your variable length fields (varchar, xml and varbinary) by running a DATALENGTH on your column:
SELECT DATALENGTH(MyField) FROM MyTable
You can also get an average:
SELECT AVG(DATALENGTH(MyField)) FROM MyTable
A good idea concerning BLOB field is to retrieve them only when needer and not when you are loading a list of data.
For example, assume a XML field stored in a PurchaseOrder table. If you wish to display the list of PO to your user, you usually don't need to retrieve that field, unless the user open the PO.
Many recent ORM, like nHibernate, offers lazy loading for columns, along with paging so you can retrieve a small amount of row.
Ayende posted a rent about loading unbounded result set two weeks ago.
You're right - the select query shouldn't take that long. It's not the number of rows. Likely it's the type of data you've got on that table/view, and perhaps the storage configuration (slow disk, filegroups config, etc).
Some ideas to consider to remedy this performance problem:
be specific in the columns that you want to retrieve. For ad-hoc queries, SELECT * is fine, but recognize that the RDBMS will work slightly harder to determine which columns are on the table/view.
gathering the values any columns of datatype text, varbinary will take proportionally longer depending on the data within those fields.
consider the indexes (do you have any?) on the table/view?
is this a Prod database, where more/other activity might be hitting this table?
If you edit your question, perhaps include the full table definition so that we can get a real look at what's happening with the datatypes.
I would recommend that you consider OMG Ponies's recommendation - it could be due to the bandwidth between the box and your machine, so
try to remote the box and see how long the query takes on that machine.
If it takes almost same amount of time, then the problem lies either in the database design or underlying hardware, or other factors (table datatypes, wrong indexes, overall load on the machine, overall hits to this table, etc)
if it takes significantly less amount of time, then the problem is surely between your machine and the box - ideally this shouldn't be a big problem, because the web server will be closer to the db server, probably on same LAN (so it should be much faster in the real world). Also, I'm sure you wouldn't use a 'Select *' in the actual app to pick 29000 rows, so it shouldn't create a lot of problem.
I have one procedure which updates record values, and i want to fire it up against all records in table (over 30k records), procedure execution time is from 2 up to 10 seconds, because it depends on network load.
Now i'm doing UPDATE table SET field = procedure_name(paramns); but with that amount of records it takes up to 40 min to process all table.
Now im using 4 different connections witch fork to background and fires query with WHERE clause set to iterate over modulo of row id's to speed this up, ( WHERE id_field % 4 = ) and this works well and cuts down table populate to ~10 mins.
But i want to avoid using cron, shell jobs and multiple connections for this, i know that it can be done with libpq, but is there a way to fire up a query (4 different non-blocking queries) and do not wait till it ends execution, within single connection?
Or if anyone can point me out to some clues on how to write that function, using postgres internals, or simply in C and bound it as a stored procedure?
Cheers Darius
I've got a sure answer for this question - IF you will share with us what your ab workout is!!! I'm getting fat by the minute and I need answers myself...
OK I'll answer anyway.
If you are updating one table, on one database server, in 40 minutes 'single threaded' and in 10 minutes with 4 threads, the bottleneck is not the database server; otherwise, it would get bogged down in I/O. If you are executing a bunch of UPDATES, one call per record, the network round-trip time is killing you.
I'm pretty sure this is the case and not that it's either an I/O bottleneck on the DB or the possibility that procedure_name(paramns); is taking a long time. (If that were the procedure taking 2-10 seconds it would take like 2500 min to do 30K records). The reason I am sure is that starting 4 concurrent processed cuts the time in 1/4. So especially it is not an i/o issue on the DB server.
This might be the one excuse for putting business logic in an SP on the server. Optimization unfortunately means breaking the rules. The consequence is difficult maintenance. but, duh!!
However, the best solution would be to get this set up to use 'bulk update' queries. That might mean you have to take several strange and unintuitive steps such as this:
This will require a lot of modfication if multiple users can run it concurrently.
refactor the system so procedure_name(paramns) can get all the data it needs to process all records via a select statement. May need to use creative joins. If it's an SP of course now you are moving the logic to the client.
Use that have the program create an XML or other importable flat file format with the PK of the record to update, and the new field value or values. Write all the updates to this file instead of executing them on the DB.
have a temp table on the database that matches the layout of this flat file
run an import on the database - clear the temp table and import the file
do an update of a join of the temp table and the table to be updated, e.g., UPDATE mytbl, mytemp WHERE myPK=mytempPK SET myval=mytempnewval (use the right join syntax of course).
You can try some of these things 'by hand' first before you bother coding, to see if it's worth the speed increase.
If possible, you can still put this all in an SP!
I'm not making any guarantees, especially as I look down at my ever-fattening belly, but, this has the potential to melt your update job down to under a minute.
It is possible to update multiple rows at once. Below an example in postgres:
UPDATE
table_name
SET
column_name = temp.column_name
FROM
(VALUES
(<id1>, <value1>),
(<id2>, <value2>),
(<id3>, <value3>)
) AS temp("id", "column_name")
WHERE
table_name.id = temp.id
PHP has some functions for asynchrone queries:
pg_ send_ execute()
pg_ send_ prepare()
pg_send_query()
pg_ send_ query_ params()
No idea about other programming languages, you have to dig into the manuals.
I think you can't. Single connection can handle single query at once. It's described in libpq documentation chapter "Asynchronous Command Processing":
"After successfully calling PQsendQuery, call PQgetResult one or more times to obtain the results. PQsendQuery cannot be called again (on the same connection) until PQgetResult has returned a null pointer, indicating that the command is done."
I have a DB table in which each row has a randomly generated primary key, a message and a user. Each user has about 10-100 messages but there are 10k-50k users.
I write the messages daily for each user in one go. I want to throw away the old messages for each user before writing the new ones to keep the table as small as possible.
Right now I effectively do this:
delete from table where user='mk'
Then write all the messages for that user. I'm seeing a lot of contention because I have lots of threads doing this at the same time.
I do have an additional requirement to retain the most recent set of messages for each user.
I don't have access to the DB directly. I'm trying to guess at the problem based on some second hand feedback. The reason I'm focusing on this scenario is that the delete query is showing a lot of wait time (again - to the best of my knowledge) plus it's a newly added bit of functionality.
Can anyone offer any advice?
Would it be better to:
select key from table where user='mk'
Then delete individual rows from there? I'm thinking that might lead to less brutal locking.
If you do this everyday for every user, why not just delete every record from the table in a single statement? Or even
truncate table whatever reuse storage
/
edit
The reason why I suggest this approach is that the process looks like a daily batch upload of user messages preceded by a clearing out of the old messages. That is, the business rules seems to me to be "the table will hold only one day's worth of messages for any given user". If this process is done for every user then a single operation would be the most efficient.
However, if users do not get a fresh set of messages each day and there is a subsidiary rule which requires us to retain the most recent set of messages for each user then zapping the entire table would be wrong.
No, it is always better to perform a single SQL statement on a set of rows than a series of "row-by-row" (or what Tom Kyte calls "slow-by-slow") operations. When you say you are "seeing a lot of contention", what are you seeing exactly? An obvious question: is column USER indexed?
(Of course, the column name can't really be USER in an Oracle database, since it is a reserved word!)
EDIT: You have said that column USER is not indexed. This means that each delete will involve a full table scan of up to 50K*100 = 5 million rows (or at best 10K * 10 = 100,000 rows) to delete a mere 10-100 rows. Adding an index on USER may solve your problems.
Are you sure you're seeing lock contention? It seems more likely that you're seeing disk contention due to too many concurrent (but unrelated updates). The solution to that is simply to reduce the number of threads you're using: Less disk contention will mean higher total throughput.
I think you need to define your requirements a bit clearer...
For instance. If you know all of the users who you want to write messages for, insert the IDs into a temp table, index it on ID and batch delete. Then the threads you are firing off are doing two things. Write the ID of the user to a temp table, Write the message to another temp table. Then when the threads have finished executing, the main thread should
DELETE * FROM Messages INNER JOIN TEMP_MEMBERS ON ID = TEMP_ID
INSERT INTO MESSAGES SELECT * FROM TEMP_messges
im not familiar with Oracle syntax, but that is the way i would approach it IF the users messages are all done in rapid succession.
Hope this helps
TALK TO YOUR DBA
He is there to help you. When we DBAs take access away from the developers for something such as this, it is assumed we will provide the support for you for that task. If your code is taking too long to complete and that time appears to be tied up in the database, your DBA will be able to look at exactly what is going on and offer suggestions or possibly even solve the problem without you changing anything.
Just glancing over your problem statement, it doesn't appear you'd be looking at contention issues, but I don't know anything about your underlying structure.
Really, talk to your DBA. He will probably enjoy looking at something fun instead of planning the latest CPU deployment.
This might speed things up:
Create a lookup table:
create table rowid_table (row_id ROWID ,user VARCHAR2(100));
create index rowid_table_ix1 on rowid_table (user);
Run a nightly job:
truncate table rowid_table;
insert /*+ append */ into rowid_table
select ROWID row_id , user
from table;
dbms_stats.gather_table_stats('SCHEMAOWNER','ROWID_TABLE');
Then when deleting the records:
delete from table
where ROWID IN (select row_id
from rowid_table
where user = 'mk');
Your own suggestion seems very sensible. Locking in small batches has two advantages:
the transactions will be smaller
locking will be limited to only a few rows at a time
Locking in batches should be a big improvement.
Is it possible to have a 'persistent' temp table in MS-SQL? What I mean is that I currently have a background task which generates a global temp table, which is used by a variety of other tasks (which is why I made it global). Unfortunately if the table becomes unused, it gets deleted by SQL automatically - this is gracefully handled by my system, since it just queues it up to be rebuilt again, but ideally I would like it just to be built once a day. So, ideally I could just set something like set some timeout parameter, like "If nothing touches this for 1 hour, then delete".
I really don't want it in my existing DB because it will cause loads more headaches related to managing the DB (fragmentation, log growth, etc), since it's effectively rollup data, only useful for a 24 hour period, and takes up more than one gigabyte of HD space.
Worst case my plan is to create another DB on the same drive as tempdb, call it something like PseudoTempDB, and just handle the dropping myself.
Any insights would be greatly appreciated!
If you create a table as tempdb.dbo.TempTable, it won't get dropped until:
a - SQL Server is restarted
b - You explicitly drop it
If you would like to have it always available, you could create that table in model, so that it will get copied to tempdb during the restart (but it will also be created on any new database you create afterwards, so you would have to delete manually) or use a startup stored procedure to have it created. There would be no way of persisting the data through restarts though.
I would go with your plan B, "create another DB on the same drive as tempdb, call it something like PseudoTempDB, and just handle the dropping myself."
How about creating a permanent table? Say, MyTable. Once every 24 hours, refresh the data like this:
Create a new table MyTableNew and populate it
Within a transaction, drop MyTable, and use rename_object to rename MyTableNew to MyTable
This way, you're recreating the table every day.
If you're worried about log files, store the table in a different database and set it to Recovery Model: Simple.
I have to admit to doing a double-take on this question: "persistent" and "temp" don't usually go together! How about a little out-of-the-box thinking? Perhaps your background task could periodically run a trivial query to keep SQL from marking the table as unused. That way, you'd take pretty direct control over creation and tear down.
After 20 years of experience dealing with all major RDBMS in existence, I can only suggest a couple of things for your consideration:
Note the oxymoronic concepts: "persistent" and "temp" are complete opposites. Choose one, and one only.
You're not doing your database any favors writing data to the temp DB for a manual, semi-permanent, user-driven basis. Normal tablespaces (i.e. user) are already there for that purpose. The temp DB is for temporary things.
If you already know that such a table will be permanently used ("daily basis" IS permanent), then create it as a normal table on a user database/schema.
Every time that you delete and recreate the very same table you're fragmenting your whole database. And have the perverse bonus of never giving a chance for the DB engine optimizer to assist you in any sort of crude optimization. Instead, try truncating it. Your rollback segments will thank you for that small relief and disk space will probably still be allocated for when you repopulate it again the next day. You can force that desired behavior by specifying a separate tablespace and datafile for that table alone.
Finally, and utterly more important: Stop mortifying you and your DB engine for a measly 1 GB of data. You're wasting CPU, I/O cycles, adding latency, fragmentation, and so on for the sake of saving literally 0.02 cents of hardware real state. Talk about dropping to the floor in a tuxedo to pick up a brown cent. 😂