I have a table with ~30,000,000 rows that I need to iterate through, manipulate the data for each row individually, then save the data from the row to file on a local drive.
What is the most efficient way to loop through all the rows in the table using SQL for Oracle? I've been googling but can see no straightforward way of doing this. Please help. Keep in mind I do not know the exact number of rows, only an estimate.
EDIT FOR CLARIFICATION:
We are using Oracle 10g I believe. The row data contains blob data (zipped text files and xml files) that will be read into memory and loaded into a custom object, where it will then be updated/converted using .Net DOM access classes, rezipped, and stored onto a local drive.
I do not have much database experience whatsoever - I planned to use straight SQL statements with ADO.Net + OracleCommands. No performance restrictions really. This is for internal use. I just want to do it the best way possible.
You need to read 30m rows from an Oracle DB and write out 30m files from the BLOBs (one zipped XML/text file in one BLOB column per row?) in each row to the file system on the local computer?
The obvious solution is open a ADO.NET DataReader on SELECT * FROM tbl WHERE <range> so you can do batches. Read the BLOB from the reader into your API, do your stuff and write out the file. I would probably try to write the program so that it can run from many computers, each doing their own ranges - your bottleneck is most likely going to be the unzipping, manipulation and the rezipping, since many consumers can probably stream data from that table from the server without noticeable effect on server performance.
I doubt you'll be able to do this with set-based operations internal to the Oracle database, and I would also be thinking about the file system and how you are going to organize so many files (and whether you have space - remember the size taken up by a file on a the file system is always an even multiple of the file system block size).
My initial solution was to do something like this, as I have access to an id number (pseudocode):
int num_rows = 100;
int base = 0;
int ceiling = num_rows;
select * from MY_TABLE where id >= base and id < ceiling;
iterate through retrieved rows, do work,
base = ceiling;
ceiling += num_rows;
select * from MY_TABLE where id >= base and id < ceiling;
iterate through retrieved rows, do work,
...etc.
But I feel that this might not be the most efficient or best way to do it...
You could try using rownum queries to grab chunks until you grab chunk that doesn't exist.
This is a good article on rownum queries:
http://www.oracle.com/technetwork/issue-archive/2006/06-sep/o56asktom-086197.html
If you don't feel like reading, jump directly to the "Pagination with ROWNUM" section at the end for an example query.
It's always preferable to use set-based operations when working with a large number of rows.
You would then enjoy a performance benefit. After processing the data, you should be able to dump the data from the table into a file in one go.
The viability of this depends on the processing you need to perform on the rows, although it is possible in most cases to avoid using a loop. Is there some specific requirement which prevents you from processing all rows at once?
If iterating through the rows is unavoidable, using bulk binding can be beneficial: FORALL bulk operations or BULK COLLECT for "select into" queries.
It sounds like you need the entire dataset before you can do any data manipulation since it is a BLOB>. I would just use a DataAdapter.Fill and then hand the dataset over to the custom object to iterate through, do it's manipulation and then write to disk the end object, and then zip.
Related
I'm trying to insert large number of records by selecting from a different table.
In the below example, BAR table has around 1 million records and trying to insert all those into FOO table. Is there a way I can do this efficiently with out the loader API or batch insert with JOOQ?
FYI, I'm trying to avoid the approach to load all the records in memory, so I'm not using the loader API which expects the JOOQRecords.
dslContext
.insertInto(FOO)
.columns(FOO.A, FOO.B)
.select(
select(A, B)
.from(BAR))
.execute();
This isn't strictly a jOOQ problem as you'd run into the same issues when writing the equivalent query in JDBC or even in a stored procedure. Such a bulk data transfer operation is usually the most efficient way to copy data between tables using SQL. There might be other tools available that bypass the SQL layer (e.g. pg_dump), but with SQL, this is optimal.
If you don't have enough resources to run everything in one go, you could partition your data set into several chunks using different techniques:
By transferring data of individual date ranges
By transferring data of individual ID ranges
By using keyset pagination
When partitioning your data as mentioned above, do also check if you can decrease the transaction size, e.g. to 1000 rows per commit. This isn't exact science, you'll have to find appropriate chunk and transaction sizes empirically for your specific system.
With all of these approaches, ACID is no longer guaranteed, so if your source data is modified during the move, you'll have to detect that somehow, and "fix it" (e.g by flagging rows that have been moved)
Or, just add more memory to the system.
I have two models -
ChatCurrent - (which stores the messages for the current active chats)
ChatArchive - (which archives the messages for the chats that have ended)
The reason I'm doing this is so that the ChatCurrent table always has minimum number of entries, making querying the table fast (I don't know if this works, please let me know if I've got this wrong)
So I basically want to copy (cut) data from the ChatCurrent to the ChatArchive model. What would be the fastest way to do this. From what I've read online, it seems that I might have to execute a raw SQL query, if you would be kind enough to even state the Query I'll be grateful.
Additional details -
Both the models have the same schema.
My opinion is that today they are not reason to denormalize database in this way to improve performance. Indexes or partitioning + indexes should be enought.
Also, in case that, for semantic reasons, you prefer have two tables (models) like: Chat and ChatHistory (or ChatCurrent and ChatActive) as you say and manage it with django, I thing that the right way to keep consistence is to create ToArchive() method in ChatCurrent. This method will move chat entries to historical chat model. You can perform this operation in background mode, then you can thread the swap in a celery process, in this way online users avoid wait for request. Into celery process the fastest method to copy data is a raw sql. Remember that you can encapsulate sql into a stored procedure.
Edited to include reply to your comment
You can perform ChatCurrent.ToArchive() in ChatCurrent.save() method:
class ChatCurrent(model.Model):
closed=models.BooleanField()
def save(self, *args, **kwargs):
super(Model, self).save(*args, **kwargs)
if self.closed:
self.ToArchive()
def ToArchive(self):
from django.db import connection, transaction
cursor = connection.cursor()
cursor.execute("insert into blah blah")
transaction.commit_unless_managed()
#self.delete() #if needed (perhaps deleted on raw sql)
Try something like this:
INSERT INTO "ChatArchive" ("column1", "column2", ...)
SELECT "column1", "column2", ...
FROM "ChatCurrent" WHERE yourCondition;
and than just
DELETE FROM "ChatCurrent" WHERE yourCondition;
The thing you are trying to do is table partitioning.
Most databases support this feature without the need for manual book keeping.
Partitioning will also yield much better results than manually moving parts of the data to a different table. By using partitioning you avoid:
- Data inconsistency. Which is easy to introduce because you will move records in bulk and then remove a lot of them from the source table. It's easy to make a mistake and copy only a portion of the data.
- Performance drop - moving the data around and the associated overhead from transactions will generally neglect any benefit you got from reducing the size of the ChatCurrent table.
For a really quick rundown. Table partitioning allows you to tell the database that parts of the data are stored and retrieved together, this significantly speeds up queries as the database knows that it only has to look into a specific part of the data set. Example: chat's from the current day, last hour, last month etc. You can additionally store each partition on a different drive, that way you can keep your current chatter on a fast SSD drive and your history on regular slower disks.
Please refer to your database manual to know the details about how it handles partitioning.
Example for PostgreSQL: http://www.postgresql.org/docs/current/static/ddl-partitioning.html
Partitioning refers to splitting what is logically one large table into smaller physical pieces. Partitioning can provide several benefits:
Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions. The partitioning substitutes for leading columns of indexes, reducing index size and making it more likely that the heavily-used parts of the indexes fit in memory.
When queries or updates access a large percentage of a single partition, performance can be improved by taking advantage of sequential scan of that partition instead of using an index and random access reads scattered across the whole table.
Bulk loads and deletes can be accomplished by adding or removing partitions, if that requirement is planned into the partitioning design. ALTER TABLE NO INHERIT and DROP TABLE are both far faster than a bulk operation. These commands also entirely avoid the VACUUM overhead caused by a bulk DELETE.
Seldom-used data can be migrated to cheaper and slower storage media.
def copyRecord(self,recordId):
emailDetail=EmailDetail.objects.get(id=recordId)
copyEmailDetail= CopyEmailDetail()
for field in emailDetail.__dict__.keys():
copyEmailDetail.__dict__[field] = emailDetail.__dict__[field]
copyEmailDetail.save()
logger.info("Record Copied %d"%copyEmailDetail.id)
As per the above solutions, don't copy over.
If you really want to have two separate tables to query, store your chats in a single table (and for preference, use all the database techniques here mentioned), and then have a Current and Archive table, whose objects simply point to Chat objects/
Need to query a database for 12 million rows, process this data and then insert the filtered data into another database.
I can't just do a SELECT * from the database for obvious reasons - far too much data would be returned for my program to handle, and also this is a live database (customer order details) and I can't have the database crawl to a halt for 10 minutes while it runs my query.
I'm looking for inspiration on how to write this program. I have to process each row. I was thinking it might be best to get a count on the rows. Then grab X at a time, wait for Y seconds, and repeat, until the dataset is complete. This way I'm not overloading the database, and since X will be sufficiently small, will run nicely in memmory.
Other suggestions or feedback ?
I'd recommend you read the doc about SELECT...INTO OUTFILE and LOAD DATA FROM INFILE.
These are very fast ways of dumping data to a flat file and then importing it to another database.
You could dump into the flat file, and then run an offline script to process your rows, and then once that's done import the result to the new database.
See also:
http://dev.mysql.com/doc/refman/5.1/en/select.html (search for "INTO OUTFILE")
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
Spreading the load over time seems the only practicable solution. Exactly how to do it depends to some extent on your schema, how records change over time in the "live database", and what consistency semantics your processing must have.
In the worst case -- any record can be changed at any time, there is nothing in the schema that lets you easily and speedily check for "recently modified, inserted, or deleted records", and you nevertheless need to be consistent in what you process -- the task is simply unfeasible, unless you can count on some special support from your relational engine and/or OS (such as volume or filesystem "snapshots", like in Linux's LVM, that let you cheaply and speedily "freeze in time" a copy of the volumes on which the DB resides, for later leisurely fetching with another, read-only, database configured to read from the snapshot volume).
But presumably you do have some constraints, something in the schema that helps with the issue, or else, one can hope, you can afford some inconsistency generated by changes in the DB happening at the same time as your processing -- some lines processed twice, some not processed, some processed in older versions and others in newer versions... unfortunately, you have told us next to nothing about any of these issues, making it essentially unfeasible to offer much more help. If you edit your question to provide a LOT more information on platform, schema, and DB usage patterns, maybe more help can be offered.
A flat file or a snapshot are both ideal.
If a flat file does not suit or you do not have access to snapshots theny you could use a sequential id field or create a sequential id in a temp table and then iterate using that.
Something like
#max_id = 0
while exists (select * from table where seq_id > #max_id)
select top n * from table where seq_id > #max_id order by seq_id
... process...
set #max_id = #max seq_id from the last lot
end
If there is no sequential id then you can create a temp table that holds the order like
insert into some_temp_table
select unique_id from table order by your_ordering_scheme
then process like this
... do something with top n from table join some_temp_table on unique_id ...
delete top n from some_temp_table
this way temp_table holds the record identifiers that still need to be processed.
You don't mention which db you are using, but I doubt any db that can hold 12 million rows would actually try to return all the data to your program at once. Your program essentially streams the data in small blocks (say 1000 rows) something that is usually handled by the database driver.
RDBMSs have different transaction levels which can be used to reduce the effort the database spends maintaining consistency guarantees, which will avoid locking up the table.
Databases can also create snapshots of tables to a file for later analysis.
In your position, I would try the simplest thing first, and see how that scales (on a development copy of the db with simulated user access.)
Now we have a firebird database with 1.000.000 that must be processed after ALL are loaded in RAM memory. To get all of those we must extract data using (select * first 1000 ...) for 8 hours. What is the solution for this?
Does each of your "select * first 1000" (as you described it) do a full table scan? Look at those queries, and make sure they are using an index.
How long does it take to construct the DTO object that you are creating with each data read?
{ int a = read.GetInt32(0); int b = read.GetInt32(1); mylist.Add(new DTO(a,b)); }
You are creating a million of these objects. If it takes 29 milliseconds to create one DTO object, then that is going to take over 8 hours to complete.
to load data from a table with
1.000.000 rows in C# using a firebird db takes on a Pentium 4 3Ghz at least
8 hours
Everybody's been assuming you were running a SQL query to select the records from the database Something like
select *
from your_big_table
/
Because that really would take a few seconds. Well, a little longer to display it on a screen, but executing the actual select should be lightning fast.
But that reference to C# makes me think you're doing something else. Perhaps what you really have is an RBAR loop instantiating one million objects. I can see how that might take a little longer. But even so, eight hours? Where does the time go?
edit
My guess was right and you are instantiating 1000000 objects in a loop. The correct advice would be to find some other way of doing whatever it is you do once you have got all your objects in memory. Without knowing more about the details it is hard to give specifics. But it seems unlikely this is a UI think - what user is going to peruse a million objects?
So a general observation will have to suffice: use bulk operations to implement bulk activity. SQL databases excel at handling sets. Leverage the power of SQL to process your million rows in a single set, rather than as individual rows.
If you don't find this answer helpful then you need to give us more details regarding want you're trying to achieve.
What sort of processing do you need to do that would require to load them in memory and not just process them via SQL statements?
There are two techniques I use that work depending on what I am trying to do.
Assuming there is some sort of artificial key (identity), work in batches, incrementing the last identity value processed.
BCP the data out to a text file, churn through the updates, then BCP it back in, remembering to turn off constraints and indexes before the IN step.
Take a look at this:
http://www.firebirdfaq.org/faq13/
I am trying to select 100s of rows at a DB that contains 100000s of row and update those rows afters.
the problem is I don't want to go to DB twice for this purpose since update only marks those rows as "read".
is there any way I can do this in java using simple jdbc libraries? (hopefully without using stored procedures)
update: ok here is some clarification.
there are a few instance of same application running on different servers, they all need to select 100s of "UNREAD" rows sorted according to creation_date column, read blob data within it, write it to file and ftp that file to some server. (I know prehistoric but requirements are requirements)
The read and update part is for to ensure each instance getting diffent set of data. (in order, tricks like odds and evens wont work :/)
We select data for update. the data transfers through the wire (we wait and wait) and then we update them as "READ". then release lock for reading. this entire thing takes too long. By reading and updating at the same time, I would like to reduce lock time (from time we use select for update to actual update) so that using multiple instances would increase read rows per second.
Still have ideas?
It seems to me there might be more than one way to interpret the question here.
You are selecting the rows for the
sole purpose of updating them and
not reading them.
You are selecting the rows to show
to somebody, and marking them as
read either one at a time or all as a group.
You want to select the rows and mark
them as read at the time you select
them.
Let's take Option 1 first, as that seems to be the easiest. You don't need to select the rows in order to update them, just issue an update with a WHERE clause:
update table_x
set read = 'T'
where date > sysdate-1;
Looking at option 2, you want to mark them as read when a user has read them (or a down stream system has received it, or whatever). For this, you'll probably have to do another update. If you query for the primary key, in addition to the other columns you'll need in the first select, you will probably have an easier time of updating, as the DB won't have to do table or index scans to find the rows.
In JDBC (Java) there is a facility to do a batch update, where you execute a set of updates all at once. That's worked out well when I need to perform a lot of updates that are of the exact same form.
Option 3, where you want to select and update all in one shot. I don't find much use for this, personally, but that doesn't mean others don't. I suppose some kind of stored procedure would reduce the round trips. I'm not sure what db you are working with here and can't really offer specifics.
Going to the DB isn't so bad. If you aren't returning anything 'across the wire' then an update shouldn't do you too much damage and its only a few hundred thousand rows. What is your worry?
If you're doing a SELECT in JDBC and iterating over the ResultSet to UPDATE each row, you're doing it wrong. That's an (n+1) query problem that will never perform well.
Just do an UPDATE with a WHERE clause that determines which of those rows needs to be updated. It's a single network round trip that way.
Don't be too code-centric. Let the database do the job it was designed for.
Can't you just use the same connection without closing it?