I have a Sql Query which returns me over half million rows to process... The process doesn't take really long, but I would like to speed it up a little bit with some multiprocessing. Considering the code below, is it possible to multithread something like that easily?
using (SqlDataReader reader = command.ExecuteReader())
{
while (reader.Read())
{
// ...process row
}
}
It would be perfect if I could simply get a cursor at the beginning and in the middle of the list of results. That way, I could have two thread processing the records. However the SqlDataReader doesn't allow me to do that...
Any idea how I could achieve that?
Set up a producer/consumer queue, with one producer process to pull from the reader and queue records as fast as it can, but do no "processing". Then some other number of processes (how many you want depends on your system) to dequeue and process each queued record.
You shouldn't read that many rows on the client.
That being said, you can partition your query into multiple queries and execute them in parallel. That means launch multiple SqlCommands in separate threads and have them each churn a partition of the result. The A+ question is how to partition the result, and this depends largely o your data and your query:
You can use a range of keys (eg. ID betweem 1 and 10000, ID between 10001 and 20000 etc)
You can use an attribute (eg. RecordTypeID IN (1,2), RecordTypeID IN (3,4) etc)
You can use a synthetic range (ie. ROW_NUMBER() BETWEEN 1 and 1000 etC), but this is very problematic to pull of right
You can use a hash (eg. BINARY_CHECKSUM(*)%10 == 0, BINARY_CHECKSUM(*)%10==1 etc)
You just have to be very careful that the partition queries do no overlap and block during execution (ie. scan the same records and acquire X locks), thus serializing each other.
Is it a simple ranged query like WHERE Id between 1 and 500000? If so you can just kick off N queries that each return 1/N of the range. But it helps to know where you are bottlenecked with the single threaded approach. If you are doing contiguous reads from one disk spindle to fulfill the query then you should probably stick with a single thread. If it is partitioned across spindles by some range then you can intelligently tune your queries to maximize throughput from disk (i.e. read from each disk in parallel with separate queries). If you expect all of the rows to be in memory then you can parallelize at will. But if the query is more complex then you may not be able to easily partition it without incurring a bunch of overhead. Most of the time the above options will not apply well and the producer/consumer that Joel mentioned will be the only place to parallelize. Depending on how much time you spend processing each row, this may be provide only trivial gains.
Related
I have oracle table contain 900 million records , this table partioned to 24 partion , and have indexes :
i try to using hint and i put fetch_buffer to 100000:
select /+ 8 parallel +/
* from table
it take 30 minutes to get 100 million records
my question is :
is there are any way more faster to get the 900 million (all data in the table ) ? should i use partions and did 24 sequential queries ? or should i use indexes and split my query to 10 queries for example
The network is almost certainly the bottleneck here. Oracle parallelism only impacts the way the database retrieves the data, but data is still sent to the client with a single thread.
Assuming a single thread doesn't already saturate your network, you'll probably want to build a concurrent retrieval solution. It helps that the table is already partitioned, then you can read large chunks of data without re-reading anything.
I'm not sure how to do this in Scala, but you want to run multiple queries like this at the same time, to use all the client and network resources possible:
select * from table partition (p1);
select * from table partition (p2);
...
Not really an answer but too long for a comment.
A few too many variables can impact this to give informed advice, so the following are just some general hints.
Is this over a network or local on the server? If the database is remote server then you are paying a heavy network price. I would suggest (if possible) running the extract on the server using the BEQUEATH protocol to avoid using the network. Once the file(s) complete, is will be quicker to compress and transfer to destination than transferring the data direct from database to local file via JDBC row processing.
With JDBC remember to set the cursor fetch size to reduce round tripping - setFetchSize. The default value is tiny (10 I think), try something like 1000 to see how that helps.
As for the query, you are writing to a file so even though Oracle might process the query in parallel, your write to file process probably doesn't so it's a bottleneck.
My approach would be to write the Java program to operate off a range of values as command line parameters, and experiment to find which range size and concurrent instances of the Java give optimal performance. The range will likely fall within discrete partitions so you will benefit from partition pruning (assuming the range value is an a indexed column ideally the partition key).
Roughly speaking I would start with range of 5m, and run concurrent instances that match the number of CPU cores - 2; this is not a scientifically derive number just one that I tend to use as my first stab and see what happens.
I need to spool over 20 million records in a flat file. A direct select query would be time utilizing. I feel the need to generate the output in parallel based on portions of the data - i.e having 10 select queries over 10% of the data each in parallel. Then sort and merge on UNIX.
I can utilize rownum to do this, however this would be tedious, static and needs to be updated every time my rownum changes.
Is there a better alternative available?
If the data in SQL is well spread out over multiple spindles and not all on one disk, and the IO and network channels are not saturated currently, splitting into separate streams may reduce your elapsed time. It may also introduce random access on one or more source hard drives which will cripple your throughput. Reading in anything other than cluster sequence will induce disk contention.
The optimal scenario here would be for your source table to be partitioned, that each partition is on separate storage (or very well striped), and each reader process is aligned with a partition boundary.
I want to create a external application which will query one table from a large Oracle database.
The query will run daily and I am expecting to handle 30,000+ rows.
To break down the size of these rows, I would like to create a new thread/ process for each 10,000 rows that exist. So going by the above figure it would be 3 threads to process all those rows.
I don't want each thread to overlap each others row set so I know I will need to add a column within the table to act as a range marker, a row_position
Logic
Get row_count of data set in query parameters
Get first_row_pos
While (row_count > 10,000)
{
Create thread with 10,000 rows starting from first_row_pos
row_count == row_count - 10,000
first_row_pos = first_row_pos + 10,000
}
create thread for remaining rows
all threads run their queries concurrently.
This is basic logic at the moment, however I do not know how feasible this is.
Is this a good way or is there a better way?
Can this be done through one database connection with each thread sharing or is it better to have a seperate db connection for each thread?
Any other advice welcome?
Note: I just realised a do while loop would be better if there is less than 10,000 rows in this case.
Thanks
Oralce provide a parallel hint for sutuations such as this where you have a full table scan or similar problem and want to make use of multiple cores to divide the workload. Further details here.
The syntax is very simple, you specify the table (or alias) and the number of cores (I usually leave as default) e.g.:
select /*+ parallel(a, default) */ *
from table_a a
You can also use this with multiple tables e.g.
select /*+ parallel(a, default) parallel(b,default) */ *
from table_a a, table_b b
where a.some_id = b.some_id
A database connection is not thread-safe, so if you are going to query the database from several threads, you would have to have a separate connection for each of them. You can either create a connection or get them from a pool.
Before you implement your approach, take some time to analyze where is the time spent. Oracle overall is pretty good with utilizing multiple cores. And the database interaction is usually is the most time-consuming part. By splitting the query in three you might actually slow things down.
If indeed your application is spending most of the time performing calculations on that data, your best approach might be loading all data in a single thread and then splitting processing into multiple threads.
I am running a pig script that performs a GROUP BY and a nested FOREACH that takes hours to run due to one or two reduce tasks. For example:
B = GROUP A BY (fld1, fld2) parallel 50;
C = FOREACH B {
U = A.fld1;
DIST = DISTINCT U;
GENERATE FLATTEN(group), COUNT_STAR(DIST);
}
Upon examining the counters for the slow tasks, I realized that it looks like the two reducers are processing through a lot more data than the other tasks. Basically, my understanding is that the data is very skewed and so the tasks that are "slow" are in fact doing more work than the fast tasks. I'm just wondering how to improve performance? I hate increasing the parallelism to try to split up the work but is that the only way?
The first option is to use a custom partitioner. Check out the documentation on GROUP for more info (check out PARTITION BY, specifically). Unfortunately, you probably have to write your own custom partitioner here. In your custom partitioner, send the first huge set of keys to reducer 0, send the next set to reducer 1, then do the standard hash partitioning across what's left. What this does is lets one reducer handle the big ones exclusively, while the others get multiple sets of keys. This doesn't always solve the problem with bad skew, though.
How valuable is the count for those two huge sets of data? I see huge skew a lot when things like NULL or empty string. If they aren't that valuable, filter them out before the GROUP BY.
I have to select all rows from a table with millions of rows (to preload a Coherence datagrid.) How do I split up this query into multiple queries that can be concurrently executed by multiple threads?
I first thought of getting a count of all records and doing:
SELECT ...
WHERE ROWNUM BETWEEN (packetNo * packetSize) AND ((packetNo + 1) * packetSize)
but that didn't work. Now I'm stuck.
Any help will be very appreciated.
If you have the Enterprise Edition license, the easiest way of achieving this objective is parallel query.
For one-off or ad hoc queries use the PARALLEL hint:
select /*+ parallel(your_table, 4) */ *
from your_table
/
The number in the hint is the number of slave queries you want to execute; in this case the database will run four threads.
If you want every query issued on the table to be parallelizable then permanently alter the table definition:
alter table your_table parallel (degree 4)
/
Note that the database won't always use parallel query; the optimizer will decide whether it's appropriate. Parallel query only works with full table scans or index range scans which cross multiple partitions.
There are a number of caveats. Parallel query requires us to have sufficient cores to satisfy the proposed number of threads; if we only have a single dual-core CPU setting a parallel degree of 16 isn't going to magically speed up the query. Also, we need spare CPU capacity; if the server is already CPU bound then parallel execution is only going to make things worse. Finally, the I/O and storage subsystems need to be capable of satisfying the concurrent demand; SANs can be remarkably unhelpful here.
As always in matters of performance, it is crucial to undertake some benchmarking against realistic volumes of data in a representative environment before going into production.
What if you don't have Enterprise Edition? Well, it is possible to mimic parallel execution by hand. Tom Kyte calls it "Do-It-Yourself Parallelism". I have used this technique myself, and it works well.
The key thing is to work out the total range ROWIDs which apply to the table, and split them across multiple jobs. Unlike some of the other solutions proposed in this thread, each job only selects the rows it needs. Mr Kyte summarized the technique in an old AskTom thread, including the vital split script: find it here.
Splitting the table and starting off threads is a manual task: fine as a one-off but rather tiresome to undertake frequently. So if you are running 11g release 2 you ought to know that there is a new PL/SQL package DBMS_PARALLEL_EXECUTE which automates this for us.
Are you sure a parallel execution of the query will be faster? This will only be the case if the huge table is stored on a disk array with many disks or if it is partitioned over several disk. In all other cases, a sequential access of the table will be many times faster.
If you really have to split the query, you have to split it in a way so that a sequential access for each part is still possible. Please post the DLL of the table so we can give a specific answer.
If the processing of the data or the loading into the data grid is the bottleneck, then you are better of reading the data with a single process and the splitting the data before futher processing it.
Assuming that reading is fast and further data processing is the bottleneck, you could for exmaple read the data and write it into very simple text files (such a fixed length or CSV). After every 10,000 rows you start a new file and spawn a thread or process to process the just finished file.
try with something like this:
select * from
( select a.*, ROWNUM rnum from
( <your_query_goes_here, with order by> ) a
where ROWNUM <= :MAX_ROW_TO_FETCH )
where rnum >= :MIN_ROW_TO_FETCH;
Have you considered using MOD 10 on ROWNUM to pull the data one tenth at a time?
SELECT A.*
FROM Table A
WHERE MOD(ROWNUM,10) = 0;