How to split up a massive data query into multiple queries

How to split up a massive data query into multiple queries - sql

I have to select all rows from a table with millions of rows (to preload a Coherence datagrid.) How do I split up this query into multiple queries that can be concurrently executed by multiple threads?
I first thought of getting a count of all records and doing:
SELECT ...
WHERE ROWNUM BETWEEN (packetNo * packetSize) AND ((packetNo + 1) * packetSize)
but that didn't work. Now I'm stuck.
Any help will be very appreciated.

If you have the Enterprise Edition license, the easiest way of achieving this objective is parallel query.
For one-off or ad hoc queries use the PARALLEL hint:
select /*+ parallel(your_table, 4) */ *
from your_table
/
The number in the hint is the number of slave queries you want to execute; in this case the database will run four threads.
If you want every query issued on the table to be parallelizable then permanently alter the table definition:
alter table your_table parallel (degree 4)
/
Note that the database won't always use parallel query; the optimizer will decide whether it's appropriate. Parallel query only works with full table scans or index range scans which cross multiple partitions.
There are a number of caveats. Parallel query requires us to have sufficient cores to satisfy the proposed number of threads; if we only have a single dual-core CPU setting a parallel degree of 16 isn't going to magically speed up the query. Also, we need spare CPU capacity; if the server is already CPU bound then parallel execution is only going to make things worse. Finally, the I/O and storage subsystems need to be capable of satisfying the concurrent demand; SANs can be remarkably unhelpful here.
As always in matters of performance, it is crucial to undertake some benchmarking against realistic volumes of data in a representative environment before going into production.
What if you don't have Enterprise Edition? Well, it is possible to mimic parallel execution by hand. Tom Kyte calls it "Do-It-Yourself Parallelism". I have used this technique myself, and it works well.
The key thing is to work out the total range ROWIDs which apply to the table, and split them across multiple jobs. Unlike some of the other solutions proposed in this thread, each job only selects the rows it needs. Mr Kyte summarized the technique in an old AskTom thread, including the vital split script: find it here.
Splitting the table and starting off threads is a manual task: fine as a one-off but rather tiresome to undertake frequently. So if you are running 11g release 2 you ought to know that there is a new PL/SQL package DBMS_PARALLEL_EXECUTE which automates this for us.

Are you sure a parallel execution of the query will be faster? This will only be the case if the huge table is stored on a disk array with many disks or if it is partitioned over several disk. In all other cases, a sequential access of the table will be many times faster.
If you really have to split the query, you have to split it in a way so that a sequential access for each part is still possible. Please post the DLL of the table so we can give a specific answer.
If the processing of the data or the loading into the data grid is the bottleneck, then you are better of reading the data with a single process and the splitting the data before futher processing it.
Assuming that reading is fast and further data processing is the bottleneck, you could for exmaple read the data and write it into very simple text files (such a fixed length or CSV). After every 10,000 rows you start a new file and spawn a thread or process to process the just finished file.

try with something like this:
select * from
( select a.*, ROWNUM rnum from
( <your_query_goes_here, with order by> ) a
where ROWNUM <= :MAX_ROW_TO_FETCH )
where rnum >= :MIN_ROW_TO_FETCH;

Have you considered using MOD 10 on ROWNUM to pull the data one tenth at a time?
SELECT A.*
FROM Table A
WHERE MOD(ROWNUM,10) = 0;

Related

the faster way to extract all records from oracle

I have oracle table contain 900 million records , this table partioned to 24 partion , and have indexes :
i try to using hint and i put fetch_buffer to 100000:
select /+ 8 parallel +/
* from table
it take 30 minutes to get 100 million records
my question is :
is there are any way more faster to get the 900 million (all data in the table ) ? should i use partions and did 24 sequential queries ? or should i use indexes and split my query to 10 queries for example

The network is almost certainly the bottleneck here. Oracle parallelism only impacts the way the database retrieves the data, but data is still sent to the client with a single thread.
Assuming a single thread doesn't already saturate your network, you'll probably want to build a concurrent retrieval solution. It helps that the table is already partitioned, then you can read large chunks of data without re-reading anything.
I'm not sure how to do this in Scala, but you want to run multiple queries like this at the same time, to use all the client and network resources possible:
select * from table partition (p1);
select * from table partition (p2);
...

Not really an answer but too long for a comment.
A few too many variables can impact this to give informed advice, so the following are just some general hints.
Is this over a network or local on the server? If the database is remote server then you are paying a heavy network price. I would suggest (if possible) running the extract on the server using the BEQUEATH protocol to avoid using the network. Once the file(s) complete, is will be quicker to compress and transfer to destination than transferring the data direct from database to local file via JDBC row processing.
With JDBC remember to set the cursor fetch size to reduce round tripping - setFetchSize. The default value is tiny (10 I think), try something like 1000 to see how that helps.
As for the query, you are writing to a file so even though Oracle might process the query in parallel, your write to file process probably doesn't so it's a bottleneck.
My approach would be to write the Java program to operate off a range of values as command line parameters, and experiment to find which range size and concurrent instances of the Java give optimal performance. The range will likely fall within discrete partitions so you will benefit from partition pruning (assuming the range value is an a indexed column ideally the partition key).
Roughly speaking I would start with range of 5m, and run concurrent instances that match the number of CPU cores - 2; this is not a scientifically derive number just one that I tend to use as my first stab and see what happens.

Executing same simple select statement or stored procedure on SQL Azure takes long time or times-out

I have two SQL Server Azure instances with Standard S2: 50 DTUs. When I run simple select statements on two instances, one of them takes more time than other or times out. Slower one have more records in tables in slower instance.
Both the instances have same table schema. Number of records in tables present in slower instances, LogEvidence table have 1324928 and LogItem table have 649391. Number of records in tables present in faster instances, LogEvidence table have 89504 and LogItem table have 89496.
Below is the simple select statement
select count(*) from logitem
Above simple select statement takes 0s on faster instance and on slower instance it takes 138s. And if I execute any stored procedure, slower instance takes more times or times out.
Time taken by both instances should be almost same.

Those simple queries perform big scans on the table and involve reading all rows. If the table has a clustered index you don't have to perform a SELECT COUNT(*) to know the number of records the table has. The following query should to that faster:
SELECT OBJECT_NAME(ps.object_id) , i.name , row_count
FROM sys.dm_db_partition_stats AS ps INNER JOIN sys.indexes AS i
ON ps.index_id = i.index_id AND ps.object_id = i.object_id
WHERE i.name like '%logitem%'
If the table does not have an Id please add an autoid on the table and make it the clustered index.
You can also try to add a useless WHERE clause like below to the query, and you may get a better performance.
SELECT count(*)
FROM logitem
WHERE id > 0
Where Id is the autoid column.

I had some experience with azure, and from your description I think there is one of following things you can do:
Since you are using only count, then indexes play no role. Though I understand other answer says to use where id>0, but azure should count 1M rows without 30 second timeout. But for other queries you need Indexes, or Azure will fail.
Check if your server is not under maintenance, it is low chance but it does happen with us, we are on s4 and occasionally our server just get slow, but after 10-30 minute it works fine. Maybe the actual hardware get in some process that slows it down.
This is most important reason for slow execution, especially if you have lot of write and delete happen on your server. Check the database size. Azure database got fragmented too quickly, we have to optimize it's data fragmentation every 10 days, if your bacpac size is 100MB and your database size in Azure shows like 5-6 GB, then it definitely need optimization as lot of fragments were generated. MSDN has given some queries to recreate indexes and remove fragmentation, I don't remember them URL, but simple google search will bring that. It should speed things up.
Azure has feature that auto generate indexes, check if both table share same indexes, maybe your faster version has some index Azure create by itself.

You should step back and ponder your assumption:
1. "performance should be about the same" - you have more data in one case vs. the other. In the limit, you should expect the performance of the second one to potentially be somewhat slower than the original one.
Now, let's go into the "why" it can be slower and how you can investigate each case:
Step 1: Look at the query plans for each case and see what you have. Likely, you will have something like:
StreamAgg <- Clustered Index Scan
(if you have other b-tree indexes, you might scan one of them and it might be faster since the index would not be as wide and thus the index will have fewer pages to scan)
Step 2: You can look at the actual execution times and resource use for each query to see why they are different. One way to do this is to run "set statistics time on", then "set statistics io on", then run your query. it will dump out extra information into SSMS when you run the query from there. (You can read about this here: https://learn.microsoft.com/en-us/sql/t-sql/statements/set-statistics-io-transact-sql?view=sql-server-2017)
If you review the output from each one, you may find reasons for why the performance is different. One possible explanation is that the amount of memory is limited in an S2 and you are just at the boundary for where all the pages fit in memory vs. not for the two examples. In that case, doing a count(*) query would need to cycle through all the pages and do much more IO than in the smaller case where they might all be in memory already.
Step 3: You can also potentially examine the query store to get insight into why one case is fast and one case is not. An overview of how to use it is here:
https://learn.microsoft.com/en-us/sql/relational-databases/performance/monitoring-performance-by-using-the-query-store?view=sql-server-2017
Note: it is on-by-default in SQL Azure so you can just go look at the time window when you ran the queries to get insight into what was happening at that time in your database.
Finally, you might consider ways to make the query go faster if you need it to be faster.
* creating a narrow b-tree index on the table may help for that one query (count(*) doesn't return any columns and just needs a count of rows from some non-filtered index).
* you could use a Columnstore (which requires an S3 or above for memory reasons). This kind of column-oriented index is optimized for this kind of query and would be much faster as the size of the table increases in the future.
Hope that help

Breaking down a large number of rows into smaller queries? Parallelism

I want to create a external application which will query one table from a large Oracle database.
The query will run daily and I am expecting to handle 30,000+ rows.
To break down the size of these rows, I would like to create a new thread/ process for each 10,000 rows that exist. So going by the above figure it would be 3 threads to process all those rows.
I don't want each thread to overlap each others row set so I know I will need to add a column within the table to act as a range marker, a row_position
Logic
Get row_count of data set in query parameters
Get first_row_pos
While (row_count > 10,000)
{
Create thread with 10,000 rows starting from first_row_pos
row_count == row_count - 10,000
first_row_pos = first_row_pos + 10,000
}
create thread for remaining rows
all threads run their queries concurrently.
This is basic logic at the moment, however I do not know how feasible this is.
Is this a good way or is there a better way?
Can this be done through one database connection with each thread sharing or is it better to have a seperate db connection for each thread?
Any other advice welcome?
Note: I just realised a do while loop would be better if there is less than 10,000 rows in this case.
Thanks

Oralce provide a parallel hint for sutuations such as this where you have a full table scan or similar problem and want to make use of multiple cores to divide the workload. Further details here.
The syntax is very simple, you specify the table (or alias) and the number of cores (I usually leave as default) e.g.:
select /*+ parallel(a, default) */ *
from table_a a
You can also use this with multiple tables e.g.
select /*+ parallel(a, default) parallel(b,default) */ *
from table_a a, table_b b
where a.some_id = b.some_id

A database connection is not thread-safe, so if you are going to query the database from several threads, you would have to have a separate connection for each of them. You can either create a connection or get them from a pool.
Before you implement your approach, take some time to analyze where is the time spent. Oracle overall is pretty good with utilizing multiple cores. And the database interaction is usually is the most time-consuming part. By splitting the query in three you might actually slow things down.
If indeed your application is spending most of the time performing calculations on that data, your best approach might be loading all data in a single thread and then splitting processing into multiple threads.

libpq very slow for large (20 million record) database

I am new to SQL/RDBMS.
I have an application which adds rows with 10 columns in PostgreSQL server using the libpq library. Right now, my server is running on same machine as my visual c++ application.
I have added around 15-20 million records. The simple query of getting total count is taking 4-5 minutes using select count(*) from <tableName>;.
I have indexed my table with the time I am entering the data (timecode). Most of the time I need count with different WHERE / AND clauses added.
Is there any way to make things fast? I need to make it as fast as possible because once the server moves to network, things will become much slower.
Thanks

I don't think network latency will be a large factor in how long your query takes. All the processing is being done on the PostgreSQL server.
The PostgreSQL MVCC design means each row in the table - not just the index(es) - must be walked to calculate the count(*) which is an expensive operation. In your case there are a lot of rows involved.
There is a good wiki page on this topic here http://wiki.postgresql.org/wiki/Slow_Counting with suggestions.
Two suggestions from this link, one is to use an index column:
select count(index-col) from ...;
... though this only works under some circumstances.
If you have more than one index see which one has the least cost by using:
EXPLAIN ANALYZE select count(index-col) from ...;
If you can live with an approximate value, another is to use a Postgres specific function for an approximate value like:
select reltuples from pg_class where relname='mytable';
How good this approximation is depends on how often autovacuum is set to run and many other factors; see the comments.

Consider pg_relation_size('tablename') and divide it by the seconds spent in
select count(*) from tablename
That will give the throughput of your disk(s) when doing a full scan of this table. If it's too low, you want to focus on improving that in the first place.
Having a good I/O subsystem and well performing operating system disk cache is crucial for databases.
The default postgres configuration is meant to not consume too much resources to play nice with other applications. Depending on your hardware and the overall utilization of the machine, you may want to adjust several performance parameters way up, like shared_buffers, effective_cache_size or work_mem. See the docs for your specific version and the wiki's performance optimization page.
Also note that the speed of select count(*)-style queries have nothing to do with libpq or the network, since only one resulting row is retrieved. It happens entirely server-side.

You don't state what your data is, but normally the why to handle tables with a very large amount of data is to partition the table. http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
This will not speed up your select count(*) from <tableName>; query, and might even slow it down, but if you are normally only interested in a portion of the data in the table this can be helpful.

Would a table lock speed up an update statement in Oracle 10g enterprise?

We have a fairly wide table BaseData with some 33 millions rows in it. Then we have an update query that joins it to several other tables containing all kinds of parameters, some functions are applied, there is a group by original Id and then the results are written back to the BaseData table in a few columns.
This process is very slow so I'm looking into ways of speeding it up. I have most of my experience in SQLServer so all this type of internals of Oracle I don't know yet.
One thing I suspect is that during the update Oracle creates versions of every row so any oher readers can read that unaffected row. This however takes up considerable resources. Is there any way to have the update take a write lock on the table so it wouldn't create versions of every row?
Any other tips you guys have for large updates? We already broke it down into batches. Each batch is in a seperate partition of the table and then several updates are run in parallel. But still its all much too slow.

The short answer is that no, in Oracle, taking an exclusive lock on a table won't prevent other sessions from reading it, or having to incur the work of generating a read-consistent view of the data. Similarly, in Oracle, you can't tell a session to enable "dirty reads."
Well, the first question is what's slow - is it all the work of joining and applying functions, or is it the writing back? How does a SELECT my_updated_resultset FROM BASEDATA JOIN... perform compared to your update statement? Have you verified that there's contention between the readers of BaseData and the update process? Also, it's it too slow for the business, or just slower than you think it should be?
Another option to consider is to use partition exchange to perform your updates. The high level concept would be:
CREATE TABLE BASEDATA_XCHG as SELECT * FROM BASEDATA WHERE 1 = 0;
INSERT /*+ append */ INTO BASEDATA_XCHG SELECT my_updated_resultset FROM BASEDATA PARTITION (ONLY_ONE_PARTITION) JOIN...
Create all the required indexes and constraints on the BASEDATA_XCHG table.
ALTER TABLE BASEDATA EXCHANGE PARTITION (ONLY_ONE_PARTITION) WITH BASEDATA_XCHG
If you're updating most of the rows in a partition of BASEDATA table, don't update them - create a new table and exchange it out. Tim Gorman has an excellent paper called "Scaling to Infinity" that covers this concept in greater depth; you may wish to check it out.

In addition to Adam's answer:
Run an EXPLAIN PLAN on your update statement and check the execution plan.
Chances are that adding indexes to support your joins and WHERE conditions can speed up the query.

Oracle uses undo segments for read consistency (along with SCNs, read more here)
I'm assuming these large batch processes are running on a staging area and not a "prod" instance that is being used by a lot of various processes. If you are updating 25% or more (rough figures) of some big table, it may be better to do a CTAS (create table as select...) than attempting updates. Your CTAS would contain the update logic for the new table. Once done, add indexes/grants/etc on new table and rename new to old. You can also add a parallel hint and nologging on the CTAS to potentially speed things up even more.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas