Why Does Running A Simple Query In Parallel Make It Faster? - sql

If we look at a simple query like this one:
SELECT * FROM CUSTOMER2;
We can tell by looking at it that it simply does one thing, retrieves everything from CUSTOMER2.
Now my question is, why is it that when we run it like this:
SELECT/*+ PARALLEL(CUSTOMER2, 8) */ * FROM CUSTOMER2;
The cost of it (according to execution plan) goes from 581 to 81? Since its only one task, isn't it just performed on the same thread anyway?
I can understand if there were two full table scans needing to be done as you can run those two in parallel threads so they execute at the same time. But in our case, there is only one full table scan.
So how does running it in parallel make it faster when there is nothing to run it "in parallel" with?
Lastly, when I altered my personal cluster and the one table to run in parallel when anything is performed on it I did not see any change in cost like I did with the small statement.
This is my personal one:
SELECT AVG(s.sellprice), s.qty, s.custid
FROM CUSTOMER_saracl c, sale_saracl s
WHERE c.custid = s.custid
GROUP BY (s.qty, s.custid)
HAVING AVG(s.sellprice) >
(SELECT MIN(AVG(price))
FROM product_saracl
WHERE pname
LIKE 'FA%'
GROUP BY price);
Why would that be?
Thank you for any help, I just today learnt about parallel execution so go easy on me haha!

One very important point about relational databases is that tables represent unordered sets. That means that the pages that are scanned for a table can be scanned in any order.
Oracle actually takes advantage of this for parallel scans of a single table. There is additional overhead to bring the results back together, which is why the estimated cost is 81 and not 73 (581 / 8).
I think this documentation has good examples that explain this. Some are quite close to your query.
Note that parallelism does not just apply to reading tables. In fact, it is more commonly associated with other operations, such as joins, aggregation, and sorting.

Related

Oracle SQL: What is the best way to select a subset of a very large table

I have been roaming these forums for a few years and I've always found my questions had already been asked, and a fitting answer was already present.
I have a pretty generic (and maybe easy) question now though, but I haven't been able to find a thread asking the same one yet.
The situation:
I have a payment table with 10-50M records per day, a history of 10 days and hundreds of columns. About 10-20 columns are indexed. One of the indices is batch_id.
I have a batch table with considerably fewer records and columns, say 10k a day and 30 columns.
If I want to select all payments from one specific sender, I could just do this:
Select * from payments p
where p.sender_id = 'SenderA'
This runs a while, even though sender_id is also indexed. So I figure, it's better to select the batches first, then go into the payments table with the batch_id:
select * from payments p
where p.batch_id in
(select b.batch_id from batches where b.sender_id = 'SenderA')
--and p.sender_id = 'SenderA'
Now, my questions are:
In the second script, should I uncomment the Sender_id in my where clause on the payments table? It doesn't feel very efficient to filter on sender_id twice, even though it's in different tables.
Is it better if I make it an inner join instead of a nested query?
Is it better if I make it a common table expression instead of a nested query or inner join?
I suppose it could all fit into one question: What is the best way to query this?
In the worst case the two queries should run in the same time and in the best case I would expect the first query to run quicker. If it is running slower, there is some problem elsewhere. You don't need the additional condition in the second query.
The first query will retrieve index entries for a single value, so that is going to access less blocks than the second query which has to find index entries for multiple batches (as well as executing the subquery, but that is probably not significant).
But the danger as always with Oracle is that there are a lot of factors determining which query plan the optimizer chooses. I would immediately verify that the statistics on your indexed columns are up-to-date. If they are not, this might be your problem and you don't need to read any further.
The next step is to obtain a query execution plan. My guess is that this will tell you that your query is running a full-table-scan.
Whether or not Oracle choses to perform a full-table-scan on a query such as this is dependent on the number of rows returned and whether Oracle thinks it is more efficient to use the index or to simply read the whole table. The threshold for flipping between the two is not a fixed number: it depends on a lot of things, one of them being a parameter called DB_FILE_MULTIBLOCK_READ_COUNT.
This is set-up by Orale and in theory it should be configured such that the transition between indexed and full-table scan queries should be smooth. In other words, at the transition point where your query is returning enough rows to just about make a full table scan more efficient, the index scan and the table scan should take roughly the same time.
Unfortunately, I have seen systems where this is way out and Oracle flips to doing full table scans far too quickly, resulting in a long query time once the number of rows gets over a certain threshold.
As I said before, first check your statistics. If that doesn't work, get a QEP and start tuning your Oracle instance.
Tuning Oracle is a very complex subject that can't be answered in full here, so I am forced to recommend links. Here is a useful page on the parameter: reducing it might help: Why Change the Oracle DB_FILE_MULTIBLOCK_READ_COUNT.
Other than that, the general Oracle performance tuning guide is here: (Oracle) Configuring a Database for Performance.
If you are still having problems, you need to progress your investigation further and then come up with a more specific question.
EDIT:
Based on your comment where you say your query is returning 4M rows out of 10M-50M in the table. If it is 4M out of 10M there is no way an index will be of any use. Even with 4M out of 50M, it is still pretty certain that a full-table-scan would be the most efficient approach.
You say that you have a lot of columns, so probably this 4M row fetch is returning a huge amount of data.
You could perhaps consider splitting off some of the columns that are not required and putting them into a child table. In particular, if you have columns containing a lot of data (e.g., some text comments or whatever) they might be better being kept outside the main table.
Remember - small is fast, not only in terms of number of rows, but also in terms of the size of each row.
SQL is an declarative language. This means, that you specify what you like not how.
Check your indexes primary and "normal" ones...

Different in postgresql query execution timing

I find that the two queries given below when fired on PostgreSQL generate different query execution times:
Query1:
\timing
select s0.value,s1.value,s2.value,s3.value,s4.value
from (
select f0.subject as r0,f0.predicate as r1,f0.object as r2,f1.predicate as r3,f1.object as r4
from schemaName.facts f0,schemaName.facts f1
where f1.subject=f0.subject
) facts,schemaName.strings s0,schemaName.strings s1,schemaName.strings s2,schemaName.strings s3,schemaName.strings s4
where s0.id=facts.r0 and s1.id=facts.r1 and s2.id=facts.r2 and s3.id=facts.r3 and s4.id=facts.r4;
Query1 rewritten:
select s0.value,s1.value,s2.value,s3.value,s4.value
from schemaName.strings s0,schemaName.strings s1,schemaName.strings s2,schemaName.strings, schemaName.facts f0,schemaName.facts f1 s3,schemaName.strings s4
where s0.id=f0.subject and s1.id=f0.predicate and s2.id=f0.object and s3.id=f1.predicate and s4.id=f1.object, f0.subject=f1.subject;
I am unable to understand the reason behind postgresql generating different query execution times. Can someone please help me understand this?
Postgresql comes with a very nice command: EXPLAIN and EXPLAIN ANALYZE. The former prints out the query plan with estimates of how long things will take, and the latter outputs the query plan while actually running the query, which allows it to place the real execution costs with the plan.
Postgresql uses a whole mess of criteria and heuristics to decide how best to run a query. Everything from sequential and random access costs (tunable in the configs) to statistical samplings of the data in the tables.
I've found that very often it will come up with the same query plan give two radically different-looking queries (assuming they give the same results), and I've seen the query structure affect the plan. The best way to see what it is doing is to ask it to explain.
All of that said: the second run will always be faster than the first, since the data is now cached. So, if you are really trying to compare runtimes, be sure to run each query at least four times, drop the first one, and average the rest.

Issues with Oracle Query execution time

Below is my query, I use four joins to access data from three different tables, now when searching for 1000 records it takes around 5.5 seconds, but when I amp it up to 100,000 it takes what seems like an infinite amount of time, (last cancelled at 7 hours..)
Does anyone have any idea of what I am doing wrong? Or what could be done to speed up the query?
This query will proabably end up having to be run to return millions of records, I've only limited it to 100,000 for the purpose of testing the query and it seems to fall over at even this small amount.
For the record im on oracle 8
CREATE TABLE co_tenancyind_batch01 AS
SELECT /*+ CHOOSE */ ou_num,
x_addr_relat,
x_mastership_flag,
x_ten_3rd_party_source
FROM s_org_ext,
s_con_addr,
s_per_org_unit,
s_contact
WHERE s_org_ext.row_id = s_con_addr.accnt_id
AND s_org_ext.row_id = s_per_org_unit.ou_id
AND s_per_org_unit.per_id = s_contact.row_id
AND x_addr_relat IS NOT NULL
AND rownum < 100000
Explain Plan in Picture : http://imgur.com/Xw9x4BA (easy to read)
Your test based on 100,000 rows is not meaningful if you are then going to run it for many millions. The optimiser knows that it can satisfy the query faster when it has a stopkey by using nested loop joins.
When you run it for a very large data set you're likely to need a different plan, with hash joins most likely. Covering indexes might help with that, but we can't tell because the selected columns are missing column aliases that tell us which table they come from. You're most likely to hit memory problems with large hash joins, which could be ameliorated with hash partitioning but there's no way the Siebel people would go for that -- you'll have to use manual memory management and monitor v$sql_workarea to see how much you really need.
(Hate the visual explain plan, by the way).
First of all, can you make sure there is an index on S_CONTACT table and it is enabled ?
If it is so, try the select statement with /*+ CHOOSE */ hint and have another look at the explain plan to see if optimizer mode is still RULE. I believe cost based optimizer would result better in this query.
If still rule try updating database statistics and try again. You can use DBMS_STATS package for that purpose, if i am not wrong it was introduced with version 8i. Are you using 8i ?
And at last, i don't know the record numbers, the cardinality between tables. I might have been more helpful if i knew the design.
Your dataset, looking at the last execution plan appear to be huge, you could limit access to the base table instead of limiting the number of returned row, like this:
CREATE TABLE co_tenancyind_batch01 AS
SELECT /*+ CHOOSE */ ou_num,
x_addr_relat,
x_mastership_flag,
x_ten_3rd_party_source
FROM s_org_ext,
s_con_addr,
s_per_org_unit,
(select * from s_contact where rownum <= 100000) cont
WHERE s_org_ext.row_id = s_con_addr.accnt_id
AND s_org_ext.row_id = s_per_org_unit.ou_id
AND s_per_org_unit.per_id = cont.row_id
AND x_addr_relat IS NOT NULL
should improve but not be extremely quick.

How to improve query performance

I have a lot of records in table. When I execute the following query it takes a lot of time. How can I improve the performance?
SET ROWCOUNT 10
SELECT StxnID
,Sprovider.description as SProvider
,txnID
,Request
,Raw
,Status
,txnBal
,Stxn.CreatedBy
,Stxn.CreatedOn
,Stxn.ModifiedBy
,Stxn.ModifiedOn
,Stxn.isDeleted
FROM Stxn,Sprovider
WHERE Stxn.SproviderID = SProvider.Sproviderid
AND Stxn.SProviderid = ISNULL(#pSProviderID,Stxn.SProviderid)
AND Stxn.status = ISNULL(#pStatus,Stxn.status)
AND Stxn.CreatedOn BETWEEN ISNULL(#pStartDate,getdate()-1) and ISNULL(#pEndDate,getdate())
AND Stxn.CreatedBy = ISNULL(#pSellerId,Stxn.CreatedBy)
ORDER BY StxnID DESC
The stxn table has more than 100,000 records.
The query is run from a report viewer in asp.net c#.
This is my go-to article when I'm trying to do a search query that has several search conditions which might be optional.
http://www.sommarskog.se/dyn-search-2008.html
The biggest problem with your query is the column=ISNULL(#column, column) syntax. MSSQL won't use an index for that. Consider changing it to (column = #column AND #column IS NOT NULL)
You should consider using the execution plan and look for missing indexes. Also, how long it takes to execute? What is slow for you?
Maybe you could also not return so many rows, but that is just a guess. Actually we need to see your table and indexes plus the execution plan.
Check sql-tuning-tutorial
For one, use SELECT TOP () instead of SET ROWCOUNT - the optimizer will have a much better chance that way. Another suggestion is to use a proper inner join instead of potentially ending up with a cartesian product using the old style table,table join syntax (this is not the case here but it can happen much easier with the old syntax). Should be:
...
FROM Stxn INNER JOIN Sprovider
ON Stxn.SproviderID = SProvider.Sproviderid
...
And if you think 100K rows is a lot, or that this volume is a reason for slowness, you're sorely mistaken. Most likely you have really poor indexing strategies in place, possibly some parameter sniffing, possibly some implicit conversions... hard to tell without understanding the data types, indexes and seeing the plan.
There are a lot of things that could impact the performance of query. Although 100k records really isn't all that many.
Items to consider (in no particular order)
Hardware:
Is SQL Server memory constrained? In other words, does it have enough RAM to do its job? If it is swapping memory to disk, then this is a sure sign that you need an upgrade.
Is the machine disk constrained. In other words, are the drives fast enough to keep up with the queries you need to run? If it's memory constrained, then disk speed becomes a larger factor.
Is the machine processor constrained? For example, when you execute the query does the processor spike for long periods of time? Or, are there already lots of other queries running that are taking resources away from yours...
Database Structure:
Do you have indexes on the columns used in your where clause? If the tables do not have indexes then it will have to do a full scan of both tables to determine which records match.
Eliminate the ISNULL function calls. If this is a direct query, have the calling code validate the parameters and set default values before executing. If it is in a stored procedure, do the checks at the top of the s'proc. Unless you are executing this with RECOMPILE that does parameter sniffing, those functions will have to be evaluated for each row..
Network:
Is the network slow between you and the server? Depending on the amount of data pulled you could be pulling GB's of data across the wire. I'm not sure what is stored in the "raw" column. The first question you need to ask here is "how much data is going back to the client?" For example, if each record is 1MB+ in size, then you'll probably have disk and network constraints at play.
General:
I'm not sure what "slow" means in your question. Does it mean that the query is taking around 1 second to process or does it mean it's taking 5 minutes? Everything is relative here.
Basically, it is going to be impossible to give a hard answer without a lot of questions asked by you. All of these will bear out if you profile the queries, understand what and how much is going back to the client and watch the interactions amongst the various parts.
Finally depending on the amount of data going back to the client there might not be a way to improve performance short of hardware changes.
Make sure Stxn.SproviderID, Stxn.status, Stxn.CreatedOn, Stxn.CreatedBy, Stxn.StxnID and SProvider.Sproviderid all have indexes defined.
(NB -- you might not need all, but it can't hurt.)
I don't see much that can be done on the query itself, but I can see things being done on the schema :
Create an index / PK on Stxn.SproviderID
Create an index / PK on SProvider.Sproviderid
Create indexes on status, CreatedOn, CreatedBy, StxnID
Something to consider: When ROWCOUNT or TOP are used with an ORDER BY clause, the entire result set is created and sorted first and then the top 10 results are returned.
How does this run without the Order By clause?

Can I trust Execution plans?

I have these Queries:
With CTE(comno) as
(select distinct comno=ErpEnterpriseId from company)
select id=Row_number() over(order by comno),comno from cte
select comno=ErpEnterpriseId,RowNo=Row_number() over (order by erpEnterpriseId) from company group by ErpEnterpriseId
SELECT erpEnterpriseId, ROW_NUMBER() OVER(ORDER BY erpEnterpriseId) AS RowNo
FROM
(
SELECT DISTINCT erpEnterpriseId
FROM Company
) x
All three of them returns identical cost and actual execution plans..why and how so ?
It's all down to the query optimizer - that will by trying to optimize the query you enter into the most efficient execution plan (i.e several different queries could be optimized down to the SAME statement that is estimated to be most efficient).
The main thing you should do when trying to optimise a query and find which one performs the best, is to just try them and compare performance. Run an SQL profiler trace to see what the duration/reads is for each version. I usually run each version of a query 3 times to get an average to compare. Each time, clearing the execution plan and data cache down to prevent skewed results.
It's worth having a read of this MSDN article on the optimizer.
Simple, the optimizer is probably turning all your statements into the same statement.
Just like in English, in which there are many ways to say the same thing, all three of those queries are asking for the same data. The SQL Engine (the query optimizer) knows that and is smart enough to know what you are asking.
Even more appropriately, the engine has information that you don't (or likely don't know) - how the data is organized and indexed. It uses this information to make it's own decision about what the BEST way to get the data is, and that's what it is doing.
Although there are ways to override the optimizer, unless you really know what you are doing, you will probably only hurt performance. So your best option is to write the queries in whatever way make most sense to you (or other humans) for readability and maintainability.