I find that the two queries given below when fired on PostgreSQL generate different query execution times:
Query1:
\timing
select s0.value,s1.value,s2.value,s3.value,s4.value
from (
select f0.subject as r0,f0.predicate as r1,f0.object as r2,f1.predicate as r3,f1.object as r4
from schemaName.facts f0,schemaName.facts f1
where f1.subject=f0.subject
) facts,schemaName.strings s0,schemaName.strings s1,schemaName.strings s2,schemaName.strings s3,schemaName.strings s4
where s0.id=facts.r0 and s1.id=facts.r1 and s2.id=facts.r2 and s3.id=facts.r3 and s4.id=facts.r4;
Query1 rewritten:
select s0.value,s1.value,s2.value,s3.value,s4.value
from schemaName.strings s0,schemaName.strings s1,schemaName.strings s2,schemaName.strings, schemaName.facts f0,schemaName.facts f1 s3,schemaName.strings s4
where s0.id=f0.subject and s1.id=f0.predicate and s2.id=f0.object and s3.id=f1.predicate and s4.id=f1.object, f0.subject=f1.subject;
I am unable to understand the reason behind postgresql generating different query execution times. Can someone please help me understand this?
Postgresql comes with a very nice command: EXPLAIN and EXPLAIN ANALYZE. The former prints out the query plan with estimates of how long things will take, and the latter outputs the query plan while actually running the query, which allows it to place the real execution costs with the plan.
Postgresql uses a whole mess of criteria and heuristics to decide how best to run a query. Everything from sequential and random access costs (tunable in the configs) to statistical samplings of the data in the tables.
I've found that very often it will come up with the same query plan give two radically different-looking queries (assuming they give the same results), and I've seen the query structure affect the plan. The best way to see what it is doing is to ask it to explain.
All of that said: the second run will always be faster than the first, since the data is now cached. So, if you are really trying to compare runtimes, be sure to run each query at least four times, drop the first one, and average the rest.
Related
I'm currently taking an SQL course and trying to understand efficiency of queries.
Given this query, what's the efficiency of it:
SELECT *
FROM Customers
WHERE Age = (SELECT MIN(Age)
FROM Customers)
What i'm trying to understand is if the subquery runs once at the beginning and then the query is O(n+n)?
Or does the subquery run everytime you go through a customer's age which makes it O(n^2)?
Thank you!
If you want to understand how the query optimizer interperets a query you have to review the execution / explain plan which almost every RDBMS makes available.
As noted in the comments you tell the RDBMS what you want, not how to get it.
Very often it helps to have a deeper understanding of the particular database engine being used in order to write a query in the most performant way, ie, to be able to think like the query processor.
Like any language, there's more than one way to skin a cat, so to speak, and with SQL there is usually more than one way to write a query that results in the same output - very often many ways, depending on the complexity.
How a query execution plan gets built and executed is determined by the query optimizer at compile time and depends on many factors, depending on the RDBMS, such as data cardinality, table size, row size, estimated number of rows, sargability, indexes, available resources, current load, concurrency, isolation level - just to name a few.
It often helps to write queries in the most performant way by thinking what you would have to do to accomplish the same task.
In your example, you are looking for all the rows in a table where a particular value equals another value. You have chosen to find that value by first looking for the minimum age - you would only have to do this once as it's a single scalar value, so it's reasonable to assume (but not guaranteed) the database engine would do the same.
You could also approach the problem by aggregating and limiting to the top qualifying row and including ties, if the syntax is supported by the RDBMS, and joining the results.
Ultimately there is no black and white answer.
If we look at a simple query like this one:
SELECT * FROM CUSTOMER2;
We can tell by looking at it that it simply does one thing, retrieves everything from CUSTOMER2.
Now my question is, why is it that when we run it like this:
SELECT/*+ PARALLEL(CUSTOMER2, 8) */ * FROM CUSTOMER2;
The cost of it (according to execution plan) goes from 581 to 81? Since its only one task, isn't it just performed on the same thread anyway?
I can understand if there were two full table scans needing to be done as you can run those two in parallel threads so they execute at the same time. But in our case, there is only one full table scan.
So how does running it in parallel make it faster when there is nothing to run it "in parallel" with?
Lastly, when I altered my personal cluster and the one table to run in parallel when anything is performed on it I did not see any change in cost like I did with the small statement.
This is my personal one:
SELECT AVG(s.sellprice), s.qty, s.custid
FROM CUSTOMER_saracl c, sale_saracl s
WHERE c.custid = s.custid
GROUP BY (s.qty, s.custid)
HAVING AVG(s.sellprice) >
(SELECT MIN(AVG(price))
FROM product_saracl
WHERE pname
LIKE 'FA%'
GROUP BY price);
Why would that be?
Thank you for any help, I just today learnt about parallel execution so go easy on me haha!
One very important point about relational databases is that tables represent unordered sets. That means that the pages that are scanned for a table can be scanned in any order.
Oracle actually takes advantage of this for parallel scans of a single table. There is additional overhead to bring the results back together, which is why the estimated cost is 81 and not 73 (581 / 8).
I think this documentation has good examples that explain this. Some are quite close to your query.
Note that parallelism does not just apply to reading tables. In fact, it is more commonly associated with other operations, such as joins, aggregation, and sorting.
Testing queries responses time returns interesting results:
When executing the same query several times in a row, at first the response times get better until a certain point, then in each execute it gets a little slower or jumps inconsistently.
Running the same query while using the USING INDEX and in other times not using the USING INDEX, returns almost the same responses times range (as described in clause 1), although the profile is getting better (less db hits while using the USING INDEX).
Dropping the index and re-running the query returns the same profile as executing the query while the index exists but the query has been executed without the USING INDEX.
Is there an explanation to the above results?
What will be the best way to know if the query has been improved if although the db hits are getting better, the response times aren't?
The best way to understand how a query executes is probably to use the PROFILE command, which will actually explain how the database goes about executing the query. This should give you feedback on what cypher does with USING INDEX hints. You can also compare different formulations of the same query to see which result in fewer dbHits.
There probably is no comprehensive answer to why the query takes a variable amount of time in various situations. You haven't provided your model, your data, or your queries. It's dependent on a whole host of factors outside of just your query, for example your data model, your cache settings, whether or not the JVM decides to garbage collect at certain points, how full your heap is, what kind of indexes you have (whether or not you use USING INDEX hints) -- and those are only the factors at the neo4j/java level. At the OS level there are many other possibilities/contingencies that make precise performance measurement difficult.
In general when I'm concerned about these things I find it's good to gather a large data sample (run the query 10,0000 times) and then take an average. All of the factors that are outside of your control tend to average out in a sample like that, but if you're looking for a concrete prediction of exactly how long this next query is going to take, down to the milliseconds, that may not be realistically possible.
I have a select query with some complex joins and where conditions and it takes ~9seconds to execute.
Now, the strange thing is if I wrap the query with select count(1) the execution time will increase dramatically.
SELECT COUNT(1) FROM
(
SELECT .... -- initial query, executes ~9s
)
-- executes 1min
That's very strange to me, since I would expect an opposite result - the sql-server engine should be smart enough to optimize the inner query execution (for instance, do not execute nested queries in the select clause, etc).
And that's what execution plans comparison shows! It says it should be 74% to 26% (the former is initial query and latter is wrapped with select count(1)).
But that's not what really happens.
Idk if I should post the query itself, since it's rather large (if you need it then just let me know in comments).
Thaks you!)
When you use count(1) you no longer need all the columns.
This means that SQL Server can consider different execution plans using narrower indexes that do not cover all the columns used in the SELECT list of the original query.
Generally this should of course lead to a leaner, faster, execution plan however looks like in this case you were unlucky and it didn't.
Probably you will find a node with a large discrepancy between actual and estimated rows - this kind of thing will propagate up in the plan and can lead to sub optimal choices of strategy for other sub trees (e.g. sub optimal join orderings or algorithms)
SELECT
/*+ INDEX(ID_BL_REF_NO REF_number_BL_idx*/ DECODE(BL_TYPE,'E',BL_ORIGIN_NAME,'I',BL_FINAL_NAME) FROM_PORT,
DECODE(BL_TYPE,'I',BL_ORIGIN_NAME,'E',BL_FINAL_NAME) TO_PORT,
(BL_VESSEL_CONNECT||'/'||BL_VOYAGE_CONNECT||'/'||BL_PORT_CONNECT) Mother_vessel_voyage_port,
SUM(BLC_SIZE) No_of_20s,
SUM(BLC_SIZE) No_of_40s,
SUM(DECODE(BLC_SIZE,'20',1,'40',2)) Teus,
SUM(BLC_GROSSWT) GrossWt,
round((BLC_GROSSWT/SUM(DECODE(BLC_SIZE,'20',1,'40',2))),2) AverageWt,
SUM(DECODE(BLF_MODE,'P',BLF_LOCAL_AMOUNT)) PREPAID,
SUM(DECODE(BLF_MODE,'C',BLF_LOCAL_AMOUNT)) COLLECT,
SUM(DECODE(BLF_MODE,'E',BLF_LOCAL_AMOUNT)) ELSEWHERE,
(SUM(DECODE(BLF_MODE,'P',BLF_LOCAL_AMOUNT)+DECODE(BLF_MODE,'C',BLF_LOCAL_AMOUNT)+DECODE(BLF_MODE,'E',BLF_LOCAL_AMOUNT))/SUM(DECODE(BLC_SIZE,'20',1,'40',2))) AVERAGE
FROM ID_BL_DETAILS,id_bl_containers,ID_BL_FREIGHT
WHERE BL_REFNO=BLC_REFNO
AND BLF_REFNO=BLC_REFNO
GROUP BY BL_VESSEL_CONNECT,BL_VOYAGE_CONNECT,BL_PORT_CONNECT,BL_ORIGIN_NAME,BL_LODPORT,BL_DISPORT,BL_FINAL_NAME,BLC_GROSSWT,BL_TYPE
Your WHERE clause contains only joins. There are no filters. This means your query needs to consider all the rows in at least one table. From this it follows that your query should execute a FULL TABLE SCAN of at least one of your tables, not an indexed read. A full table scan is the most efficient way of getting all the rows in a table.
So don't fix the syntax of your INDEX hint, get rid of it.
Next, figure out which table should drive your query. This is business logic. Probably your requirement is something like
"Summarise BL_DETAILS and BL_FREIGHT
for every row in BL_CONTAINERS."
In which case you might think you need a full table scan of BL_CONTAINERS. But if BL_FREIGHT has more rows than BL_CONTAINERS and every BLF_REF_NO matches a BL_REF_NO (i.e. there is a foreign key on BL_FREIGHT.BLF_REF_NO referencing BL_CONTAINERS.BL_REF_NO) it would probably be better to drive from BL_FREIGHT.
Note that this is true if you are only interested in BL_CONTAINERS which have matching BL_FREIGHT rows. But if you want to include containers which have not been used (i.e. they have no matching no BL_FREIGHT records) you need to use outer joins and drive off the BL_CONTAINERS table.
These considerations get more complicated when you throw BL_DETAILS into the mix. Your report seems to be based around the BL_DETAILS categories (as Jeffrey observes it is hard for us to understand your query without aliases or describes). So perhaps BL_DETAILS is the right candidate for driving table.
As you can see, tuning requires insight into the business logic and the details of the data model. You have that local knowledge, we do not.
There are tools which can help you. Oracle has EXPLAIN PLAN which will show you how the database will execute the query. The query optimizer gets better with each release so it matters which version of the database you're using. Here is the documentation for 10g.
The important thing to note is that you need to give the database accurate statistics, in order for it to come up with a good plan. Find out more.
run explain on your query and make sure the proper indexes are set up. Will improve the speed of the query
http://www.sql.org/sql-database/postgresql/manual/sql-explain.html
Your question states that the query takes 9.968 seconds and you want it to be 0.5 seconds or less. This can only be done effectively (if possible at all) when you know where those 9.968 seconds are spent on. And to know where time in a query is being spent, you'll not only want to explain the statement, you'll want to trace an execution of that query. The latter will give you a breakdown of how the time in your query is spent.
There are two threads on OTN that describe how you can do that.
If you want to do the bare minimum, please follow this one:
http://forums.oracle.com/forums/thread.jspa?messageID=1812597
And if you want to give full details, please follow this one:
http://forums.oracle.com/forums/thread.jspa?threadID=863295
Happy tracing!
Regards,
Rob.