I always write very long sql, but for later maintenance.
Is one sql statement divide into many statement better?
example:
select a.a1, a.a2, b.b3, sum(c.c4), b.b4...b.bn
from A a
inner join B b on a.a1=b.b1
left join C c on a.a2=c.c2
group by a.a1, a.a2, b.b3, b.b4,...,b.bn
I divide into
create temp_table select a.a1, a.a2, sum(c.c4)
from A a
left join C c on a.a2=c.c2
group by a.a1, a.a2
select temp.*, b.b3, b.b4,...b.bn
from temp_table temp
inner join B b on temp.a1=b.b1
But it need to create table in pl/sql.Is there a better way?
Can many sql statement execute faster by Oracle's CHOOSE(soft parse)?
Thanks to experience sharing.
I am a fan of writing SQL as a single statement. I find that approach is better for a variety of reasons:
A single statement is easier to maintain.
I don't have to name and remember intermediate table names.
I might make a mistake and not re-build an intermediate result when the logic changes.
The optimizer has a good chance of getting the right execution plan.
That said, the optimizer is not always right. Oracle has a good optimizer and one that makes use of statistics. On occasion, dividing a complex query into pieces can improve performance, under some circumstances:
The optimizer is not able to do a good job of estimated the size of the intermediate result. A table "knows" exactly how many rows it has.
You add indexes to the intermediate table.
You want to re-use results, say for inter-query optimization.
Although these might be beneficial, I myself shy away because of the complexity and maintainability. However, it can sometimes be faster.
It's rarely faster. You're hiding your intent from the optimizer. Generally give it one query with no user functions for optimum performance.
It won't be necessarily faster, as both are run on Oracle server, and your PL/SQL will be compiled anyway.
If you have everything done by one single SQL, you leave the query optimization to Oracle, while if you write your own PL/SQL, you might have more control of how the queries are executed. But sure if your write bad PL/SQL, it will definitely perform worse.
However, I am not sure breaking codes it up really improve maintainability. Unless you are saying you can reuse the broken pieces in other places, which improve code reuse, I would think making it one single statement seems more logical. You can definitely add more comments to explain as much detail as possible to make it clear to whoever read it in the future.
Related
So, I have an SQL query for MSSQL looking like this (simplified for readability):
SELECT ...
FROM (
SELECT ..., ROUND(SUM(TOTAL_TIME)/86400.0,2) ...
FROM MY_DATA
WHERE STATUS NOT IN (107)
GROUP BY ...
) q
WHERE q.Tdays > 0
GROUP BY ...
It works fine, but I need a comparison against another table in the inner query, so I added a left join and said comparison:
SELECT ...
FROM (
SELECT ..., ROUND(SUM(TOTAL_TIME)/86400.0,2) ...
FROM MY_DATA
LEFT JOIN OTHER_TABLE ON MY_DATA.ID=OTHER_TABLE.ID //new JOIN
WHERE STATUS NOT IN (107) AND (DEPARTMENT_ID='SP' OR DEPARTMENT_ID='BL') //new AND branch
GROUP BY ...
) q
WHERE q.Tdays > 0
GROUP BY ...
This query works, but is A LOT slower thant the previous one. The wierd thing is, commenting out the new AND-branch of the WHERE clause while leaving the JOIN as it is makes it faster again. As if it's not joining another table that is slowing the query down, but the actual string comparisons... I am lost as to why this is so slow, or how I could speed it up... any advice would be appreciated!
Use an INNER JOIN. The outer join is being undone by the WHERE clause anyway:
SELECT ..., ROUND(SUM(TOTAL_TIME)/86400.0,2) ...
FROM MY_DATA d INNER JOIN
OTHER_TABLE ot
ON d.ID = ot.ID //new JOIN
WHERE od.STATUS NOT IN (107) AND DEPARTMENT_ID IN ('SP', 'BL') //new AND branch
GROUP BY ...
(The IN shouldn't make a difference; it is just easier to write.)
Next, if this still has slow performance, then look at the execution plans. It means that SQL Server is making a poor decision, probably on the JOIN algorithm. Normally, I fix this by forbidding nested loop joins, but there might be other solutions as well.
It's hard to say definitively what will or won't speed things up without seeing the execution plan. Also, understanding how fast you need it to be affects what steps you might want to (or not want to) consider taking.
What follows is admittedly somewhat vague, but these are a few things that came to mind when I thought about this. Take a look at the execution plan as Philip Couling suggested in that good link to get an idea where the pain points are, and of course, take these suggestions with a grain of salt.
You might consider adding some indexes to either or both of the tables. The execution plan might even give you suggestions on what could be useful, but off the top of my head, something on OTHER_TABLE.DEPARTMENT_ID probably wouldn't hurt.
You might be able to build potential new indexes as Filtered Indexes if you know those hard-coded search terms (like STATUS and DEPARTMENT_ID are always going to be the same).
You could pre-calculate some of this information if it's not changing so rapidly that you need to query it fresh on every call. This comes back to how fast you need it to go, because for just about any query, you can add columns or pre-populated lookup tables to avoid doing work at run time. For example, you could make an new bit field like IsNewOrBranch or IsStatusNot107 (both somewhat egregious steps, but things which could work). Or that might be pre-aggregating the data in the inner query ahead of time.
I know you simplified the query for our benefit, but that also makes it a little hard to know what's going on with the subquery, and the subsequent GROUP BY being performed against that subquery. There might be a way to avoid having to do two group bys.
Along the same vein, you might also look into splitting what you're doing into separate statements if SQL is having a difficult time figuring out how best to return the data. For example, you might populate a temp table or table variable with the results of your inner query, then perform your subsequent GROUP BY on that. While this approach isn't always useful, there are many times where trying to cram all the work into a single query will actually end up being worse than several individual, simple, optimized steps would be.
And as Gordon Linoff suggested, there are a number of query hints which could be used to coax the execution plan into doing things a specific way. But be careful, often that way lies madness.
Your SQL is fine, and restricting your data with an additional AND clause should usually not make it slower.
As it happens, choosing a fast execution path is a hard problem, and SQL Server sometimes (albeit seldom) gets it wrong.
What you can do to help SQL Server find the best execution path is to:
make sure the statistics on your tables are up-to-date and
make sure that there is an "obviously suitable" index that SQL Server can use. SQL Server Management studio will usually give you suggestions on missing indexes when selecting the "show actual execution plan" option.
I'm using a database that requires optimized queries and I'm wondering which one of those queries are the optimized one, I used a timer but the result are too close. so I do not have to clue which one to use.
QUERY 1:
SELECT A.MIG_ID_ACTEUR, A.FL_FACTURE_FSL , B.VAL_NOM,
B.VAL_PRENOM, C.VAL_CDPOSTAL, C.VAL_NOM_COMMUNE, D.CCB_ID_ACTEUR
FROM MIG_FACTURE A
INNER JOIN MIG_ACTEUR B
ON A.MIG_ID_ACTEUR= B.MIG_ID_ACTEUR
INNER JOIN MIG_ADRESSE C
ON C.MIG_ID_ADRESSE = B.MIG_ID_ADRESSE
INNER JOIN MIG_CORR_REF_ACTEUR D
ON A.MIG_ID_ACTEUR= D.MIG_ID_ACTEUR;
QUERY 2:
SELECT A.MIG_ID_ACTEUR, A.FL_FACTURE_FSL , B.VAL_NOM, B.VAL_PRENOM,
C.VAL_CDPOSTAL, C.VAL_NOM_COMMUNE, D.CCB_ID_ACTEUR
FROM MIG_FACTURE A , MIG_ACTEUR B, MIG_ADRESSE C, MIG_CORR_REF_ACTEUR D
WHERE A.MIG_ID_ACTEUR= B.MIG_ID_ACTEUR
AND C.MIG_ID_ADRESSE = B.MIG_ID_ADRESSE
AND A.MIG_ID_ACTEUR= D.MIG_ID_ACTEUR;
If you are asking whether it is more efficient to use the SQL 99 join syntax (a inner join b) or whether it is more efficient to use the older join syntax of listing the join predicates in the WHERE clause, it shouldn't matter. I'd expect that the query plans for the two queries would be identical. If the query plans are identical, performance will be identical. If the plans are not identical, that would generally imply that you had encountered a bug in the database's query parsing engine.
Personally, I'd use the SQL 99 syntax (query 1) both because it is more portable when you want to do an outer join and because it generally makes the query more readable and decreases the probability that you'll accidentally leave out a join condition. That's solely a readability and maintainability consideration, though, not a performance consideration.
First things first:
"I used a timer but the result are too close" -- This is actually not a good way to test performance. Databases have caches. The results you get back won't be comparable with a stopwatch. You have system load to contend with, caching, and a million other things that make that particular comparison worthless. Instead of that, try using EXPLAIN to figure out the execution plan. Use SHOW PROFILES and SHOW STATUS to see where and how the queries are spending time. Check last_query_cost. But don't check your stopwatch. That won't tell you anything.
Second: this question can't be answered with the info your provided. In point of fact the queries are identical (verify that with Explain) and simply boil down to implicit vs explicit joins. Doesn't make either one of them optimized though. Again, you need to dig into the join itself and see if it's making use of indices, for example, or if it's doing a lot temp tables or file sorts.
Optimizing the query is a good thing... but these two are the same. A stop watch won't help you. Use explain, show profiles, show status.... not a stop watch :-)
One of my jobs it to maintain our database, usually we have troubles with lack of performance while getting reports and working whit that base.
When I start looking at queries which our ERP sending to database I see a lot of totally needlessly subselect queries inside main queries.
As I am not member of developers which is creator of program we using, they do not like much when I criticize they code and job. Let say they do not taking my review as serious statements.
So I asking you few questions about subselect in SQL
Does subselect is taking a lot of more time then left outer joins?
Does exists any blog, article or anything where I subselect is recommended not to use ?
How I can prove that if we avoid subselesct in query that query is going to be faster ?
Our database server is MSSQL2005
"Show, Don't Tell" - Examine and compare the query plans of the queries identified using SQL Profiler. Particularly look out for table scans and bookmark lookups (you want to see index seeks as often as possible). The 'goodness of fit' of query plans depends on up-to-date statistics, what indexes are defined, the holistic query workload.
Execution Plan Basics
Understanding More Complex Query Plans
Using SQL Server Profiler (2005 Version)
Run the queries in SQL Server Management Studio (SSMS) and turn on Query->Include Actual Execution Plan (CTRL+M)
Think yourself lucky they're only subselects (which in some cases the optimiser will produce equivalent 'join plans') and not correlated sub-queries!
Identify a query that is performing a high number of logical reads, re-write it using your preferred technique and then show how few logicals reads it does by comparison.
Here's a tip. To get the total number of logical reads performed, wrap a query in question with:
SET STATISTICS IO ON
GO
-- Run your query here
SET STATISTICS IO OFF
GO
Run your query, and switch to the messages tab in the results pane.
If you are interested in learning more, there is no better book than SQL Server 2008 Query Performance Tuning Distilled, which covers the essential techniques for monitoring, interpreting and fixing performance issues.
One thing you can do is to load SQL Profiler and show them the cost (in terms of CPU cycles, reads and writes) of the sub-queries. It's tough to argue with cold, hard statistics.
I would also check the query plan for these queries to make sure appropriate indexes are being used, and table/index scans are being held to a minimum.
In general, I wouldn't say sub-queries are bad, if used correctly and the appropriate indexes are in place.
I'm not very familiar with MSSQL, as we are using postrgesql in most of our applications. However there should exist something like "EXPLAIN" which shows you the execution plan for the query. There you should be able to see the various steps that a query will produce in order to retrieve the needed data.
If you see there a lot of table scans or loop join without any index usage it is definitely a hint for a slow query execution. With such a tool you should be able to compare the two queries (one with the join, the other without)
It is difficult to state which is the better ways, because it really highly depends on the indexes the optimizer can take in the various cases and depending on the DBMS the optimizer may be able to implicitly rewrite a subquery-query into a join-query and execute it.
If you really want to show which is better you have to execute both and measure the time, cpu-usage and so on.
UPDATE:
Probably it is this one for MSSQL -->QueryPlan
From my own experience both methods can be valid, as for example an EXISTS subselect can avoid a lot of treatment with an early break.
Buts most of the time queries with a lot of subselect are done by devs which do not really understand SQL and use their classic-procedural-programmer way of thinking on queries. Then they don't even think about joins, and makes some awfull queries. So I prefer joins, and I always check subqueries. To be completly honnest I track slow queries, and my first try on slow queries containing subselects is trying to do joins. Works a lot of time.
But there's no rules which can establish that subselect are bad or slower than joins, it's just that bad sql programmer often do subselects :-)
Does subselect is taking a lot of more time then left outer joins?
This depends on the subselect and left outer joins.
Generally, this construct:
SELECT *
FROM mytable
WHERE mycol NOT IN
(
SELECT othercol
FROM othertable
)
is more efficient than this:
SELECT m.*
FROM mytable m
LEFT JOIN
othertable o
ON o.othercol = m.mycol
WHERE o.othercol IS NULL
See here:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
Does exists any blog, article or anything where subselect is recommended not to use ?
I would steer clear of the blogs which blindly recommend to avoid subselects.
They are implemented for a reason and, believe it or not, the developers have put some effort into optimizing them.
How I can prove that if we avoid subselesct in query that query is going to be faster ?
Write a query without the subselects which runs faster.
If you post your query here we possibly will be able to improve it. However, a version with the subselects may turn out to be faster.
Try rewriting some of the queries to elminate the sub-select and compare runtimes.
Share and enjoy.
let's say that you want to select all rows from one table that have a corresponding row in another one (the data in the other table is not important, only the presence of a corresponding row is important). From what I know about DB2, this kinda query is better performing when written as a correlated query with a EXISTS clause rather than a INNER JOIN. Is that the same for SQL Server? Or doesn't it make any difference whatsoever?
I just ran a test query and the two statements ended up with the exact same execution plan. Of course, for just about any performance question I would recommend running the test on your own environment; With SQL server Management Studio this is easy (or SQL Query Analyzer if your running 2000). Just type both statements into a query window, select Query|Include Actual Query Plan. Then run the query. Go to the results tab and you can easily see what the plans are and which one had a higher cost.
Odd: it's normally more natural for me to write these as a correlated query first, at which point I have to then go back and re-factor to use a join because in my experience the sql server optimizer is more likely to get that right.
But don't take me too seriously. For all I have 26K rep here and one of only 2 current sql topic-specific badges, I'm actually pretty junior in terms of sql knowledge (It's all about the volume! ;) ); certainly I'm no DBA. In practice, you will of course need to profile each method to gauge it's actual performance. I would expect the optimizer to recognize what you're asking for and handle either query in the optimal way, but you never know until you check.
As everyone notes, it all boils down to the optimizer. I'd suggest writing it in whatever way feels more natural to you, then making sure the optimizer can figure out the most effective query plan (gather statistics, create an index, whatever). The SQL Server optimizer is pretty good overall, so long as you give it the information it needs to work with.
Use the join. It might not make much of a difference in performance if you have small tables, but if the "outer" table is very large then it will need to do the EXISTS sub-query for each row. If your tables are indexed on the common columns then it should be far quicker to do the INNER JOIN. BTW, if you want to find all rows that are NOT in the second table, use a LEFT JOIN and test for NULL in the second table--it is much faster than using EXISTS when you have very large tables and indexes.
Probably the best performance is with a join to a derived table. Exists would probably be next (and might be faster). The worst performance would be with a subquery inside the select as it would tend to run row by row instead of as a set.
However, all things being equal and database performance being very dependent on the database design. I would try out all possible methods and see which are faster in your circumstances.
Consider the following 2 queries:
select tblA.a,tblA.b,tblA.c,tblA.d
from tblA
where tblA.a not in (select tblB.a from tblB)
select tblA.a,tblA.b,tblA.c,tblA.d
from tblA left outer join tblB
on tblA.a = tblB.a where tblB.a is null
Which will perform better? My assumption is that in general the join will be better except in cases where the subselect returns a very small result set.
RDBMSs "rewrite" queries to optimize them, so it depends on system you're using, and I would guess they end up giving the same performance on most "good" databases.
I suggest picking the one that is clearer and easier to maintain, for my money, that's the first one. It's much easier to debug the subquery as it can be run independently to check for sanity.
non-correlated sub queries are fine. you should go with what describes the data you're wanting. as has been noted, this likely gets rewritten into the same plan, but isn't guaranteed to! what's more, if table A and B are not 1:1 you will get duplicate tuples from the join query (as the IN clause performs an implicit DISTINCT sort), so it's always best to code what you want and actually think about the outcome.
Well, it depends on the datasets. From my experience, if You have small dataset then go for a NOT IN if it's large go for a LEFT JOIN. The NOT IN clause seems to be very slow on large datasets.
One other thing I might add is that the explain plans might be misleading. I've seen several queries where explain was sky high and the query run under 1s. On the other hand I've seen queries with excellent explain plan and they could run for hours.
So all in all do test on your data and see for yourself.
I second Tom's answer that you should pick the one that is easier to understand and maintain.
The query plan of any query in any database cannot be predicted because you haven't given us indexes or data distributions. The only way to predict which is faster is to run them against your database.
As a rule of thumb I tend to use sub-selects when I do not need to include any columns from tblB in my select clause. I would definitely go for a sub-select when I want to use the 'in' predicate (and usually for the 'not in' that you included in the question), for the simple reason that these are easier to understand when you or someone else has come back and change them.
The first query will be faster in SQL Server which I think is slighty counter intuitive - Sub queries seem like they should be slower. In some cases (as data volumes increase) an exists may be faster than an in.
It should be noted that these queries will produce different results if TblB.a is not unique.
From my observations, MSSQL server produces same query plan for these queries.
I created a simple query similar to the ones in the question on MSSQL2005 and the explain plans were different. The first query appears to be faster. I am not a SQL expert but the estimated explain plan had 37% for query 1 and 63% for the query 2. It appears that the biggest cost for query 2 is the join. Both queries had two table scans.