Related
I work with MS Access 2010 on a daily basis and I want to know if there are alternatives to a Common Table Expression as it is used in SQL Server, and how it affects performance?
For example, is it better to create a subquery or is it better to call a query from another query, which essentially is very similar..
Example:
SELECT A.field1,A.Date
FROM (SELECT * FROM TABLE B WHERE B.Date = Restriction )
or
SELECT A.field1,A.Date
FROM SavedQueryB
SavedQueryB:
SELECT * FROM TABLE B WHERE B.Date = Restriction
I feel having multiple queries makes it easier to debug and manage, but does it affect performance when the data-set is very large?
Also, I've seen some videos about implementing the queries thru VBA, however I'm not very comfortable doing it that way yet.
Essentially. What is more efficient or a better practice? Any suggestion or recommendations on better practices?
I am mostly self taught, through videos, and books, and some programming background (VB.NET)
For the simplest queries, such as those in your question, I doubt you would see a significant performance difference between a subquery and a "stacked query" (one which uses another saved query as its data source) approach. And perhaps the db engine would even use the same query plan for both. (If you're interested, you can use SHOWPLAN to examine the query plans.)
The primary performance driver for those 2 examples will be whether the db engine can use indexed retrieval to fetch the rows which satisfy the WHERE restriction. If TABLE.Date is not indexed, the query will require a full table scan. That would suck badly with a very large dataset, and the performance impact from a full scan should far overshadow any difference between a subquery and stacked query.
The situation could be different with a complex subquery as Allen Browne explains:
Complex subqueries on tables with many records can be slow to run.
Among the potential fixes for subquery performance problems, he suggests ...
Use stacked queries instead of subqueries. Create a separate saved
query for JET to execute first, and use it as an input "table" for
your main query. This pre-processing is usually (but not always)
faster than a subquery. Likewise, try performing aggregation in one
query, and then create another query that operates on the aggregated
results. This post-processing can be orders of magnitude faster than a
query that tries to do everything in a single query with subqueries.
I think your best answer will come from testing more complex real world examples. The queries in your question are so simple that conclusions drawn from them will likely not apply to those real world queries.
This is really context-dependent as a host of scenarios will decide the most efficient outcome including data types, join tables, indexes, and more. In essence and for simple queries like posted SELECT statmeents, the two queries are equivalent but the Jet/ACE's (underlying engine of MS Access) query optimizer may decide a different plan again according to structural needs of the query. Possibly, calling an external query adds a step in execution plan but then subqueries can be executed as self-contained tables then linked to main tables.
Recall SQL's general Order of Operations which differs from typed order as each step involves a virtual table (see SQL Server):
FROM clause --VT1
ON clause --VT2
WHERE clause --VT3
GROUP BY clause --VT4
HAVING clause --VT5
SELECT clause --VT6
ORDER BY clause --VT7
What can be said is for stored query objects, MS Access analyzes and caches the optimized "best plan" version. This is often the argument to use stored queries over VBA string queries which the latter was not optimized before execution. Even further, Access' query object is similar to other RDMS' view object (though Jet/ACE does have the VIEW and PROCEDURE objects). A regular discussion in the SQL world involves your very question of efficiency and best practices: views vs subqueries and usually the answer returns "it depends". So, experiment on a needs basis.
And here CTEs are considered "inline-views" denoted by WITH clause (not yet supported in JET/ACE). SQL programmers may use CTEs for readibility and maintainability as you avoid referencing same statement multiple times in body of statement. All in all, use what fits your coding rituals and project requirements, then adjust as needed.
Resources
MS Access 2007 Query Performance
Tips - with note on subquery performance
Intermediate Microsoft Jet
SQL
- with note on subqueries (just learned about the ANY/SOME/ALL)
Microsoft Jet 3.5 Performance White
Paper
- with note on query plans
I cannot recall where but there are discussion out there about this topic of nested or sub-queries. Basically they all suggest saved queries and then referenced the saved query.
From a personal experience, nested queries are extremely difficult to troubleshoot or modify later. Also, if they get too deep I have experienced a performance hit.
Allen Browne has several tips and tricks listed out here
The one place I use nest queries a lot are in the criteria of action queries. This way I do not have any joins and can limit some of the "cannot perform this operation" issue.
Finally, using query strings in VBA. I have found it be much easier to build parameter queries and then in VBA set a variable to the QueryDef and add in the parameters rather than build up a query string in VBA. So much easier to troubleshoot and modify later.
Just my two cents.
I always write very long sql, but for later maintenance.
Is one sql statement divide into many statement better?
example:
select a.a1, a.a2, b.b3, sum(c.c4), b.b4...b.bn
from A a
inner join B b on a.a1=b.b1
left join C c on a.a2=c.c2
group by a.a1, a.a2, b.b3, b.b4,...,b.bn
I divide into
create temp_table select a.a1, a.a2, sum(c.c4)
from A a
left join C c on a.a2=c.c2
group by a.a1, a.a2
select temp.*, b.b3, b.b4,...b.bn
from temp_table temp
inner join B b on temp.a1=b.b1
But it need to create table in pl/sql.Is there a better way?
Can many sql statement execute faster by Oracle's CHOOSE(soft parse)?
Thanks to experience sharing.
I am a fan of writing SQL as a single statement. I find that approach is better for a variety of reasons:
A single statement is easier to maintain.
I don't have to name and remember intermediate table names.
I might make a mistake and not re-build an intermediate result when the logic changes.
The optimizer has a good chance of getting the right execution plan.
That said, the optimizer is not always right. Oracle has a good optimizer and one that makes use of statistics. On occasion, dividing a complex query into pieces can improve performance, under some circumstances:
The optimizer is not able to do a good job of estimated the size of the intermediate result. A table "knows" exactly how many rows it has.
You add indexes to the intermediate table.
You want to re-use results, say for inter-query optimization.
Although these might be beneficial, I myself shy away because of the complexity and maintainability. However, it can sometimes be faster.
It's rarely faster. You're hiding your intent from the optimizer. Generally give it one query with no user functions for optimum performance.
It won't be necessarily faster, as both are run on Oracle server, and your PL/SQL will be compiled anyway.
If you have everything done by one single SQL, you leave the query optimization to Oracle, while if you write your own PL/SQL, you might have more control of how the queries are executed. But sure if your write bad PL/SQL, it will definitely perform worse.
However, I am not sure breaking codes it up really improve maintainability. Unless you are saying you can reuse the broken pieces in other places, which improve code reuse, I would think making it one single statement seems more logical. You can definitely add more comments to explain as much detail as possible to make it clear to whoever read it in the future.
One of my jobs it to maintain our database, usually we have troubles with lack of performance while getting reports and working whit that base.
When I start looking at queries which our ERP sending to database I see a lot of totally needlessly subselect queries inside main queries.
As I am not member of developers which is creator of program we using, they do not like much when I criticize they code and job. Let say they do not taking my review as serious statements.
So I asking you few questions about subselect in SQL
Does subselect is taking a lot of more time then left outer joins?
Does exists any blog, article or anything where I subselect is recommended not to use ?
How I can prove that if we avoid subselesct in query that query is going to be faster ?
Our database server is MSSQL2005
"Show, Don't Tell" - Examine and compare the query plans of the queries identified using SQL Profiler. Particularly look out for table scans and bookmark lookups (you want to see index seeks as often as possible). The 'goodness of fit' of query plans depends on up-to-date statistics, what indexes are defined, the holistic query workload.
Execution Plan Basics
Understanding More Complex Query Plans
Using SQL Server Profiler (2005 Version)
Run the queries in SQL Server Management Studio (SSMS) and turn on Query->Include Actual Execution Plan (CTRL+M)
Think yourself lucky they're only subselects (which in some cases the optimiser will produce equivalent 'join plans') and not correlated sub-queries!
Identify a query that is performing a high number of logical reads, re-write it using your preferred technique and then show how few logicals reads it does by comparison.
Here's a tip. To get the total number of logical reads performed, wrap a query in question with:
SET STATISTICS IO ON
GO
-- Run your query here
SET STATISTICS IO OFF
GO
Run your query, and switch to the messages tab in the results pane.
If you are interested in learning more, there is no better book than SQL Server 2008 Query Performance Tuning Distilled, which covers the essential techniques for monitoring, interpreting and fixing performance issues.
One thing you can do is to load SQL Profiler and show them the cost (in terms of CPU cycles, reads and writes) of the sub-queries. It's tough to argue with cold, hard statistics.
I would also check the query plan for these queries to make sure appropriate indexes are being used, and table/index scans are being held to a minimum.
In general, I wouldn't say sub-queries are bad, if used correctly and the appropriate indexes are in place.
I'm not very familiar with MSSQL, as we are using postrgesql in most of our applications. However there should exist something like "EXPLAIN" which shows you the execution plan for the query. There you should be able to see the various steps that a query will produce in order to retrieve the needed data.
If you see there a lot of table scans or loop join without any index usage it is definitely a hint for a slow query execution. With such a tool you should be able to compare the two queries (one with the join, the other without)
It is difficult to state which is the better ways, because it really highly depends on the indexes the optimizer can take in the various cases and depending on the DBMS the optimizer may be able to implicitly rewrite a subquery-query into a join-query and execute it.
If you really want to show which is better you have to execute both and measure the time, cpu-usage and so on.
UPDATE:
Probably it is this one for MSSQL -->QueryPlan
From my own experience both methods can be valid, as for example an EXISTS subselect can avoid a lot of treatment with an early break.
Buts most of the time queries with a lot of subselect are done by devs which do not really understand SQL and use their classic-procedural-programmer way of thinking on queries. Then they don't even think about joins, and makes some awfull queries. So I prefer joins, and I always check subqueries. To be completly honnest I track slow queries, and my first try on slow queries containing subselects is trying to do joins. Works a lot of time.
But there's no rules which can establish that subselect are bad or slower than joins, it's just that bad sql programmer often do subselects :-)
Does subselect is taking a lot of more time then left outer joins?
This depends on the subselect and left outer joins.
Generally, this construct:
SELECT *
FROM mytable
WHERE mycol NOT IN
(
SELECT othercol
FROM othertable
)
is more efficient than this:
SELECT m.*
FROM mytable m
LEFT JOIN
othertable o
ON o.othercol = m.mycol
WHERE o.othercol IS NULL
See here:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
Does exists any blog, article or anything where subselect is recommended not to use ?
I would steer clear of the blogs which blindly recommend to avoid subselects.
They are implemented for a reason and, believe it or not, the developers have put some effort into optimizing them.
How I can prove that if we avoid subselesct in query that query is going to be faster ?
Write a query without the subselects which runs faster.
If you post your query here we possibly will be able to improve it. However, a version with the subselects may turn out to be faster.
Try rewriting some of the queries to elminate the sub-select and compare runtimes.
Share and enjoy.
I am working on someone else's PHP code and seeing this pattern over and over:
(pseudocode)
result = SELECT blah1, blah2, foreign_key FROM foo WHERE key=bar
if foreign_key > 0
other_result = SELECT something FROM foo2 WHERE key=foreign_key
end
The code needs to branch if there is no related row in the other table, but couldn't this be done better by doing a LEFT JOIN in a single SELECT statement? Am I missing some performance benefit? Portability issue? Or am I just nitpicking?
This is definitely wrong. You are going over the wire a second time for no reason. DBs are very fast at their problem space. Joining tables is one of those and you'll see more of a performance degradation from the second query then the join. Unless your tablespace is hundreds of millions of records, this is not a good idea.
There is not enough information to really answer the question. I've worked on applications where decreasing the query count for one reason and increasing the query count for another reason both gave performance improvements. In the same application!
For certain combinations of table size, database configuration and how often the foreign table would be queried, doing the two queries can be much faster than a LEFT JOIN. But experience and testing is the only thing that will tell you that. MySQL with moderately large tables seems to be susceptable to this, IME. Performing three queries on one table can often be much faster than one query JOINing the three. I've seen speedups of an order of magnitude.
I'm with you - a single SQL would be better
There's a danger of treating your SQL DBMS as if it was a ISAM file system, selecting from a single table at a time. It might be cleaner to use a single SELECT with the outer join. On the other hand, detecting null in the application code and deciding what to do based on null vs non-null is also not completely clean.
One advantage of a single statement - you have fewer round trips to the server - especially if the SQL is prepared dynamically each time the other result is needed.
On average, then, a single SELECT statement is better. It gives the optimizer something to do and saves it getting too bored as well.
It seems to me that what you're saying is fairly valid - why fire off two calls to the database when one will do - unless both records are needed independently as objects(?)
Of course while it might not be as simple code wise to pull it all back in one call from the database and separate out the fields into the two separate objects, it does mean that you're only dependent on the database for one call rather than two...
This would be nicer to read as a query:
Select a.blah1, a.blah2, b.something From foo a Left Join foo2 b On a.foreign_key = b.key Where a.Key = bar;
And this way you can check you got a result in one go and have the database do all the heavy lifting in one query rather than two...
Yeah, I think it seems like what you're saying is correct.
The most likely explanation is that the developer simply doesn't know how outer joins work. This is very common, even among developers who are quite experienced in their own specialty.
There's also a widespread myth that "queries with joins are slow." So many developers blindly avoid joins at all costs, even to the extreme of running multiple queries where one would be better.
The myth of avoiding joins is like saying we should avoid writing loops in our application code, because running a line of code multiple times is obviously slower than running it once. To say nothing of the "overhead" of ++i and testing i<20 during every iteration!
You are completely correct that the single query is the way to go. To add some value to the other answers offered let me add this axiom: "Use the right tool for the job, the Database server should handle the querying work, the code should handle the procedural work."
The key idea behind this concept is that the compiler/query optimizers can do a better job if they know the entire problem domain instead of half of it.
Considering that in one database hit you have all the data you need having one single SQL statement would be better performance 99% of the time. Not sure if the connections is being creating dynamically in this case or not but if so doing so is expensive. Even if the process if reusing existing connections the DBMS is not getting optimize the queries be best way and not really making use of the relationships.
The only way I could ever see doing the calls like this for performance reasons is if the data being retrieved by the foreign key is a large amount and it is only needed in some cases. But in the sample you describe it just grabs it if it exists so this is not the case and therefore not gaining any performance.
The only "gotcha" to all of this is if the result set to work with contains a lot of joins, or even nested joins.
I've had two or three instances now where the original query I was inheriting consisted of a single query that had so a lot of joins in it and it would take the SQL a good minute to prepare the statement.
I went back into the procedure, leveraged some table variables (or temporary tables) and broke the query down into a lot of the smaller single select type statements and constructed the final result set in this manner.
This update dramatically fixed the response time, down to a few seconds, because it was easier to do a lot of simple "one shots" to retrieve the necessary data.
I'm not trying to object for objections sake here, but just to point out that the code may have been broken down to such a granular level to address a similar issue.
A single SQL query would lead in more performance as the SQL server (Which sometimes doesn't share the same location) just needs to handle one request, if you would use multiple SQL queries then you introduce a lot of overhead:
Executing more CPU instructions,
sending a second query to the server,
create a second thread on the server,
execute possible more CPU instructions
on the sever, destroy a second thread
on the server, send the second results
back.
There might be exceptional cases where the performance could be better, but for simple things you can't reach better performance by doing a bit more work.
Doing a simple two table join is usually the best way to go after this problem domain, however depending on the state of the tables and indexing, there are certain cases where it may be better to do the two select statements, but typically I haven't run into this problem until I started approaching 3-5 joined tables, not just 2.
Just make sure you have covering indexes on both tables to ensure you aren't scanning the disk for all records, that is the biggest performance hit a database gets (in my limited experience)
You should always try to minimize the number of query to the database when you can. Your example is perfect for only 1 query. This way you will be able later to cache more easily or to handle more request in same time because instead of always using 2-3 query that require a connexion, you will have only 1 each time.
There are many cases that will require different solutions and it isn't possible to explain all together.
Join scans both the tables and loops to match the first table record in second table. Simple select query will work faster in many cases as It only take cares for the primary/unique key(if exists) to search the data internally.
What is better as far as performance goes?
There is only one way to know: Time it.
In general, I think a single join enables the database to do a lot of optimizations, as it can see all the tables it needs to scan, overhead is reduced, and it can build up the result set locally.
Recently, I had about 100 select-statements which I changed into a JOIN in my code. With a few indexes, I was able to go from 1 minute running time to about 0.6 seconds.
Do not try to write your own join loop as a bunch of selects. Your database server has many clever algorithms for doing joins. Further, your database server can use statistics and estimated cost of access to dynamically pick a join algorithm.
The database server's join algorithm is -- usually -- better than anything you might concoct. They know more about physical I/O, caching and what-not.
This allows you to focus on your problem domain.
A single join will usually outperform multiple single selects. However, there are too many different cases that fit your question. It isn't wise to lump them together under a single simple rule.
More important, a single join will usually be easier for the next programmer to understand and to revise, provided that you and the next programmer "speak the same language" when you use SQL. I'm talking about the language of sets of tuples.
And equally important is that database physical design and query design need to focus first on the questions that will result in a ten for one speed improvement, not on a 10% speed imporvement. If you were doing thousands of simple selects versus a single join, you might get a ten for one advantage. If you are doing three or four simple selects, you won't see a big improvement one way or the other.
One thing to consider besides what has been said, is that the selects will return more data through the network than the joins probably will. If the network connection is already a bottleneck, this could make it much worse, especially if this is done frequently. That said, your best bet in any performacne situation is to test, test, test.
It all depends on how the database will optimize the joins, and the use of indexes.
I had a slow and complex query with lots of joins. Then i subdivided it into 2 or 3 less complex querys. The performance gain was astonishing.
But in the end, "it depends", you have to know where´s the bottleneck.
As has been said before, there is no right answer without context.
The answer to this is dependent on (from the top of my head):
the amount of joining
the type of joining
indexing
the amount of re-use you could have for any of the separate pieces to be joined
the amount of data to be processed
the server setup
etc.
If you are using SQL Server (I am not sure if this is available with other RDBMSs) I would suggest that you bundle an execution plan with you query results. This will give you the ability to see exactly how your query(s) are being executed and what is causing any bottlenecks.
Until you know what SQL Server is actually doing I wouldn't hazard a guess about which query is better.
If your database has lots of data .... and there are multiple joins then please use indexing for better performance.
If there are left/right outer joins in this case , then use multiple selects.
It all depends on your db size, your query, the indexes (which include primary and foreign keys also) ... One cannot reach on conclusion with yes/no on your question.