I've been digging the Internet over and over and couldn't find any reasonable answer. What's the difference between inlining and flattening in SQL query? I do actually use both interchangeably, eventually they lead to the same result - a big single query not many atomic ones.
But maybe there is a difference in definitions? For instance inline refer only to functions and flattening means convert subquery to join only as stands here? But in another source one can find an example of completely different transformation.
I guess there may be slight differences in how “inlining” and “flattening” are defined by people, but the way these terms are normally understood in the PostgreSQL community is that inlining is to pull the definition of a LANGUAGE sql function into the main query, and flattening is to transform a subquery or view into something else than a subquery, for example a join.
Related
I've seen many examples of SQL with complex nested subqueries (and subsubqueries and subsubsubqueries and...). Is there ever any reason to write complex queries like this instead of using WITH and CTEs, just as one would use variables in a programming language to break up complex expressions?
In particular, is there a performance advantage?
Any query that you can write using only subqueries in the FROM clause and regular joins can use CTEs with direct substitution.
Subqueries are needed for:
Correlated subqueries (which are generally not in the FROM clause).
Lateral joins (in databases that support LATERAL or APPLY keywords in the FROM clause).
Scalar subqueries.
Sometimes, a query could be rewritten to avoid these constructs.
Subqueries in the FROM clause (except for lateral joins) can be written using CTEs.
Why are subqueries used and not CTEs? The most important reason is that CTEs are a later addition to the SQL language. With the exception of recursive CTEs, they are not really needed. They are really handy when a subquery is being referenced more than one time, but one could argue that a view serves the same purpose.
As mentioned in the comments, CTEs and subqueries might be optimized differently. This could be a performance gain or loss, depending on the query and the underlying indexes and so on.
Unless your query plan tells you that subquery performance is better than CTE otherwise I would use CTE instead of a subquery.
In particular, is there a performance advantage?
subquery vs simple (non-recursive) CTE versions, they are probably very similar. You would have to use the profiler and actual execution plan to spot any differences.
There are some reason I would use CTE
In general, CTE can be used recursively but subquery cannot make it, which can help you make a calendar table or especially well suited to tree structures.
CTE will be easier to maintain and read as (#Lukasz Szozda comment), because you can break up complex queries into several CTEs and give them good names, which will be very comfortable when writing in the main query.
Without performance considerations:
CTEs are more readable as sql code, meaning easier to maintain and debug.
Subqueries (at the FROM clause) are good as long as there are few, small and simple, thus converting to CTE would actually make it more difficult to read.
There is also the option of views which mostly prevents sql code duplication.
With performance considerations:
CTEs may screw up the more complex they become. If so, they become too risky to be trusted with some teaks and changes and may lead to a more aggressive performance approach like converting all CTEs to temps (#).
Subqueries behave as good as views and little better than CTEs in most cases. Still becoming too complex may hinter performance and turn performance optimization difficult. Eventually someone may need to tweak them or even extract the heavier(s) out to temps to lighten the main select.
Views are slightly better on increased complexity as long as they are composed of plain tables and simple views, they have elegant SQL and possible filters are linked wherever possible within view's joins. Still joining two complex views will get you to the same situation as complex CTEs or subqueries.
I work with MS Access 2010 on a daily basis and I want to know if there are alternatives to a Common Table Expression as it is used in SQL Server, and how it affects performance?
For example, is it better to create a subquery or is it better to call a query from another query, which essentially is very similar..
Example:
SELECT A.field1,A.Date
FROM (SELECT * FROM TABLE B WHERE B.Date = Restriction )
or
SELECT A.field1,A.Date
FROM SavedQueryB
SavedQueryB:
SELECT * FROM TABLE B WHERE B.Date = Restriction
I feel having multiple queries makes it easier to debug and manage, but does it affect performance when the data-set is very large?
Also, I've seen some videos about implementing the queries thru VBA, however I'm not very comfortable doing it that way yet.
Essentially. What is more efficient or a better practice? Any suggestion or recommendations on better practices?
I am mostly self taught, through videos, and books, and some programming background (VB.NET)
For the simplest queries, such as those in your question, I doubt you would see a significant performance difference between a subquery and a "stacked query" (one which uses another saved query as its data source) approach. And perhaps the db engine would even use the same query plan for both. (If you're interested, you can use SHOWPLAN to examine the query plans.)
The primary performance driver for those 2 examples will be whether the db engine can use indexed retrieval to fetch the rows which satisfy the WHERE restriction. If TABLE.Date is not indexed, the query will require a full table scan. That would suck badly with a very large dataset, and the performance impact from a full scan should far overshadow any difference between a subquery and stacked query.
The situation could be different with a complex subquery as Allen Browne explains:
Complex subqueries on tables with many records can be slow to run.
Among the potential fixes for subquery performance problems, he suggests ...
Use stacked queries instead of subqueries. Create a separate saved
query for JET to execute first, and use it as an input "table" for
your main query. This pre-processing is usually (but not always)
faster than a subquery. Likewise, try performing aggregation in one
query, and then create another query that operates on the aggregated
results. This post-processing can be orders of magnitude faster than a
query that tries to do everything in a single query with subqueries.
I think your best answer will come from testing more complex real world examples. The queries in your question are so simple that conclusions drawn from them will likely not apply to those real world queries.
This is really context-dependent as a host of scenarios will decide the most efficient outcome including data types, join tables, indexes, and more. In essence and for simple queries like posted SELECT statmeents, the two queries are equivalent but the Jet/ACE's (underlying engine of MS Access) query optimizer may decide a different plan again according to structural needs of the query. Possibly, calling an external query adds a step in execution plan but then subqueries can be executed as self-contained tables then linked to main tables.
Recall SQL's general Order of Operations which differs from typed order as each step involves a virtual table (see SQL Server):
FROM clause --VT1
ON clause --VT2
WHERE clause --VT3
GROUP BY clause --VT4
HAVING clause --VT5
SELECT clause --VT6
ORDER BY clause --VT7
What can be said is for stored query objects, MS Access analyzes and caches the optimized "best plan" version. This is often the argument to use stored queries over VBA string queries which the latter was not optimized before execution. Even further, Access' query object is similar to other RDMS' view object (though Jet/ACE does have the VIEW and PROCEDURE objects). A regular discussion in the SQL world involves your very question of efficiency and best practices: views vs subqueries and usually the answer returns "it depends". So, experiment on a needs basis.
And here CTEs are considered "inline-views" denoted by WITH clause (not yet supported in JET/ACE). SQL programmers may use CTEs for readibility and maintainability as you avoid referencing same statement multiple times in body of statement. All in all, use what fits your coding rituals and project requirements, then adjust as needed.
Resources
MS Access 2007 Query Performance
Tips - with note on subquery performance
Intermediate Microsoft Jet
SQL
- with note on subqueries (just learned about the ANY/SOME/ALL)
Microsoft Jet 3.5 Performance White
Paper
- with note on query plans
I cannot recall where but there are discussion out there about this topic of nested or sub-queries. Basically they all suggest saved queries and then referenced the saved query.
From a personal experience, nested queries are extremely difficult to troubleshoot or modify later. Also, if they get too deep I have experienced a performance hit.
Allen Browne has several tips and tricks listed out here
The one place I use nest queries a lot are in the criteria of action queries. This way I do not have any joins and can limit some of the "cannot perform this operation" issue.
Finally, using query strings in VBA. I have found it be much easier to build parameter queries and then in VBA set a variable to the QueryDef and add in the parameters rather than build up a query string in VBA. So much easier to troubleshoot and modify later.
Just my two cents.
I have a hard time figuring out what is best, or if there is difference at all,
however i have not found any material to help my understanding of this,
so i will ask this question, if not for me, then for others who might end up in the same situation.
Aggregating a sub-query before or after a join, in my specific situation the sub-query is rather slow due to fragmented data and bad normalization procedure,
I got a main query that is highly complex and a sub-query that is built from 3 small queries that is combined using union (will remove duplicate records)
i only need a single value from this sub-query (for each line), so at some point i will end up summing this value, (together with grouping the necessary control data with it so i can join)
what will have the greatest impact?
To sum sub-query before the join and then join with the aggregated version
To leave the data raw, and then sum the value together with the rest of the main query
remember there are thousands of records that will be summed for each line,
and the data is not native but built, and therefore may reside in memory,
(that is just a guess from the query optimizers perspective)
Usually I keep the group-by inside the subquery (referred as "inline view" in Oracle lingo).
This way the query is much more simple and clear.
Also I believe the execution plan is more efficient, because the data set to be aggregated is smaller and the resulting set of join keys is also smaller.
This is not a definitive answer though. If the row source that you are joining to the inline view has few matching rows, you may find that a early join reduces the aggregation effort.
The right anwer is: benchmark the queries for your particular data set.
I think in such a general way there is no right or wrong way to do it. The performance from a query like the one that you describe depends on many different factors:
what kind of join are you actually doing (and what algorithm is used in the background)
is the data to be joined small enough to fit into the memory of the machine joining it?
what query optimizations are you using, i.e. what DBMS (Oracle, MsSQL, MySQL, ...)
...
For your case I simply suggest benchmarking. I'm sorry if that does not seem like a satisfactory answer, but it is the way to go in many performance questions...
So set up a simple test using both your approaches and some test data, then pick whatever is faster.
I have a sql query that I will be reusing in multiple stored procedures. The query works against multiple tables and returns an integer value based on 2 variables passed to it.
Rather than repeating the query in different stored procedures I want to share it and have 2 options:
create a view to which I can join to based on the variables and get the integer value from it.
create a function again with criteria passed to it and return integer variable
I am leaning towards option 1 but would like opinions on which is better and common practice. Which would be better performance wise etc. (joining to a view or calling function)
EDIT: The RDBMS is SQL Server
If you will always be using the same parametrised predicate to filter the results then I'd go for a parametrised inline table valued function. In theory this is treated the same as a View in that they both get expanded out by the optimiser in practice it can avoid predicate pushing issues. An example of such a case can be seen in the second part of this article.
As Andomar points out in the comments most of the time the query optimiser does do a good job of pushing down the predicate to where it is needed but I'm not aware of any circumstance in which using the inline TVF will perform worse so this seems a rational default choice between the two (very similar) constructs.
The one advantage I can see for the View would be that it would allow you to select without a filter or with different filters so is more versatile.
Inline TVFs can also be used to replace scalar UDFs for efficiency gains as in this example.
You cannot pass variables into a view, so your only option it seems is to use a function. There are two options for this:
a SCALAR function
a TABLE-VALUED function (inline or multi-statement)
If you were returning records, then you could use a WHERE clause from outside a not-too-complex VIEW which can get in-lined into the query within the view, but since all you are returning is a single column integer value, then a view won't work.
An inline TVF can be expanded by the query optimizer to work together with the outer (calling) query, so it can be faster in most cases when compared to a SCALAR function.
However, the usages are different - a SCALAR function returns a single value immediately
select dbo.scalarme(col1, col2), other from ..
whereas an inline-TVF requires you to either subquery it or CROSS APPLY against another table
select (select value from dbo.tvf(col1, col2)), other from ..
-- or
select f.value, t.other
from tbl t
CROSS apply dbo.tvf(col1, col2) f -- or outer apply
I'm going to give you a half-answer because I cannot be sure about what is better in terms of performance, I'm sorry. But then other people have surely got good pieces of advice on that score, I'm certain.
I will stick to your 'common practice' part of question.
So, a scalar function wood seem to me a natural solution in this case. Why, you only want a value, an integer value to be returned - this is what scalar functions are for, isn't it?
But then, if I could see a probability that later I would need more than one value, I might then consider switching to a TVF. Then again, what if you have already implemented your scalar function and used it in many places of your application and now you need a row, a column or a table of values to be returned using basically the same logic?
In my view (no pun intended), a view could become something like the greatest common divisor for both scalar and table-valued functions. The functions would only need to apply the parameters.
Now you have said that you are only planning to choose which option to use. Yet, considering the above, I still think that views can be a good choice and prove useful when scaling your application, and you could actually use both views and functions (if only that didn't upset the performance too badly) just as I have described.
One advantage a TVF has has over a view is that you can force whoever calls it to target a specific index.
I’ve just found out that the execution plan performance between the following two select statements are massively different:
select * from your_large_table
where LEFT(some_string_field, 4) = '2505'
select * from your_large_table
where some_string_field like '2505%'
The execution plans are 98% and 2% respectively. Bit of a difference in speed then. I was actually shocked when I saw it.
I've always done LEFT(xxx) = 'yyy' as it reads well.
I actually found this out by checking the LINQ generated SQL against my hand crafted SQL. I assumed the LIKE command would be slower, but is in fact much much faster.
My question is why is the LEFT() slower than the LIKE '%..'. They are afterall identical?
Also, is there a CPU hit by using LEFT()?
More generally speaking, you should never use a function on the LEFT side of a WHERE clause in a query. If you do, SQL won't use an index--it has to evaluate the function for every row of the table. The goal is to make sure that your where clause is "Sargable"
Some other examples:
Bad: Select ... WHERE isNull(FullName,'') = 'Ed Jones'
Fixed: Select ... WHERE ((FullName = 'Ed Jones') OR (FullName IS NULL))
Bad: Select ... WHERE SUBSTRING(DealerName,4) = 'Ford'
Fixed: Select ... WHERE DealerName Like 'Ford%'
Bad: Select ... WHERE DateDiff(mm,OrderDate,GetDate()) >= 30
Fixed: Select ... WHERE OrderDate < DateAdd(mm,-30,GetDate())
Bad: Select ... WHERE Year(OrderDate) = 2003
Fixed: Select ... WHERE OrderDate >= '2003-1-1' AND OrderDate < '2004-1-1'
It looks like the expression LEFT(some_string_field, 4) is evaluated for every row of a full table scan, while the "like" expression will use the index.
Optimizing "like" to use an index if it is a front-anchored pattern is a much easier optimization than analyzing arbitrary expressions involving string functions.
There's a huge impact on using function calls in where clauses as SQL Server must calculate the result for each row. On the other hand, like is a built in language feature which is highly optimized.
If you use a function on a column with an index then the db no longer uses the index (at least with Oracle anyway)
So I am guessing that your example field 'some_string_field' has an index on it which doesn't get used for the query with 'LEFT'
Why do you say they are identical? They might solve the same problem, but their approach is different. At least it seems like that...
The query using LEFT optimizes the test, since it already knows about the length of the prefix and etc., so in a C/C++/... program or without an index, an algorithm using LEFT to implement a certain LIKE behavior would be the fastest. But contrasted to most non-declarative languages, on a SQL database, a lot op optimizations are done for you. For example LIKE is probably implemented by first looking for the % sign and if it is noticed that the % is the last char in the string, the query can be optimized much in the same way as you did using LEFT, but directly using an index.
So, indeed I think you were right after all, they probably are identical in their approach. The only difference being that the db server can use an index in the query using LIKE because there is not a function transforming the column value to something unknown in the WHERE clause.
What happened here is either that the RDBMS is not capable of using an index on the LEFT() predicate and is capable of using it on the LIKE, or it simply made the wrong call in which would be the more appropriate access method.
Firstly, it may be true for some RDBMSs that applying a function to a column prevents an index-based access method from being used, but that is not a universal truth, nor is there any logical reason why it needs to be. An index-based access method (such as Oracle's full index scan or fast full index scan) might be beneficial but in some cases the RDBMS is not capable of the operation in the context of a function-based predicate.
Secondly, the optimiser may simply get the arithmetic wrong in estimating the benefits of the different available access methods. Assuming that the system can perform an index-based access method it has first to make an estimate of the number of rows that will match the predicate, either from statistics on the table, statistics on the column, by sampling the data at parse time, or be using a heuristic rule (eg. "assume 5% of rows will match"). Then it has to assess the relative costs of a full table scan or the available index-based methods. Sometimes it will get the arithmetic wrong, sometimes the statistics will be misleading or innaccurate, and sometimes the heuristic rules will not be appropriate for the data set.
The key point is to be aware of a number of issues:
What operations can your RDBMS support?
What would be the most appropriate operation in the
case you are working with?
Is the system's choice correct?
What can be done to either allow the system to perform a more efficient operation (eg. add a missing not null constraint, update the statistics etc)?
In my experience this is not a trivial task, and is often best left to experts. Or on the other hand, just post the problem to Stackoverflow -- some of us find this stuff fascinating, dog help us.
As #BradC mentioned, you shouldn't use functions in a WHERE clause if you have indexes and want to take advantage of them.
If you read the section entitled "Use LIKE instead of LEFT() or SUBSTRING() in WHERE clauses when Indexes are present" from these SQL Performance Tips, there are more examples.
It also hints at questions you'll encounter on the MCSE SQL Server 2012 exams if you're interested in taking those too. :-)