I need to write a query to fetch some data from table and need to use concat method. I end up preparing two different queries and I'm not sure which one will have better performance when we have huge amounts of data.
Please help me understand which query will perform better, and why.
I need to concat twice, one for displaying and one for where condition:
select
id, concat(boolean_value, long_value, string_value, double_value) as Value
from
table
where
concat(boolean_value, long_value, string_value, double_value) = 'XX'
Write the above query as a subquery and add the where condition
Select *
from
(select
id, concat(boolean_value, long_value, string_value, double_value) as Value
from
table) as output
where
Value = 'XX'
Note: this is sample query and the actual query will have multiple joins and the need to concat from multiple columns of different tables
SQL databases represent the result set being produced. You seem to be asking if there is common expression elimination -- that is, will the concat() be performed exactly once.
You have a better chance of this with the subquery than with the version that repeat the expression. But SQL Server could be smart enough to only evaluate it once.
If you want to guarantee single evaluation, then I think cross apply does that:
select t.id, v.value
from table t cross apply
(values (concat( boolean_value, long_value, string_value, double_value))
) as Value
where v.value = 'XX';
I would, however, question why you are comparing the result of the concat() rather than the base columns. Comparing the base columns would allow the optimizer to take advantage of indexes, partitions, and statistics.
You may run EXPLAIN on both queries to see what exactly is on the mind of SQL Server. I would expect that the first version would be preferable, because, unlike the second version, it does not force SQL Server to materialize an intermediate table. Regarding whether or not SQL Server would have to actually evaluate CONCAT twice in the first version, it may not even matter if the cost of the subquery be very high.
Related
While watching Troy Hunt's fantastic course on SQLi, I've noticed that he ends up using this strategy to see if a table has a specific column:
select * from TableA order by (select top 1 some_column from TableB) desc
This expression will get executed by SQL Server, but what will it do for the order by clause? I've seen expressions being used with order by before (case when then else end), but I'm really curious to understand how SQL can process the previous query without any errors...
EDIT: Giving more info because it seems like my initial post was not clear enough:
I know this is not the best strategy for getting table or column name though SQLi (that's not what I'm asking)
I'm not interested in knowing how to protect against this (I know how to do that already)
I know that sorting by a constant value doesn't make sense (though it allows you to run these types of "boolean queries")
What t I really want to know is why it works.
So, going back to the docs, the order by clause expects an order_by_expression, which is described as:
Specifies a column or expression on which to sort the query result set. A sort column can be specified as a name or column alias, or a nonnegative integer representing the position of the column in the select list.
According to the docs, an expression is:
Is a combination of symbols and operators that the SQL Server Database Engine evaluates to obtain a single data value. Simple expressions can be a single constant, variable, column, or scalar function. Operators can be used to join two or more simple expressions into a complex expression.
As #SMor demonstrated, the query does run if you replace the order by select expression with a simple select 'A':
select * from TableA order by (select 'A') desc
But this does not work:
select * from TableA order by 'A' desc
So, the question is: why is select 'A' accepted by SQL Server in the order by clause? Doesn't it produce a constant too? Since a constant is an expression and taking into account the definition for the order by clause, shouldn't it thrown an error in both cases?
Thanks.
The use of (select top 1 some_column from TableB) is an example of a scalar subquery. This is a subquery that returns exactly one column and at most one row. It can be used anywhere a literal value can be used -- and perhaps in some other places as well. Apparently, it can be used in an order by, even though SQL Server does not allow a literal value for order by.
The most common type of scalar subquery is a correlated subquery, which has a where clause that connects the subquery to the outer query. This is not an example of a scalar subquery.
In fact, this is not an example of anything useful as far as I can tell. It has one major shortcoming, which is the use of top without order by. The value returned by the subquery is indeterminate. That seems like a bad practice, and particularly bad if you are trying to teach people SQL.
And, it is probably going to be evaluated once. So the subquery would return a constant value and would not contribute much to a meaningful ordering.
COALESCE in Postgres is a function that returns the first parameter not null.
So I used coalesce in subqueries like:
SELECT COALESCE (
( SELECT * FROM users WHERE... ORDER BY ...),
( SELECT * FROM users WHERE... ORDER BY ...),
( SELECT * FROM users WHERE... ORDER BY ...),
( SELECT * FROM users WHERE... ORDER BY ...)
);
I change the where in any query and they contain lots of params and CASE, also different ORDER BY clauses.
This is because I always want to return something but giving priorities.
What I noticed while issuing EXPLAIN ANALYZE is that any query is executed despite the first one actually returns NOT a null value.
I would expect the engine to run only the first one query and not the following ones if it returns not null.
This way I could have a bad performance.
So am I doing any bad practice and is it better to run the queries separately for performance reason?
EDIT:
Sorry you where right I don’t select * but I select only one column. I didn’t post my code because I am not interested in my query but it’s a generic question to understand how the engine is working. So I reproduce a very simple fiddle here http://sqlfiddle.com/#!17/a8aa7/4
I may be wrong but I think it behaves as I was telling: it runs all the subqueries despite the first one already returns a not null value
EDIT 2: ok I read only now it says never executed. So the other two queries aren’t getting executed. What confused me was the fact they were included in the query plan.
Anyways it’s still important for my question. Is it better to run all the queries separately for performance reasons? Because it seems like that even if the first one returns a not null value the other two subqueries can slow down the performance
For separate SELECT queries, I suggest to use UNION ALL / LIMIT 1 instead. Based on your fiddle:
(select user_id from users order by age limit 1) -- misleading example, see below
UNION ALL
(select user_id from users where user_id=1)
UNION ALL
(select user_id from users order by user_id DESC limit 1)
LIMIT 1;
db<>fiddle here
For three reasons:
Works for any SELECT list: single expressions (your fiddle), multiple or whole row (your example in the question).
You can distinguish actual NULL values from "no row". Since user_id is the PK in the example (and hence, NOT NULL), the problem cannot surface in the example. But with an expression that can be NULL, COALESCE cannot distinguish between both, "no row" is coerced to NULL for the purpose of the query. See:
Return a value if no record is found
Faster.
Aside, your first SELECT in the example makes this a wild-goose chase. It returns a row if there is at least one. The rest is noise in this case.
Related:
PostgreSQL combine multiple select statements
SQL - does order of OR conditions matter?
Way to try multiple SELECTs till a result is available?
Which should be the preferred and efficient way?
where #TeamId in (Team1Id, Team2Id)
or
where #TeamId=Team1Id or #TeamId=Team2Id
I am using sql server 2008.
Edit
When I checked execution plans, both the queries showed that they are using indexes and same execution plan.
Both are same
SQL server converts this
where #TeamId in (Team1Id, Team2Id)
Into
where #TeamId=Team1Id or #TeamId=Team2Id
It's better to write IN compare to OR more readable and easy.
For the specific example yo provide, of testing a variable, IN is simply syntactic sugar for multiple OR's.
However in the related case of selecting rows of a relation the use of a join to another relation is superior, particulalry if the data field being compared is indexed or the list of comparison values grows. Such a comparison relation is easily created using a static sub-query like this:
select *
from data
join (
select Team1Id as TeamId union all
select Team2Id
) comparison on comparison.TeamId = data.TeamId
This technique of a static sub-query is widely applicable to many circumstances.
Hello I made a SQL test and dubious/curious about one question:
In which sequence are queries and sub-queries executed by the SQL engine?
the answers was
primary query -> sub query -> sub sub query and so on
sub sub query -> sub query -> prime query
the whole query is interpreted at one time
There is no fixed sequence of interpretation, the query parser takes a decision on fly
I choosed the last answer (just supposing that it is most reliable w.r.t. others).
Now the curiosity:
where can i read about this and briefly what is the mechanism under all of that?
Thank you.
I think answer 4 is correct. There are a few considerations:
type of subquery - is it corrrelated, or not. Consider:
SELECT *
FROM t1
WHERE id IN (
SELECT id
FROM t2
)
Here, the subquery is not correlated to the outer query. If the number of values in t2.id is small in comparison to t1.id, it is probably most efficient to first execute the subquery, and keep the result in memory, and then scan t1 or an index on t1.id, matching against the cached values.
But if the query is:
SELECT *
FROM t1
WHERE id IN (
SELECT id
FROM t2
WHERE t2.type = t1.type
)
here the subquery is correlated - there is no way to compute the subquery unless t1.type is known. Since the value for t1.type may vary for each row of the outer query, this subquery could be executed once for each row of the outer query.
Then again, the RDBMS may be really smart and realize there are only a few possible values for t2.type. In that case, it may still use the approach used for the uncorrelated subquery if it can guess that the cost of executing the subquery once will be cheaper that doing it for each row.
Option 4 is close.
SQL is declarative: you tell the query optimiser what you want and it works out the best (subject to time/"cost" etc) way of doing it. This may vary for outwardly identical queries and tables depending on statistics, data distribution, row counts, parallelism and god knows what else.
This means there is no fixed order. But it's not quite "on the fly"
Even with identical servers, schema, queries, and data I've seen execution plans differ
The SQL engine tries to optimise the order in which (sub)queries are executed. The part deciding about that is called a query optimizer. The query optimizer knows how many rows are in each table, which tables have indexes and on what fields. It uses that information to decide what part to execute first.
If you want something to read up on these topics, get a copy of Inside SQL Server 2008: T-SQL Querying. It has two dedicated chapters on how queries are processed logically and physically in SQL Server.
It's usually depends from your DBMS, but ... I think second answer is more plausible.
Prime query usually can't be calculated without sub query results.
If I have a few UNION Statements as a contrived example:
SELECT * FROM xxx WHERE z = 1
UNION
SELECT * FROM xxx WHERE z = 2
UNION
SELECT * FROM xxx WHERE z = 3
What is the default order by behaviour?
The test data I'm seeing essentially does not return the data in the order that is specified above. I.e. the data is ordered, but I wanted to know what are the rules of precedence on this.
Another thing is that in this case xxx is a View. The view joins 3 different tables together to return the results I want.
There is no default order.
Without an Order By clause the order returned is undefined. That means SQL Server can bring them back in any order it likes.
EDIT:
Based on what I have seen, without an Order By, the order that the results come back in depends on the query plan. So if there is an index that it is using, the result may come back in that order but again there is no guarantee.
In regards to adding an ORDER BY clause:
This is probably elementary to most here but I thought I add this.
Sometimes you don't want the results mixed, so you want the first query's results then the second and so on. To do that I just add a dummy first column and order by that. Because of possible issues with forgetting to alias a column in unions, I usually use ordinals in the order by clause, not column names.
For example:
SELECT 1, * FROM xxx WHERE z = 'abc'
UNION ALL
SELECT 2, * FROM xxx WHERE z = 'def'
UNION ALL
SELECT 3, * FROM xxx WHERE z = 'ghi'
ORDER BY 1
The dummy ordinal column is also useful for times when I'm going to run two queries and I know only one is going to return any results. Then I can just check the ordinal of the returned results. This saves me from having to do multiple database calls and most empty resultset checking.
Just found the actual answer.
Because UNION removes duplicates it does a DISTINCT SORT. This is done before all the UNION statements are concatenated (check out the execution plan).
To stop a sort, do a UNION ALL and this will also not remove duplicates.
If you care what order the records are returned, you MUST use an order by.
If you leave it out, it may appear organized (based on the indexes chosen by the query plan), but the results you see today may NOT be the results you expect, and it could even change when the same query is run tomorrow.
Edit: Some good, specific examples: (all examples are MS SQL server)
Dave Pinal's blog describes how two very similar queries can show a different apparent order, because different indexes are used:
SELECT ContactID FROM Person.Contact
SELECT * FROM Person.Contact
Conor Cunningham shows how the apparent order can change when the table gets larger (if the query optimizer decides to use a parallel execution plan).
Hugo Kornelis proves that the apparent order is not always based on primary key. Here is his follow-up post with explanation.
A UNION can be deceptive with respect to result set ordering because a database will sometimes use a sort method to provide the DISTINCT that is implicit in UNION , which makes it look like the rows are deliberately ordered -- this doesn't apply to UNION ALL for which there is no implicit distinct, of course.
However there are algorithms for the implicit distinct, such as Oracle's hash method in 10g+, for which no ordering will be applied.
As DJ says, always use an ORDER BY
It's very common to come across poorly written code that assumes table data is returned in insert order, and 95% of the time the coder gets away with it and is never aware that this is a problem as on many common databases (MSSQL, Oracle, MySQL). It is of course a complete fallacy and should always be corrected when it's come across, and always, without exception, use an Order By clause yourself.