SQL SELECT clause tuning - sql

Why does the sql query execute faster if I use the actual column names in the SELECT statement instead of SELECT *?

A noticeable difference at all seems odd... since I'd expect it to be a very minuscule difference and am intrigued to test it out.
Any difference might in a statement using Select * might be due to it taking extra time to find out what all of the column names are.

Because depending on the query it has to work out if there are unique names, what they all are, etc. Where as, if you specificy it, its all done for it.

Generally, the more you tell it, the less it has to calculate. This is the same for many systems.

Its possible the performance is way better when you select certain column names than Select *, one good reason, just check whether, you have used the columns which are already indexed, in this case, the optimizer will make a plan to select all the data only from index instead from actual table. But check the plan once for sure.

Related

Performance of ISNULL() in GROUP BY clause SQL

I have been refactoring some old queries recently and noticed that a lot of them repeat ISNULL() in the GROUP BY clause, where it is used in the SELECT clause. I feel in my bones that removing the ISNULL() in the GROUP BY clause will improve performance, but I can't find any documentation on whether it is actually likely to or not. Here's the sort of thing I mean:
SELECT
ISNULL(Foo,-1) AS Foo
,ISNULL(Bar,-1) AS Bar
,SUM(This) AS This
,SUM(That) AS That
FROM
dbo.ThisThatTable AS ThisThat
LEFT JOIN dbo.FooBarTable AS FooBar ON ThisThat.FooBarId = FooBar.Id
GROUP BY
ISNULL(Foo,-1)
,ISNULL(Bar,-1);
GO
The above is the way I keep coming across - When there is grouping on the Foo column, the SELECT and the GROUP BY for selected columns match exactly. The example below is a possible alternative - some possibly unnecessary ISNULL() calls have been removed, and the SELECT and GROUP BY clauses no longer match.
SELECT
ISNULL(Foo,-1) AS Foo
,ISNULL(Bar,-1) AS Bar
,SUM(This) AS This
,SUM(That) AS That
FROM
dbo.ThisThatTable AS ThisThat
LEFT JOIN dbo.FooBarTable AS FooBar ON ThisThat.FooBarId = FooBar.Id
GROUP BY
Foo
,Bar;
GO
I suppose maybe when the SELECT and GROUP BY clauses match, the optimiser only has to do the ISNULL() calculation once to know what is going on, so it might be theoretically more performative to group by the results that are actually selected? Alternatively, maybe it is better to avoid adding a second set of ISNULL() calls that don't change the granularity of the data at all... Maybe the optimiser is clever enough to realise that the NULLS in the grouping are (in this case) -1s in the selection...?
I personally would prefer removing any unnecessary functions, especially once that might affect index usage but when I look online, the references to performance are all like the answers here, about using ISNULL() in the WHERE clause, which I already know to avoid.
I also suspect that any gains are going to be vanishingly small, so this is really asking for an academic or theoretical answer, but as I work, I keep wondering and it bugs me, so I thought I would ask if anyone has any thoughts.
Non-aggregated columns in SELECT clauses generally must precisely match the ones in GROUP BY clauses. If I were you, and I were dealing with tested production code, I would not make the change you propose.
Edit the match between non-aggregated SELECT columns and GROUP BY columns is necessary for GROUP BY. If the columns in SELECT are 1:1 dependent on the columns in GROUP BY, it will work. Otherwise the results are ambiguous.
Internally, SQL does not really have two copies of each ISNULL. They are all flattened together in the internal tree used during compilation. So, this level of optimization is not useful to consider in SQL Server. A query without any ISNULL in it would probably perform a bit faster and potentially a lot faster depending on the rest of the schema and query. However, the ISNULL in the select list and the GROUP BY list are not executed twice in most cases within SQL - this level of detail can show up in showplan, but it's often below the level of detail most people would care to examine.
There are a few different aspects to consider here:
Referring to the same value multiple times in the same scope
In most situations, the optimizer is clever enough to collapse these into calculating them once. The fact that you have a GROUP BY over them makes this even more likely.
Is it faster to group when the value is guaranteed to not be null?
Possibly, although I doubt the difference is measurable.
The SELECT does not have to match exactly, it only needs to be functionally dependent on GROUP BY columns and aggregation functions. It may not be functionally dependent on any other columns.
The most important thing top consider: indexing.
This is much, much more important than the other considerations. When grouping, if you can hit an index then it will go much faster, because it can remove sorting and just use Stream Aggregate. This is not possible if you use ISNULL in the GROUP BY (barring computed columns or indexed views).
Note that your results will not be the same: the first example collapses the NULL group into the -1 group. The second example does not, so you may want to remove the ISNULL from the SELECT also, in order to differentiate them. Alternatively, put a WHERE ... IS NOT NULL instead.

Using temp table for sorting data in SQL Server

Recently, I came across a pattern (not sure, could be an anti-pattern) of sorting data in a SELECT query. The pattern is more of a verbose and non-declarative way for ordering data. The pattern is to dump relevant data from actual table into temporary table and then apply orderby on a field on the temporary table. I guess, the only reason why someone would do that is to improve the performance (which I doubt) and no other benefit.
For e.g. Let's say, there is a user table. The table might contain rows in millions. We want to retrieve all the users whose first name starts with 'G' and sorted by first name. The natural and more declarative way to implement a SQL query for this scenario is:
More natural and declarative way
SELECT * FROM Users
WHERE NAME LIKE 'G%'
ORDER BY Name
Verbose way
SELECT * INTO TempTable
FROM Users
WHERE NAME LIKE 'G%'
SELECT * FROM TempTable
ORDER BY Name
With that context, I have few questions:
Will there be any performance difference between two ways if there is no index on the first name field. If yes, which one would be better.
Will there be any performance difference between two ways if there is an index on the first name field. If yes, which one would be better.
Should not the SQL Server optimizer generate same execution plan for both the ways?
Is there any benefit in writing a verbose way from any other persective like locking/blocking?
Thanks in advance.
Reguzlarly: Anti pattern by people without an idea what they do.
SOMETIMES: ok, because SQL Server has a problem that is not resolvable otherwise - not seen that one in yeas, though.
It makes things slower because it forces the tmpddb table to be fully populated FIRST, while otherwise the query could POSSIBLY be resoled more efficiently.
last time I saw that was like 3 years ago. We got it 3 times as fast by not being smart and using a tempdb table ;)
Answers:
1: No, it still needs a table scan, obviously.
2: Possibly - depends on data amount, but an index seek by index would contain the data in order already (as the index is ordered by content).
3: no. Obviously. Query plan optimization is statement by statement. By cutting the execution in 2, the query optimizer CAN NOT merge the join into the first statement.
4: Only if you run into a query optimizer issue or a limitation of how many tables you can join - not in that degenerate case (degenerate in a technical meaning - i.e. very simplistic). BUt if you need to join MANY MANY tables it may be better to go with an interim step.
If the field you want to do an order by on is not indexed, you could put everything into a temp table and index it and then do the ordering and it might be faster. You would have to test to make sure.
There is never any benefit of the second approach that I can think of.
It means if the data is available pre-ordered SQL Server can't take advantage of this and adds an unnecessary blocking operator and additional sort to the plan.
In the case that the data is not available pre-ordered SQL Server will sort it in a work table either in memory or tempdb anyway and adding an explicit #temp table just adds an unnecessary additional step.
Edit
I suppose one case where the second approach could give an apparent benefit might be if the presence of the ORDER BY caused SQL Server to choose a different plan that turned out to be sub optimal. In which case I would resolve that in a different way by either improving statistics or by using hints/query rewrite to avoid the undesired plan.

using distinct command

using distinct command in SQL is good practice or not? is there any drawback of distinct command?
It depends entirely on what your use case is. DISTINCT is useful in certain circumstances, but it can be overused.
The drawbacks are mainly increased load on the query engine to perform the sort (since it needs to compare the resultset to itself to remove duplicates), and it can be used to mask an issue in your data - if you are getting duplicates there may be a problem with your source data.
The command itself isn't inherently good or bad. You can use a screwdriver to hammer a nail, but that doesn't mean it's a good idea, or that screwdrivers are bad in all cases.
If you need to use it regularly to get the correct output then you have a design or JOIN issue
It's perfectly valid for use otherwise.
It is a kind of aggregate though: the equivalent to a GROUP BY on all output columns. So it is an extra step is query processing
From this http://www.mindfiresolutions.com/Think-Before-Using-Distinct-Command-Arbitarily-1050.php
Sometimes it is seen if the beginners are getting some duplicates in their resultset then they are using DISTINCT. But this has its own disadvantages.
Distinct decreases the query's performance. Because the normal procedure is sorting the results and then removing rows that
are equal to the row immediately before it.
DISTINCT compares between all fields of the record. So DISTINCT increases computation .
It is part of the language, so should be used.
Is some circumstances using DISTINCT may cause a table scan where otherwise one would not occur.
You will need to test for each of your own use cases to see if there is an impact and find a workaround if the impact is unacceptable.
If you want the work to make sure the results are distinct to happen inside the SQL server on the SQL machine, then use it. If you don't mind sending extra results to the client and doing the work there (to reduce server load) then do that. It depends on your performance requirements and the characteristics of your database.
For example, if it's extremely unlikely that distinct will reduce the result set much, and you don't have the right columns indexed to make it fast, and you need to reduce SQL Server load, and you have spare cycles on the client, and it's easy to ensure distinctness on the client -- then you might want to do that.
That's a lot of ifs, ands, and mights. If you don't know -- just use it.

MIN/MAX vs ORDER BY and LIMIT

Out of the following queries, which method would you consider the better one? What are your reasons (code efficiency, better maintainability, less WTFery)...
SELECT MIN(`field`)
FROM `tbl`;
SELECT `field`
FROM `tbl`
ORDER BY `field`
LIMIT 1;
In the worst case, where you're looking at an unindexed field, using MIN() requires a single full pass of the table. Using SORT and LIMIT requires a filesort. If run against a large table, there would likely be a significant difference in percieved performance. As an anecdotal data point, MIN() took .36s while SORT and LIMIT took .84s against a 106,000 row table on my dev server.
If, however, you're looking at an indexed column, the difference is harder to notice (meaningless data point is 0.00s in both cases). Looking at the output of explain, however, it looks like MIN() is able to simply pluck the smallest value from the index ('Select tables optimized away' and 'NULL' rows) whereas the SORT and LIMIT still needs needs to do an ordered traversal of the index (106,000 rows). The actual performance impact is probably negligible.
It looks like MIN() is the way to go - it's faster in the worst case, indistinguishable in the best case, is standard SQL and most clearly expresses the value you're trying to get. The only case where it seems that using SORT and LIMIT would be desirable would be, as mson mentioned, where you're writing a general operation that finds the top or bottom N values from arbitrary columns and it's not worth writing out the special-case operation.
SELECT MIN(`field`)
FROM `tbl`;
Simply because it is ANSI compatible. Limit 1 is particular to MySql as TOP is to SQL Server.
As mson and Sean McSomething have pointed out, MIN is preferable.
One other reason where ORDER BY + LIMIT is useful is if you want to get the value of a different column than the MIN column.
Example:
SELECT some_other_field, field
FROM tbl
ORDER BY field
LIMIT 1
I think the answers depends on what you are doing.
If you have a 1 off query and the intent is as simple as you specified, select min(field) is preferable.
However, it is common to have these types of requirements change into - grab top n results, grab nth - mth results, etc.
I don't think it's too terrible an idea to commit to your chosen database. Changing dbs should not be made lightly and have to revise is the price you pay when you make this move.
Why limit yourself now, for pain you may or may not feel later on?
I do think it's good to stay ANSI as much as possible, but that's just a guideline...
Given acceptable performance I would use the first one because it is semantically closer to the intent.
If the performance was an issue, (Most modern optimizers will probalbly optimize both to the same query plan, although you have to test to verify that) then of course I would use the faster one.
user650654 said that ORDER BY with LIMIT 1 useful when one need "to get the value of a different column than the MIN column". I think, in this case we still have better performance with two single passes using MIN instead of sorting (hoping this is optimized :()
SELECT some_other_field, field
FROM tbl
WHERE field=(SELECT MIN(field) FROM tbl)

Does the way you write sql queries affect performance?

say i have a table
Id int
Region int
Name nvarchar
select * from table1 where region = 1 and name = 'test'
select * from table1 where name = 'test' and region = 1
will there be a difference in performance?
assume no indexes
is it the same with LINQ?
Because your qualifiers are, in essence, actually the same (it doesn't matter what order the where clauses are put in), then no, there's no difference between those.
As for LINQ, you will need to know what query LINQ to SQL actually emits (you can use a SQL Profiler to find out). Sometimes the query will be the simplest query you can think of, sometimes it will be a convoluted variety of such without you realizing it, because of things like dependencies on FKs or other such constraints. LINQ also wouldn't use an * for select.
The only real way to know is to find out the SQL Server Query Execution plan of both queries. To read more on the topic, go here:
SQL Server Query Execution Plan Analysis
Should it? No. SQL is a relational algebra and the DBMS should optimize irrespective of order within the statement.
Does it? Possibly. Some DBMS' may store data in a certain order (e.g., maintain a key of some sort) despite what they've been told. But, and here's the crux: you cannot rely on it.
You may need to switch DBMS' at some point in the future. Even a later version of the same DBMS may change its behavior. The only thing you should be relying on is what's in the SQL standard.
Regarding the query given: with no indexes or primary key on the two fields in question, you should assume that you'll need a full table scan for both cases. Hence they should run at the same speed.
I don't recommend the *, because the engine should look for the table scheme before executing the query. Instead use the table fields you want to avoid unnecessary overhead.
And yes, the engine optimizes your queries, but help him :) with that.
Best Regards!
For simple queries, likely there is little or no difference, but yes indeed the way you write a query can have a huge impact on performance.
In SQL Server (performance issues are very database specific), a correlated subquery will usually have poor performance compared to doing the same thing in a join to a derived table.
Other things in a query that can affect performance include using SARGable1 where clauses instead of non-SARGable ones, selecting only the fields you need and never using select * (especially not when doing a join as at least one field is repeated), using a set-bases query instead of a cursor, avoiding using a wildcard as the first character in a a like clause and on and on. There are very large books that devote chapters to more efficient ways to write queries.
1 "SARGable", for those that don't know, are stage 1 predicates in DB2 parlance (and possibly other DBMS'). Stage 1 predicates are more efficient since they're parts of indexes and DB2 uses those first.