Related
All developers know that "IN" and DISTINCT create issue for all sql query. My colleague created below query but now He is not working at my employed company.PLease take a look below code. How can I tune up my query for high performance?
SELECT xxx
, COUNT(DISTINCT Id) Count
FROM Test (NOLOCK)
WHERE IsDeleted = 0
AND xxx IN
(
SELECT CAST(value AS INT)
FROM STRING_SPLIT(#ProductIds, ',')
)
GROUP BY xxx
All developers know that "IN" and DISTINCT create issue for all sql query.
This is not necessarily true. They do hurt performance, but sometimes they are necessary.
The IN is probably not a big deal. It gets evaluated once. If you have another way to pass in a list -- say using a temporary table -- that is better.
The COUNT(DISTINCT id) is suspicious. I would expect id to already be unique. If so, then just use COUNT(*).
The WITH (NOLOCK) is not recommended unless you really know what you are doing. Working with data that might be inconsistent is dangerous.
I have used used Sentry One Plan Explorer to help find the tuning points of queries I am having performance issues with:
https://www.sentryone.com/plan-explorer
First you need to decide what good performance is in your environment, then find the worst parts of the query and optimize those first.
Last, consider how you are storing your data, look for places it makes sense to add an index if needed.
better you have to create an index for XXX column
I've heard several times that you shouldn't perform COUNT(*) or SELECT * for performance reasons, but wasn't able to dig up some further information about it.
I can imagine that the database is then using all columns for the action, which can be an impressive performance loss, but I'm not sure about that. Does somebody have further information about the topic?
1. On count(*) vs. count(something else)
SQL is declarative in that you specify what you want. This is different from specifying how to get what you want. That means the database engine is free to realize your query in whatever way it thinks is the most efficient. Many database optimizers rewrites your query to a less costly alternative (if such a plan is available).
Given the following table:
table(
pk not null
,color not null
,nullable null
,unique(pk)
,index(color)
);
...all of the following are functionally equivalent (due to the mechanics of count and nulls):
1) select count(*) from table;
2) select count(1) from table;
3) select count(pk) from table;
4) select count(color) from table;
Regardless of which form you use, the optimizer is free to rewrite the query to another form if it is more efficient. (Again, not all optimizers are sophisticated enough to do this). The unique index(pk) would be smaller (bytes occupied) than the entire table. Therefore it would be more efficient to count the number of index entries rather than scanning through the entire table. In Oracle we have bitmap indexes, which also compress repeating strings. If we had used such an index on the color column, it would probably have been the smallest index to scan. Oracle also supports table compression which in some cases makes the physical table smaller than a composite index.
1. TL;DR;
Your specific dbms will have its own set of tools that enables different rewriting rules and in turn execution plans. That renders the question somewhat useless (unless we talk about a specific release of a specific dbms). I recommend COUNT(*) in all cases because it requires the least cognitive effort to grasp.
2. On select a,b,c vs. select *
There are very few valid uses of SELECT * in code you write and put into production. Imagine a table which contains Bluray movies (yes, the movies is stored as a blob in this table). So you slapped together your awesomesauce abstraction layer and put SELECT * FROM movies where id = ? in the getMovies(movie_id) method. I will refrain myself from explaining why SELECT name FROM movies will be transported across the network just a tad faster. Of course, in most realistic cases it won't have a noticable impact.
One last point on performance is that when all the referenced columns (selected, filtered) in your query exists as an index (called a covering index), the database need not touch the table at all. It can be fully resolved from scanning the index only. By selecting all columns you remove this option from the optimizer.
Another thing about SELECT * which is far more serious than anything, is that it creates an implicit dependency on a specific physical layout of the table. Let me explain. Consider the following tables:
table T1(name, id)
table T2(name, id)
The following statement...
insert into t1 select * from t2;
... will break or produce a different result if any of the following happens:
Any of the tables columns are rearranged for example T1(id, name)
T1 gets an additional not-null column
T2 gets another column
2. TL;DR; When possible, explicitly specify the columns you want (eventually, you'll have to do that anyway). Also, selecting fewer columns are faster than selecting more columns. A possitive side-effect on explicit selects is that it gives greater freedom to the optimizer.
COUNT(*) is different from COUNT(column1) !
COUNT(*) returns the number of records, and does NOT use more resources, while COUNT(column1) counts the number of records where column1 is non null.
For SELECT, it is different. SELECT * will of course request more data.
When using count(*) the * doesn't mean "all fields". Using count(field) will count all non-null values in the field, but count(*) will always count all records even if all fields in all records are null, so it doesn't need to check the data in the fields at all.
Using select * means that you almost always return more data than you are going to use, which of course is a waste. However, perhaps more serious is the maintainence problem; if you add fields to a table your query will return these too. That might mean that the record becomes too large to fit in the buffer, resulting in an error message.
Don't confuse the * in "COUNT(*)" with the * in "SELECT * ". They are completely unrelated but sometimes confused because it's such an odd syntax. There is nothing wrong with using COUNT(*), which just means "count rows".
SELECT * on the other hand means "select all columns". That's generally poor practice because it tightly couples your code to the database schema. That means when you change the table you probably have to change the code even if it should have been unaffected. It increases the impact of any schema change.
SELECT * may also cause a sub-optimal query plan. Either because you didn't really need all columns or because it forces the DBMS to do an extra lookup at runtime to get the list of columns.
It's absolutely true that "*" is "all columns". And you're right in the point of if you've a table with an incredible number of columns (say 100+), these kind of queries can be bad in terms of efficiency.
I believe that the best solution is creating database views previously filtering the amount of records evolved in the count operation, so, the performance impact isn't a big problem, because views can be cached.
In the other hand, it seems that "*" operator should be avoided when returning records, and it's brutally better to select the fields you really need to use in some business.
When using SELECT * it can have a performance hit. Applications which use the SELECT * syntax when they actually only need a handful of columns are transferring more data across the network than they need to consume, which is wasteful.
Also, in Microsoft SQL Server at least, there's a strange problem when you use SELECT * in a view and then add a column to the underlying table. The column headings and data returned by the view don't match each other following certain changes! See my blog post for further details of this particular problem.
Depending on the size of the database depends on how inefficient it becomes, the simnplest way to describe would be like so:
when you specifically do:
SELECT column1,column2,column3 FROM table1
Mysql knows exactly exactly what columns it looking for, but when you do
SELECT * FROM table1
Mysql does not know the columns you want, it knows you want all of them but not the names, so it has to perform extra tasks that analyse the table to discover the columns, thus resulting in using resources.
In case of COUNT(*) it depends on database and its version. For example in modern versions of MS SQL it doesn't matter [source needed].
So the best approach in case of COUNT(*) is to measure it.
Using SELECT * is really bad idea. * means read all columns which can be heavy IO and network operation (especially for various type of CHAR columns). Moreover -- rather rarely you need all columns.
I know that this topic has been beaten to death, but it seems that many articles on the Internet are often looking for the most elegant way instead of the most efficient way how to solve it. Here is the problem. We are building an application where one of the common database querys will involve manipulation (SELECT’s and UPDATE’s) based on a user supplied list of ID’s. The table in question is expected to have hundreds of thousands of rows, and the user provided lists of ID’s can potentially unbounded, bust they will be most likely in terms of tens or hundreds (we may limit it for performance reasons later).
If my understanding of how databases work in general is correct, the most efficient is to simply use the WHERE ID IN (1, 2, 3, 4, 5, ...) construct and build queries dynamically. The core of the problem is the input lists of ID’s will be really arbitrary, and so no matter how clever the database is or how cleverly we implement it, we always have an random subset of integers to start with and so eventually every approach has to internally boil down to something like WHERE ID IN (1, 2, 3, 4, 5, ...) anyway.
One can find many approaches all over the web. For instance, one involves declaring a table variable, passing the list of ID’s to a store procedure as a comma delimited string, splitting it in the store procedure, inserting the ID’s into the table variable and joining the master table on it, i.e. something like this:
-- 1. Temporary table for ID’s:
DECLARE #IDS TABLE (ID int);
-- 2. Split the given string of ID’s, and each ID to #IDS.
-- Omitted for brevity.
-- 3. Join the main table to #ID’s:
SELECT MyTable.ID, MyTable.SomeColumn
FROM MyTable INNER JOIN #IDS ON MyTable.ID = #IDS.ID;
Putting the problems with string manipulation aside, I think what essentially happens in this case is that in the third step the SQL Server says: “Thank you, that’s nice, but I just need a list of the ID’s”, and it scans the table variable #IDS, and then does n seeks in MyTable where n is the number of the ID’s. I’ve done some elementary performance evaluations and inspected the query plan, and it seems that this is what happens. So the table variable, the string concatenation and splitting and all the extra INSERT’s are for nothing.
Am I correct? Or am I missing anything? Is there really some clever and more efficient way? Basically, what I’m saying is that the SQL Server has to do n index seeks no matter what and formulating the query as WHERE ID IN (1, 2, 3, 4, 5, ...) is the most straightforward way to ask for it.
Well, it depends on what's really going on. How is the user choosing these IDs?
Also, it's not just efficiency; there's also security and correctness to worry about. When and how does the user tell the database about their ID choices? How do you incorporate them into the query?
It might be much better to put the selected IDs into a separate table that you can join against (or use a WHERE EXISTS against).
I'll give you that you're not likely to do much better performance-wise than IN (1,2,3..n) for a small (user-generated) n. But you need to think about how you generate that query. Are you going to use dynamic SQL? If so, how will you secure it from injection? Will the server be able to cache the execution plan?
Also, using an extra table is often just easier. Say you're building a shopping cart for an eCommerce site. Rather than worrying up keeping track of the cart client side or in a session, it's likely better to update the ShoppingCart table every time the user makes a selection. This also avoids the whole problem of how to safely set the parameter value for your query, because you're only making one change at a time.
Don't forget to old adage (with apologies to Benjamin Franklin):
He who would trade correctness for performance deserves neither
Be careful; on many databases, IN (...) is limited to a fixed number of things in the IN clause. For example, I think it's 1000 in Oracle. That's big, but possibly worth knowing.
The IN clause does not guaranties a INDEX SEEK. I faced this problem before using SQL Mobile edition in a Pocket with very few memory. Replacing IN (list) with a list of OR clauses boosted my query by 400% aprox.
Another approach is to have a temp table that stores the ID's and join it against the target table, but if this operation is used too often a permanent/indexed table can help the optimizer.
For me the IN (...) is not the preferred option due to many reasons, including the limitation on the number of parameters.
Following up on a note from Jan Zich regarding the performance using various temp-table implementations, here are some numbers from SQL execution plan:
XML solution: 99% time - xml parsing
comma-separated procedure using UDF from CodeProject: 50% temp table scan, 50% index seek. One can agrue if this is the most optimal implementation of string parsing, but I did not want to create one myself (I will happily test another one).
CLR UDF to split string: 98% - index seek.
Here is the code for CLR UDF:
public class SplitString
{
[SqlFunction(FillRowMethodName = "FillRow")]
public static IEnumerable InitMethod(String inputString)
{
return inputString.Split(',');
}
public static void FillRow(Object obj, out int ID)
{
string strID = (string)obj;
ID = Int32.Parse(strID);
}
}
So I will have to agree with Jan that XML solution is not efficient. Therefore if comma-separated list is to be passed as a filter, simple CLR UDF seems be optimal in terms of performance.
I tested the search of 1K record in a table of 200K.
A table var has issues: using a temp table with index has benefits for statistics.
A table var is assumed to always have one row, whereas a temp table has stats the optimiser can use.
Parsing a CSV is easy: see questions on right...
Essentially, I would agree with your observation; SQL Server's optimizer will ultimately pick the best plan for analyzing a list of values and it will typically equate to the same plan, regardless of whether or not you are using
WHERE IN
or
WHERE EXISTS
or
JOIN someholdingtable ON ...
Obviously, there are other factors which influence plan choice (like covering indexes, etc). The reason that people have various methods for passing in this list of values to a stored procedure is that before SQL 2008, there really was no simple way of passing in multiple values. You could do a list of parameters (WHERE IN (#param1, #param2)...), or you could parse a string (the method you show above). As of SQL 2008, you can also pass table variables around, but the overall result is the same.
So yes, it doesn't matter how you get the list of variables to the query; however, there are other factors which may have some effect on the performance of said query once you get the list of variables in there.
Once upon a long time ago, I found that on the particular DBMS I was working with, the IN list was more efficient up to some threshold (which was, IIRC, something like 30-70), and after that, it was more efficient to use a temp table to hold the list of values and join with the temp table. (The DBMS made creating temp tables very easy, but even with the overhead of creating and populating the temp table, the queries ran faster overall.) This was with up-to-date statistics on the main data tables (but it also helped to update the statistics for the temp table too).
There is likely to be a similar effect in modern DBMS; the threshold level may well have changed (I am talking about depressingly close to twenty years ago), but you need to do your measurements and consider your strategy or strategies. Note that optimizers have improved since then - they may be able to make sensible use of bigger IN lists, or automatically convert an IN list into an anonymous temp table. But measurement will be key.
In SQL Server 2008 or later you should be looking to use table-valued parameters.
2008 makes it simple to pass a comma-separated list to SQL Server using this method.
Here is an excellent source of information and performance tests on the subject:
Arrays-in-sql-2008
Here is a great tutorial:
passing-table-valued-parameters-in-sql-server-2008
For many years I use 3 approach but when I start using OR/M it's seems to be unnecessary.
Even loading each row by id is not as much inefficient like it looks like.
If problems with string manipulation are putted aside, I think that:
WHERE ID=1 OR ID=2 OR ID=3 ...
is more efficient, nevertheless I wouldn't do it.
You could compare performance between both approaches.
To answer the question directly, there is no way to pass a (dynamic) list of arguments to an SQL Server 2005 procedure. Therefore what most people do in these cases is passing a comma-delimited list of identifiers, which I did as well.
Since sql 2005 though I prefer passing and XML string, which is also very easy to create on a client side (c#, python, another SQL SP), and "native" to work with since 2005:
CREATE PROCEDURE myProc(#MyXmlAsSTR NVARCHAR(MAX)) AS BEGIN
DECLARE #x XML
SELECT #x = CONVERT(XML, #MyXmlAsSTR)
Then you can join your base query directly with the XML select as (not tested):
SELECT t.*
FROM myTable t
INNER JOIN #x.nodes('/ROOT/ROW') AS R(x)
ON t.ID = x.value('#ID', 'INTEGER')
when passing <ROOT><ROW ID="1"/><ROW ID="2"/></ROOT>. Just remember that XML is CaSe-SensiTiv.
select t.*
from (
select id = 35 union all
select id = 87 union all
select id = 445 union all
...
select id = 33643
) ids
join my_table t on t.id = ids.id
If the set of ids to search on is small, this may improve performance by permitting the query engine to do an index seek. If the optimizer judges that a table scan would be faster than, say, one hundred index seeks, then the optimizer will so instruct the query engine.
Note that query engines tend to treat
select t.*
from my_table t
where t.id in (35, 87, 445, ..., 33643)
as equivalent to
select t.*
from my_table t
where t.id = 35 or t.id = 87 or t.id = 445 or ... or t.id = 33643
and note that query engines tend not to be able to perform index seeks on queries with disjunctive search criteria. As an example, Google AppEngine datastore will not execute a query with a disjunctive search criteria at all, because it will only execute queries for which it knows how to perform an index seek.
This question already has answers here:
Closed 13 years ago.
Possible Duplicates:
Select * vs Specifying Column Names
Which is faster/best? SELECT * or SELECT column1, colum2, column3, etc.
Is there a performance difference between select * from tablename and select column1, column2 from tablename?
When it is select * from, the database pulls out all fields/columns which are more than 2 fields/columns. So does the first query cost more time/resources?
If you do select * from, there are two performance issues:
the database has to determine which columns exist in the table
there is more data sent from the server to the client (all columns instead of only two)
The answer is YES in general! Because for small databases you don't see a performance difference ... but with biggest databases could be relevant differences if you use the unqualified * selector as shorthand!
In general, it's better to instantiate each column from which you want to retrieve data!
I can suggest you to read the official document about how to optimize SELECT and other statements!
In each case you always should test your changes. You could use a profiler to do that.
For mysql see : http://dev.mysql.com/tech-resources/articles/using-new-query-profiler.html
there is difference, specially in case when the other columns are BLOB or (big) TEXT fields. in case, your table contains just the two columns, there is no difference.
I've checked with the profiler - and it seems that the answer is no - both queries took the same time to execute.
The table had relatively small fields so the result set wasn't bloated with large quantities of data I potentially wouldn't need.
If you have large data fields you don't need in your result-set don't include them in your query
And while this may not make a big difference with one query run one time, if you use select * through the application, and especially if you use it when you are doing joins where you are definitely returning unneeded data, you are clearly slowing down the system for no good reason other than developer laziness. Select * should almost never be used on a production system.
Well, I guess it would depend on which type of performance you're talking about: Database or Programmer.
Speaking as someone who has had to clean up database queries that were written in the
select * from foo
format, it's a total nightmare. The database can and does change, so now the neat and tidy normalized database has become denormalized for performance reasons, and the host of sloppy queries written to grab everything are now slogging tons more data back.
If you don't need it, don't ask for it. Take a moment, think of the people who will follow after you and need to deal with your choices.
I've still got an entire section in our LIMS to uncluster, thanks for reminding me. -- Sigh
This question already has answers here:
What is the reason not to use select *?
(20 answers)
Closed 3 years ago.
We are using SQL Server 2005, but this question can be for any RDBMS.
Which of the following is more efficient, when selecting all columns from a view?
Select * from view
or
Select col1, col2, ..., colN from view
NEVER, EVER USE "SELECT *"!!!!
This is the cardinal rule of query design!
There are multiple reasons for this. One of which is, that if your table only has three fields on it and you use all three fields in the code that calls the query, there's a great possibility that you will be adding more fields to that table as the application grows, and if your select * query was only meant to return those 3 fields for the calling code, then you're pulling much more data from the database than you need.
Another reason is performance. In query design, don't think about reusability as much as this mantra:
TAKE ALL YOU CAN EAT, BUT EAT ALL YOU TAKE.
It is best practice to select each column by name. In the future your DB schema might change to add columns that you would then not need for a particular query. I would recommend selecting each column by name.
Just to clarify a point that several people have already made, the reason Select * is inefficient is because there has to be an initial call to the DB to find out exactly what fields are available, and then a second call where the query is made using explicit columns.
Feel free to use Select * when you are debugging, running casual queries or are in the early stages of developing a query, but as soon as you know your required columns, state them explicitly.
Select * is a poor programming practice. It is as likely to cause things to break as it is to save things from breaking. If you are only querying one table or view, then the efficiency gain may not be there (although it is possible if you are not intending to actually use every field). If you have an inner join, then you have at least two fields returning the same data (the join fields) and thus you are wasting network resources to send redundant data back to the application. You won't notice this at first, but as the result sets get larger and larger, you will soon have a network pipeline that is full and doesn't need to be. I can think of no instance where select * gains you anything. If a new column is added and you don't need to go to the code to do something with it, then the column shouldn't be returned by your query by definition. If someone drops and recreates the table with the columns in a different order, then all your queries will have information displaying wrong or will be giving bad results, such as putting the price into the part number field in a new record.
Plus it is quick to drag the column names over from the object browser, so that is just pure laziness not efficiency in coding.
It depends. Inheritance of views can be a handy thing and easy to maintain (SQL Anywhere):
create view v_fruit as select F.id, S.strain from F key join S;
create view v_apples as select v_fruit.*, C.colour from v_fruit key join C;
If you're really selecting all columns, it shouldn't make any noticeable difference whether you ask for * or if you are explicit. The SQL server will parse the request the same way in pretty much the same amount of time.
Always do select col1, col2 etc from view. There's no efficieny difference between the two methods that I know of, but using "select *" can be dangerous. If you modify your view definition adding new columns, you can break a program using "select *", whereas selecting a predefined set of columns (even all of them, named), will still work.
I guess it all depends on what the query optimizer does.
If I want to get every record in the row, I will generally use the "SELECT *..." option, since I then don't have to worry should I change the underlying table structure. As well, for someone maintaining the code, seeing "SELECT *" tells them that this query is intended to return every column, whereas listing the columns individually does not convey the same intention.
For performance - look at the query plan (should be no difference).
For maintainability. - always supply a fieldlist (that goes for INSERT INTO too).
select
column1
,column2
,column3
.
.
.
from Your-View
this one is more optimizer than Using the
select *
from Your View