Why would my parameterized query not use indexes correctly?

Why would my parameterized query not use indexes correctly? - sql

This is in SQL Server 2014, but I'm seeing the same behavior in 2008, 2012, and 2016, as well as Sybase ASE 15.7.
I have a simple query that looks like this:
SELECT myField
FROM myTable
WHERE someIndexedField = #myParam
If I run this query from SSMS (replacing #myParam with 'myValue'), the query runs in under a second, because someIndexedField is indexed, and lookups should be very fast.
However, if I create the same query as a parameterized query string in a C# program, the query takes 20 to 30 seconds. Analysis by the DBAs shows that the query plan is NOT using the index on the someIndexedField column, but it is instead doing a table scan.
Even stranger, if I do the exact same parameterized query, but instead change it slightly to this:
DECLARE #_myParam char(13)
SET #_myParam = #myParam
SELECT myField
FROM myTable
WHERE someIndexedField = #_myParam
...this version suddenly uses the index again, and performance is back up to sub-second response times. I see this same behavior in queries of various complexity, but not 100% consistently for different queries - sometimes the server DOES decide to use an index. It IS, however, consistent for any given query. I never know which queries will be affected before trying them, but if a given query has this problem, it ALWAYS has the problem. Also, a query that does not have the problem doesn't ever seem to develop it later.
Another odd thing I have noticed is that sometimes, changing the total length of the query will actually make a difference in how this behavior shows up. I had one example where adding extra carriage returns into the query (essentially double-spacing it) actually caused the server to suddenly start using indexes as expected. Literally NO CODE was changed. I was unable to pin down an exact length at which this happened, however. Also, this particular "solution" only seemed to work on Sybase ASE - I was unable to reproduce that one on SQL Server.
(Incidentally, I can also use a hint to push the server to use the appropriate index, and this also fixes the problem. However, index hints are generally not a good idea if you can avoid them, and it seems that the server should be perfectly capable of picking an index on its own, especially with such a simple query.)
What's going on here? Why is the first version running as though there were no indexes on the table? And why does simply putting the parameter into a locally defined variable suddenly cause indexes to be used?

Related

Optimization of large SQL Select query in odbc environment, comparison operators

I'm currently designing a database application that executes SQL statements on a SQL Server linked to the PCs via ODBC drivers (SQL Native Client v10, Local Network, Network Latency >1ms, executed from withing a MS Access 2003 Environment).
I'm dealing with a peculiar select query that is executed often and has to iterate through an indexed table with about 1.5 million entries. Currently the query structure is this:
SELECT *
FROM table1
WHERE field1 = value1
AND field2 = value2
AND textfield1 LIKE '* value3 *'
AND (field3 = value3 OR field4 = value4 OR field5 = value5)
ORDER BY indexedField1 DESC
(Simplified for reading comprehension and understandability, the real query can have up to 4 bracketed AND connected OR blocks, and up to a total of 31 AND connected statements).
Currently this query takes about ~2s every time it gets executed. It returns somewhere between 1.000 and 15.000 records in usual production. I'm looking for a way to make it execute faster or to restructure it in a way to make it work faster.
Coworkers of mine have hinted at the fact that using LIKE operators might be performance inefficient and that restructuring the OR statements in brackets could bring additional performance.
Edit: Additional relevant information: the table that is being pulled from is VERY active, there is an entry roughly every 1-5 minutes into it.
So the final question is:
Given my parameters outlined above, is this version of the query the most simplistic I can get it.
Can I do something to otherwise speed up the query or the execution time thereof.

General query optimisation is beyond the scope of a single answer, though there may be some help to be had on http://dba.stackexchange.com. However, you should learn how to read a query plan and figure our your bottlenecks before you start optimising.
(The way I'd do that would be to take a few typical queries and look at their estimated execution plan through a tool like SQL Server Management Studio. You may have to try to dig out the real SQL Server query that's resulted from your Access query, which your DBA might be able to help with. I'm assuming that your Access query is actually being translated into a SQL Server query and run natively on the server; if it's not then that will be your big problem!)
I'm going to assume you've indexed every column used in every predicate in your WHERE clause and still have a problem. If that's the case, the suspect is likely to be:
AND textfield1 LIKE '* value3 *'
Because that can't use an index. (It's not SARGable, because it has a wildcard at the beginning, so an index won't be any help.)
If you can't rearrange your search or pre-calculate this particular predicate, then you basically have the problem that Full-Text Searching was designed to solve by tokenising and pre-indexing the words in your text, and that will probably be the best solution.

Simple SQL query performance puzzles me

A quick note: We're running SQL Server 2012 in house, but the problem seems to also occur in 2008 and 2008 R2, and perhaps older versions as well.
I've been investigating a performance issue in some code of ours, and I've tracked down the problem to the following very simple query:
SELECT min(document_id)
FROM document
WHERE document_id IN
(SELECT TOP 5000 document_id FROM document WHERE document_id > 442684)
I've noticed that this query takes an absurdly long time (between 18s and 70s depending on the resources of the machine running it) to return when that final value (after greater-than) is roughly 442000 or larger. Anything below that, the query returns nearly instantly.
I've since tweaked the query to look like this:
SELECT min(t.document_id)
FROM (SELECT TOP 5000 document_id FROM document WHERE document_id > 442684) t
This returns immediately for all values of > that I've tested with.
I've solved the performance issue at hand, so I'm largely happy, but I'm still puzzling over why the original query performed so poorly for 442000 and why it runs quickly for virtually any value below that (400000, 350000, etc).
Can anyone explain this?
EDIT: Fixed the 2nd query to be min instead of max (that was a typo)

The secret to understanding performance in SQL Server (and other databases) is the execution plan. You would need to look at the execution plan for the queries to understand what is going on.
The first version of your query has a join operation. IN with a subquery is another way to express a JOIN. SQL Server has several ways to implement joins, such as hash-matches, merge-sort, nested-loop, and index-lookup operations. The optimizer chooses the one that it thinks is the best.
Without seeing the execution plans, my best guess is that the optimizer changes its mind as to the best algorithm to use for the in. In my experience, this usually means that it switched to a nested-loop algorithm from a more reasonable one.

Change your query to
select min(document_id) from document where document_id > 442684
The select in (select top 5000) is a bad idea in sql -- it can expand to 5000 if tests. Don't know why the optimizer worked well in the max() case

Performance in these two SQL scenarios

I had assumed the first call below was more efficient, because it only makes this check once to see if #myInputParameter is less than 5000.
If the check fails, I avoid a query altogether. However, I've seen other people code like the second example, saying it's just as efficient, if not more.
Can anyone tell me which is quicker? It seems like the second one would be much slower, especially if the call is combing-through a large data set.
First call:
IF (#myInputParameter < 5000)
BEGIN
SELECT
#myCount = COUNT(1)
FROM myTable
WHERE someColumn=#someInputParameter
AND someOtherColumn='Hello'
--and so on
END
Second call:
SELECT
#myCount = COUNT(1)
FROM myTable
WHERE someColumn=#someInputParameter
AND someOtherColumn='Hello'
AND #myInputParameter < 5000
--and so on
Edit: I'm using SQL Server 2008 R2, but I'm really asking to get a feel for which query is "best practice" for SQL. I'm sure the difference in query-time between these two statements is a thousandth of a second, so it's not THAT critical. I'm just interested in writing better SQL code in general. Thanks

Sometimes, SQL Server is clever enough to transform the latter into the former. This manifests itself as a "startup predicate" on some plan operator like a filter or a loop-join. This causes the query to evaluate very quickly in small, constant time. I just tested this, btw.
You can't rely on this for all queries but once having verified that it works for a particular query through testing I'd rely on it.
If you use OPTION (recompile) it becomes even more reliable because this option inlines the parameter values into the query plan causing the entire query to turn into a constant scan.

SQL Server is well designed. It will do literal evaluations without actually scanning the table/indexes if it doesn't need to. So, I would expect them to perform basically identical.
From a practice standpoint, I think one should use the if statement, especially if it wraps multiple statements. But, this really is a matter of preference to me. For me, code that can't be executed would logically be faster than code that "should" execute without actually hitting the data.
Also, there is the possibility that SQL Server should create a bad plan and actually do a hit the data. I've never seen this specific scenario with literals, but I've had bad execution plans be created.

Cost of logic in a query

I have a query that looks something like this:
select xmlelement("rootNode",
(case
when XH.ID is not null then
xmlelement("xhID", XH.ID)
else
xmlelement("xhID", xmlattributes('true' AS "xsi:nil"), XH.ID)
end),
(case
when XH.SER_NUM is not null then
xmlelement("serialNumber", XH.SER_NUM)
else
xmlelement("serialNumber", xmlattributes('true' AS "xsi:nil"), XH.SER_NUM)
end),
/*repeat this pattern for many more columns from the same table...*/
FROM XH
WHERE XH.ID = 'SOMETHINGOROTHER'
It's ugly and I don't like it, and it is also the slowest executing query (there are others of similar form, but much smaller and they aren't causing any major problems - yet). Maintenance is relatively easy as this is mostly a generated query, but my concern now is for performance. I am wondering how much of an overhead there is for all of these case expressions.
To see if there was any difference, I wrote another version of this query as:
select xmlelement("rootNode",
xmlforest(XH.ID, XH.SER_NUM,...
(I know that this query does not produce exactly the same, thing, my plan was to move the logic for handling the renaming and xsi:nil attribute to XSL or maybe to PL/SQL)
I tried to get execution plans for both versions, but they are the same. I'm guessing that the logic does not get factored into the execution plan. My gut tells me the second version should execute faster, but I'd like some way to prove that (other than writing a PL/SQL test function with timing statements before and after the query and running that code over and over again to get a test sample).
Is it possible to get a good idea of how much the case-when will cost?
Also, I could write the case-when using the decode function instead. Would that perform better (than case-statements)?

Just about anything in your SELECT list, unless it is a user-defined function which reads a table or view, or a nested subselect, can usually be neglected for the purpose of analyzing your query's performance.
Open your connection properties and set the value SET STATISTICS IO on. Check out how many reads are happening. View the query plan. Are your indexes being used properly? Do you know how to analyze the plan to see?

For the purposes of performance tuning you are dealing with this statement:
SELECT *
FROM XH
WHERE XH.ID = 'SOMETHINGOROTHER'
How does that query perform? If it returns in markedly less time than the XML version then you need to consider the performance of the functions, but I would astonished if that were the case (oh ho!).
Does this return one row or several? If one row then you have only two things to work with:
is XH.ID indexed and, if so, is the index being used?
does the "many more columns from the same table" indicate a problem with chained rows?
If the query returns several rows then ... Well, actually you have the same two things to work with. It's just the emphasis is different with regards to indexes. If the index has a very poor clustering factor then it could be faster to avoid using the index in favour of a full table scan.
Beyond that you would need to look at physical problems - I/O bottlenecks, poor interconnects, a dodgy disk. The reason why your scope for tuning the query is so restricted is because - as presented - it is a single table, single column read. Most tuning is about efficient joining. Now if XH transpires to be a view over a complex query then it is a different matter.

You can use good old tkprof to analyze statistics. One of the many forms of ALTER SESSION that turn on stats gathering. The DBMS_PROFILER package also gathers statistics if your cursor is in a PL/SQL code block.

LINQ / SQL Performace problem with a very long list

In our database we have a table with more then 100000 entries but most of the time we only need a part of it. We do this with a very simple query.
items.AddRange(from i in this
where i.ResultID == resultID && i.AgentID == parentAgentID
orderby i.ChangeDate descending
select i);
After this Query we get a List with up to 500 items. But even from this result we only need the newest and following item. My coworker did this very simple with:
items[0];
items[1];
Works fine since the query result is already ordered by date. But the overall performance is very poor. Takes some seconds even.
My idea was to add a .Take(2) at the end of the query but my coworker said this will make no difference.
items.AddRange((from i in resultlist
where i.ResultID == resultID && i.AgentID == parentAgentID
orderby i.ChangeDate descending
select i).Take(2));
We haven't tried this yet and we are still looking for additional ways to speed things up. But database programming is not our strong side and any advice would be great.
Maybe we can even make some adjustments to the database itself? We use a SQL Compact Database.

Using Take(2) should indeed make a difference, if the optimiser is reasonably smart, and particularly if the ChangeDate column is indexed. (I don't know how much optimization SQL Compact edition does, but I'd still expect limiting the results to be helpful.)
However, you shouldn't trust me or anyone else to say so. See what query is being generated in each case, and run it against the SQL profiler. See what the execution plan is. Measure the performance with various samples. Measure, measure, measure.

The problem you might be having is that the data is being pulled down to your computer and then you're doing the Take(2) on it. The part that probably takes the most time is pulling all of that data to your application. If you want SQL server to do it then make sure you don't access any of the result set record's values until you're done with your query statements.
Second, LINQ isn't fast for doing things like sorting and where clauses on large sets of data in application memory. It's much easier to write in LINQ at times but it's always better to do as much sorting and where clauses in the database as opposed to manipulating in memory sets of objects.
If you really care about performance in this scenario, don't use LINQ. Just make a loop.
http://ox.no/posts/linq-vs-loop-a-performance-test
I love using LINQ-To-SQL and LINQ but it's not always the right tool for the job. If you have a lot of data and performance is critical then you don't want to use LINQ for in memory sorting and where statements.

Adding .Take(2) will make a big difference. If you only need two items then you should definitely use it and it will most certainly make a performance difference for you.
Add it and look at the SQL that is generated from it. The SQL generated will only get 2 records, which should save you time on the SQL side and also on the object instantiation side.

1 - Add index to cover the fields you use in the query
2 - Make sure that getting just top 2 is not paid by repeating query too frequently
try to define query criteria that will let you take a batch of records
3 - Try to compile your LINQ query

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas