How to structure a query with a large, complex where clause?

How to structure a query with a large, complex where clause? - sql

I have an SQL query that takes these parameters:
#SearchFor nvarchar(200) = null
,#SearchInLat Decimal(18,15) = null
,#SearchInLng Decimal(18,15) = null
,#SearchActivity int = null
,#SearchOffers bit = null
,#StartRow int
,#EndRow int
The variables #SearchFor, #SearchActivity, #SearchOffers can be either null or not null. #SearchInLat and #SearchInLng must both null, or both have values.
I'm not going to post the whole query as its boring and hard to read, but the WHERE clause is shaped like this:
( -- filter by activity --
(#SearchActivity IS NULL)
OR (#SearchActivity = Activities.ActivityID)
)
AND ( -- filter by Location --
(#SearchInLat is NULL AND #SearchInLng is NULL)
OR ( ... )
)
AND ( -- filter by activity --
#SearchActivity is NULL
OR ( ... )
)
AND ( -- filter by has offers --
#SearchOffers is NULL
OR ( ... )
)
AND (
... -- more stuff
)
I have read that this is a bad way to structure a query - that SqlServer has trouble working out an efficient execution plan with lots of clauses like this, so I'm looking for other ways to do it.
I see two ways of doing this:
Construct the query as a string in my client application, so that the WHERE clause only contains filters for the relevant parameters. The problem with this is it means not accessing the database through stored procedures, as everything else is at the moment.
Change the stored procedure so that it examines which arguments are null, and executes child procedures depending on which arguments it is passed. The problem here is that it would mean repeating myself a lot in the definition of the procs, and thus be harder to maintain.
What should I do? Or should I just keep on as I am currently doing? I have OPTION (RECOMPILE) set for the procedures, but I've heard that this doesn't work right in Server 2005. Also, I plan to add more parameters to this proc, so I want to make sure whatever solution I have is fairly scaleable.

The answer is to use DynamicSQL (be it in the client, or in an SP using sp_executesql), but the reason why is long, so here's a link...
Dynamic Search Conditions in T-SQL
A very short version is that one-size does not fit all. And as the optimiser creates one plan for one query, it's slow. So the solution is to continue using parameterised queries (for execution plan caching), but to have many queries, for the different types of search that can happen.

Perhaps an alternative might be to perform several separate select statements?
e.g.
( -- filter by activity --
if #SearchActivity is not null
insert into tmpTable (<columns>)
select *
from myTable
where (#SearchActivity = Activities.ActivityID)
)
( -- filter by Location --
if #SearchInLat is not null and #SearchInLng is not null
insert into tmpTable (<columns>)
select *
from myTable
where (latCol = #SearchInLat AND lngCol = #SearchInLng)
etc...
then select the temp table to return the final result set.
I'm not sure how this would work with respect to the optimiser and the query plans, but each individual select would be very straightforward and could utilise the indexes that you would have created on each column which should make them very quick.
Depending on your requirements it also may make sense to create a primary key on the temp table to allow you to join to it on each select (to avoid duplicates).

Look at the performance first, like others have said.
If possible, you can use IF clauses to simplify the queries based on what parameters are provided.
You could also use functions or views to encapsulate some of the code if you find you are repeating it often.

Related

T-Sql - Select query in another select query takes long time

I have a procedure with arguments but its calling takes a very long time. I decided to check what is wrong with my query and came to the conclusion that the problem is Column In (SELECT [...]).
Both queries return 1500 rows.
First query: time 45 second
Second query: time 0 second
1.
declare #FILTER_OPTION int
declare #ID_DISTRIBUTOR type_int_value
declare #ID_DATA_TYPE type_bigint_value
declare #ID_AGGREGATION_TYPE type_int_value
set #FILTER_OPTION = 8
insert into #ID_DISTRIBUTOR values (19)
insert into #ID_DATA_TYPE values (30025)
insert into #ID_AGGREGATION_TYPE values (10)
SELECT * FROM dbo.[DATA] WHERE
[ID_DISTRIBUTOR] IN (select [VALUE] from #ID_DISTRIBUTOR)
AND [ID_DATA_TYPE] IN (select [VALUE] from #ID_DATA_TYPE)
AND [ID_AGGREGATION_TYPE] IN (select [VALUE] from #ID_AGGREGATION_TYPE)
2.
select * FROM dbo.[DATA] WHERE
[ID_DISTRIBUTOR] IN (19)
AND [ID_DATA_TYPE] IN (30025)
AND [ID_AGGREGATION_TYPE] IN (10)
Why this is happening?
How should I create a stored procedure that takes an array of arguments to use it quickly?
Edit:
Maybe it's a problem with indexes? indexes are created on these three columns.

For such a large performance difference, I would guess that you have one or more indexes. In particular, if you have an index on (ID_DISTRIBUTOR, ID_DATA_TYPE, ID_AGGREGATION_TYPE), then the second query can make use of the index. SQL Server can recognize that the IN is really = and the query is a simple lookup.
In the first case, SQL Server doesn't "know" that the subqueries really have only one row in them. That requires a different set of optimizations. In particular, the above index cannot be used, because the IN generally optimizes differently from =.
As for what to do. First, look at the execution plans so you can see the different between the two versions. Then, test the second version with more than one value in the IN lists.
If you can live with just one value for each comparison, then use = rather than IN.

SQL Function in column running slow

I have a computed column(function) that is causing one of my tables to be extremely slow (its output is a column in my table. I thought it might be some logical statements in my function. I commented those out and just returned a string called 'test'. This still caused the table to be slow. I believe the SELECT statement is slowing down the function. When I comment out the select statement, everything is cherry. I think I am not using functions in the correct manner.
FUNCTION [dbo].[Pend_Type](#Suspense_ID int, #Loan_ID nvarchar(10),#Suspense_Date datetime, #Investor nvarchar(10))
RETURNS nvarchar(20)
AS
BEGIN
DECLARE #Closing_Date Datetime, #Paid_Date Datetime
DECLARE #pendtype nvarchar(20)
--This is the issue!!!!
SELECT #Closing_Date = Date_Closing, #Paid_Date = Date_Paid from TABLE where Loan_ID = #Loan_ID
SET #pendtype = 'test'
--commented out logic
RETURN #pendtype
END
UPDATE:
I have another computed column that does something similar and is a column in the same table. This one runs fast. Anyone see a difference in why this would be?
Declare #yOrn AS nvarchar(1)
IF((Select count(suspense_ID) From TABLE where suspense_ID = #suspenseID) = 0)
SET #yOrn = 'N'
ELSE
SET #yOrn = 'Y'
RETURN #yOrn

You have isolated the performance problem in the select statement:
SELECT TOP 1 #Closing_Date = Date_Closing, #Paid_Date = Date_Paid
from TABLE
where Loan_ID = #Loan_ID;
To make this run faster, create a composite index on table(Load_id, Date_Closing, Date_Paid).
By the way, you are using top with no order by. When multiple rows match, you can get any one of them back. Normally, top is used with order by.
EDIT:
You can create the index by issuing the following command:
create index idx_table_load_closing_paid on table(Load_id, Date_Closing, Date_Paid);

Scalar functions are often executed like cursors, one row at a time; that is why they are slow and are to be avoided. I would not use the function as written but would write a set-based version instead. incidentally a select top 1 without an order by column will not always give you the same record and is generally a poor practice. In this case I would think you would want the latest date for instance or the earliest one.
In this particular case I think you would be better off not using a function but using a derived table join.

Optimizing stored procedure with multiple "LIKE"s

I am passing in a comma-delimited list of values that I need to compare to the database
Here is an example of the values I'm passing in:
#orgList = "1123, 223%, 54%"
To use the wildcard I think I have to do LIKE but the query runs a long time and only returns 14 rows (the results are correct, but it's just taking forever, probably because I'm using the join incorrectly)
Can I make it better?
This is what I do now:
declare #tempTable Table (SearchOrg nvarchar(max) )
insert into #tempTable
select * from dbo.udf_split(#orgList) as split
-- this splits the values at the comma and puts them in a temp table
-- then I do a join on the main table and the temp table to do a like on it....
-- but I think it's not right because it's too long.
select something
from maintable gt
join #tempTable tt on gt.org like tt.SearchOrg
where
AYEAR= ISNULL(#year, ayear)
and (AYEAR >= ISNULL(#yearR1, ayear) and ayear <= ISNULL(#yearr2, ayear))
and adate = ISNULL(#Date, adate)
and (adate >= ISNULL(#dateR1, adate) and adate <= ISNULL(#DateR2 , adate))
The final result would be all rows where the maintable.org is 1123, or starts with 223 or starts with 554
The reason for my date craziness is because sometimes the stored procedure only checks for a year, sometimes for a year range, sometimes for a specific date and sometimes for a date range... everything that's not used in passed in as null.
Maybe the problem is there?

Try something like this:
Declare #tempTable Table
(
-- Since the column is a varchar(10), you don't want to use nvarchar here.
SearchOrg varchar(20)
);
INSERT INTO #tempTable
SELECT * FROM dbo.udf_split(#orgList);
SELECT
something
FROM
maintable gt
WHERE
some where statements go here
And
Exists
(
SELECT 1
FROM #tempTable tt
WHERE gt.org Like tt.SearchOrg
)

Such a dynamic query with optional filters and LIKE driven by a table (!) are very hard to optimize because almost nothing is statically known. The optimizer has to create a very general plan.
You can do two things to speed this up by orders of magnitute:
Play with OPTION (RECOMPILE). If the compile times are acceptable this will at least deal with all the optional filters (but not with the LIKE table).
Do code generation and EXEC sp_executesql the code. Build a query with all LIKE clauses inlined into the SQL so that it looks like this: WHERE a LIKE #like0 OR a LIKE #like1 ... (not sure if you need OR or AND). This allows the optimizer to get rid of the join and just execute a normal predicate).

Your query may be difficult to optimize. Part of the question is what is in the where clause. You probably want to filter these first, and then do the join using like. Or, you can try to make the join faster, and then do a full table scan on the results.
SQL Server should optimize a like statement of the form 'abc%' -- that is, where the wildcard is at the end. (See here, for example.) So, you can start with an index on maintable.org. Fortunately, your examples meet this criteria. However, if you have '%abc' -- the wildcard comes first -- then the optimization won't work.
For the index to work best, it might also need to take into account the conditions in the where clause. In other words, adding the index is suggestive, but the rest of the query may preclude the use of the index.
And, let me add, the best solution for these types of searches is to use the full text search capability in SQL Server (see here).

Performance implications of sql 'OR' conditions when one alternative is trivial?

I'm creating a stored procedure for searching some data in my database according to some criteria input by the user.
My sql code looks like this:
Create Procedure mySearchProc
(
#IDCriteria bigint=null,
...
#MaxDateCriteria datetime=null
)
as
select Col1,...,Coln from MyTable
where (#IDCriteria is null or ID=#IDCriteria)
...
and (#MaxDateCriteria is null or Date<#MaxDateCriteria)
Edit : I've around 20 possible parameters, and each combination of n non-null parameters can happen.
Is it ok performance-wise to write this kind of code? (I'm using MS SQL Server 2008)
Would generating SQL code containing only the needed where clauses be notably faster?

OR clauses are notorious for causing performance issues mainly because they require table scans. If you can write the query without ORs you'll be better off.

where (#IDCriteria is null or ID=#IDCriteria)
and (#MaxDateCriteria is null or Date<#MaxDateCriteria)
If you write this criteria, then SQL server will not know whether it is better to use the index for IDs or the index for Dates.
For proper optimization, it is far better to write separate queries for each case and use IF to guide you to the correct one.
IF #IDCriteria is not null and #MaxDateCriteria is not null
--query
WHERE ID = #IDCriteria and Date < #MaxDateCriteria
ELSE IF #IDCriteria is not null
--query
WHERE ID = #IDCriteria
ELSE IF #MaxDateCriteria is not null
--query
WHERE Date < #MaxDateCriteria
ELSE
--query
WHERE 1 = 1
If you expect to need different plans out of the optimizer, you need to write different queries to get them!!
Would generating SQL code containing only the needed where clauses be notably faster?
Yes - if you expect the optimizer to choose between different plans.
Edit:
DECLARE #CustomerNumber int, #CustomerName varchar(30)
SET #CustomerNumber = 123
SET #CustomerName = '123'
SELECT * FROM Customers
WHERE (CustomerNumber = #CustomerNumber OR #CustomerNumber is null)
AND (CustomerName = #CustomerName OR #CustomerName is null)
CustomerName and CustomerNumber are indexed. Optimizer says : "Clustered
Index Scan with parallelization". You can't write a worse single table query.
Edit : I've around 20 possible parameters, and each combination of n non-null parameters can happen.
We had a similar "search" functionality in our database. When we looked at the actual queries issued, 99.9% of them used an AccountIdentifier. In your case, I suspect either one column is -always supplied- or one of two columns are always supplied. This would lead to 2 or 3 cases respectively.
It's not important to remove OR's from the whole structure. It is important to remove OR's from the column/s that you expect the optimizer to use to access the indexes.

So, to boil down the above comments:
Create a separate sub-procedure for each of the most popular variations of specific combinations of parameters, and within a dispatcher procedure call the appropriate one from an IF ELSE structure, the penultimate ELSE clause of which builds a query dynamically to cover the remaining cases.
Perhaps only one or two cases may be specifically coded at first, but as time goes by and particular combinations of parameters are identified as being statistically significant, implementation procedures may be written and the master IF ELSE construct extended to identify those cases and call the appropriate sub-procedure.

Regarding "Would generating SQL code containing only the needed where clauses be notably faster?"
I don't think so, because this way you effectively remove the positive effects of query plan caching.

You could perform selective queries, in order of the most common / efficient (indexed etc), parameters, and add PK(s) to a temporary table
That would create a (hopefully small!) subset of data
Then join that Temporary Table with the main table, using a full WHERE clause with
SELECT ...
FROM #TempTable AS T
JOIN dbo.MyTable AS M
ON M.ID = T.ID
WHERE (#IDCriteria IS NULL OR M.ID=#IDCriteria)
...
AND (#MaxDateCriteria IS NULL OR M.Date<#MaxDateCriteria)
style to refine the (small) subset.

What if constructs like these were replaced:
WHERE (#IDCriteria IS NULL OR #IDCriteria=ID)
AND (#MaxDateCriteria IS NULL OR Date<#MaxDateCriteria)
AND ...
with ones like these:
WHERE ID = ISNULL(#IDCriteria, ID)
AND Date < ISNULL(#MaxDateCriteria, DATEADD(millisecond, 1, Date))
AND ...
or is this just coating the same unoptimizable query in syntactic sugar?

Choosing the right index is hard for the optimizer. IMO, this is one of few cases where dynamic SQL is the best option.

this is one of the cases i use code building or a sproc for each searchoption.
since your search is so complex i'd go with code building.
you can do this either in code or with dynamic sql.
just be careful of SQL Injection.

I suggest one step further than some of the other suggestions - think about degeneralizing at a much higher abstraction level, preferably the UI structure. Usually this seems to happen when the problem is being pondered in data mode rather than user domain mode.
In practice, I've found that almost every such query has one or more non-null, fairly selective columns that would be reasonably optimizable, if one (or more) were specified. Furthermore, these are usually reasonable assumptions that users can understand.
Example: Find Orders by Customer; or Find Orders by Date Range; or Find Orders By Salesperson.
If this pattern applies, then you can decompose your hypergeneralized query into more purposeful subqueries that also make sense to users, and you can reasonably prompt for required values (or ranges), and not worry too much about crafting efficient expressions for subsidiary columns.
You may still end up with an "All Others" category. But at least then if you provide what is essentially an open-ended Query By Example form, then users will have some idea what they're getting into. Doing what you describe really puts you in the role of trying to out-think the query optimizer, which is folly IMHO.

I'm currently working with SQL 2005, so I don't know if the 2008 optimizer acts differently. That being said, I've found that you need to do a couple of things...
Make sure that you are using WITH (RECOMPILE) for your query
Use CASE statements to cause short-circuiting of the logic. At least in 2005 this is NOT done with OR statements. For example:
.
SELECT
...
FROM
...
WHERE
(1 =
CASE
WHEN #my_column IS NULL THEN 1
WHEN my_column = #my_column THEN 1
ELSE 0
END
)
The CASE statement will cause the SQL Server optimizer to recognize that it doesn't need to continue past the first WHEN. In this example it's not a big deal, but in my search procs a non-null parameter often meant searching in another table through a subquery for existence of a matching row, which got costly. Once I made this change the search procs started running much faster.

My suggestion is to build the sql string. You will gain maximum performance from index and reuse execution plan.
DECLARE #sql nvarchar(4000);
SET #sql = N''
IF #param1 IS NOT NULL
SET #sql = CASE WHEN #sql = N'' THEN N'' ELSE N' AND ' END + N'param1 = #param1';
IF #param2 IS NOT NULL
SET #sql = CASE WHEN #sql = N'' THEN N'' ELSE N' AND ' END + N'param2 = #param2';
...
IF #paramN IS NOT NULL
SET #sql = CASE WHEN #sql = N'' THEN N'' ELSE N' AND ' END + N'paramN = #paramN';
IF #sql <> N''
SET #sql = N' WHERE ' + #sql;
SET #sql = N'SELECT ... FROM myTable' + #sql;
EXEC sp_executesql #sql, N'#param1 type, #param2 type, ..., #paramN type', #param1, #param2, ..., #paramN;

Each time the procedure is called, passing different parameters, there is a different optimal execution plan for getting the data. The problem being, that SQL has cached an execution plan for your procedure and will use a sub-optimal (read terrible) execution plan.
I would recommend:
Create specific SPs for frequently run execution paths (i.e. passed parameter sets) optimised for each scenario.
Keep you main generic SP for edge cases (presuming they are rarely run) but use the WITH RECOMPILE clause to cause a new execution plan to be created each time the procedure is run.
We use OR clauses checking against NULLs for optional parameters to great affect. It works very well without the RECOMPILE option so long as the execution path is not drastically altered by passing different parameters.

Conditional Joins - Dynamic SQL

The DBA here at work is trying to turn my straightforward stored procs into a dynamic sql monstrosity. Admittedly, my stored procedure might not be as fast as they'd like, but I can't help but believe there's an adequate way to do what is basically a conditional join.
Here's an example of my stored proc:
SELECT
*
FROM
table
WHERE
(
#Filter IS NULL OR table.FilterField IN
(SELECT Value FROM dbo.udfGetTableFromStringList(#Filter, ','))
)
The UDF turns a comma delimited list of filters (for example, bank names) into a table.
Obviously, having the filter condition in the where clause isn't ideal. Any suggestions of a better way to conditionally join based on a stored proc parameter are welcome. Outside of that, does anyone have any suggestions for or against the dynamic sql approach?
Thanks

You could INNER JOIN on the table returned from the UDF instead of using it in an IN clause
Your UDF might be something like
CREATE FUNCTION [dbo].[csl_to_table] (#list varchar(8000) )
RETURNS #list_table TABLE ([id] INT)
AS
BEGIN
DECLARE #index INT,
#start_index INT,
#id INT
SELECT #index = 1
SELECT #start_index = 1
WHILE #index <= DATALENGTH(#list)
BEGIN
IF SUBSTRING(#list,#index,1) = ','
BEGIN
SELECT #id = CAST(SUBSTRING(#list, #start_index, #index - #start_index ) AS INT)
INSERT #list_table ([id]) VALUES (#id)
SELECT #start_index = #index + 1
END
SELECT #index = #index + 1
END
SELECT #id = CAST(SUBSTRING(#list, #start_index, #index - #start_index ) AS INT)
INSERT #list_table ([id]) VALUES (#id)
RETURN
END
and then INNER JOIN on the ids in the returned table. This UDF assumes that you're passing in INTs in your comma separated list
EDIT:
In order to handle a null or no value being passed in for #filter, the most straightforward way that I can see would be to execute a different query within the sproc based on the #filter value. I'm not certain how this affects the cached execution plan (will update if someone can confirm) or if the end result would be faster than your original sproc, I think that the answer here would lie in testing.

Looks like the rewrite of the code is being addressed in another answer, but a good argument against dynamic SQL in a stored procedure is that it breaks the ownership chain.
That is, when you call a stored procedure normally, it executes under the permissions of the stored procedure owner EXCEPT when executing dynamic SQL with the execute command,for the context of the dynamic SQL it reverts back to the permissions of the caller, which may be undesirable depending on your security model.
In the end, you are probably better off compromising and rewriting it to address the concerns of the DBA while avoiding dynamic SQL.

I am not sure I understand your aversion to dynamic SQL. Perhaps it is that your UDF has nicely abstracted away some of the messyness of the problem, and you feel dynamic SQL will bring that back. Well, consider that most if not all DAL or ORM tools will rely extensively on dynamic SQL, and I think your problem could be restated as "how can I nicely abstract away the messyness of dynamic SQL".
For my part, dynamic SQL gives me exactly the query I want, and subsequently the performance and behavior I am looking for.

I don't see anything wrong with your approach. Rewriting it to use dynamic SQL to execute two different queries based on whether #Filter is null seems silly to me, honestly.
The only potential downside I can see of what you have is that it could cause some difficulty in determining a good execution plan. But if the performance is good enough as it is, there's no reason to change it.

No matter what you do (and the answers here all have good points), be sure to compare the performance and execution plans of each option.
Sometimes, hand optimization is simply pointless if it impacts your code maintainability and really produces no difference in how the code executes.
I would first simply look at changing the IN to a simple LEFT JOIN with NULL check (this doesn't get rid of your udf, but it should only get called once):
SELECT *
FROM table
LEFT JOIN dbo.udfGetTableFromStringList(#Filter, ',') AS filter
ON table.FilterField = filter.Value
WHERE #Filter IS NULL
OR filter.Value IS NOT NULL

It appears that you are trying to write a a single query to deal with two scenarios:
1. #filter = "x,y,z"
2. #filter IS NULL
To optimise scenario 2, I would INNER JOIN on the UDF, rather than use an IN clause...
SELECT * FROM table
INNER JOIN dbo.udfGetTableFromStringList(#Filter, ',') AS filter
ON table.FilterField = filter.Value
To optimise for scenario 2, I would NOT try to adapt the existing query, instead I would deliberately keep those cases separate, either an IF statement or a UNION and simulate the IF with a WHERE clause...
TSQL IF
IF (#filter IS NULL)
SELECT * FROM table
ELSE
SELECT * FROM table
INNER JOIN dbo.udfGetTableFromStringList(#Filter, ',') AS filter
ON table.FilterField = filter.Value
UNION to Simulate IF
SELECT * FROM table
INNER JOIN dbo.udfGetTableFromStringList(#Filter, ',') AS filter
ON table.FilterField = filter.Value
UNION ALL
SELECT * FROM table WHERE #filter IS NULL
The advantage of such designs is that each case is simple, and determining which is simple is it self simple. Combining the two into a single query, however, leads to compromises such as LEFT JOINs and so introduces significant performance loss to each.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas