How to optimize multiple left-joins SQL SELECT-query? - sql

Situation:
We have a database "base1" ~ 6 million lines of data, which shows the actual customer purchases and the day of purchase + the parameters of this purchase.
CREATE TABLE base1 (
User_id NOT NULL PRIMARY KEY ,
PurchaseDate date,
Parameter1 int,
Parameter2 int,
...
ParameterK int );
And also another database "base2" ~ 90 million lines of data, which actually shows the same thing, but instead of the day of purchase, a weekly section is used (for example: all weeks for 4 years for each client - if there was no purchase for N week, the client is still shown).
CREATE TABLE base2 (
Users_id NOT NULL PRIMARY KEY ,
Week_start date ,
Week_end date,
Parameter1 int,
Parameter2 int,
...
ParameterN int );
The task to do the following query:
-- a = base1 , b , wb%% = base2
--create index idx_uid_purch_date on base1(Users_ID,Purchasedate);
SELECT a.Users_id
-- Checking whether the client will make a purchase in next week and the purchase will be bought on condition
,iif(b.Users_id is not null,1,0) as User_will_buy_next_week
,iif(b.Users_id is not null and b.Parameter1 = 1,1,0) as User_will_buy_on_Condition1
-- about 12 similar iif-conditions
,iif(b.Users_id is not null and (b.Parameter1 = 1 and b.Parameter12 = 1),1,0)
as User_will_buy_on_Condition13
-- checking on the fact of purchase in the past month, 2 months ago, 2.5 months, etc.
,iif(wb1m.Users_id is null,0,1) as was_buy_1_month_ago
,iif(wb2m.Users_id is null,0,1) as was_buy_2_month_ago
,iif(wb25m.Users_id is null,0,1) as was_buy_25_month_ago
,iif(wb3m.Users_id is null,0,1) as was_buy_3_month_ago
,iif(wb6m.Users_id is null,0,1) as was_buy_6_month_ago
,iif(wb1y.Users_id is null,0,1) as was_buy_1_year_ago
,a.[Week_start]
,a.[Week_end]
into base3
FROM base2 a
-- Join for User_will_buy
left join base1 b
on a.Users_id =b.Users_id and
cast(b.[PurchaseDate] as date)>=DATEADD(dd,7,cast(a.[Week_end] as date))
and cast(b.[PurchaseDate] as date)<=DATEADD(dd,14,cast(a.[Week_end] as date))
-- Joins for was_buy
left join base1 wb1m
on a.Users_id =wb1m.Users_id
and cast(wb1m.[PurchaseDate] as date)>=DATEADD(dd,-30-4,cast(a.[Week_end] as date))
and cast(wb1m.[PurchaseDate] as date)<=DATEADD(dd,-30+4,cast(a.[Week_end] as date))
/* 4 more similar joins where different values are added in
DATEADD (dd, %%, cast (a. [Week_end] as date))
to check on the fact of purchase for a certain period */
left outer join base1 wb1y
on a.Users_id =wb1y.Users_id and
cast(wb1y.[PurchaseDate] as date)>=DATEADD(dd,-365-4,cast(a.[Week_end] as date))
and cast(wb1y.[PurchaseDate] as date)<=DATEADD(dd,-365+5,cast(a.[Week_end] as date))
Because of the huge number of Joins and rather big databases - this script runs for about 24 hours, which is incredibly long.
Main time, as the execution plan shows, is spent on "Merge Join" and view the rows of the table from base1 and base2, and insert the data into another base3 table.
Question: Is it possible to optimize this query so it works faster?
Perhaps using one Join instead or something.
Help please, I'm not that smart enough :(
Thanx everybody for your answers!
UPD: Maybe use of different type of joins (merge, loop, or hash) may help me, but can't really check this theory. Maybe someone can tell me whether it's right or wrong ;)

You want to have all 90 million base2 rows in your result, each with additional information on base1 data. So, what the DBMS must do is a full table scan on base2 and quickly find related rows in base1.
The query with EXISTS clauses would look something like this:
select
b2.users_id,
b2.week_start,
b2.week_end,
case when exists
(
select *
from base1 b1
where b1.users_id = b2.users_id
and b1.purchasedate between dateadd(day, 7, cast(b2.week_end as date))
and dateadd(day, 14, cast(b2.week_end as date))´
) then 1 else 0 end as user_will_buy_next_week,
case when exists
(
select *
from base1 b1
where b1.users_id = b2.users_id
and b1.parameter1 = 1
and b1.purchasedate between dateadd(day, 7, cast(b2.week_end as date))
and dateadd(day, 14, cast(b2.week_end as date))´
) then 1 else 0 end as user_will_buy_on_condition1,
case when exists
(
select *
from base1 b1
where b1.users_id = b2.users_id
and b1.parameter1 = 1
and b1.parameter2 = 1
and b1.purchasedate between dateadd(day, 7, cast(b2.week_end as date))
and dateadd(day, 14, cast(b2.week_end as date))´
) then 1 else 0 end as user_will_buy_on_condition13,
case when exists
(
select *
from base1 b1
where b1.users_id = b2.users_id
and b1.purchasedate between dateadd(day, -30-4, cast(b2.week_end as date))
and dateadd(day, -30+4, cast(b2.week_end as date))´
) then 1 else 0 end as was_buy_1_month_ago,
...
from base2 b2;
We can easily see that this will take a long time, because all conditions must be checked per base2 row. That is 9 million times 7 lookups. The only thing we can do about this is to provide an index, hoping the query will benefit from it.
create index idx1 on base1 (users_id, purchasedate, parameter1, parameter2);
We can add more indexes, so the DBMS can choose between them on selectivity. Later we can check whether they are used and drop them in case they aren't.
create index idx2 on base1 (users_id, parameter1, purchasedate);
create index idx3 on base1 (users_id, parameter1, parameter2, purchasedate);
create index idx4 on base1 (users_id, parameter2, parameter1, purchasedate);

I assume that the base1 table stores information about the current week purchases.
If that is true, in the query conditions of joins you could ignore [PurchaseDate] parameter, replacing it with the current date constant instead. In that case your DATEADD functions will be applied to the current date and will be constants in the conditions of joins:
left join base1 b
on a.Users_id =b.Users_id and
DATEADD(day,-7,GETDATE())>=a.[Week_end]
and DATEADD(day,-14,GETDATE())<=a.[Week_end]
To have the query above running correctly you should limit b.[PurchaseDate] to the current day.
Then you could run another query, for the purchases made yesterday, and all DATEADD constants in join conditions corrected by -1
And so on, up to 7 queries, or whatever timespan the base1 table covers.
You could also implement grouping of [PurchaseDate] values by days, recalculate constants and make all of that in a single query, but I'm not ready to spend time creating it myself. :)

If you have recurring argument such as DATEADD(dd,-30-4,cast(a.[Week_end] as date)) for example, to make it SARGable you can create an index on it (SQL Server can't). Postgres can do this:
create index ix_base2__34_days_ago on base2(DATEADD(dd,-30-4, cast([Week_end] as date)))
Then an expression like the following would be SARGable as index on DATEADD(dd,-30-4, cast([Week_end])) would be utilized by your database, hence a condition like the following will be fast if you have an index like on the example above.
and cast(wb1m.[PurchaseDate] as date) >= DATEADD(dd,-30-4,cast(a.[Week_end] as date))
Note that casting PurchaseDate to date yields a SARGable expression, despite cast looking like a function, as SQL Server has special handling of datetime to date, an index on datetime field is SARGable even you search on datetime field partially (the date part only). Similar to partial expression like, where lastname LIKE 'Mc%', that expression is SARGable even if an index is for the whole lastname field. I digress.
To somewhat achieve the index on expression on SQL Server, you can create a computed column on that expression.., e.g.,
CREATE TABLE base2 (
Users_id NOT NULL PRIMARY KEY ,
Week_start date ,
Week_end date,
Parameter1 int,
Parameter2 int,
Thirty4DaysAgo as DATEADD(dd,-30-4, cast([Week_end] as date))
)
..and then create index on that column:
create index ix_base2_34_days_ago on base2(Thirty4DaysAgo)
Then change your expression to:
and cast(wb1m.[PurchaseDate] as date) >= a.Thirty4DaysAgo
That's what I would recommend before, change the old expression to use the computed column. However, upon further searching, it looks like you can just retain your original code, as SQL Server can intelligently match an expression to the computed column, and if you have an index on that column, your expression would be SARGable. Thus your DBA can optimize things behind the scenes and your original code would run optimized without requiring any changes on your code. So no need to change the following, and it will be SARGable (granted that your DBA created a computed column for dateadd(recurring parameters here) expression, and applied index on it) :
and cast(wb1m.[PurchaseDate] as date) >= DATEADD(dd,-30-4,cast(a.[Week_end] as date))
The only downside (when compared to Postgres) is you still have the dangling computed column on your table when using SQL Server :)
Good read: https://littlekendra.com/2016/03/01/sql-servers-year-function-and-index-performance/

Related

Why query optimizer selects completely different query plans?

Let us have the following table in SQL Server 2016
-- generating 1M test table with four attributes
WITH x AS
(
SELECT n FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) v(n)
), t1 AS
(
SELECT ones.n + 10 * tens.n + 100 * hundreds.n + 1000 * thousands.n + 10000 * tenthousands.n + 100000 * hundredthousands.n as id
FROM x ones, x tens, x hundreds, x thousands, x tenthousands, x hundredthousands
)
SELECT id,
id % 50 predicate_col,
row_number() over (partition by id % 50 order by id) join_col,
LEFT('Value ' + CAST(CHECKSUM(NEWID()) AS VARCHAR) + ' ' + REPLICATE('*', 1000), 1000) as padding
INTO TestTable
FROM t1
GO
-- setting the `id` as a primary key (therefore, creating a clustered index)
ALTER TABLE TestTable ALTER COLUMN id int not null
GO
ALTER TABLE TestTable ADD CONSTRAINT pk_TestTable_id PRIMARY KEY (id)
-- creating a non-clustered index
CREATE NONCLUSTERED INDEX ix_TestTable_predicate_col_join_col
ON TestTable (predicate_col, join_col)
GO
Ok, and now when I run the following queries having just slightly different predicates (b.predicate_col <= 0 vs. b.predicate_col = 0) I got completely different plans.
-- Q1
select b.id, b.predicate_col, b.join_col, b.padding
from TestTable b
join TestTable a on b.join_col = a.id
where a.predicate_col = 1 and b.predicate_col <= 0
option (maxdop 1)
-- Q2
select b.id, b.predicate_col, b.join_col, b.padding
from TestTable b
join TestTable a on b.join_col = a.id
where a.predicate_col = 1 and b.predicate_col = 0
option (maxdop 1)
If I look on query plans, then it is clear that he chooses to join the key lookup together with non-clustered index seek first and then he does the final join with non-clustered index in the case of Q1 (which is bad). A much better solution is in the case of Q2: he joins the non-clustered indexes first and then he does the final key lookup.
The question is: why is that and can I improve it somehow?
In my intuitive understanding of histograms, it should be easy to estimate the correct result for both variants of predicates (b.predicate_col <= 0 vs. b.predicate_col = 0), therefore, why different query plans?
EDIT:
Actually, I do not want to change the indexes or physical structure of the table. I would like to understand why he picks up such a bad query plan in the case of Q1. Therefore, my question is precisely like this:
Why he picks such a bad query plan in the case of Q1 and can I improve without altering the physical design?
I have checked the result estimations in the query plan and both query plans have exact row number estimations of every operator! I have checked the result memo structure (OPTION (QUERYTRACEON 3604, QUERYTRACEON 8615, QUERYTRACEON 8620)) and rules applied during the compilation (OPTION (QUERYTRACEON 3604, QUERYTRACEON 8619, QUERYTRACEON 8620)) and it seems that he finish the query plan search once he hit the first plan. Is this the reason for such behaviour?
This is caused by SQL Server's inability to use Index Columns to the Right of the Inequality search.
This code produces the same issue:
SELECT * FROM TestTable WHERE predicate_col <= 0 and join_col = 1
SELECT * FROM TestTable WHERE predicate_col = 0 and join_col <= 1
Inequality queries such as >= or <= put a limitation on SQL, the Optimiser can't use the rest of the columns in the index, so when you put an inequality on [predicate_col] you're rendering the rest of the index useless, SQL can't make full use of the index and produces an alternate (bad) plan. [join_col] is the last column in the Index so in the second query SQL can still make full use of the index.
The reason SQL opts for the Hash Match is because it can't guarantee the order of the data coming out of table B. The inequality renders [join_col] in the index useless so SQL has to prepare for unsorted data on the join, even though the row count is the same.
The only way to fix your problem (even though you don't like it) is to alter the Index so that Equality columns come before Inequality columns.
Ok answer can be from Statistics and histogram point of view also.
Answer can be from index structure arrangement point of view also.
Ok I am trying to answer this from index structure.
Although you get same result in both query because there is no predicate_col < 0 records
When there is Range predicate in composite index ,both the index are not utilise. There can also be so many other reason of index not being utilise.
-- Q1
select b.id, b.predicate_col, b.join_col, b.padding
from TestTable b
join TestTable a on b.join_col = a.id
where a.predicate_col = 1 and b.predicate_col <= 0
option (maxdop 1)
If we want plan like in Q2 then we can create another composite index.
-- creating a non-clustered index
CREATE NONCLUSTERED INDEX ix_TestTable_predicate_col_join_col_1
ON TestTable (join_col,predicate_col)
GO
We get query plan exactly like Q2.
Another way is to define CHECK constraint in predicate_col
Alter table TestTable ADD check (predicate_col>=0)
GO
This also give same query plan as Q2.
Though in real table and data, whether you can create CHECK Constraint or create another composite index or not is another discussion.

Avoiding repetitive conditions in the select case and where clause

I have a table say TAB1 with the following columns -
USER_ID NUMBER(5),
PHN_NO1 CHAR(20),
PHN_NO2 CHAR(20)
I have to fetch records from TAB1 into another table TAB2 such that all records with either one of the two or both PHN_NO1 and PHN_NO2 are of length 10 and begin with 5.
If in a record,say only PHN_NO1 satisfies the condition and PHN_NO2 does not then, TAB2.P1 should be same as TAB1.PHN_NO1 but TAB2.P2 should be NULL.
If neither of the two satisfy the condition, then the record should not be inserted into TAB2
Structure of TAB2 would be as
USER_ID number(5)- holding the ROWID of the record selected from TAB1
P1 char(10)- holding TAB1.PHN_NO1 if it is of length 10 and begins with 5, otherwise NULL
P2 char(10)- holding TAB1.PHN_NO2 if it is of length 10 and beigns with 5, otherwise NULL
I could write the below query to achieve the above, but the conditions in the CASE and WHERE are repetitive. Please suggest a way to achieve the above in a better way.
CREATE TABLE TAB2
AS
SELECT
USER_ID,
CASE WHEN
(LENGTH(TRIM(PHN_NO1)) = 10 AND TRIM(PHN_NO1) like '5%')
THEN
CAST(TRIM(PHN_NO1) as CHAR(10))
ELSE
CAST(NULL as CHAR(10))
END AS P1,
CASE (LENGTH(TRIM(PHN_NO2)) = 10 AND TRIM(PHN_NO2) like '5%')
THEN
CAST(TRIM(PHN_NO2) as CHAR(10))
ELSE
CAST(NULL as CHAR(10))
END AS P2
WHERE
(LENGTH(TRIM(PHN_NO1) = 10 AND TRIM(PHN_NO1) like '5%')
OR
(LENGTH(TRIM(PHN_NO2) = 10 AND TRIM(PHN_NO2) like '5%')
Sure you can! You do have to use some conditions though:
INSERT INTO New_Phone
SELECT user_id, phn_no1, phn_no2
FROM (SELECT user_id,
CASE WHEN LENGTH(TRIM(phn_no1)) = 10 AND TRIM(phn_no1) like '5%'
THEN SUBSTR(phn_no1, 1, 10) ELSE NULL END phn_no1,
CASE WHEN LENGTH(TRIM(phn_no2)) = 10 AND TRIM(phn_no2) like '5%'
THEN SUBSTR(phn_no2, 1, 10) ELSE NULL END phn_no2
FROM Old_Phone) Old
WHERE phn_no1 IS NOT NULL
OR phn_no2 IS NOT NULL;
(I have a working SQL Fiddle example.)
This should work on any RDBMS. Note that, because of your data, this isn't likely to be less performant than your original (which would not have used an index, given the TRIM()). It's also not likely to be better, given that most major RDBMSs are able to re-use the results of deterministic functions per-row.
Oh, it should be noted that, internationally, phone numbers can be up to 15 digits in length (with a minimum in-country of 6 or less). Maybe use VARCHAR (and save yourself some TRIM()s, too)? And INTEGER (or BIGINT, maybe TINYINT) is more often used for surrogate ids, NUMBER is a little strange.

Delete with WHERE - date, time and string comparison - very slow

I have a slow performing query and was hoping someone with a bit more knowledge in sql might be able to help me improve the performance:
I have 2 tables a Source and a Common, I load in some data which contains a Date, a Time and String (whch is a server name), plus some..
The Source table can contain 40k+ rows (it has 30 odd columns, a mix of ints, dates, times and some varchars (255)/(Max)
I use the below query to remove any data from Common that is in source:
'Delete from Common where convert(varchar(max),Date,102)+convert(varchar(max),Time,108)+[ServerName] in
(Select convert(varchar(max),[date],102)+convert(varchar(max),time,108)+ServerName from Source where sc_status < 300)'
The Source Fields are in this format:
ServerName varchar(255) I.E SN1234
Date varchar(255) I.E 2012-05-22
Time varchar(255) I.E 08:12:21
The Common Fields are in this format:
ServerName varchar(255) I.E SN1234
Date date I.E 2011-08-10
Time time(7) I.E 14:25:34.0000000
Thanks
Converting both sides to strings, then concatenating them into one big string, then comparing those results is not very efficient. Only do conversions where you have to. Try this example and see how it compares:
DELETE c
FROM dbo.Common AS c
INNER JOIN dbo.Source AS s
ON s.ServerName = c.ServerName
AND CONVERT(DATE, s.[Date]) = c.[Date]
AND CONVERT(TIME(7), s.[Time]) = c.[Time]
WHERE s.sc_status < 300;
All those conversions to VARCHAR(MAX) are unnecessary and probably slowing you down. I would start with something like this instead:
DELETE c
from [Common] c
WHERE EXISTS(
SELECT 1
FROM Source
WHERE CAST([Date] AS DATE)=c.[Date]
AND CAST([Time] AS TIME(7))=c.[Time]
AND [ServerName]=c.[ServerName]
AND sc_status < 300
);
Something like
Delete from Common inner join Source
On Common.ServerName = Source.ServerName
and Common.Date = Convert(Date,Source.Date)
and Common.Time = Convert(Time, Source.Time)
And Source.sc_Status < 300
If it's too slow after that, then you need some indexes, possible on both tables.
Removing the unecessary conversions will help a lot as detailed in Aaron's answer. You might also consider creating an indexed view over the top of the log table, since you probably dont have much flexibility in that schema or insert DML from the log parser.
Simple example:
create table dbo.[Source] (LogId int primary key, servername varchar(255),
[date] varchar(255), [time] varchar(255));
insert into dbo.[Source]
values (1, 'SN1234', '2012-05-22', '08:12:21'),
(2, 'SN5678', '2012-05-23', '09:12:21')
go
create view dbo.vSource with schemabinding
as
select [LogId],
[servername],
[date],
[time],
[actualDateTime] = convert(datetime, [date]+' '+[time], 120)
from dbo.[Source];
go
create unique clustered index UX_Source on vSource(LogId);
create nonclustered index IX_Source on vSource(actualDateTime);
This will give you an indexed datetime column on which to seek and vastly improve your execution plans at the cost of some insert performance.

sql server-query optimization with many columns

we have "Profile" table with over 60 columns like (Id, fname, lname, gender, profilestate, city, state, degree, ...).
users search other peopel on website. query is like :
WITH TempResult as (
select ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum, profile.id from Profile
where
(#a is null or a = #a) and
(#b is null or b = #b) and
...(over 60 column)
)
SELECT profile.* FROM TempResult join profile on TempResult.id = profile.id
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)
sql server by default use clustered index for execution query. but total execution time is over 300. we test another solution such as multi column index in all columns in where clause but total execution time is over 400.
do you have any solution to make total execution time lower than 100.
we using sql server 2008.
Unfortunately I don't think there is a pure SQL solution to your issue. Here are a couple alternatives:
Dynamic SQL - build up a query that only includes WHERE clause statements for values that are actually provided. Assuming the average search actually only fills in 2-3 fields, indexes could be added and utilized.
Full Text Search - go to something more like a Google keyword search. No individual options.
Lucene (or something else) - Search outside of SQL; This is a fairly significant change though.
One other option that I just remembered implementing in a system once. Create a vertical table that includes all of the data you are searching on and build up a query for it. This is easiest to do with dynamic SQL, but could be done using Table Value Parameters or a temp table in a pinch.
The idea is to make a table that looks something like this:
Profile ID
Attribute Name
Attribute Value
The table should have a unique index on (Profile ID, Attribute Name) (unique to make the search work properly, index will make it perform well).
In this table you'd have rows of data like:
(1, 'city', 'grand rapids')
(1, 'state', 'MI')
(2, 'city', 'detroit')
(2, 'state', 'MI')
Then your SQL will be something like:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
WHERE (AttributeName = 'city' AND AttributeValue = 'grand rapids')
AND (AttributeName = 'state' AND AttributeValue = 'MI')
GROUP BY ProfileID
HAVING COUNT(*) = 2
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
Like I said, you could use a temp table that has attribute name/values:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
JOIN PassedInAttributeTable ON ProfileAttributes.AttributeName = PassedInAttributeTable.AttributeName
AND ProfileAttributes.AttributeValue = PassedInAttributeTable.AttributeValue
GROUP BY ProfileID
HAVING COUNT(*) = CountOfRowsInPassedInAttributeTable -- calculate or pass in
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
As I recall, this ended up performing very well, even on fairly complicated queries (though I think we only had 12 or so columns).
As a single query, I can't think of a clever way of optimising this.
Provided that each column's check is highly selective, however, the following (very long winded) code, might prove faster, assuming each individual column has it's own separate index...
WITH
filter AS (
SELECT
[a].*
FROM
(SELECT * FROM Profile WHERE #a IS NULL OR a = #a) AS [a]
INNER JOIN
(SELECT id FROM Profile WHERE b = #b UNION ALL SELECT NULL WHERE #b IS NULL) AS [b]
ON ([a].id = [b].id) OR ([b].id IS NULL)
INNER JOIN
(SELECT id FROM Profile WHERE c = #c UNION ALL SELECT NULL WHERE #c IS NULL) AS [c]
ON ([a].id = [c].id) OR ([c].id IS NULL)
.
.
.
INNER JOIN
(SELECT id FROM Profile WHERE zz = #zz UNION ALL SELECT NULL WHERE #zz IS NULL) AS [zz]
ON ([a].id = [zz].id) OR ([zz].id IS NULL)
)
, TempResult as (
SELECT
ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum,
[filter].*
FROM
[filter]
)
SELECT
*
FROM
TempResult
WHERE
(RowNum >= #FirstRow)
AND (RowNum <= #LastRow)
EDIT
Also, thinking about it, you may even get the same result just by having the 60 individual indexes. SQL Server can do INDEX MERGING...
You've several issues imho. One is that you're going to end up with a seq scan no matter what you do.
But I think your more crucial issue here is that you've an unnecessary join:
SELECT profile.* FROM TempResult
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)
This is a classic "SQL Filter" query problem. I've found that the typical approaches of "(#b is null or b = #b)" & it's common derivatives all yeild mediocre performance. The OR clause tends to be the cause.
Over the years I've done a lot of Perf/Tuning & Query Optimisation. The Approach I've found best is to generate Dynamic SQL inside a Stored Proc. Most times you also need to add "with Recompile" on the statement. The Stored Proc helps reduce potential for SQL injection attacks. The Recompile is needed to force the selection of indexes appropriate to the parameters you are searching on.
Generally it is at least an order of magnitude faster.
I agree you should also look at points mentioned above like :-
If you commonly only refer to a small subset of the columns you could create non-clustered "Covering" indexes.
Highly selective (ie:those with many unique values) columns will work best if they are the lead colum in the index.
If many colums have a very small number of values, consider using The BIT datatype. OR Create your own BITMASKED BIGINT to represent many colums ie: a form of "Enumerated datatyle". But be careful as any function in the WHERE clause (like MOD or bitwise AND/OR) will prevent the optimiser from choosing an index. It works best if you know the value for each & can combine them to use an equality or range query.
While often good to find RoWID's with a small query & then join to get all the other columns you want to retrieve. (As you are doing above) This approach can sometimes backfire. If the 1st part of the query does a Clustred Index Scan then often it is faster to get the otehr columns you need in the select list & savethe 2nd table scan.
So always good to try it both ways & see what works best.
Remember to run SET STATISTICS IO ON & SET STATISTICS TIME ON. Before running your tests. Then you can see where the IO is & it may help you with index selection, for the mose frequent combination of paramaters.
I hope this makes sense without long code samples. (it is on my other machine)

sql query with conditional where only works sometimes

I'm creating a report (in Crystal Reports XI) based on a SQL stored procedure in a database. The query accepts a few parameters, and returns records within the specified date range. If parameters are passed in, they are used to determine which records to return. If one or more parameters are not passed in, that field is not used to limit the types of records returned. It's a bit complicated, so here's my WHERE clause:
WHERE ((Date > #start_date) AND (Date < #end_date))
AND (#EmployeeID IS NULL OR emp_id = #EmployeeID)
AND (#ClientID IS NULL OR client_id = #ClientID)
AND (#ProjectID IS NULL OR project_id = #ProjectID)
AND (#Group IS NULL OR group = #Group)
Now, for the problem:
The query (and report) works beautifully for old data, within the range of years 2000-2005. However, the WHERE clause is not filtering the data properly for more recent years: it only returns records where the parameter #Group is NULL (ie: not passed in).
Any hints, tips, or leads are appreciated!
Solved!
It actually had nothing to do with the WHERE clause, after all. I had let SQL Server generate an inner join for me, which should have been a LEFT join: many records from recent years do not contain entries in the joined table (expenses), so they weren't showing up. Interestingly, the few recent records that do have entries in the expenses table have a NULL value for group, which is why I got records only when #Group was NULL.
Morals of the story: 1. Double check anything that is automatically generated; and 2. Look out for NULL values! (n8wl - thanks for giving me the hint to look closely at NULLs.)
What are the chances that your newer data (post-2005) has some rows with NULL's in emp_id, client_id, project
_id, or group? If they were NULL's they can't match the parameters you're passing.
Since Date and group are reserved words you might try putting square brackets around the fields so they aren't processed. Doing so can get rid of "odd" issues like this. So that would make it:
WHERE (([Date] > #start_date) AND ([Date] < #end_date))
AND (#EmployeeID IS NULL OR emp_id = #EmployeeID)
AND (#ClientID IS NULL OR client_id = #ClientID)
AND (#ProjectID IS NULL OR project_id = #ProjectID)
AND (#Group IS NULL OR [group] = #Group)