I have this table in my database where HusbandPersonId and WifePersonId are foreign keys to another table called Person, and the start/end date refer when the marriage start and when it ends.
And I have this query :
SELECT
DISTINCT A.WifePersonId
FROM
Couple A
INNER JOIN Couple B
ON A.WifePersonId = B.WifePersonId
AND A.HusbandPersonId <> B.HusbandPersonId
AND A.StartDate < B.EndDate
AND A.EndDate > B.StartDate;
which returns any wife that is married to more than one person at same time.
Now I would like to add an index to improve the speed of search of this query.
Which index would be the best and what is the execution plan of the query before and after the index has been added ?
This is a request in homework and I search too much but I didn't find any helpful topic
Can anyone help on this ?
General indexing rules suggest including all fields in your join condition. The order of these fields may have an affect. Again, general rules suggest ordering the fields in order of increasing cardinality (# of unique values for each field).
So you might try:
CREATE INDEX Couple__multi ON Couple(StartDate, EndDate, WifePersonId, HusbandPersonId)
This assumes that the # of distinct start/end dates is less than the number of unique wife/husband PersonIds.
These rules are basically Indexing 101 type things. Most of the time they will get you an acceptable level of performance. It depends highly on your data and application if this is sufficient for your purposes.
Personally, I have not thought much of the SQL Performance Analyzer-suggested indexes, but I last took a serious look at them in SQL2005. I'm sure there has been some improvement since.
Hope this helps.
Related
I have to optimize the following query with the help of indexes.
SELECT f.*
FROM first f
JOIN second s on f.attributex_id = s.id
WHERE f.attributex_id IS NOT NULL AND f.attributey_id IS NULL
ORDER BY s.month ASC LIMIT 100;
Further infos:
attributex_id is a foreign key pointing to second.id
attributey_id is a foreign key pointing to another table not used in the query
Changing the query is not an option
Most entries (98%) in first the following will be true f.attributex_id IS NOT NULL. Same for the second condition f.attributey_id IS NULL
I tried to add as index as follows.
CREATE INDEX index_for_first
ON first (attributex_id, attributey_id)
WHERE attributex_id IS NOT NULL AND (attributey_id IS NULL)
But the index is not used (checked via Explain Analyze) when executing the query. What kind of indexes would I need to optimize the query and what am I doing wrong with the above index?
Does an index on s.month make sense, too (month is unique)?
Based on the query text and the fact that nearly all records in first satisfy the where clause, what you're essentially trying to do is
identify the 100 second records with the lowest month value
output the contents of the related records in the first table.
To achieve that you can create indexes on
second.month
first.attributex_id
Caveats
Since this query must be optimized, it's safe to say there are many rows in both tables. Since there are only 12 months in the year, the output of the query is probably not deterministic (i.e., it may return a different set of rows each time it's run, even if there is no activity in either table between runs) since many records likely share the same value for month. Adding "tie breaker" column(s) to the index on second may help, though your order by only includes month, so no guarantees. Also, if second.month can have null values, you'll need to decide whether those null values should collate first or last among values.
Also, this particular query is not the only one being run against your data. These indexes will take up disk space and incrementally slow down writes to the tables. If you have a dozen queries that perform poorly, you might fall into a trap of creating a couple indexes to help each one individually and that's not a solution that scales well.
Finally, you stated that
changing the query is not an option
Does that mean you're not allowed to change the text of the query, or the output of the query?
I personally feel like re-writing the query to select from second and then join first makes the goal of the query more obvious. The fact that your initial instinct was to add indexes to first lends credence to this idea. If the query were written as follows, it would have been more obvious that the thing to do is facilitate efficient access to the tiny set of rows in second that you're interested in:
...
from second s
join first f ...
where ...
order by s.month asc limit 100;
I'm attempting to create a view to join in hierarchical data to a normalized dataset in SQL using a profileID field.
The issue that I'm having is that my company's hierarchy data is, for lack of a better term, gapped. There are startdate and enddate fields that need to be considered in the join.
Currently I'm working with something like the following -
Select * from
dbo.datatable dt
inner join dbo.hierarchy h
on dt.profileid = h.profileid
AND dt.date >= h.startdate
AND dt.date < h.enddate
I've got a clustered index on dt that includes date and profileid and a clustered index on h that includes startdate, enddate, and profileid. SSMS has also suggested a couple indexes that I've added as well that include a lot of the data fields.
I cannot change the format of the hierarchy, but the view is absurdly slow when I try to pull a large number of days in a sql query. This dataset is end-user facing, so it's gotta be fast and usable.
Any tips are greatly appreciated!
In both tables, put profileId first in the likely index. This is because it is tested with =.
Alas, that is probably the only optimization you can do. Looking at a date range degenerates into using only one of the tests, and leads to scanning up to half the table. Or, after testing profileId, scanning half the rows for that profileId.
If the start-end ranges never overlap, there may be a trick to make things faster, but it will involve changes to the schema and code.
I've got some data I'd like to pull off our SQL server.
This old database does not have any primary keys associated with it, so pulling data is like querying an Excel spreadsheet (what it actually originated as years ago).
I need to run reports on this data, though.
Currently, I get a list of distinct serial numbers for a given time period, then pull all of the records for a given serial number. For a 1-month time frame, this can be 1500 to 3000 serial numbers. The serial number field is formatted as char(20), even though the serial numbers are only 15 characters long.
BEGIN UPDATE
There are typically 5 to 15 entries in this table per Serial_Number.
There are at most 10 machines writing data to this table, so identical Date_Time values are possible
END UPDATE
This process takes a while, but between different serial numbers in the list, I am able to update the Windows Form with a Progress Bar so management knows something is happening and about how much longer to expect.
I am always trying to make this query run faster.
Now, I am thinking about pulling the data I need using a WHERE clause such as:
SELECT Col1, Col2, Col3
FROM Table1
WHERE Serial_Number IN (
SELECT DISTINCT Serial_Number
FROM Table1
WHERE Date_Time Between #startDate AND #endDate
)
My question is: Are there any issues I could run into with this, particularly because we have so many distinct serial numbers during a given time frame.
And, of course, you know someone in Management is going to try running a year's worth of data when they are bored! Then, they are going to try running data since Jesus was born, just because they've got nothing better to do.
Restate Question: Is there a limit to the WHERE clause's IN method that limits the number of items I can pass in?
Index Serial_Number and Date_Time in Table1 (with separate indexes, not a single compound index) and this should perform fairly well for you unless the table is really truly ginormous.
You might get a little more speed with one index on Serial_Number and the second on (Date_Time, Serial_Number). That second index covers the sub query, allowing it to be answered from the index alone.
Note: I'm suggesting indexes, not primary keys, which don't require uniqueness.
Well, in the naive case where there are no indexes (which it sounds like is your case) you're going to have to scan over all the rows in Table1 to perform the DISTINCT on Serial_Number anyway. So I'm not sure it's going to help you much.
I would highly recommend the following:
Use the execution plan to determine what's going on in your query, and
Use that information to add some relevant indexes to speed your operations.
Just from what we see here, it sounds like Date_Time would be a good candidate for a clustered index in Table1.
Edit:
To make a nonunique clustered index as I describe above, you can use the following:
CREATE CLUSTERED INDEX IX_Table1_Date_Time
ON Table1 (Date_Time)
(from http://msdn.microsoft.com/en-us/library/aa258260(v=sql.80).aspx)
This will reorder your table such that all rows are sorted in Date_Time order. Further work with the execution plan will help identify other indexes that may greatly help your performance, depending on the exact types of queries you run.
Honestly, I see no benefit to the WHERE clause as it is written.
You use an expensive inner query, but don't do anything meaningful with the results. I don't even see you getting the Serial_Number in the results anywhere. However, based on your question, it does sound like you need it.
I don't see the need for the DISTINCT keyword for Serial_Number, since the duplicates would not be eliminated in the results in the outer query.
What is wrong with doing this?
SELECT Serial_Number, Col1, Col2, Col3
FROM Table1
WHERE Date_Time Between #startDate AND #endDate
This should do the same thing as your original query. But it would eliminate the expensive nested query.
Just put an index on Date_Time and it should work. This would also eliminate the need for the index on Serial_Number.
Apparently, there is no way to tell what the maximum length of the WHERE X IN (...) can be.
For now, this is the answer.
If, at some later point in time, someone comes along and finds something to the contrary, please post that answer and I will mark it as such.
Thanks,
Joe
I'm experiencing some heavy performance-issues with a query in SQLite. Currently there are around 20000 entries in the table activity_tbl and about 40 in the table activity_data_tbl. I have an index for both of the columns used in the query below, but it doesn't seem to have any effect on the performance at all.
SELECT a._id, a.start_time + b.length AS time
FROM activity_tbl a INNER JOIN activity_data_tbl b
ON a.activity_data_id = b._data_id
WHERE time > ?
ORDER BY 2
LIMIT 1
As you can see, I select one column and a value created from adding two columns together. I guess this is what's causing the low performance, since the query is very fast if I just select a.start_time or b.length.
Do you guys have any suggestion for how I could optimize this?
Try putting an index on the time column. This should speed up the query
This query is not optimizable using indexes for the filter part since you are filtering and ordering on a calculated value. To optimize the query you will either need to filter on one of the actual table columns (starttime or length) or pre-compute the time values before querying.
The only place an index will help, and I assume you have one, is on b.data_id.
A compound index may help. According to its docs, SQLite tries to avoid to access the table, if the index has enough information. So if the engine did its homework it will recognize that the index is enough to compute the where clause value and spare some time. If it does not work, only the pre-computation will do.
If you are more often confronted with similar tasks, please read this: http://www.sqlite.org/rtree.html
I have a query that joins 3 tables in SQL Server 2005, but has no Where clause, so I am indexing the fields found in the join statement.
If my index is set to Col1,col2,col3
And my join is
Tbl1
inner join tbl2
On
Tbl1.col3=tbl2.col3
Tbl1.col2=Tbl2.col2
Tbl1.col1=Tbl2.col1
Does the order of the join statement make a difference as compared to the order of the index? Should I set my index to be Col3,col2,col1? Or rearrage my join statement to be Col1,col2,col3?
Thanks
The SQL Server query optimiser should work it out. No need to change for the example you gave.
This is the simple answer though, and it depends on what columns you are selecting and how you are joining the 3 tables.
Note: I'd personally prefer to change the JOIN around to match a "natural" order. That is, I try to use my columns in the same order (JOIN, WHERE) that matches my keys and/or indexes. As Joel mentioned, it can help later on for troubleshooting.
For querying purposes, it does not matter. You may consider alternate ordering sequences, based on the following:
possible use of the index for other queries (including some with ORDER BY .. one of these columns)
to limit index fragmentation (by using orders that tends to add records towards the end of the table and/or nearby non-selective parameters)
Edit: on second thoughts, having the most selective column first may help the optimizer, for example by providing it with a better estimated row yield and such... But this important issue may be getting off topic, as the OP's question was whether the order of the join conditions mattered.
If you allways have a join on Col1-3 then you should build the index so that the "most distinctive column" is in the 1st field and the most general ones in the last field
So a "Status Ok" / "Status denied" field should be field 3 and a SSN or Phonenumber should be field one on the index